🧩 SpatialBench

Evaluating Spatial World Models in Large Language Models · Paper (ICLR 2026 Workshop) · Code

Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

We systematically evaluate the spatial understanding of large language models through maze tasks—a controlled testing context requiring multi-step planning and spatial abstraction. Across experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities.

Key findings:

  • Representation sensitivity: Gemini drops from 86% (raw tokenized) to 34% (visual grid) on 5×5 mazes with CoT
  • Prompting dependency: Claude-Haiku fails completely without CoT, recovers to 78% with it
  • No spatial memory: Models treat sequential questions independently, failing to reuse computed spatial knowledge

Task 1 — Maze Navigation (Planning)

Models find shortest paths through mazes. Two input formats: raw tokenized adjacency lists vs visual character grids.

K-shot

Number of in-context examples

Input Format

Leaderboard (mean accuracy across grid sizes)

Leaderboard (mean accuracy across grid sizes)

Task 2 — Sequential Reasoning with Point Reuse

Models answer 4 proximity questions. Q3 = Q0 (same question repeated). Do models reuse their earlier computation, or start from scratch?

5 9

Leaderboard

Leaderboard

Task 3 — Compositional Distance Comparison

Models answer 3 questions about maze corners (A, B, C, D) and center M. Q2 (B→C) can potentially be composed from Q0 (A→M) and Q1 (D→M). Δ = Q2 accuracy − avg(Q0, Q1).

Leaderboard (Δ shows compositional benefit)

Leaderboard (Δ shows compositional benefit)