🧩 SpatialBench
Evaluating Spatial World Models in Large Language Models · Paper (ICLR 2026 Workshop) · Code
Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks
We systematically evaluate the spatial understanding of large language models through maze tasks—a controlled testing context requiring multi-step planning and spatial abstraction. Across experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities.
Key findings:
- Representation sensitivity: Gemini drops from 86% (raw tokenized) to 34% (visual grid) on 5×5 mazes with CoT
- Prompting dependency: Claude-Haiku fails completely without CoT, recovers to 78% with it
- No spatial memory: Models treat sequential questions independently, failing to reuse computed spatial knowledge
Task 1 — Maze Navigation (Planning)
Models find shortest paths through mazes. Two input formats: raw tokenized adjacency lists vs visual character grids.
Leaderboard (mean accuracy across grid sizes)
Task 2 — Sequential Reasoning with Point Reuse
Models answer 4 proximity questions. Q3 = Q0 (same question repeated). Do models reuse their earlier computation, or start from scratch?
Leaderboard
Task 3 — Compositional Distance Comparison
Models answer 3 questions about maze corners (A, B, C, D) and center M. Q2 (B→C) can potentially be composed from Q0 (A→M) and Q1 (D→M). Δ = Q2 accuracy − avg(Q0, Q1).
Leaderboard (Δ shows compositional benefit)
Launch Experiments
Experiments call LLM APIs directly — no compute cluster needed.
Your API keys are used only for your session and are never stored or logged.
Enter keys only for the model(s) you want to evaluate. Jobs for models without a key will be skipped.
API Keys
Enter the key(s) for the model(s) you selected. Keys are used only for this session.
Job Status
About SpatialBench
SpatialBench is the evaluation platform accompanying the paper:
Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks Under review at ICLR 2026 Workshop
Three Tasks
| Task | Type | What it tests |
|---|---|---|
| Task 1: Maze Navigation | Planning | Find shortest path from start to goal |
| Task 2: Sequential Point Reuse | Reasoning | Reuse Q0 computation when Q3=Q0 |
| Task 3: Compositional Distance | Reasoning | Compose corner→center distances for Q2 |
Input Representations
- Raw (tokenized):
<ADJLIST_START> (0,0) <--> (0,1) ... <ADJLIST_END> - Visual (grid):
Row 0: ['.', 'S', '.', '#'] Row 1: ['#', '.', '.', 'E']
Models Evaluated
| Model | Provider |
|---|---|
| Gemini 2.5 Flash | |
| GPT-5 Mini | OpenAI |
| Claude Haiku 4.5 | Anthropic |
| DeepSeek Chat | DeepSeek |
Grid Sizes
Experiments run on n×n grids for n ∈ {5, 6, 7, 8, 9} by default.
The underlying maze-dataset library supports larger grids — adjust in the Run tab.
Adding a New Model
Edit pipeline/configs/experiments.yaml:
models:
your-model-id:
api_key_env: YOUR_API_KEY_ENV_VAR
display_name: "Your Model Name"
Then add inference support in utils/llm_inference.py.
Citation
@inproceedings{spatialbench2026,
title = {Do {LLMs} Build Spatial World Models? Evidence from Grid-World Maze Tasks},
author = {Anonymous},
booktitle = {ICLR 2026 Workshop},
year = {2026},
}