🧩 SpatialBench
Evaluating Spatial World Models in Large Language Models · Paper (ICLR 2026 Workshop) · Code
Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks
We systematically evaluate the spatial understanding of large language models through maze tasks—a controlled testing context requiring multi-step planning and spatial abstraction. Across experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities.
Key findings:
- Representation sensitivity: Gemini drops from 86% (raw tokenized) to 34% (visual grid) on 5×5 mazes with CoT
- Prompting dependency: Claude-Haiku fails completely without CoT, recovers to 78% with it
- No spatial memory: Models treat sequential questions independently, failing to reuse computed spatial knowledge
Task 1 — Maze Navigation (Planning)
Models find shortest paths through mazes. Two input formats: raw tokenized adjacency lists vs visual character grids.
Leaderboard (mean accuracy across grid sizes)
Task 2 — Sequential Reasoning with Point Reuse
Models answer 4 proximity questions. Q3 = Q0 (same question repeated). Do models reuse their earlier computation, or start from scratch?
Leaderboard
Task 3 — Compositional Distance Comparison
Models answer 3 questions about maze corners (A, B, C, D) and center M. Q2 (B→C) can potentially be composed from Q0 (A→M) and Q1 (D→M). Δ = Q2 accuracy − avg(Q0, Q1).
Leaderboard (Δ shows compositional benefit)
Generate Experiment Scripts
Configure the experiments you want to run, then download a zip of ready-to-run shell scripts.
How to use:
- Select tasks, models, and settings below
- Enter the path to your local clone of the repo (so paths in the scripts are correct)
- Click Generate — a preview appears and a zip is ready to download
- Unzip on your cluster, set your API key(s) as environment variables, then:
export GEMINI_API_KEY=your_key_here bash run_all.sh # run sequentially # — or submit individually: sbatch Task_1__Maze_Navigation__gemini-2.5-flash__raw__cot.sh
About SpatialBench
SpatialBench is the evaluation platform accompanying the paper:
Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks Under review at ICLR 2026 Workshop
Three Tasks
| Task | Type | What it tests |
|---|---|---|
| Task 1: Maze Navigation | Planning | Find shortest path from start to goal |
| Task 2: Sequential Point Reuse | Reasoning | Reuse Q0 computation when Q3=Q0 |
| Task 3: Compositional Distance | Reasoning | Compose corner→center distances for Q2 |
Input Representations
- Raw (tokenized):
<ADJLIST_START> (0,0) <--> (0,1) ... <ADJLIST_END> - Visual (grid):
Row 0: ['.', 'S', '.', '#'] Row 1: ['#', '.', '.', 'E']
Models Evaluated
| Model | Provider |
|---|---|
| Gemini 2.5 Flash | |
| GPT-5 Mini | OpenAI |
| Claude Haiku 4.5 | Anthropic |
| DeepSeek Chat | DeepSeek |
Grid Sizes
Experiments run on n×n grids for n ∈ {5, 6, 7, 8, 9} by default.
Reproducing Experiments
Clone the repo and use the Get Scripts tab above to generate SLURM scripts, or use the CLI directly:
cd pipeline/
python run_experiments.py --tasks maze_navigation --models gemini-2.5-flash --mode slurm --dry-run
Citation
@inproceedings{spatialbench2026,
title = {Do {LLMs} Build Spatial World Models? Evidence from Grid-World Maze Tasks},
author = {Anonymous},
booktitle = {ICLR 2026 Workshop},
year = {2026},
}