Preprint · Under review

Lost in Aggregation A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation

1Technical University of Munich  ·  2Massachusetts Institute of Technology
✉ Corresponding author
From spatial input to navigation: where LLMs differ from human spatial cognition.
From spatial input to navigation. The benchmark diagnoses LLM navigation by decomposing it into the same representational levels that structure human spatial cognition — Fine (local passability), Meso (junction topology), and Macro (global goal direction) — and probes each in isolation.

OverviewAbstract

Large language models (LLMs) are increasingly deployed as planners and assistants in tasks with inherent spatial structure, such as navigation and route planning, yet they remain brittle in sequential spatial reasoning. We ask not merely whether LLMs fail at navigation but where in the spatial-cognition pipeline they get lost.

We introduce a multi-scale diagnostic benchmark that decomposes maze navigation into three cognitive levels drawn from human spatial cognition: Fine (local passability), Meso (junction topology), and Macro (global goal direction). We evaluate three instruction-tuned chat LLMs (GPT-4o, DeepSeek-V3, Llama-3.3-70B) on 1,050 topology-annotated mazes spanning seven sizes (3×3 to 30×30) and three difficulty tiers.

The central finding: end-to-end one-shot navigation collapses to near zero by 10×10 for every model, yet the same models answer isolated single-level probes at 30–75% far beyond that size. A multi-hot first-error analysis localizes failures to Meso junction choices (59%) and Fine perception (39%), with global direction almost never at fault (1%). The barrier is the cross-scale aggregation of individually available competences over a long sequential plan — not any single perceptual deficit.

1,050
topology-annotated mazes
7
sizes (3×3 → 30×30)
3
LLMs × 3 difficulty tiers
49,372
scored probe questions

Key FindingsThree modules, three questions

The benchmark mirrors the navigation pipeline: how space is read in, how it is represented across scales, and how control should be divided between an LLM and an algorithm.

RQ1 · INPUT ACQUISITION

Coordinates beat pictures

Among four input formats, structured coordinate text is the most navigable (mean SR 0.34) — far ahead of rendered images (0.07). Local moves stay legal even as whole-path success vanishes, so the collapse is one of global path assembly.

RQ2 · MULTI-SCALE REPRESENTATION

Lost in aggregation

Isolated Fine / Meso / Macro probes survive far past where navigation collapses. First errors fall on Meso (59%) and Fine (39%), almost never Macro (1%). The deficit is cross-scale aggregation, not perception.

RQ3 · HIERARCHICAL ROUTE PLANNING

Delegate, but the wall returns

Handing per-step execution to a deterministic walker and querying the LLM only at junctions lifts GPT-4o by up to 92 points at mid sizes — but the same scaling wall re-emerges by 30×30.

The BenchmarkDesign at a glance

Benchmark overview: maze corpus, three modules, and a metrics and failure-signature panel.
Benchmark overview. ① Input acquisition — each maze is rendered as Words, Coordinates, an ASCII Map, or an Image. ② Multi-scale representation — one-shot navigation is decomposed into Fine / Meso / Macro, each probed in isolation. ③ Hierarchical route planning — One-shot, Topology-blind Junction, and Topology-aided Junction delegation regimes. Trials are scored by success and a failure-signature vocabulary.

Mazes are 2-D grids generated by a mix of depth-first-search (long corridors) and randomized Prim (many branches), with fixed seeds for reproducibility. By construction they are trees: any two cells are connected by a unique path, so every junction choice is unambiguously right or wrong. Every maze ships with per-cell passable directions, cell types, the shortest path, and the goal-reaching branch at each junction.

The CorpusSeven sizes, one example each

Effective (open-cell) widths of 3, 5, 7, 10, 15, 20, and 30. Each thumbnail shows a representative medium-difficulty maze with its start (S), goal (G), and unique solution path.

3×3 maze
3×3
5×5 maze
5×5
7×7 maze
7×7
10×10 maze
10×10
15×15 maze
15×15
20×20 maze
20×20
30×30 maze
30×30
Start Goal Unique solution path

In MotionA robot walks the maze

The base map is a decision-point density heatmap: path cells darken (white → deep red) where junctions and dead-ends cluster — the spots where a navigator must choose. The grey dashed line is the unique correct solution; the robot's blue trail is the route it actually takes.

A robot successfully navigating the maze to the goal.

Solving it

When the robot keeps Fine, Meso, and Macro reasoning aligned step after step, it follows the unique path from S to G. On small mazes the models manage this; the benchmark's question is what breaks as the maze grows.

The three clips below replay the dominant ways navigation falls apart — the same failure signatures our first-error analysis counts: Meso junction choices (59%) and Fine perception (39%).

Robot takes the wrong branch at a junction and ends in a dead-end.

Wrong junction MESO

At a branch point the robot picks a decoy branch (green arrow marks the goal-reaching one) and dies in a dead-end. The single most common first error.

Robot lunges into a wall.

Wall hit FINE

The robot emits a move into a wall cell — a local-passability error. It misreads which neighbours are open.

Robot teleports to a non-adjacent cell.

Teleport FINE

The robot jumps to a non-adjacent cell, losing track of its own position — the most frequent Fine error at larger sizes.

ResultsWhere LLMs get lost

Structured textual input outperforms visual and grid input across every model and size.
RQ1. Structured coordinate text outperforms visual and grid input across every model. Valid-move ratio stays high even where success rate is near zero — the failure is global path assembly, not local legality.
Scaling failure is driven by aggregation, not local perception.
RQ2. One-shot navigation collapses to ≈0 by 10×10 for all three models, while isolated Fine / Meso / Macro probes persist at 30–75%. The coupled Meso×Macro probe decays faster than its Meso component.
First-step errors dominated by Meso topological choices and Fine perception.
First-error analysis. Over 1,484 failed trials, the centroid sits at Fine ≈39%, Meso ≈59%, Macro ≈1%.
Coupling ladder: cost of moving from any single probe to full sequential navigation.
Coupling ladder. The jump that breaks navigation is not between levels but from any single question to multi-step sequential execution.
Junction-level delegation lifts navigation but does not remove the scaling barrier.
RQ3. Junction-level delegation with an explicit topological prompt lifts GPT-4o from 2/0/0% to 94/80/70% at sizes 7/10/15 — but recovery is bounded: even aided navigation falls to near zero by 30×30. Delegation buys roughly a doubling of the navigable size; it does not remove the scaling wall.

DownloadBenchmark data

All 1,050 mazes are released as JSON, one file per effective size (150 mazes each: 50 per difficulty tier). Every maze is fully topology-annotated — usable as a drop-in spatial-reasoning benchmark for any model.

FileEffective sizeMazesSizeDownload
mazes_s3.json3×31500.7 MBdownload
mazes_s5.json5×51501.7 MBdownload
mazes_s7.json7×71503.2 MBdownload
mazes_s10.json10×101506.3 MBdownload
mazes_s15.json15×1515013.5 MBdownload
mazes_s20.json20×2015023.4 MBdownload
mazes_s30.json30×3015051.2 MBdownload
All files are published as assets on the repository's GitHub Releases (tag v0.1). The full corpus is ~100 MB; each file is one effective size (150 mazes: 50 per difficulty tier).

Per-maze schema

// one entry of mazes_s{N}.json { "id": "maze_s7_medium_000", "effective_size": 7, // open-cell width "grid_size": 15, // incl. outer wall "algorithm": "dfs", // "dfs" | "prim" "seed": 30042, "grid": [[1,1,1,...],[1,0,0,...], ...], // 0 = path, 1 = wall "start": [5, 1], "goal": [1, 5], "difficulty": "medium", // "easy" | "medium" | "hard" "metrics": { "junction_count_on_path": ..., "dead_end_density": ..., "confusion_ratio": ..., "shortest_path_length": ... }, "topology": { "cell_types": { "r,c": "corridor|corner|junction|dead-end" }, "passability": { "r,c": {up,down,left,right: bool} }, "shortest_path": [[r,c], ...], "junction_choices": { ... }, // goal-reaching branch per junction "dead_ends": [[r,c], ...] } }

ReproduceCode & harness

We will release the maze generator, the four input encoders, the isolated-probe generators with answer keys, the junction-delegation harness, and all evaluation and plotting code. The benchmark data is available now.

The benchmark is built to be extended: the topology annotation already supports cyclic mazes (where junction accuracy and shortest-path-branch accuracy diverge), and the delegation harness accepts new per-level oracle conditions.

CiteBibTeX

If you use the benchmark, please cite:
@misc{jiang2026lostinaggregation,
  title        = {Lost in Aggregation: A Multi-Scale Diagnostic Benchmark
                  for LLM Spatial Navigation},
  author       = {Jiang, Yuhan and Luo, Peng and Meng, Liqiu},
  year         = {2026},
  note         = {Preprint, under review},
  howpublished = {\url{https://yuhanjiang415.github.io/lost-in-aggregation/}}
}