reasonCore AI Data partner
← All benchmarks
Spatial Reasoning Benchmark

Spatial reasoning? Frontier models have some distance to go.

We evaluated leading VLMs on 3D spatial reasoning tasks derived from a production Level 4 AV stack. The best model scored 50%, barely above the 25% random baseline. When we checked the reasoning, it dropped to 10%.

Built in partnership with PlusAI, using their multi-million-mile driving catalog spanning the U.S., Europe, and APAC.

50%
Best Accuracy
Sonnet 4.6
10%
Best Reasoning
Opus 4.7 rationale
25%
Random Baseline
4-way MC
10
Categories
Depth, heading, lateral
The Benchmark

Grounded in LiDAR-fused 3D annotations

Every answer comes from calibrated sensor data: sub-meter distances, yaw in radians, verified object labels. Sourced from PlusAI’s production perception stack.

Each sample pairs an annotated front-camera image (numbered bounding boxes on detected objects) with a 4-way multiple-choice question. The model must infer depth, lateral position, heading, and object type from a single 2D frame.

Questions are based on production 3D scene graphs, ensuring metric precision, consistency, and arbitrary scalability, and are human-reviewed for accuracy.

Highway, urban, nighttime, construction. Multiple platforms, 2022–2025.

Annotated driving scene
Highway scene with numbered labels. Models reason about 3D relationships from this monocular view.
CategoryTaskGround Truth
order_closestRank 4 objects by distance from egoMetric distances (m)
pick_closerWhich of two objects is nearer?Distance pair (m)
identify_distance_longClassify distance: Close / Medium / FarMetric distance (m)
identify_nearest_aheadNearest object along the forward axisForward projection (m)
order_leftmostRank 4 objects left-to-right in 3DLateral offset (m)
identify_rightmostFurthest-right object in 3D?Lateral offset (m)
identify_positionClassify position (ahead-left, etc.)Forward + lateral (m)
identify_headingObject heading in clock notationYaw angle (rad)
relative_headingSame, opposite, or perpendicular heading?Yaw diff (deg)
identify_typeClassify the object type3D annotation label
Eval Results

Barely above random — and the reasoning is worse

Two scoring layers: choice accuracy (right letter?) and rationale match (right reasoning, verified by a judge model against metric ground truth).

Choice Accuracy vs. Rationale Quality
Right answer? And right reasoning? The gap between the two reveals how much is guesswork.

Per-Category Results

Category Sonnet 4.6 Opus 4.7
Choice Rationale Choice Rationale
Dist. Order
Nearest Ahead
Rightmost
Dist. Bin
Position
Pick Closer
Lateral Order
Obj. Type
Heading
Rel. Heading
Pass Partial Fail
Rationale Quality Breakdown
Of 10 samples each: how many had correct, partial, or incorrect reasoning?
Eval integrity: Oracle runs achieve 100% on both scorers, confirming calibration. Rationale scoring uses a judge model that checks for correct object types, distance orderings, and angular relationships. The eval harness ships with the dataset.
Key Findings

What the eval reveals

Five patterns, each pointing to a specific, addressable gap in current VLM training.

01

Heading estimation is broken

Both models fail every heading task. Estimating yaw from a monocular frame requires 3D reasoning current VLMs lack entirely.

02

Fragile 2D pixel heuristics

Even correct answers use “lower in image = closer” instead of 3D inference. Breaks on slopes, elevated roads, complex intersections.

03

Object ID degrades with range

Beyond ~50m, models misclassify objects (SUV → car → van). Wrong types cascade into wrong spatial reasoning.

04

Scale doesn’t solve it

Opus 4.7 underperforms Sonnet 4.6 (40% vs. 50%). Spatial reasoning requires targeted supervision, not more parameters.

05 — The headline

Correct answers ≠ correct reasoning

Sonnet: 50% choice, 35% rationale. Opus: 40% choice, 10% rationale. Models are frequently right for wrong reasons, creating false confidence that masks the true spatial reasoning deficit.

The Dataset

SFT & RL-ready spatial reasoning data

Every sample includes metric-grounded chain-of-thought rationales: the kind of spatial supervision current training corpora lack.

// Sample record { "question": "Order objects <2>, <7>, <9>, <10> closest to furthest.", "answer": "<2> (barrier, 47.4m) <7> (truck, 90.3m) <9> (suv, 132m) <10> (suv, 165m)", "answer_key": "C", "category": "order_closest" }

Chain-of-thought supervision

Step-by-step rationales with exact distances, yaw angles, object types. Directly usable for SFT.

Arbitrarily scalable

Programmatic generation from a multi-year driving log archive. This sample is 10; full runs produce thousands.

Diagnostic granularity

10 categories for precise capability profiling.

Production-grade ground truth

LiDAR-camera fusion, sub-meter precision. Deterministic, reproducible, no annotation noise.

Object & scene coverage

Car, SUV, truck, bus, bike, pedestrian, barriers. 10m–200m+. Highway, urban, night, construction.

Eval harness included

4 task variants, oracle validation, judge-model scorer. Ready to run.

Ready to close the spatial reasoning gap?

Expert-curated, RL & SFT-ready spatial training data at the scale and quality frontier labs require.

Get in touch →