We evaluated leading VLMs on 3D spatial reasoning tasks derived from a production Level 4 AV stack. The best model scored 50%, barely above the 25% random baseline. When we checked the reasoning, it dropped to 10%.
Built in partnership with PlusAI, using their multi-million-mile driving catalog spanning the U.S., Europe, and APAC.
Every answer comes from calibrated sensor data: sub-meter distances, yaw in radians, verified object labels. Sourced from PlusAI’s production perception stack.
Each sample pairs an annotated front-camera image (numbered bounding boxes on detected objects) with a 4-way multiple-choice question. The model must infer depth, lateral position, heading, and object type from a single 2D frame.
Questions are based on production 3D scene graphs, ensuring metric precision, consistency, and arbitrary scalability, and are human-reviewed for accuracy.
Highway, urban, nighttime, construction. Multiple platforms, 2022–2025.
| Category | Task | Ground Truth |
|---|---|---|
order_closest | Rank 4 objects by distance from ego | Metric distances (m) |
pick_closer | Which of two objects is nearer? | Distance pair (m) |
identify_distance_long | Classify distance: Close / Medium / Far | Metric distance (m) |
identify_nearest_ahead | Nearest object along the forward axis | Forward projection (m) |
order_leftmost | Rank 4 objects left-to-right in 3D | Lateral offset (m) |
identify_rightmost | Furthest-right object in 3D? | Lateral offset (m) |
identify_position | Classify position (ahead-left, etc.) | Forward + lateral (m) |
identify_heading | Object heading in clock notation | Yaw angle (rad) |
relative_heading | Same, opposite, or perpendicular heading? | Yaw diff (deg) |
identify_type | Classify the object type | 3D annotation label |
Two scoring layers: choice accuracy (right letter?) and rationale match (right reasoning, verified by a judge model against metric ground truth).
| Category | Sonnet 4.6 | Opus 4.7 | ||
|---|---|---|---|---|
| Choice | Rationale | Choice | Rationale | |
| Dist. Order | ||||
| Nearest Ahead | ||||
| Rightmost | ||||
| Dist. Bin | ||||
| Position | ||||
| Pick Closer | ||||
| Lateral Order | ||||
| Obj. Type | ||||
| Heading | ||||
| Rel. Heading | ||||
Five patterns, each pointing to a specific, addressable gap in current VLM training.
Both models fail every heading task. Estimating yaw from a monocular frame requires 3D reasoning current VLMs lack entirely.
Even correct answers use “lower in image = closer” instead of 3D inference. Breaks on slopes, elevated roads, complex intersections.
Beyond ~50m, models misclassify objects (SUV → car → van). Wrong types cascade into wrong spatial reasoning.
Opus 4.7 underperforms Sonnet 4.6 (40% vs. 50%). Spatial reasoning requires targeted supervision, not more parameters.
Sonnet: 50% choice, 35% rationale. Opus: 40% choice, 10% rationale. Models are frequently right for wrong reasons, creating false confidence that masks the true spatial reasoning deficit.
Every sample includes metric-grounded chain-of-thought rationales: the kind of spatial supervision current training corpora lack.
Step-by-step rationales with exact distances, yaw angles, object types. Directly usable for SFT.
Programmatic generation from a multi-year driving log archive. This sample is 10; full runs produce thousands.
10 categories for precise capability profiling.
LiDAR-camera fusion, sub-meter precision. Deterministic, reproducible, no annotation noise.
Car, SUV, truck, bus, bike, pedestrian, barriers. 10m–200m+. Highway, urban, night, construction.
4 task variants, oracle validation, judge-model scorer. Ready to run.
Expert-curated, RL & SFT-ready spatial training data at the scale and quality frontier labs require.
Get in touch →