Benchmarks

Where frontier models still fall short.

Public benchmarks have largely saturated. The frontier now advances on real-world tasks the headline evals don’t measure — and the field needs robust, domain-grounded benchmarks to know where to push next. Better benchmarks are the bottleneck for better models.

We build with the people who own each domain. Production-grade ground truth from established category leaders, scenarios authored by senior practitioners, expert review on every sample. Each benchmark is a sharp, defensible look at what current frontier models can and can’t yet do.

Browse benchmarks ↓ Get in touch

Benchmark 01 · Vision & Autonomous Driving

View benchmark

Spatial reasoning from a single camera frame

Vision-language models reason about depth, heading, and 3D object position from monocular driving frames. The best frontier VLM scored 50% — and its reasoning held up only 10% of the time.

Sourced from PlusAI’s production Level 4 autonomous-vehicle stack — multi-million-mile catalog across the U.S., Europe, and APAC. Every answer comes from LiDAR-camera fusion: sub-meter distances, verified object labels, yaw in radians.

Ten task categories from depth-binning to heading estimation. Two scoring layers: did the model pick the right letter, and was the reasoning right?

View the spatial reasoning benchmark →

Benchmark 02 · Financial Risk & Fraud Classification

View benchmark

Adversarial fraud risk classification

Four frontier LLMs on bank-account fraud scenarios from senior practitioners. Every model has at least one weak category — and the categories are different per model.

30 scenarios spanning Account Takeover and Money Mule typologies. Adversarial near-duplicates by design: two scenarios share almost every variable and still demand different risk classifications. Models that pattern-match fall apart; models that fuse device, behavioral, network, and transaction evidence hold up.

Pass@8 sampled. Exact-match scoring against expert-adjudicated labels. The failures are model-specific — and that asymmetry is the training signal.

View the financial fraud benchmark →

Want a benchmark for your domain?

We partner with category leaders to build benchmarks where frontier models actually break — and where the post-training signal lives.

Get in touch →