Financial Fraud Benchmark

Frontier models have blind spots in financial fraud.

We evaluated four frontier LLMs on financial fraud risk classification across 30 expert-curated scenarios spanning Account Takeover and Money Mule typologies. Every model has at least one weak category — and the failures are different per model. That asymmetry is the training signal.

Built by senior financial-crimes practitioners with 15+ years average tenure in payments fraud, AML, and risk operations — former executives at Bank of America Payments, FICO/Falcon, Discover, and HSBC.

See the results ↓ Request full dataset

Eval at a glance

67%

Best Overall

Gemini 3.1 Pro

25%

Lowest Cell

Claude × Passive ATO

25%

Random Baseline

4-way classification

Scenarios

4 sub-categories

The Benchmark

Adversarial near-duplicates, by design

Financial-crimes detection turns on subtle contextual signals: a transaction from a new device, an unusual counterparty, a velocity shift. Two scenarios can share almost every variable and still demand different risk classifications. This pack concentrates the training signal at exactly that boundary.

Each scenario presents a bank account profile and a single 4-way classification: High, Medium, Low, or No risk. Subtle contextual shifts (new vs. existing device, transaction velocity, counterparty patterns) flip the correct answer while leaving most context unchanged. Models that lean on coarse pattern-matching fall apart here; models that fuse device, behavioral, network, and transaction evidence hold up.

Every scenario is independently authored by a domain expert, passes through two levels of expert review and editorial compilation, and gets Pass@8 sampled to confirm adversarial validity and quality.

Sample is drawn from a common bank-account-holder profile. Tight cross-scenario similarity is intentional: single-variable shifts are the lesson.

Built by senior practitioners. Content, labeling, and adjudication are led by former executives in Payments, AML, and risk operations — backgrounds include Bank of America Payments, FICO / Falcon (card fraud), Discover, and HSBC. Average tenure 15+ years; 20+ AML / fraud engagements led.

Sub-categories evaluated

Sub-Category	Threat	Sample N	What models must reason about
`ato_cash_out`	Account Takeover	19	Unauthorized access through hijacked credentials or session; bad actor cashes out. Detection requires fusing device, network, geolocation, and behavioral-biometric signals.
`passive_ato`	Account Takeover	3	Credentials silently compromised; activity looks ordinary at the surface. Signal is in long-tail behavioral drift and counterparty subtleties.
`mule_1st_party`	Money Mule	4	Account holder knowingly forwards illicit funds. Profile looks legitimate; signal is transaction velocity and counterparty anomalies.
`mule_3rd_party`	Money Mule	4	Account holder is an unwitting intermediary; account itself is hijacked for laundering. Detection hinges on counterparty graph and timing.

Eval Results

Every model has at least one weak category

Aggregate pass rates flatter the models. The per-category breakdown shows asymmetric, model-specific failure modes — and that asymmetry is precisely where adversarial near-duplicates concentrate the training signal.

Overall pass rate by model

Mean Pass@k across all 30 scenarios. Dashed line: 25% random baseline (4-way classification).

Per-category pass rate, by model

Sub-Category	Claude Opus 4.7	Gemini 3.1 Pro	GPT 5.4	Grok 4.2
ATO Cash Out	45%	65%	45%	51%
Passive ATO	25%	46%	67%	38%
Money Mule (1st Party Complicit)	66%	69%	34%	28%
Money Mule (3rd Party Hostile)	100%	94%	47%	72%
Overall	53%	67%	46%	50%

< 30% · near random 30–49% · weak 50–69% · moderate 70–89% · strong ≥ 90% · near perfect

Eval integrity. Each question gets Pass@8 sampled per model. Scoring is exact-match against the expert-adjudicated label. Cross-scenario similarity within sub-categories is intentional: subtle single-variable shifts (new vs existing device, velocity, counterparty) are the lesson the data pack teaches. Random-baseline reference is 25% (uniform across the four risk levels).

Key Findings

Four models, four failure patterns

Aggregate ranking hides the structure. Each model is strong somewhere and weak somewhere else — and the "somewhere" is different for each.

01 · Gemini 3.1 Pro

Best overall, weakest on Passive ATO

67% mean across all scenarios — the leader. Holds up across Money Mule typologies (94% on 3rd-party-hostile, 69% on 1st-party-complicit). Drops to 46% on Passive Account Take Over, where the signal is long-tail behavioral drift.

02 · Claude Opus 4.7

Perfect on hostile mule, random on Passive ATO

Scores 100% on Money Mule (3rd-Party Hostile) — the only category any model fully clears. Falls to 25% on Passive ATO, exactly the random baseline. A 75-point spread is the widest of any model.

03 · GPT 5.4

The mirror image: strong where others are weak

Best of any model on Passive ATO at 67%. But the worst overall at 46%, dragged by 34% on Money Mule (1st-Party) and 47% on Money Mule (3rd-Party). Underestimates fraud risk and mixes threat categories.

04 · Grok 4.2

Mid-pack with the deepest mule gap

72% on Money Mule (3rd-Party Hostile), at the strong end. Drops to 28% on Money Mule (1st-Party Complicit) — the only model below 30% on a Mule typology. Consistently underestimates fraud risk; failures vary by typology.

05 — The headline

The failures don’t overlap. The training signal is the asymmetry.

Each model has at least one weak category — but the categories are different. Claude is near-perfect on a category where GPT is mid; GPT clears Passive ATO at 67% where Claude is at random. Same data shape, different reasoning failures. That's the high-signal RL target: model-specific blind spots on adversarial near-duplicates.

The Data Pack

RL- and SFT-ready training pack for financial fraud

Ships with full scenario schemas, expert rationales, and per-model failure taxonomies. Designed to move frontier models from pattern-matching to defensible, industry-expert judgment.

// Sample record (truncated) { "record_id": 1, "prompt": "You are a fraud investigator within the operations department of a bank. Your task is to access the data provided on the account and perform a full investigation and determine risk from one of the following [A) High Risk; B) Medium Risk; C) Low Risk; D) No Risk]. ### Context Data … [truncated]", "answer": "Medium Risk", "category": "ATO", "sub_category": "Passive Account Take Over", "pass@k": { "claude_opus_4.7": 0.000, "gemini_3.1_pro": 0.000, "gpt_5.4": 0.875, "grok_4.2": 0.000 } }

Adversarial near-duplicates

Scenarios share most context but flip on a single variable. Trains models to read fine-grained signals instead of pattern-matching surface features.

Multi-signal scenarios

Each prompt fuses device, network, behavioral, counterparty, and transaction evidence. Models must integrate, not pick.

Practitioner-built

Authored, reviewed, and adjudicated by senior fraud / AML / payments executives. Two-level expert review per scenario.

Pass@8 quality control

Every scenario validated by Pass@8 sampling. Adversarial validity confirmed against the expert-adjudicated label.

Calibrated risk scoring

4-way classification (High / Medium / Low / No risk). Addresses both over- and under-indexing failures observed in frontier models.

Regulated-workflow ready

Fraud detection, AML alert triage, investigations, and new-account onboarding — the actual surfaces where LLMs are being deployed.

Want the full data pack?

Includes scenario schemas, expert rationales, per-model failure taxonomies, and the eval harness. Scoped to your post-training goals.

Get in touch →