We evaluated four frontier LLMs on financial fraud risk classification across 30 expert-curated scenarios spanning Account Takeover and Money Mule typologies. Every model has at least one weak category — and the failures are different per model. That asymmetry is the training signal.
Built by senior financial-crimes practitioners with 15+ years average tenure in payments fraud, AML, and risk operations — former executives at Bank of America Payments, FICO/Falcon, Discover, and HSBC.
Financial-crimes detection turns on subtle contextual signals: a transaction from a new device, an unusual counterparty, a velocity shift. Two scenarios can share almost every variable and still demand different risk classifications. This pack concentrates the training signal at exactly that boundary.
Each scenario presents a bank account profile and a single 4-way classification: High, Medium, Low, or No risk. Subtle contextual shifts (new vs. existing device, transaction velocity, counterparty patterns) flip the correct answer while leaving most context unchanged. Models that lean on coarse pattern-matching fall apart here; models that fuse device, behavioral, network, and transaction evidence hold up.
Every scenario is independently authored by a domain expert, passes through two levels of expert review and editorial compilation, and gets Pass@8 sampled to confirm adversarial validity and quality.
Sample is drawn from a common bank-account-holder profile. Tight cross-scenario similarity is intentional: single-variable shifts are the lesson.
| Sub-Category | Threat | Sample N | What models must reason about |
|---|---|---|---|
ato_cash_out | Account Takeover | 19 | Unauthorized access through hijacked credentials or session; bad actor cashes out. Detection requires fusing device, network, geolocation, and behavioral-biometric signals. |
passive_ato | Account Takeover | 3 | Credentials silently compromised; activity looks ordinary at the surface. Signal is in long-tail behavioral drift and counterparty subtleties. |
mule_1st_party | Money Mule | 4 | Account holder knowingly forwards illicit funds. Profile looks legitimate; signal is transaction velocity and counterparty anomalies. |
mule_3rd_party | Money Mule | 4 | Account holder is an unwitting intermediary; account itself is hijacked for laundering. Detection hinges on counterparty graph and timing. |
Aggregate pass rates flatter the models. The per-category breakdown shows asymmetric, model-specific failure modes — and that asymmetry is precisely where adversarial near-duplicates concentrate the training signal.
| Sub-Category | Claude Opus 4.7 | Gemini 3.1 Pro | GPT 5.4 | Grok 4.2 |
|---|---|---|---|---|
| ATO Cash Out | 45% | 65% | 45% | 51% |
| Passive ATO | 25% | 46% | 67% | 38% |
| Money Mule (1st Party Complicit) | 66% | 69% | 34% | 28% |
| Money Mule (3rd Party Hostile) | 100% | 94% | 47% | 72% |
| Overall | 53% | 67% | 46% | 50% |
Aggregate ranking hides the structure. Each model is strong somewhere and weak somewhere else — and the "somewhere" is different for each.
67% mean across all scenarios — the leader. Holds up across Money Mule typologies (94% on 3rd-party-hostile, 69% on 1st-party-complicit). Drops to 46% on Passive Account Take Over, where the signal is long-tail behavioral drift.
Scores 100% on Money Mule (3rd-Party Hostile) — the only category any model fully clears. Falls to 25% on Passive ATO, exactly the random baseline. A 75-point spread is the widest of any model.
Best of any model on Passive ATO at 67%. But the worst overall at 46%, dragged by 34% on Money Mule (1st-Party) and 47% on Money Mule (3rd-Party). Underestimates fraud risk and mixes threat categories.
72% on Money Mule (3rd-Party Hostile), at the strong end. Drops to 28% on Money Mule (1st-Party Complicit) — the only model below 30% on a Mule typology. Consistently underestimates fraud risk; failures vary by typology.
Each model has at least one weak category — but the categories are different. Claude is near-perfect on a category where GPT is mid; GPT clears Passive ATO at 67% where Claude is at random. Same data shape, different reasoning failures. That's the high-signal RL target: model-specific blind spots on adversarial near-duplicates.
Ships with full scenario schemas, expert rationales, and per-model failure taxonomies. Designed to move frontier models from pattern-matching to defensible, industry-expert judgment.
Scenarios share most context but flip on a single variable. Trains models to read fine-grained signals instead of pattern-matching surface features.
Each prompt fuses device, network, behavioral, counterparty, and transaction evidence. Models must integrate, not pick.
Authored, reviewed, and adjudicated by senior fraud / AML / payments executives. Two-level expert review per scenario.
Every scenario validated by Pass@8 sampling. Adversarial validity confirmed against the expert-adjudicated label.
4-way classification (High / Medium / Low / No risk). Addresses both over- and under-indexing failures observed in frontier models.
Fraud detection, AML alert triage, investigations, and new-account onboarding — the actual surfaces where LLMs are being deployed.
Includes scenario schemas, expert rationales, per-model failure taxonomies, and the eval harness. Scoped to your post-training goals.
Get in touch →