Skip to main content

Benchmark suite for evaluating LLM-based corporate decision simulation

Project description

mimic-bench

The benchmark suite for LLM-based corporate decision simulation.

pip install mimic-bench

mimic-bench answers one question: "How accurately does a model predict what a real company actually did in response to a macro shock?"

It provides:

  1. The dataset — 200 historical events × 50 companies ≈ 2,800 labeled (event, company) pairs
  2. The scoring system — a 4-component fidelity metric that compares model predictions to documented corporate actions

How it fits

mimic         → builds company twins, simulates decisions
mimic-bench   → grades those simulations against reality   ← you are here
mimic-forecast → plugs in foundation models (TimesFM, Chronos)
mimic-world   → stress tests entire supply chains
mimic-sim     → Monte Carlo with LLM agents
mimic-signal  → real-time event detection

Without mimic-bench, mimic is a demo. With it, mimic is a research tool.


Quickstart

from mimic_bench import Benchmark
import anthropic

client = anthropic.Anthropic()

def my_predict(prompt: str) -> str:
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return msg.content[0].text

bench = Benchmark.load("v1", predict_fn=my_predict)
results = bench.run(verbose=True)

print(results)
# BenchmarkResult(n=200, errors=0, fidelity=0.67)

print(results.by_category())
# {'macro': 0.71, 'supply_chain': 0.68, 'energy': 0.65, 'geopolitical': 0.63, ...}

print(results.worst_events(n=3))
# [{'event_id': '2021_08_hurricane_ida', 'mean': 0.52, ...}, ...]

With a mimic.Twin:

from mimic import Twin
from mimic_bench import Benchmark

bench = Benchmark.load("v1")
twin = Twin.from_ticker("WMT")

results = bench.run(twin, subset="supply_chain")
print(results.fidelity_score)      # 0.71
print(results.worst_events(n=5))   # where the twin failed most

Fidelity Score

Component Weight Description
Action alignment 40% Cosine similarity of predicted vs actual action strings (sentence-transformers)
Financial accuracy 30% `1 - min(
Direction accuracy 20% Sign match on financial impact (+/-)
Timing accuracy 10% Whether the model identified the right response window (0-24h, 1-7d, 8-30d)

v0.1 target: 0.65+ average across labeled pairs.


Dataset

v0.1 Seed Set (this release)

  • 10 events (hand-curated, 2020–2023)
  • 20 companies (large-cap S&P 500, spread across 5 sectors)
  • 200 labeled pairs, all human-reviewed

v1.0 Target

  • 200 events across 8 categories (2010–2024)
  • 50 companies (10 per sector)
  • ~2,800 high-signal pairs (not every event affects every company)

Event categories

Category Count Examples
Supply chain shocks 30 Suez Canal, Port of LA, COVID factory shutdowns
Geopolitical shocks 30 Russia-Ukraine, US-China tariffs, TSMC export controls
Macro / monetary 25 Fed hikes, SVB collapse, 2022 inflation peak
Energy 20 Oil crash, European gas crisis, OPEC cuts
Natural disasters 20 Hurricane Ida, Texas freeze, Japan earthquake
Industry-specific 30 Chip shortage, EV battery crunch
Pandemic / health 20 COVID waves, China lockdowns and reopening
Regulatory / policy 25 EU DMA, SEC climate rules, IRA credits

Ground Truth Schema

// data/ground_truth/labels_v1.jsonl  (one JSON object per line)
{
  "event_id": "2021_03_suez_canal",
  "ticker": "FDX",
  "actual_action_0_24h": "Issued customer advisory; began contingency routing planning",
  "actual_action_1_7d": "Rerouted 14 ocean freight shipments via Cape of Good Hope",
  "actual_action_8_30d": "Added temporary fuel/route surcharge on affected trade lanes",
  "financial_impact_usdM": -45.0,
  "financial_impact_reported": true,
  "source_type": "earnings_call",
  "source_url": "https://seekingalpha.com/...",
  "extraction_method": "llm_claude-opus-4-7",
  "confidence": 0.82,
  "human_reviewed": false
}

Event Schema

// data/events/2021_03_suez_canal.json
{
  "id": "2021_03_suez_canal",
  "title": "Suez Canal blockage — Ever Given grounding",
  "date": "2021-03-23",
  "end_date": "2021-03-29",
  "category": "supply_chain",
  "severity": 0.60,           // 0.0 = minimal, 1.0 = extreme
  "description": "...",
  "affected_sectors": ["logistics", "retail", "energy"],
  "affected_geographies": ["global", "europe"],
  "keywords": ["suez", "ever given", "shipping"],
  "source": "Suez Canal Authority",
  "source_url": "https://..."
}

Ground Truth Extraction Pipeline

Stage 1  Signal detection (automated)
         ├── Fetch 8-K filings from SEC EDGAR within 30 days of event
         ├── Check earnings call transcripts for event keywords
         └── If signal found → proceed to Stage 2
             If no signal  → mark as "not materially affected"

Stage 2  LLM extraction (semi-automated)
         ├── Feed 8-K text + transcript excerpt to Claude
         ├── Extract: action_0_24h, action_1_7d, action_8_30d, financial_impact_usdM
         └── Store raw + structured output

Stage 3  Human review (spot-check 10%)
         ├── python scripts/validate_labels.py --sample 280
         └── Adjust confidence_in_label accordingly

Run it:

# Extract all events, all companies (requires ANTHROPIC_API_KEY)
python scripts/extract_ground_truth.py

# Single event
python scripts/extract_ground_truth.py --event 2021_03_suez_canal

# Dry run (print prompts, skip API calls)
python scripts/extract_ground_truth.py --dry-run

Leaderboard

from mimic_bench.leaderboard import submit, display

results = bench.run(my_twin)
submit(results, model_name="mimic-v0.2-rag", notes="Added SEC EDGAR RAG")
display()
Rank  Model                          Fidelity     Std      N   Submitted
------------------------------------------------------------------------
1     mimic-v0.2-rag                   0.7312  0.1201    200  2024-09-01
2     mimic-v0.1-baseline              0.6543  0.1389    200  2024-08-15
3     gpt-4o-zero-shot                 0.6201  0.1501    200  2024-08-20

Repo Structure

mimic-bench/
├── pyproject.toml
├── README.md
├── data/
│   ├── events/              ← 200 event JSON files (v0.1: 10 events)
│   ├── ground_truth/
│   │   └── labels_v1.jsonl  ← ~2,800 labeled pairs (v0.1: 200 pairs)
│   ├── companies.json        ← 50 company definitions
│   └── leaderboard.json      ← submitted scores
├── mimic_bench/
│   ├── benchmark.py          ← Benchmark class (main entry point)
│   ├── scoring.py            ← 4-component fidelity metric
│   ├── datasets.py           ← data loaders
│   ├── leaderboard.py        ← score submission + display
│   └── extraction/
│       ├── sec.py            ← SEC EDGAR 8-K fetcher
│       ├── transcripts.py    ← earnings call transcript parser
│       └── llm_extract.py    ← Claude extraction prompt
├── scripts/
│   ├── generate_seed_labels.py   ← generates v0.1 200-record JSONL
│   ├── extract_ground_truth.py   ← Stage 1 + 2 extraction pipeline
│   ├── validate_labels.py        ← Stage 3 human review CLI
│   └── build_event_list.py       ← scaffold new event JSON files
└── examples/
    ├── 01_run_benchmark.ipynb
    ├── 02_custom_events.ipynb
    └── 03_leaderboard.ipynb

Installation

# Core (no dependencies)
pip install mimic-bench

# With semantic similarity scoring (recommended)
pip install "mimic-bench[semantic]"

# With extraction pipeline
pip install "mimic-bench[extraction]"

# Everything
pip install "mimic-bench[all]"

Paper

MimicBench: A Benchmark for LLM-Based Corporate Decision Simulation

Target venue: NeurIPS 2025 Datasets & Benchmarks track, or ICLR 2026.


License

MIT © Mimic

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mimic_bench-0.1.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mimic_bench-0.1.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file mimic_bench-0.1.0.tar.gz.

File metadata

  • Download URL: mimic_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for mimic_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a508bd3d8a9f3263a6e040a89f74fdc835aed767fb8812b2d46bd7304012649d
MD5 1e598774879dc4449e5b3429c2c8af05
BLAKE2b-256 acd9ab57d386a75e371bdfb745393a932bad4430783ba486dd8fdfbcff8ea1ae

See more details on using hashes here.

File details

Details for the file mimic_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mimic_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for mimic_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27e3f36c3d93fe2b18341799f37b50320dc49292425d01ae789100a0a886c554
MD5 ba758d6865ea630732dc4b7c9289f8dd
BLAKE2b-256 505f9dec7987c0594533ce3417278faf21941d5654c4788ad882dc11f5956348

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page