World-class document processing pipeline for the Epstein case files — OCR, NER, dedup, embeddings, knowledge graph, Neon Postgres export

These details have not been verified by PyPI

Project links

Project description

Epstein Pipeline

Open-source document processing pipeline for the Jeffrey Epstein case files. Downloads, OCRs, extracts entities, deduplicates, embeds, and exports 2.1 million+ documents to Neon Postgres with pgvector semantic search.

This is the data engine behind epsteinexposed.com -- the most comprehensive searchable database of the Epstein files.

What It Does

DOJ EFTA Releases (DS1-DS12)  ─┐
Kaggle Datasets                ─┤
HuggingFace Collections        ─┼──► Download
Archive.org Mirrors            ─┘
        │
        ▼
┌──────────────────────────────────────────────────────────┐
│  OCR (multi-backend fallback chain)                      │
│  PyMuPDF → Surya → olmOCR 2 → Docling                   │
│  Per-page confidence scoring, automatic backend selection│
└──────────────────────┬───────────────────────────────────┘
                       │
    ┌──────────────────┼──────────────────┐
    ▼                  ▼                  ▼
┌────────────┐  ┌────────────┐  ┌──────────────────┐
│ NER        │  │ Dedup      │  │ Classifier       │
│ spaCy trf  │  │ Hash →     │  │ Zero-shot BART   │
│ + GLiNER   │  │ MinHash →  │  │ 12 doc categories│
│ + regex    │  │ Semantic   │  │                  │
└─────┬──────┘  └─────┬──────┘  └────────┬─────────┘
      │               │                  │
      ▼               ▼                  ▼
┌────────────┐  ┌────────────┐  ┌──────────────────┐
│ Summarizer │  │ Redaction  │  │ Image Extractor  │
│ LLM-based  │  │ Analysis   │  │ + AI description │
└─────┬──────┘  └─────┬──────┘  └────────┬─────────┘
      │               │                  │
      └───────────────┼──────────────────┘
                      ▼
┌──────────────────────────────────────────────────────────┐
│  Semantic Chunker → Embeddings (nomic-embed-text-v2-moe) │
│  Paragraph-aware splitting, 768-dim / 256-dim Matryoshka │
└──────────────────────┬───────────────────────────────────┘
                       │
    ┌──────────────────┼──────────────────┐
    ▼                  ▼                  ▼
┌────────────┐  ┌────────────┐  ┌──────────────────┐
│ Neon PG    │  │ JSON/CSV   │  │ Knowledge Graph  │
│ + pgvector │  │ SQLite     │  │ GEXF + JSON      │
│ cosine ANN │  │ NDJSON     │  │ LLM extraction   │
└────────────┘  └────────────┘  └──────────────────┘

Current Scale

Metric	Count
Documents ingested	2,145,000+
OCR texts extracted	2,014,000+
Persons identified	1,723
Document-person links	2,443,000+
SHA-256 integrity hashes	1,380,000+
DOJ datasets processed	12 of 12 (DS1-DS12)

Quickstart

# Install with all features
pip install "epstein-pipeline[all]"
python -m spacy download en_core_web_sm

# Download one DOJ dataset, or add --list-datasets to browse the catalog
epstein-pipeline download doj --dataset 9

# OCR with automatic backend selection
epstein-pipeline ocr ./raw-pdfs/ --output ./processed/

# Extract entities (spaCy + GLiNER)
epstein-pipeline extract-entities ./processed/ --output ./entities/

# Generate embeddings and push to Neon
epstein-pipeline embed ./processed/ --output ./embeddings/ --format neon

# Export everything to Neon Postgres
epstein-pipeline export-neon ./processed/

30-Second Smoke Test

epstein-pipeline --version
epstein-pipeline status --json
epstein-pipeline status --fail-on-unhealthy
epstein-pipeline validate ./processed/

Use epstein-pipeline status --fail-on-unhealthy in wrappers or CI when you want a fast stop signal before a long ingest or export run. Add --check-database when EPSTEIN_NEON_DATABASE_URL is configured and you want the smoke test to include a live Neon ping.

Release operators should also read docs/production-readiness.md and the concrete docs/release-v1.0.4.md handoff.

Neon Postgres Setup

# Set your Neon connection string
export EPSTEIN_NEON_DATABASE_URL="postgresql://user:pass@ep-xxx.us-east-2.aws.neon.tech/epstein"

# Run schema migration (idempotent, safe to re-run)
epstein-pipeline migrate

# Semantic search from the command line
epstein-pipeline search "financial transactions offshore accounts"

Processing Backends

Component	Backend	Speed	Accuracy	GPU Required
OCR	PyMuPDF	Instant	Text layers only	No
OCR	Surya	Fast	High (90+ langs)	Optional
OCR	olmOCR 2	Slow	Highest (VLM)	Yes (8GB+)
OCR	Docling (IBM)	Medium	High	No
NER	spaCy `en_core_web_trf`	Fast	High	Optional
NER	GLiNER	Medium	High (zero-shot)	Optional
Dedup	Content hash + fuzzy	Instant	Exact only	No
Dedup	MinHash/LSH	O(n)	Near-duplicate	No
Dedup	Semantic embeddings	Slow	OCR-variant	Optional
Embeddings	nomic-embed-text-v2-moe	Fast	SOTA	Optional
Classifier	BART-large-mnli	Medium	Good	Optional

Installation

# Core only (no ML models)
pip install epstein-pipeline

# With OCR (CPU -- Surya)
pip install "epstein-pipeline[ocr-surya]"

# With OCR (GPU -- olmOCR 2, requires CUDA)
pip install "epstein-pipeline[ocr-gpu]"

# With NLP (spaCy + GLiNER)
pip install "epstein-pipeline[nlp,nlp-gliner]"

# With embeddings (sentence-transformers + torch)
pip install "epstein-pipeline[embeddings]"

# With Neon Postgres export (psycopg + pgvector)
pip install "epstein-pipeline[neon]"

# Everything (except GPU-only olmOCR)
pip install "epstein-pipeline[all]"

Docker

docker compose run pipeline --help
docker compose run pipeline ocr ./raw-pdfs/ --output ./output/
docker compose run pipeline migrate

CLI Reference

# -- Data Ingestion ------------------------------------------------
epstein-pipeline download doj --dataset 9       # Download DOJ EFTA dataset (1-12)
epstein-pipeline download kaggle                # Download Kaggle dataset
epstein-pipeline download huggingface           # Download HuggingFace datasets
epstein-pipeline download archive               # Download from Archive.org mirrors

# -- Processing ----------------------------------------------------
epstein-pipeline ocr ./pdfs/ -o ./out/          # OCR (auto backend selection)
epstein-pipeline ocr ./pdfs/ --backend surya    # OCR with specific backend
epstein-pipeline extract-entities ./out/ -o ./e/ # NER extraction (spaCy + GLiNER)
epstein-pipeline classify ./out/                # Zero-shot document classification
epstein-pipeline dedup ./out/ --mode all        # 3-pass deduplication
epstein-pipeline embed ./out/ -o ./emb/         # Generate embeddings

# -- Export --------------------------------------------------------
epstein-pipeline export json ./out/ -o ./site/  # JSON for website
epstein-pipeline export csv ./out/ -o docs.csv  # CSV for researchers
epstein-pipeline export sqlite ./out/ -o ep.db  # SQLite database
epstein-pipeline export-neon ./out/             # Push to Neon Postgres

# -- Database ------------------------------------------------------
epstein-pipeline migrate                        # Run Neon schema migration
epstein-pipeline search "query text here"       # Semantic search (pgvector)

# -- Quality -------------------------------------------------------
epstein-pipeline validate ./out/                # Data quality checks
epstein-pipeline stats ./out/                   # Show processing statistics

# -- Sanctions & PEP Cross-Reference --------------------------------
epstein-pipeline check-sanctions               # Cross-check all persons vs OpenSanctions
epstein-pipeline check-sanctions --threshold 0.3 --use-search  # Lower threshold, search API
epstein-pipeline import-sanctions ./output/sanctions/opensanctions-results.json

# -- Person Integrity Auditor -------------------------------------
epstein-pipeline audit-persons                  # Full 5-phase audit
epstein-pipeline audit-persons --phases dedup   # Single phase only
epstein-pipeline audit-persons --person bill-clinton --dry-run
epstein-pipeline audit-persons --min-severity 40 -o report.json

Processors

OCR (`processors/ocr.py`)

Multi-backend OCR with automatic fallback. Tries PyMuPDF (text extraction) first, falls back through Surya, olmOCR 2, and Docling based on per-page confidence scores. Handles scanned PDFs, image-only pages, and mixed documents.

Entity Extraction (`processors/entities.py`)

Hybrid NER using spaCy transformer models + GLiNER zero-shot extraction + regex patterns. Extracts people, organizations, locations, dates, case numbers, flight IDs, financial amounts, and Bates numbers from legal documents.

Deduplication (`processors/dedup.py`)

Three-pass deduplication pipeline:

Exact hash -- SHA-256 content hash for identical files
MinHash/LSH -- O(n) near-duplicate detection for OCR variants
Semantic similarity -- Embedding cosine similarity for reformatted duplicates

Document Classification (`processors/classifier.py`)

Zero-shot classification using BART-large-mnli into 12 legal document categories (court filings, depositions, correspondence, financial records, flight logs, etc.).

Semantic Chunking (`processors/chunker.py`)

Paragraph-aware text splitting with OCR noise cleaning. Respects sentence and paragraph boundaries, targets 450 tokens per chunk with 50-token overlap. Includes contextual prefixes (document title + source) per chunk.

Embeddings (`processors/embeddings.py`)

Generates vector embeddings using nomic-embed-text-v2-moe (768-dim, Matryoshka to 256-dim). Used for semantic deduplication and search indexing.

Redaction Analysis (`processors/redaction.py`)

Detects redacted regions in PDFs and attempts text recovery where redactions are improperly applied (transparent overlays, recoverable text layers).

Image Extraction (`processors/image_extractor.py`)

Extracts embedded images from PDFs with optional AI-powered description via vision models.

Summarization (`processors/summarizer.py`)

LLM-based document summarization for generating concise descriptions of legal documents.

Person Linking (`processors/person_linker.py`)

Links extracted entity mentions to known persons in the database using fuzzy name matching with word boundary safety (multi-word names only to prevent false positives).

Knowledge Graph (`processors/knowledge_graph.py`)

Builds entity relationship graphs from co-occurrence analysis and optional LLM-based relationship extraction. Exports to GEXF and JSON formats.

Plist Forensics (`processors/plist_forensics.py`)

Parses Apple plist files found in the Epstein device data for contact and metadata extraction.

OpenSanctions Cross-Reference

Cross-references all 1,538 persons against 100+ global sanctions, PEP, and watchlist datasets via the OpenSanctions API.

Datasets checked: OFAC SDN, EU Financial Sanctions, UN Security Council, UK HMT, Interpol Red Notices, PEP registries (Every Politician), ICIJ Offshore Leaks (mirrored), and 100+ more.

# Cross-check all persons (takes ~13 min at 0.5s/request rate limit)
export EPSTEIN_OPENSANCTIONS_API_KEY="your-api-key"
epstein-pipeline check-sanctions

# Import results into Neon Postgres
epstein-pipeline import-sanctions ./output/sanctions/opensanctions-results.json

What it does:

Loads all persons from data/persons-registry.json
Queries OpenSanctions /match endpoint for each person (fuzzy name matching)
Flags persons as is_sanctioned (on any sanctions list) or is_pep (politically exposed)
Saves detailed results to output/sanctions/opensanctions-results.json
import-sanctions writes flags to the persons table and creates a sanctions_matches table in Neon

Output: Each person gets: best match score, sanctions/PEP flags, matched datasets, and individual match details. Results are displayed in a Rich summary table with top matches ranked by score.

Requires: EPSTEIN_OPENSANCTIONS_API_KEY (free for non-commercial use at opensanctions.org)

Person Integrity Auditor

Automated 5-phase data quality pipeline that scans all person records against the Neon database, Wikidata, Wikipedia, and Claude AI to detect issues before they reach users.

Phase	What It Does	Cost
Dedup	rapidfuzz name similarity + alias cross-check for duplicate entries	Free
Wikidata	Cross-reference occupation, dates, nationality against Wikidata + Wikipedia	Free
Fact-Check	Decompose bios into atomic claims, verify against 2M+ documents via FTS	~$1-2
Coherence	Sample linked documents, detect merged identities (one record = two people)	~$0.50
Score	Calculate composite severity (0-100), create ai_leads for admin review	Free

Severity Tiers: Critical (70-100), High (40-69), Medium (20-39), Low (0-19)

Issues detected: duplicate entries, merged identities, wrong categories, bio contradictions, ungrounded claims, stale data, external contradictions with Wikidata/Wikipedia.

Requires: EPSTEIN_AUDITOR_ANTHROPIC_API_KEY + EPSTEIN_NEON_DATABASE_URL

Optional: EPSTEIN_AUDITOR_VOYAGE_API_KEY (semantic search), EPSTEIN_AUDITOR_COHERE_API_KEY (reranking)

Environment Variables

All configuration is via environment variables prefixed with EPSTEIN_. No credentials are ever stored in code or config files.

Variable	Required	Purpose
`EPSTEIN_NEON_DATABASE_URL`	For DB export/search	Neon Postgres connection string
`EPSTEIN_OPENSANCTIONS_API_KEY`	For sanctions check	OpenSanctions API key (free for non-commercial)
`EPSTEIN_AUDITOR_ANTHROPIC_API_KEY`	For person audit	Claude API key (fact-checking)
`EPSTEIN_AUDITOR_VOYAGE_API_KEY`	Optional	Voyage AI (semantic search in auditor)
`EPSTEIN_AUDITOR_COHERE_API_KEY`	Optional	Cohere (reranking in auditor)

Export Formats

Format	Use Case	Command
Neon Postgres	Production website, semantic search	`export-neon`
JSON	Static site generation, API consumption	`export json`
CSV	Research, spreadsheet analysis	`export csv`
SQLite	Local querying, offline research	`export sqlite`

Data Sources

All source data comes from publicly released government records and court documents:

Source	URL	Content
DOJ EFTA Library	https://www.justice.gov/epstein	12 datasets, 2M+ files
FBI Vault	https://vault.fbi.gov/jeffrey-epstein	FBI records
CourtListener	https://www.courtlistener.com/docket/4355835/giuffre-v-maxwell/	Court filings
House Oversight	https://oversight.house.gov	Congressional releases
DocumentCloud	https://www.documentcloud.org	Searchable court docs
Archive.org	https://archive.org/details/epstein-flight-logs-unredacted_202304	Flight logs, mirrors
Kaggle	Various	Community-compiled datasets

See docs/DATA_SOURCES.md for the complete list.

Documentation

Document	Description
CONTRIBUTING.md	How to contribute (setup, workflow, standards)
CODE_OF_CONDUCT.md	Community standards and expectations
SECURITY.md	Security policy and data handling guidelines
docs/ARCHITECTURE.md	System architecture and design decisions
docs/DATA_SOURCES.md	All known public data sources
docs/PROCESSORS.md	Processor reference (OCR, NER, dedup, etc.)
docs/production-readiness.md	Release checks, CI contract, and ship checklist
docs/SITE_SYNC.md	Syncing processed data to epsteinexposed.com
docs/SEA_DOUGHNUT.md	Sea_Doughnut research data integration

Contributing

We welcome contributions! See CONTRIBUTING.md for the full guide.

No coding required: Report data quality issues, suggest new data sources, review processed data.

Code contributions: Add downloaders, improve extraction accuracy, add export formats, fix bugs.

Related Projects

epsteinexposed.com -- The live website powered by this pipeline
rodrigopolo/epstein-doj-library-sha256 -- SHA-256 integrity hashes for DOJ files
Epstein-Files -- DOJ file mirrors
Epstein-doc-explorer -- Email graph explorer
Epstein-research-data -- Community research dataset

License

MIT License. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.4

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epstein_pipeline-1.0.4.tar.gz (211.8 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

epstein_pipeline-1.0.4-py3-none-any.whl (209.2 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file epstein_pipeline-1.0.4.tar.gz.

File metadata

Download URL: epstein_pipeline-1.0.4.tar.gz
Upload date: Mar 11, 2026
Size: 211.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epstein_pipeline-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`52f578cebcb823223c6d914e0d6a3554ade9787a5d8dc09245694c85dfd6b04d`
MD5	`c0977eaeb809c394dff8d96cd73f7ba9`
BLAKE2b-256	`26a4c263b7d2a60c6ea0f9e46702742f33e9ea5f8d7aa38e4215eafed3089f37`

See more details on using hashes here.

File details

Details for the file epstein_pipeline-1.0.4-py3-none-any.whl.

File metadata

Download URL: epstein_pipeline-1.0.4-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 209.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epstein_pipeline-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`136acf63def16de30bec8feaa0796a1248a6e77752ff62221f0837cf1d098f87`
MD5	`95f5706930fcc424f7ae85a6e1895027`
BLAKE2b-256	`b39f3b14020b6b76178f20274e44678048db47d004c5630a4bc43c687cb4edf4`

See more details on using hashes here.

epstein-pipeline 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Epstein Pipeline

What It Does

Current Scale

Quickstart

30-Second Smoke Test

Neon Postgres Setup

Processing Backends

Installation

Docker

CLI Reference

Processors

OCR (processors/ocr.py)

Entity Extraction (processors/entities.py)

Deduplication (processors/dedup.py)

Document Classification (processors/classifier.py)

Semantic Chunking (processors/chunker.py)

Embeddings (processors/embeddings.py)

Redaction Analysis (processors/redaction.py)

Image Extraction (processors/image_extractor.py)

Summarization (processors/summarizer.py)

Person Linking (processors/person_linker.py)

Knowledge Graph (processors/knowledge_graph.py)

Plist Forensics (processors/plist_forensics.py)

OpenSanctions Cross-Reference

Person Integrity Auditor

Environment Variables

Export Formats

Data Sources

Documentation

Contributing

Related Projects

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

OCR (`processors/ocr.py`)

Entity Extraction (`processors/entities.py`)

Deduplication (`processors/dedup.py`)

Document Classification (`processors/classifier.py`)

Semantic Chunking (`processors/chunker.py`)

Embeddings (`processors/embeddings.py`)

Redaction Analysis (`processors/redaction.py`)

Image Extraction (`processors/image_extractor.py`)

Summarization (`processors/summarizer.py`)

Person Linking (`processors/person_linker.py`)

Knowledge Graph (`processors/knowledge_graph.py`)

Plist Forensics (`processors/plist_forensics.py`)