Production-grade document extraction to versioned RAG-ready JSON/JSONL
Project description
extractor
Production-grade document extraction CLI + Python library.
Converts any file format into a versioned, schema-stable, RAG-ready JSONL stream.
Why extractor?
Most RAG pipelines treat documents as bags of text. extractor preserves structure:
- Every chunk carries a full breadcrumb (
section_path) so your retriever knows where it came from - Tables are atomic — never split across chunks; emitted in both Markdown and structured JSON
- Sections drive chunking — boundaries follow headings, not page numbers or character counts
- Content-addressed IDs — stable across re-runs, safe to use as vector store keys
- Streaming JSONL — process terabytes without loading files into memory
- Versioned protocol —
schema_versionon every record; breaking changes bump the major version - Named entity extraction — GLiNER2 NER inline on every element and chunk;
entity_typesfield for fast vector store payload filtering; local inference viafastino/gliner2-base-v1
Installation
pip install coresdk-extractor # core (PDF, DOCX, XLSX, PPTX, HTML, Markdown, EPUB, JSON, XML, CSV, LaTeX)
pip install "coresdk-extractor[audio]" # + audio transcription (faster-whisper + pyannote)
pip install "coresdk-extractor[ocr]" # + scanned PDF OCR (surya-ocr)
pip install "coresdk-extractor[ner]" # + GLiNER2 NER, classification, relations, structured extraction
pip install "coresdk-extractor[lang]" # + language detection (langdetect)
pip install "coresdk-extractor[otel]" # + OpenTelemetry tracing
# Source connectors
pip install "coresdk-extractor[s3]" # S3 / MinIO
pip install "coresdk-extractor[azure]" # Azure Blob Storage / ADLS Gen2
pip install "coresdk-extractor[gcs]" # Google Cloud Storage
pip install "coresdk-extractor[sources]" # all three cloud connectors
# HTTP/HTTPS and IMAP/email connectors are included in the core install
# Database sinks
pip install "coresdk-extractor[clickhouse]" # ClickHouse
pip install "coresdk-extractor[mongodb]" # MongoDB
pip install "coresdk-extractor[postgres]" # PostgreSQL
pip install "coresdk-extractor[elasticsearch]" # Elasticsearch
pip install "coresdk-extractor[qdrant]" # Qdrant
pip install "coresdk-extractor[weaviate]" # Weaviate
pip install "coresdk-extractor[kafka]" # Kafka (confluent-kafka)
# Webhook sink requires no extra install
pip install "coresdk-extractor[full]" # everything above
Scientific PDFs (GROBID): run a GROBID server and set GROBID_URL=http://localhost:8070.
Without it, scientific PDFs fall back to pymupdf4llm automatically.
Verify installation
extractor info # lists all supported formats
extractor run README.md # quick smoke test on any local file
Heavy optional dependencies:
extractor[audio]pulls PyTorch (~2 GB).extractor[ocr]requires surya-ocr with PyTorch.extractor[ner]pulls PyTorch (~2 GB) for local GLiNER2 inference. Install these only when needed.
Quick start
# Extract a PDF — stream elements to stdout
extractor run paper.pdf
# Extract in RAG-ready chunks mode, write to file
extractor run paper.pdf --mode chunks --out paper.chunks.jsonl
# Extract a whole directory, write each file alongside it
extractor run ./docs/ --mode chunks --out ./out/
# View what was extracted
extractor view paper.chunks.jsonl
extractor view paper.chunks.jsonl --count # element type breakdown
extractor view paper.chunks.jsonl --types table:simple,code:block
# Inspect a file before extracting
extractor info paper.pdf
# List all parsers and supported formats
extractor info
# Validate output against the schema
extractor validate paper.chunks.jsonl --level invariants
# List all element types
extractor schema
Source connectors
extractor can pull documents directly from cloud storage, HTTP, and email — no manual download step needed.
# S3 bucket or prefix
extractor run s3://my-bucket/docs/ --out ./output/
# MinIO (S3-compatible)
EXTRACTOR_S3_ENDPOINT_URL=http://minio:9000 extractor run s3://my-bucket/docs/ --out ./output/
# Azure Blob Storage
extractor run az://my-container/reports/ --out ./output/
# Azure Data Lake Storage Gen2
extractor run abfs://my-container/data/ --out ./output/
# Google Cloud Storage
extractor run gcs://my-bucket/papers/ --out ./output/
# Single file via HTTPS (no extra install needed)
extractor run https://example.com/report.pdf
# Email attachments via IMAP (no extra install needed)
extractor run imap://inbox --out ./output/
# Filter to PDF files only, download up to 8 files in parallel
extractor run s3://my-bucket/docs/ --source-filter "*.pdf" --source-concurrency 8 --out ./output/
Every downloaded file passes through the same quarantine gate as local files before extraction.
Auth env vars by connector
| Connector | Required env vars | Optional env vars |
|---|---|---|
| S3 | AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY (or AWS_PROFILE, or IAM role) |
AWS_SESSION_TOKEN, EXTRACTOR_S3_ENDPOINT_URL, EXTRACTOR_S3_REGION |
| MinIO | AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + EXTRACTOR_S3_ENDPOINT_URL |
EXTRACTOR_S3_REGION |
| Azure Blob / ADLS | AZURE_STORAGE_CONNECTION_STRING or AZURE_STORAGE_ACCOUNT + AZURE_STORAGE_KEY |
AZURE_STORAGE_ACCOUNT alone uses DefaultAzureCredential (managed identity / service principal) |
| GCS | GOOGLE_APPLICATION_CREDENTIALS (path to service account JSON) |
On GKE/Cloud Run: Workload Identity — no env var needed |
| HTTP/HTTPS | none | EXTRACTOR_HTTP_HEADERS_JSON (JSON dict), EXTRACTOR_HTTP_VERIFY_SSL, EXTRACTOR_HTTP_MAX_BYTES |
| IMAP/email | EXTRACTOR_IMAP_HOST, EXTRACTOR_IMAP_USERNAME, EXTRACTOR_IMAP_PASSWORD |
EXTRACTOR_IMAP_PORT (default: 993), EXTRACTOR_IMAP_FOLDER (default: INBOX), EXTRACTOR_IMAP_SEARCH (default: UNSEEN) |
See docs/sources.md for full connector documentation.
Database sinks
Stream extracted records directly into a database with --sink. The sink writes in batches alongside normal JSONL output.
# Write to Qdrant (chunk payloads only — add embeddings separately)
extractor run ./docs/ --mode chunks --sink qdrant --sink-uri http://localhost:6333
# Write to MongoDB
extractor run ./docs/ --mode chunks --sink mongodb --sink-uri mongodb://localhost:27017 --sink-database mydb --sink-table chunks
# Write to PostgreSQL (DSN from env: EXTRACTOR_PG_DSN)
extractor run ./docs/ --mode chunks --sink postgres
# Write to Elasticsearch
extractor run ./docs/ --mode chunks --sink elasticsearch --sink-uri http://localhost:9200 --sink-table my_index
# Write to ClickHouse
extractor run ./docs/ --mode chunks --sink clickhouse --sink-uri localhost:8123
# Write to Kafka topic
extractor run ./docs/ --mode chunks --sink kafka --sink-uri broker:9092 --sink-table my_topic
# POST batches to a webhook
extractor run ./docs/ --mode chunks --sink webhook --sink-uri https://my-api.example.com/ingest
# Adjust batch size (default: 1000)
extractor run ./docs/ --mode chunks --sink postgres --sink-batch 500
| Sink | Install | Connection |
|---|---|---|
clickhouse |
coresdk-extractor[clickhouse] |
--sink-uri host:port or defaults to localhost:8123 |
mongodb |
coresdk-extractor[mongodb] |
--sink-uri mongodb://... or defaults to mongodb://localhost:27017 |
postgres / postgresql |
coresdk-extractor[postgres] |
--sink-uri postgresql://user:pass@host/db or EXTRACTOR_PG_DSN |
elasticsearch / es |
coresdk-extractor[elasticsearch] |
--sink-uri http://... or EXTRACTOR_ES_URL |
qdrant |
coresdk-extractor[qdrant] |
--sink-uri http://... or EXTRACTOR_QDRANT_URL |
weaviate |
coresdk-extractor[weaviate] |
--sink-uri http://... or EXTRACTOR_WEAVIATE_URL |
kafka |
coresdk-extractor[kafka] |
--sink-uri broker:9092 or EXTRACTOR_KAFKA_BROKERS |
webhook / http_post |
none (core) | --sink-uri https://... or EXTRACTOR_WEBHOOK_URL |
See docs/sinks.md for schema mapping details, auth env vars, and custom sink plugins.
Config file
Place .extractor.toml in your project directory (or pass --config path/to/extractor.toml) to set defaults without repeating CLI flags.
[run]
mode = "chunks"
chunk_size = 512
tokenizer = "cl100k_base"
[ner]
enabled = true
model = "fastino/gliner2-base-v1"
[sink]
type = "qdrant"
uri = "http://localhost:6333"
[source]
concurrency = 8
[quality_gates]
min_chunks = 1
max_extraction_error_rate = 0.05
New in v1.2.0
- GLiNER2 extended capabilities — classification (
--classify-as), relation triples (--relations), structured field extraction (--extract-schema) - Source connectors — pull documents from S3/MinIO, Azure Blob/ADLS, GCS, HTTP/HTTPS, and IMAP email directly via URI
- Database sinks — stream records into ClickHouse, MongoDB, PostgreSQL, Elasticsearch, Qdrant, Weaviate, Kafka, or any HTTP webhook
- Table serialization modes —
--table-text-mode markdown|nl-rows|nl-columns|hybridcontrols how tables are serialized into chunk text - Chunk quality scoring —
--qualityemitsChunkQualitywith lexical density, entity density, compression ratio, and heading coverage - Language detection —
metadata.languagepopulated per element whenextractor[lang]is installed - Quality gates — configurable pass/fail thresholds in
.extractor.tomlunder[quality_gates]; gate failures are recorded in the manifest - Figure extraction —
--figures-direxports figure assets (PNG/JPEG) alongside JSONL output - OpenTelemetry —
extractor[otel]emits spans per document; configure via standard OTEL env vars (OTEL_EXPORTER_OTLP_ENDPOINT, etc.) - Dual-chunk mode —
--mode dual-chunksproduces both coarse parent chunks and fine child chunks linked byparent_chunk_id - Parallel workers —
--workers Nfor local directory extraction; cloud sources use--source-concurrency N - Incremental processing —
--incrementalskips files unchanged since last run (SHA256 + run-config keyed JSON cache)
Entity extraction (NER)
# Extract with named entities (requires extractor[ner])
extractor run paper.pdf --mode chunks --entities
# Disable NER
extractor run paper.pdf --mode chunks --no-entities
# Custom entity types
extractor run paper.pdf --entities-types "person,organization,product" --entities-threshold 0.6
from extractor import extract
for chunk in extract("paper.pdf", mode="chunks", entities=True):
print(chunk.entities) # list of EntityAnnotation
print(chunk.chunk_metadata.entity_types) # ["organization", "person"]
Python API
from extractor import extract
# Elements mode — fine-grained semantic units
for el in extract("paper.pdf", mode="elements"):
print(el.element_type, el.section_path, el.text[:120])
# Chunks mode — pre-committed, RAG-ready chunks
for chunk in extract("paper.pdf", mode="chunks", chunk_size=512):
print(chunk.id, chunk.token_count, chunk.text[:120])
# Filter to only tables and headings
for el in extract("report.docx", include_types=["table:simple", "structural:section_header"]):
if el.table:
print(el.table.structured) # {"headers": [...], "rows": [[...]]}
Error handling
from extractor import extract, QuarantineError, UnsupportedFormatError, ParserError
try:
for el in extract("untrusted_file.pdf"):
print(el.element_type, el.text[:80])
except QuarantineError as e:
print(f"File rejected by security check: {e}")
except UnsupportedFormatError as e:
print(f"Format not supported. Run `extractor info` for the full list.")
except ParserError as e:
print(f"Parser failed: {e}")
Output schema
Every record is a JSON object. Key fields:
| Field | Type | Description |
|---|---|---|
id |
string |
Content-addressed ID (el_ + 16 hex chars) |
element_type |
string |
One of 47 canonical types (see below) |
text |
string |
Plain-text content |
section_path |
string[] |
Heading breadcrumb, e.g. ["Introduction", "Methods"] |
section_path_tier |
int |
Quality: 1=native, 2=font-heuristic, 3=keyword, 4=positional |
sequence_index |
int |
Document order, 0-based |
page |
int|null |
Source page (1-based) |
schema_version |
string |
"1.2.0" |
source_filename |
string |
Source file name |
source_sha256 |
string |
SHA-256 of source file |
entities |
EntityAnnotation[]|absent |
Named entity annotations (absent = NER not run; [] = NER ran, nothing found) |
table |
object|null |
{markdown, structured, has_header_row, row_count, col_count} |
equation |
object|null |
{latex, plain_text, mathml} |
figure |
object|null |
{caption, image_ref, image_sha256, ocr_text} |
transcript |
object|null |
{speaker, start_time_s, end_time_s, word_timestamps} |
admonition |
object|null |
{kind, title} |
Element types
structural: title section_header subtitle divider page_header page_footer
text: narrative abstract admonition pull_quote footnote caption sidebar transcript_segment
table: simple complex continuation
code: block cell inline
list: item item_ordered item_definition
media: figure image audio video
scientific: equation_display equation_inline citation reference_entry theorem definition proof
meta: document_title author date url email page_number extraction_error
form: field label checkbox
composite: chunk
Atomic types (never split across chunks): table:simple, table:complex, table:continuation, media:figure, media:image, code:block, scientific:equation_display
JSONL envelope format
{"type":"envelope","extractor_version":"1.2.0","source":{...},"run_config":{...},"created_at":"..."}
{"id":"el_a1b2c3d4e5f6a7b8","element_type":"structural:title","text":"Introduction",...}
{"id":"el_...","element_type":"text:narrative","text":"...",...}
...
{"type":"stream_end","status":"complete","total_elements":42,"schema_version":"1.0.0"}
A manifest.json companion file is written alongside every --out file with full stats.
Supported formats
| Format | Library | Notes |
|---|---|---|
| PDF (digital) | pymupdf4llm | Fast, heading-aware |
| PDF (scientific) | GROBID TEI → pymupdf4llm fallback | Equations, citations, references |
| PDF (scanned) | surya-ocr → pymupdf fallback | Layout detection + OCR |
| DOCX | python-docx | Headings, tables, runs, images |
| XLSX | openpyxl | Sheet-per-section, dual-format tables |
| PPTX | python-pptx | Slide titles + body, speaker notes |
| HTML | trafilatura + lxml | Boilerplate removal, GFM alerts |
| Markdown | mistletoe (GFM) | Headings, tables, alerts, code fences |
| EPUB | ebooklib + BS4 | Spine-order chapter extraction |
| LaTeX | pure-regex parser | Sections, equations, tables, figures, bibliography |
| JSON | stdlib | Key-value pairs as narrative |
| XML | lxml | Title/paragraph heuristics |
| CSV | stdlib csv | Entire file as dual-format table |
| Plain text | heuristic | Heading pattern detection |
| Audio (mp3/wav/m4a/flac) | faster-whisper + pyannote | Diarization, word timestamps |
CLI reference
extractor run <target> [options]
Output
--mode elements | chunks | dual-chunks (default: elements)
--out Output file or directory (default: stdout)
--quiet / -q Suppress progress output
--debug Show full tracebacks on errors
--include-full-path Store absolute source path instead of filename
Chunking
--chunk-size Max tokens per chunk (default: 512)
--overlap Overlap tokens between chunks (default: 0)
--tokenizer tiktoken encoding (default: cl100k_base)
--context-prefix Prepend section breadcrumb to each chunk text
--parent-size Token budget for coarse chunks in dual-chunks mode (default: 512)
--child-size Token budget for fine chunks in dual-chunks mode (default: 128)
Extraction
--strategy fast | accurate | ocr (default: fast)
--include-types Comma-separated element types to emit
--exclude-types Comma-separated element types to suppress
--table-text-mode markdown | nl-rows | nl-columns | hybrid (default: markdown)
--figures-dir Directory to export figure assets (PNG/JPEG)
NER / GLiNER2
--entities/--no-entities Run GLiNER2 NER (default: on when extractor[ner] installed)
--entities-model Local GLiNER2 model (default: fastino/gliner2-base-v1)
--entities-types Comma-separated NER label list
--entities-threshold Min confidence score (default: 0.50)
--classify-as Comma-separated classification labels
--relations/--no-relations Extract (subject, predicate, object) triples
--relation-types Comma-separated relation predicates
--extract-schema JSON file with schema dict for structured field extraction
--extract-on Comma-separated element types for structured extraction
--canonicalize/--no-canonicalize Cross-document entity canonicalization
--registry-path Path to EntityRegistry JSON file
Quality
--quality/--no-quality Emit ChunkQuality scores on each chunk
Parallel / incremental
--workers / -w Parallel workers for local directory extraction (default: 1)
--incremental Skip files unchanged since last run (requires --out)
Source connectors
--source-filter Glob pattern to filter remote files, e.g. "*.pdf"
--source-concurrency Max parallel downloads from remote sources (default: 4)
--source-tmp-dir Directory for temp files during remote download
Database sinks
--sink clickhouse | mongodb | postgres | elasticsearch | qdrant | weaviate | kafka | webhook
--sink-uri Connection URI or host:port
--sink-database Database name (default: extractor)
--sink-table Table/collection/index/topic name (default: elements)
--sink-batch Batch size for database writes (default: 1000)
Config
--config Path to extractor.toml config file
extractor view <jsonl-file>
--max-text Max chars per element (default: 200)
--types / -t Comma-separated element types to show
--count / -n Print element counts by type and exit
--no-meta Hide envelope/manifest lines
extractor validate <jsonl-file>
--level basic | schema | invariants (default: schema)
extractor info [file]
extractor schema [element-type] [--json] [--type element|chunk|manifest] [--out file]
extractor cache clear [cache-file] [--older-than N]
Environment variables
All environment variables are optional. The library works out of the box without any of them — each one unlocks a specific optional capability.
PDF (scientific)
| Variable | Default | Description |
|---|---|---|
GROBID_URL |
http://localhost:8070 |
URL of a running GROBID server. When set and reachable, scientific PDFs are parsed via GROBID TEI (better equation/citation/reference extraction). Without it, scientific PDFs automatically fall back to the standard digital PDF parser — no errors. |
Audio transcription — only relevant if you install extractor[audio] and pass audio files (MP3/WAV/M4A)
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
— | Hugging Face API token. Required only for speaker diarization ("who said what"). Without it, you still get a full transcript — just without speaker labels. Get a free token at huggingface.co/settings/tokens and accept the pyannote.audio model license. |
WHISPER_MODEL |
base |
Whisper model size controlling accuracy vs. speed. base (~150 MB) is fast and good for most uses. Use large-v2 (~3 GB) for production-quality transcription. Options: tiny / base / small / medium / large-v2. |
WHISPER_DEVICE |
auto | Hardware to run Whisper on. Auto-detected: uses NVIDIA GPU (cuda), Apple Silicon (mps), or falls back to cpu. Set explicitly if auto-detection picks the wrong device. |
WHISPER_COMPUTE |
auto | Float precision for faster-whisper. int8 on CPU (faster, less RAM). float16 on GPU (fastest). float32 for maximum accuracy. Auto-set based on device. |
Entity extraction (NER) — only relevant if you install extractor[ner]
NER uses local inference only (
fastino/gliner2-base-v1). No external API key required.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Partial success (some files failed) |
| 2 | Quarantine failure (file rejected) |
| 3 | Unsupported format |
| 4 | Parser crash |
| 5 | Output write error |
| 6 | Configuration error |
Documentation
| Document | Description |
|---|---|
| Protocol Specification | Full schema and protocol spec v1.2.0 |
| Source Connectors | S3, Azure, GCS, HTTP, IMAP connectors |
| Database Sinks | ClickHouse, MongoDB, Postgres, ES, Qdrant, Weaviate, Kafka, Webhook |
| Architecture | System architecture and design decisions |
| Writing a Parser | How to add support for a new format |
| Contributing | Dev setup, test workflow, PR guidelines |
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file coresdk_extractor-1.2.1.tar.gz.
File metadata
- Download URL: coresdk_extractor-1.2.1.tar.gz
- Upload date:
- Size: 242.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
343c10b2ddb25957ad950a9b039225185ec8378312e31b22d5e6c68c665d85f5
|
|
| MD5 |
c77e6f31a8facac8787433203e051ffa
|
|
| BLAKE2b-256 |
1c42a5fbda58bcf687b5fe4be313acacd2b0e29b6bbfca198f88e1b91cfbac8a
|
File details
Details for the file coresdk_extractor-1.2.1-py3-none-any.whl.
File metadata
- Download URL: coresdk_extractor-1.2.1-py3-none-any.whl
- Upload date:
- Size: 156.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6678b4350e21158a789f22d4dcde71ad520b30380eda78534aaacbfc5e40f0dd
|
|
| MD5 |
e6edc959cb4208d3115a8f7067593553
|
|
| BLAKE2b-256 |
e73494637d9a7427e2958e02b23cc8be879bed18b85f6378f975b9beaa82e3b6
|