Skip to main content

Production-grade document extraction to versioned RAG-ready JSON/JSONL

Project description

extractor

Production-grade document extraction CLI + Python library.
Converts any file format into a versioned, schema-stable, RAG-ready JSONL stream.

Python 3.10+ Schema v1.2.0 License: MIT


Why extractor?

Most RAG pipelines treat documents as bags of text. extractor preserves structure:

  • Every chunk carries a full breadcrumb (section_path) so your retriever knows where it came from
  • Tables are atomic — never split across chunks; emitted in both Markdown and structured JSON
  • Sections drive chunking — boundaries follow headings, not page numbers or character counts
  • Content-addressed IDs — stable across re-runs, safe to use as vector store keys
  • Streaming JSONL — process terabytes without loading files into memory
  • Versioned protocolschema_version on every record; breaking changes bump the major version
  • Named entity extraction — GLiNER2 NER inline on every element and chunk; entity_types field for fast vector store payload filtering; local inference via fastino/gliner2-base-v1

Installation

pip install coresdk-extractor                      # core (PDF, DOCX, XLSX, PPTX, HTML, Markdown, EPUB, JSON, XML, CSV, LaTeX)
pip install "coresdk-extractor[audio]"             # + audio transcription (faster-whisper + pyannote)
pip install "coresdk-extractor[ocr]"               # + scanned PDF OCR (surya-ocr)
pip install "coresdk-extractor[ner]"               # + GLiNER2 NER, classification, relations, structured extraction
pip install "coresdk-extractor[lang]"              # + language detection (langdetect)
pip install "coresdk-extractor[otel]"              # + OpenTelemetry tracing

# Source connectors
pip install "coresdk-extractor[s3]"                # S3 / MinIO
pip install "coresdk-extractor[azure]"             # Azure Blob Storage / ADLS Gen2
pip install "coresdk-extractor[gcs]"               # Google Cloud Storage
pip install "coresdk-extractor[sources]"           # all three cloud connectors
# HTTP/HTTPS and IMAP/email connectors are included in the core install

# Database sinks
pip install "coresdk-extractor[clickhouse]"        # ClickHouse
pip install "coresdk-extractor[mongodb]"           # MongoDB
pip install "coresdk-extractor[postgres]"          # PostgreSQL
pip install "coresdk-extractor[elasticsearch]"     # Elasticsearch
pip install "coresdk-extractor[qdrant]"            # Qdrant
pip install "coresdk-extractor[weaviate]"          # Weaviate
pip install "coresdk-extractor[kafka]"             # Kafka (confluent-kafka)
# Webhook sink requires no extra install

pip install "coresdk-extractor[full]"              # everything above

Scientific PDFs (GROBID): run a GROBID server and set GROBID_URL=http://localhost:8070. Without it, scientific PDFs fall back to pymupdf4llm automatically.

Verify installation

extractor info          # lists all supported formats
extractor run README.md # quick smoke test on any local file

Heavy optional dependencies: extractor[audio] pulls PyTorch (~2 GB). extractor[ocr] requires surya-ocr with PyTorch. extractor[ner] pulls PyTorch (~2 GB) for local GLiNER2 inference. Install these only when needed.


Quick start

# Extract a PDF — stream elements to stdout
extractor run paper.pdf

# Extract in RAG-ready chunks mode, write to file
extractor run paper.pdf --mode chunks --out paper.chunks.jsonl

# Extract a whole directory, write each file alongside it
extractor run ./docs/ --mode chunks --out ./out/

# View what was extracted
extractor view paper.chunks.jsonl
extractor view paper.chunks.jsonl --count        # element type breakdown
extractor view paper.chunks.jsonl --types table:simple,code:block

# Inspect a file before extracting
extractor info paper.pdf

# List all parsers and supported formats
extractor info

# Validate output against the schema
extractor validate paper.chunks.jsonl --level invariants

# List all element types
extractor schema

Source connectors

extractor can pull documents directly from cloud storage, HTTP, and email — no manual download step needed.

# S3 bucket or prefix
extractor run s3://my-bucket/docs/ --out ./output/

# MinIO (S3-compatible)
EXTRACTOR_S3_ENDPOINT_URL=http://minio:9000 extractor run s3://my-bucket/docs/ --out ./output/

# Azure Blob Storage
extractor run az://my-container/reports/ --out ./output/

# Azure Data Lake Storage Gen2
extractor run abfs://my-container/data/ --out ./output/

# Google Cloud Storage
extractor run gcs://my-bucket/papers/ --out ./output/

# Single file via HTTPS (no extra install needed)
extractor run https://example.com/report.pdf

# Email attachments via IMAP (no extra install needed)
extractor run imap://inbox --out ./output/

# Filter to PDF files only, download up to 8 files in parallel
extractor run s3://my-bucket/docs/ --source-filter "*.pdf" --source-concurrency 8 --out ./output/

Every downloaded file passes through the same quarantine gate as local files before extraction.

Auth env vars by connector

Connector Required env vars Optional env vars
S3 AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY (or AWS_PROFILE, or IAM role) AWS_SESSION_TOKEN, EXTRACTOR_S3_ENDPOINT_URL, EXTRACTOR_S3_REGION
MinIO AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + EXTRACTOR_S3_ENDPOINT_URL EXTRACTOR_S3_REGION
Azure Blob / ADLS AZURE_STORAGE_CONNECTION_STRING or AZURE_STORAGE_ACCOUNT + AZURE_STORAGE_KEY AZURE_STORAGE_ACCOUNT alone uses DefaultAzureCredential (managed identity / service principal)
GCS GOOGLE_APPLICATION_CREDENTIALS (path to service account JSON) On GKE/Cloud Run: Workload Identity — no env var needed
HTTP/HTTPS none EXTRACTOR_HTTP_HEADERS_JSON (JSON dict), EXTRACTOR_HTTP_VERIFY_SSL, EXTRACTOR_HTTP_MAX_BYTES
IMAP/email EXTRACTOR_IMAP_HOST, EXTRACTOR_IMAP_USERNAME, EXTRACTOR_IMAP_PASSWORD EXTRACTOR_IMAP_PORT (default: 993), EXTRACTOR_IMAP_FOLDER (default: INBOX), EXTRACTOR_IMAP_SEARCH (default: UNSEEN)

See docs/sources.md for full connector documentation.


Database sinks

Stream extracted records directly into a database with --sink. The sink writes in batches alongside normal JSONL output.

# Write to Qdrant (chunk payloads only — add embeddings separately)
extractor run ./docs/ --mode chunks --sink qdrant --sink-uri http://localhost:6333

# Write to MongoDB
extractor run ./docs/ --mode chunks --sink mongodb --sink-uri mongodb://localhost:27017 --sink-database mydb --sink-table chunks

# Write to PostgreSQL (DSN from env: EXTRACTOR_PG_DSN)
extractor run ./docs/ --mode chunks --sink postgres

# Write to Elasticsearch
extractor run ./docs/ --mode chunks --sink elasticsearch --sink-uri http://localhost:9200 --sink-table my_index

# Write to ClickHouse
extractor run ./docs/ --mode chunks --sink clickhouse --sink-uri localhost:8123

# Write to Kafka topic
extractor run ./docs/ --mode chunks --sink kafka --sink-uri broker:9092 --sink-table my_topic

# POST batches to a webhook
extractor run ./docs/ --mode chunks --sink webhook --sink-uri https://my-api.example.com/ingest

# Adjust batch size (default: 1000)
extractor run ./docs/ --mode chunks --sink postgres --sink-batch 500
Sink Install Connection
clickhouse coresdk-extractor[clickhouse] --sink-uri host:port or defaults to localhost:8123
mongodb coresdk-extractor[mongodb] --sink-uri mongodb://... or defaults to mongodb://localhost:27017
postgres / postgresql coresdk-extractor[postgres] --sink-uri postgresql://user:pass@host/db or EXTRACTOR_PG_DSN
elasticsearch / es coresdk-extractor[elasticsearch] --sink-uri http://... or EXTRACTOR_ES_URL
qdrant coresdk-extractor[qdrant] --sink-uri http://... or EXTRACTOR_QDRANT_URL
weaviate coresdk-extractor[weaviate] --sink-uri http://... or EXTRACTOR_WEAVIATE_URL
kafka coresdk-extractor[kafka] --sink-uri broker:9092 or EXTRACTOR_KAFKA_BROKERS
webhook / http_post none (core) --sink-uri https://... or EXTRACTOR_WEBHOOK_URL

See docs/sinks.md for schema mapping details, auth env vars, and custom sink plugins.


Config file

Place .extractor.toml in your project directory (or pass --config path/to/extractor.toml) to set defaults without repeating CLI flags.

[run]
mode = "chunks"
chunk_size = 512
tokenizer = "cl100k_base"

[ner]
enabled = true
model = "fastino/gliner2-base-v1"

[sink]
type = "qdrant"
uri = "http://localhost:6333"

[source]
concurrency = 8

[quality_gates]
min_chunks = 1
max_extraction_error_rate = 0.05

New in v1.2.0

  • GLiNER2 extended capabilities — classification (--classify-as), relation triples (--relations), structured field extraction (--extract-schema)
  • Source connectors — pull documents from S3/MinIO, Azure Blob/ADLS, GCS, HTTP/HTTPS, and IMAP email directly via URI
  • Database sinks — stream records into ClickHouse, MongoDB, PostgreSQL, Elasticsearch, Qdrant, Weaviate, Kafka, or any HTTP webhook
  • Table serialization modes--table-text-mode markdown|nl-rows|nl-columns|hybrid controls how tables are serialized into chunk text
  • Chunk quality scoring--quality emits ChunkQuality with lexical density, entity density, compression ratio, and heading coverage
  • Language detectionmetadata.language populated per element when extractor[lang] is installed
  • Quality gates — configurable pass/fail thresholds in .extractor.toml under [quality_gates]; gate failures are recorded in the manifest
  • Figure extraction--figures-dir exports figure assets (PNG/JPEG) alongside JSONL output
  • OpenTelemetryextractor[otel] emits spans per document; configure via standard OTEL env vars (OTEL_EXPORTER_OTLP_ENDPOINT, etc.)
  • Dual-chunk mode--mode dual-chunks produces both coarse parent chunks and fine child chunks linked by parent_chunk_id
  • Parallel workers--workers N for local directory extraction; cloud sources use --source-concurrency N
  • Incremental processing--incremental skips files unchanged since last run (SHA256 + run-config keyed JSON cache)

Entity extraction (NER)

# Extract with named entities (requires extractor[ner])
extractor run paper.pdf --mode chunks --entities

# Disable NER
extractor run paper.pdf --mode chunks --no-entities

# Custom entity types
extractor run paper.pdf --entities-types "person,organization,product" --entities-threshold 0.6
from extractor import extract

for chunk in extract("paper.pdf", mode="chunks", entities=True):
    print(chunk.entities)           # list of EntityAnnotation
    print(chunk.chunk_metadata.entity_types)  # ["organization", "person"]

Python API

from extractor import extract

# Elements mode — fine-grained semantic units
for el in extract("paper.pdf", mode="elements"):
    print(el.element_type, el.section_path, el.text[:120])

# Chunks mode — pre-committed, RAG-ready chunks
for chunk in extract("paper.pdf", mode="chunks", chunk_size=512):
    print(chunk.id, chunk.token_count, chunk.text[:120])

# Filter to only tables and headings
for el in extract("report.docx", include_types=["table:simple", "structural:section_header"]):
    if el.table:
        print(el.table.structured)  # {"headers": [...], "rows": [[...]]}

Error handling

from extractor import extract, QuarantineError, UnsupportedFormatError, ParserError

try:
    for el in extract("untrusted_file.pdf"):
        print(el.element_type, el.text[:80])
except QuarantineError as e:
    print(f"File rejected by security check: {e}")
except UnsupportedFormatError as e:
    print(f"Format not supported. Run `extractor info` for the full list.")
except ParserError as e:
    print(f"Parser failed: {e}")

Output schema

Every record is a JSON object. Key fields:

Field Type Description
id string Content-addressed ID (el_ + 16 hex chars)
element_type string One of 47 canonical types (see below)
text string Plain-text content
section_path string[] Heading breadcrumb, e.g. ["Introduction", "Methods"]
section_path_tier int Quality: 1=native, 2=font-heuristic, 3=keyword, 4=positional
sequence_index int Document order, 0-based
page int|null Source page (1-based)
schema_version string "1.2.0"
source_filename string Source file name
source_sha256 string SHA-256 of source file
entities EntityAnnotation[]|absent Named entity annotations (absent = NER not run; [] = NER ran, nothing found)
table object|null {markdown, structured, has_header_row, row_count, col_count}
equation object|null {latex, plain_text, mathml}
figure object|null {caption, image_ref, image_sha256, ocr_text}
transcript object|null {speaker, start_time_s, end_time_s, word_timestamps}
admonition object|null {kind, title}

Element types

structural:  title  section_header  subtitle  divider  page_header  page_footer
text:        narrative  abstract  admonition  pull_quote  footnote  caption  sidebar  transcript_segment
table:       simple  complex  continuation
code:        block  cell  inline
list:        item  item_ordered  item_definition
media:       figure  image  audio  video
scientific:  equation_display  equation_inline  citation  reference_entry  theorem  definition  proof
meta:        document_title  author  date  url  email  page_number  extraction_error
form:        field  label  checkbox
composite:   chunk

Atomic types (never split across chunks): table:simple, table:complex, table:continuation, media:figure, media:image, code:block, scientific:equation_display


JSONL envelope format

{"type":"envelope","extractor_version":"1.2.0","source":{...},"run_config":{...},"created_at":"..."}
{"id":"el_a1b2c3d4e5f6a7b8","element_type":"structural:title","text":"Introduction",...}
{"id":"el_...","element_type":"text:narrative","text":"...",...}
...
{"type":"stream_end","status":"complete","total_elements":42,"schema_version":"1.0.0"}

A manifest.json companion file is written alongside every --out file with full stats.


Supported formats

Format Library Notes
PDF (digital) pymupdf4llm Fast, heading-aware
PDF (scientific) GROBID TEI → pymupdf4llm fallback Equations, citations, references
PDF (scanned) surya-ocr → pymupdf fallback Layout detection + OCR
DOCX python-docx Headings, tables, runs, images
XLSX openpyxl Sheet-per-section, dual-format tables
PPTX python-pptx Slide titles + body, speaker notes
HTML trafilatura + lxml Boilerplate removal, GFM alerts
Markdown mistletoe (GFM) Headings, tables, alerts, code fences
EPUB ebooklib + BS4 Spine-order chapter extraction
LaTeX pure-regex parser Sections, equations, tables, figures, bibliography
JSON stdlib Key-value pairs as narrative
XML lxml Title/paragraph heuristics
CSV stdlib csv Entire file as dual-format table
Plain text heuristic Heading pattern detection
Audio (mp3/wav/m4a/flac) faster-whisper + pyannote Diarization, word timestamps

CLI reference

extractor run <target> [options]

  Output
  --mode            elements | chunks | dual-chunks  (default: elements)
  --out             Output file or directory (default: stdout)
  --quiet / -q      Suppress progress output
  --debug           Show full tracebacks on errors
  --include-full-path  Store absolute source path instead of filename

  Chunking
  --chunk-size      Max tokens per chunk (default: 512)
  --overlap         Overlap tokens between chunks (default: 0)
  --tokenizer       tiktoken encoding (default: cl100k_base)
  --context-prefix  Prepend section breadcrumb to each chunk text
  --parent-size     Token budget for coarse chunks in dual-chunks mode (default: 512)
  --child-size      Token budget for fine chunks in dual-chunks mode (default: 128)

  Extraction
  --strategy        fast | accurate | ocr  (default: fast)
  --include-types   Comma-separated element types to emit
  --exclude-types   Comma-separated element types to suppress
  --table-text-mode markdown | nl-rows | nl-columns | hybrid  (default: markdown)
  --figures-dir     Directory to export figure assets (PNG/JPEG)

  NER / GLiNER2
  --entities/--no-entities      Run GLiNER2 NER (default: on when extractor[ner] installed)
  --entities-model              Local GLiNER2 model (default: fastino/gliner2-base-v1)
  --entities-types              Comma-separated NER label list
  --entities-threshold          Min confidence score (default: 0.50)
  --classify-as                 Comma-separated classification labels
  --relations/--no-relations    Extract (subject, predicate, object) triples
  --relation-types              Comma-separated relation predicates
  --extract-schema              JSON file with schema dict for structured field extraction
  --extract-on                  Comma-separated element types for structured extraction
  --canonicalize/--no-canonicalize  Cross-document entity canonicalization
  --registry-path               Path to EntityRegistry JSON file

  Quality
  --quality/--no-quality        Emit ChunkQuality scores on each chunk

  Parallel / incremental
  --workers / -w    Parallel workers for local directory extraction (default: 1)
  --incremental     Skip files unchanged since last run (requires --out)

  Source connectors
  --source-filter       Glob pattern to filter remote files, e.g. "*.pdf"
  --source-concurrency  Max parallel downloads from remote sources (default: 4)
  --source-tmp-dir      Directory for temp files during remote download

  Database sinks
  --sink            clickhouse | mongodb | postgres | elasticsearch | qdrant | weaviate | kafka | webhook
  --sink-uri        Connection URI or host:port
  --sink-database   Database name (default: extractor)
  --sink-table      Table/collection/index/topic name (default: elements)
  --sink-batch      Batch size for database writes (default: 1000)

  Config
  --config          Path to extractor.toml config file

extractor view <jsonl-file>
  --max-text        Max chars per element (default: 200)
  --types / -t      Comma-separated element types to show
  --count / -n      Print element counts by type and exit
  --no-meta         Hide envelope/manifest lines

extractor validate <jsonl-file>
  --level           basic | schema | invariants  (default: schema)

extractor info [file]
extractor schema [element-type] [--json] [--type element|chunk|manifest] [--out file]
extractor cache clear [cache-file] [--older-than N]

Environment variables

All environment variables are optional. The library works out of the box without any of them — each one unlocks a specific optional capability.

PDF (scientific)

Variable Default Description
GROBID_URL http://localhost:8070 URL of a running GROBID server. When set and reachable, scientific PDFs are parsed via GROBID TEI (better equation/citation/reference extraction). Without it, scientific PDFs automatically fall back to the standard digital PDF parser — no errors.

Audio transcription — only relevant if you install extractor[audio] and pass audio files (MP3/WAV/M4A)

Variable Default Description
HF_TOKEN Hugging Face API token. Required only for speaker diarization ("who said what"). Without it, you still get a full transcript — just without speaker labels. Get a free token at huggingface.co/settings/tokens and accept the pyannote.audio model license.
WHISPER_MODEL base Whisper model size controlling accuracy vs. speed. base (~150 MB) is fast and good for most uses. Use large-v2 (~3 GB) for production-quality transcription. Options: tiny / base / small / medium / large-v2.
WHISPER_DEVICE auto Hardware to run Whisper on. Auto-detected: uses NVIDIA GPU (cuda), Apple Silicon (mps), or falls back to cpu. Set explicitly if auto-detection picks the wrong device.
WHISPER_COMPUTE auto Float precision for faster-whisper. int8 on CPU (faster, less RAM). float16 on GPU (fastest). float32 for maximum accuracy. Auto-set based on device.

Entity extraction (NER) — only relevant if you install extractor[ner]

NER uses local inference only (fastino/gliner2-base-v1). No external API key required.


Exit codes

Code Meaning
0 Success
1 Partial success (some files failed)
2 Quarantine failure (file rejected)
3 Unsupported format
4 Parser crash
5 Output write error
6 Configuration error

Documentation

Document Description
Protocol Specification Full schema and protocol spec v1.2.0
Source Connectors S3, Azure, GCS, HTTP, IMAP connectors
Database Sinks ClickHouse, MongoDB, Postgres, ES, Qdrant, Weaviate, Kafka, Webhook
Architecture System architecture and design decisions
Writing a Parser How to add support for a new format
Contributing Dev setup, test workflow, PR guidelines

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coresdk_extractor-1.2.1.tar.gz (242.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coresdk_extractor-1.2.1-py3-none-any.whl (156.9 kB view details)

Uploaded Python 3

File details

Details for the file coresdk_extractor-1.2.1.tar.gz.

File metadata

  • Download URL: coresdk_extractor-1.2.1.tar.gz
  • Upload date:
  • Size: 242.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coresdk_extractor-1.2.1.tar.gz
Algorithm Hash digest
SHA256 343c10b2ddb25957ad950a9b039225185ec8378312e31b22d5e6c68c665d85f5
MD5 c77e6f31a8facac8787433203e051ffa
BLAKE2b-256 1c42a5fbda58bcf687b5fe4be313acacd2b0e29b6bbfca198f88e1b91cfbac8a

See more details on using hashes here.

File details

Details for the file coresdk_extractor-1.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for coresdk_extractor-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6678b4350e21158a789f22d4dcde71ad520b30380eda78534aaacbfc5e40f0dd
MD5 e6edc959cb4208d3115a8f7067593553
BLAKE2b-256 e73494637d9a7427e2958e02b23cc8be879bed18b85f6378f975b9beaa82e3b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page