Production-grade document extraction to versioned RAG-ready JSON/JSONL

These details have not been verified by PyPI

Project description

extractor

Production-grade document extraction CLI + Python library.
Converts any file format into a versioned, schema-stable, RAG-ready JSONL stream.

Why extractor?

Most RAG pipelines treat documents as bags of text. extractor preserves structure:

Every chunk carries a full breadcrumb (section_path) so your retriever knows where it came from
Tables are atomic — never split across chunks; emitted in both Markdown and structured JSON
Sections drive chunking — boundaries follow headings, not page numbers or character counts
Content-addressed IDs — stable across re-runs, safe to use as vector store keys
Streaming JSONL — process terabytes without loading files into memory
Versioned protocol — schema_version on every record; breaking changes bump the major version
Named entity extraction — GLiNER2 NER inline on every element and chunk; entity_types field for fast vector store payload filtering; local inference via fastino/gliner2-base-v1

Installation

pip install coresdk-extractor                      # core (PDF, DOCX, XLSX, PPTX, HTML, Markdown, EPUB, JSON, XML, CSV, LaTeX)
pip install "coresdk-extractor[audio]"             # + audio transcription (faster-whisper + pyannote)
pip install "coresdk-extractor[ocr]"               # + scanned PDF OCR (surya-ocr)
pip install "coresdk-extractor[ner]"               # + GLiNER2 NER, classification, relations, structured extraction
pip install "coresdk-extractor[lang]"              # + language detection (langdetect)
pip install "coresdk-extractor[otel]"              # + OpenTelemetry tracing

# Source connectors
pip install "coresdk-extractor[s3]"                # S3 / MinIO
pip install "coresdk-extractor[azure]"             # Azure Blob Storage / ADLS Gen2
pip install "coresdk-extractor[gcs]"               # Google Cloud Storage
pip install "coresdk-extractor[sources]"           # all three cloud connectors
# HTTP/HTTPS and IMAP/email connectors are included in the core install

# Database sinks
pip install "coresdk-extractor[clickhouse]"        # ClickHouse
pip install "coresdk-extractor[mongodb]"           # MongoDB
pip install "coresdk-extractor[postgres]"          # PostgreSQL
pip install "coresdk-extractor[elasticsearch]"     # Elasticsearch
pip install "coresdk-extractor[qdrant]"            # Qdrant
pip install "coresdk-extractor[weaviate]"          # Weaviate
pip install "coresdk-extractor[kafka]"             # Kafka (confluent-kafka)
# Webhook sink requires no extra install

pip install "coresdk-extractor[full]"              # everything above

Scientific PDFs (GROBID): run a GROBID server and set GROBID_URL=http://localhost:8070. Without it, scientific PDFs fall back to pymupdf4llm automatically.

Verify installation

extractor info          # lists all supported formats
extractor run README.md # quick smoke test on any local file

Heavy optional dependencies: extractor[audio] pulls PyTorch (~2 GB). extractor[ocr] requires surya-ocr with PyTorch. extractor[ner] pulls PyTorch (~2 GB) for local GLiNER2 inference. Install these only when needed.

Quick start

# Extract a PDF — stream elements to stdout
extractor run paper.pdf

# Extract in RAG-ready chunks mode, write to file
extractor run paper.pdf --mode chunks --out paper.chunks.jsonl

# Extract a whole directory, write each file alongside it
extractor run ./docs/ --mode chunks --out ./out/

# View what was extracted
extractor view paper.chunks.jsonl
extractor view paper.chunks.jsonl --count        # element type breakdown
extractor view paper.chunks.jsonl --types table:simple,code:block

# Inspect a file before extracting
extractor info paper.pdf

# List all parsers and supported formats
extractor info

# Validate output against the schema
extractor validate paper.chunks.jsonl --level invariants

# List all element types
extractor schema

Source connectors

extractor can pull documents directly from cloud storage, HTTP, and email — no manual download step needed.

# S3 bucket or prefix
extractor run s3://my-bucket/docs/ --out ./output/

# MinIO (S3-compatible)
EXTRACTOR_S3_ENDPOINT_URL=http://minio:9000 extractor run s3://my-bucket/docs/ --out ./output/

# Azure Blob Storage
extractor run az://my-container/reports/ --out ./output/

# Azure Data Lake Storage Gen2
extractor run abfs://my-container/data/ --out ./output/

# Google Cloud Storage
extractor run gcs://my-bucket/papers/ --out ./output/

# Single file via HTTPS (no extra install needed)
extractor run https://example.com/report.pdf

# Email attachments via IMAP (no extra install needed)
extractor run imap://inbox --out ./output/

# Filter to PDF files only, download up to 8 files in parallel
extractor run s3://my-bucket/docs/ --source-filter "*.pdf" --source-concurrency 8 --out ./output/

Every downloaded file passes through the same quarantine gate as local files before extraction.

Auth env vars by connector

Connector	Required env vars	Optional env vars
S3	`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` (or `AWS_PROFILE`, or IAM role)	`AWS_SESSION_TOKEN`, `EXTRACTOR_S3_ENDPOINT_URL`, `EXTRACTOR_S3_REGION`
MinIO	`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` + `EXTRACTOR_S3_ENDPOINT_URL`	`EXTRACTOR_S3_REGION`
Azure Blob / ADLS	`AZURE_STORAGE_CONNECTION_STRING` or `AZURE_STORAGE_ACCOUNT` + `AZURE_STORAGE_KEY`	`AZURE_STORAGE_ACCOUNT` alone uses DefaultAzureCredential (managed identity / service principal)
GCS	`GOOGLE_APPLICATION_CREDENTIALS` (path to service account JSON)	On GKE/Cloud Run: Workload Identity — no env var needed
HTTP/HTTPS	none	`EXTRACTOR_HTTP_HEADERS_JSON` (JSON dict), `EXTRACTOR_HTTP_VERIFY_SSL`, `EXTRACTOR_HTTP_MAX_BYTES`
IMAP/email	`EXTRACTOR_IMAP_HOST`, `EXTRACTOR_IMAP_USERNAME`, `EXTRACTOR_IMAP_PASSWORD`	`EXTRACTOR_IMAP_PORT` (default: 993), `EXTRACTOR_IMAP_FOLDER` (default: INBOX), `EXTRACTOR_IMAP_SEARCH` (default: UNSEEN)

See docs/sources.md for full connector documentation.

Database sinks

Stream extracted records directly into a database with --sink. The sink writes in batches alongside normal JSONL output.

# Write to Qdrant (chunk payloads only — add embeddings separately)
extractor run ./docs/ --mode chunks --sink qdrant --sink-uri http://localhost:6333

# Write to MongoDB
extractor run ./docs/ --mode chunks --sink mongodb --sink-uri mongodb://localhost:27017 --sink-database mydb --sink-table chunks

# Write to PostgreSQL (DSN from env: EXTRACTOR_PG_DSN)
extractor run ./docs/ --mode chunks --sink postgres

# Write to Elasticsearch
extractor run ./docs/ --mode chunks --sink elasticsearch --sink-uri http://localhost:9200 --sink-table my_index

# Write to ClickHouse
extractor run ./docs/ --mode chunks --sink clickhouse --sink-uri localhost:8123

# Write to Kafka topic
extractor run ./docs/ --mode chunks --sink kafka --sink-uri broker:9092 --sink-table my_topic

# POST batches to a webhook
extractor run ./docs/ --mode chunks --sink webhook --sink-uri https://my-api.example.com/ingest

# Adjust batch size (default: 1000)
extractor run ./docs/ --mode chunks --sink postgres --sink-batch 500

Sink	Install	Connection
`clickhouse`	`coresdk-extractor[clickhouse]`	`--sink-uri host:port` or defaults to `localhost:8123`
`mongodb`	`coresdk-extractor[mongodb]`	`--sink-uri mongodb://...` or defaults to `mongodb://localhost:27017`
`postgres` / `postgresql`	`coresdk-extractor[postgres]`	`--sink-uri postgresql://user:pass@host/db` or `EXTRACTOR_PG_DSN`
`elasticsearch` / `es`	`coresdk-extractor[elasticsearch]`	`--sink-uri http://...` or `EXTRACTOR_ES_URL`
`qdrant`	`coresdk-extractor[qdrant]`	`--sink-uri http://...` or `EXTRACTOR_QDRANT_URL`
`weaviate`	`coresdk-extractor[weaviate]`	`--sink-uri http://...` or `EXTRACTOR_WEAVIATE_URL`
`kafka`	`coresdk-extractor[kafka]`	`--sink-uri broker:9092` or `EXTRACTOR_KAFKA_BROKERS`
`webhook` / `http_post`	none (core)	`--sink-uri https://...` or `EXTRACTOR_WEBHOOK_URL`

See docs/sinks.md for schema mapping details, auth env vars, and custom sink plugins.

Config file

Place .extractor.toml in your project directory (or pass --config path/to/extractor.toml) to set defaults without repeating CLI flags.

[run]
mode = "chunks"
chunk_size = 512
tokenizer = "cl100k_base"

[ner]
enabled = true
model = "fastino/gliner2-base-v1"

[sink]
type = "qdrant"
uri = "http://localhost:6333"

[source]
concurrency = 8

[quality_gates]
min_chunks = 1
max_extraction_error_rate = 0.05

New in v1.2.0

GLiNER2 extended capabilities — classification (--classify-as), relation triples (--relations), structured field extraction (--extract-schema)
Source connectors — pull documents from S3/MinIO, Azure Blob/ADLS, GCS, HTTP/HTTPS, and IMAP email directly via URI
Database sinks — stream records into ClickHouse, MongoDB, PostgreSQL, Elasticsearch, Qdrant, Weaviate, Kafka, or any HTTP webhook
Table serialization modes — --table-text-mode markdown|nl-rows|nl-columns|hybrid controls how tables are serialized into chunk text
Chunk quality scoring — --quality emits ChunkQuality with lexical density, entity density, compression ratio, and heading coverage
Language detection — metadata.language populated per element when extractor[lang] is installed
Quality gates — configurable pass/fail thresholds in .extractor.toml under [quality_gates]; gate failures are recorded in the manifest
Figure extraction — --figures-dir exports figure assets (PNG/JPEG) alongside JSONL output
OpenTelemetry — extractor[otel] emits spans per document; configure via standard OTEL env vars (OTEL_EXPORTER_OTLP_ENDPOINT, etc.)
Dual-chunk mode — --mode dual-chunks produces both coarse parent chunks and fine child chunks linked by parent_chunk_id
Parallel workers — --workers N for local directory extraction; cloud sources use --source-concurrency N
Incremental processing — --incremental skips files unchanged since last run (SHA256 + run-config keyed JSON cache)

Entity extraction (NER)

# Extract with named entities (requires extractor[ner])
extractor run paper.pdf --mode chunks --entities

# Disable NER
extractor run paper.pdf --mode chunks --no-entities

# Custom entity types
extractor run paper.pdf --entities-types "person,organization,product" --entities-threshold 0.6

from extractor import extract

for chunk in extract("paper.pdf", mode="chunks", entities=True):
    print(chunk.entities)           # list of EntityAnnotation
    print(chunk.chunk_metadata.entity_types)  # ["organization", "person"]

Python API

from extractor import extract

# Elements mode — fine-grained semantic units
for el in extract("paper.pdf", mode="elements"):
    print(el.element_type, el.section_path, el.text[:120])

# Chunks mode — pre-committed, RAG-ready chunks
for chunk in extract("paper.pdf", mode="chunks", chunk_size=512):
    print(chunk.id, chunk.token_count, chunk.text[:120])

# Filter to only tables and headings
for el in extract("report.docx", include_types=["table:simple", "structural:section_header"]):
    if el.table:
        print(el.table.structured)  # {"headers": [...], "rows": [[...]]}

Error handling

from extractor import extract, QuarantineError, UnsupportedFormatError, ParserError

try:
    for el in extract("untrusted_file.pdf"):
        print(el.element_type, el.text[:80])
except QuarantineError as e:
    print(f"File rejected by security check: {e}")
except UnsupportedFormatError as e:
    print(f"Format not supported. Run `extractor info` for the full list.")
except ParserError as e:
    print(f"Parser failed: {e}")

Output schema

Every record is a JSON object. Key fields:

Field	Type	Description
`id`	`string`	Content-addressed ID (`el_` + 16 hex chars)
`element_type`	`string`	One of 47 canonical types (see below)
`text`	`string`	Plain-text content
`section_path`	`string[]`	Heading breadcrumb, e.g. `["Introduction", "Methods"]`
`section_path_tier`	`int`	Quality: 1=native, 2=font-heuristic, 3=keyword, 4=positional
`sequence_index`	`int`	Document order, 0-based
`page`	`int\|null`	Source page (1-based)
`schema_version`	`string`	`"1.2.0"`
`source_filename`	`string`	Source file name
`source_sha256`	`string`	SHA-256 of source file
`entities`	`EntityAnnotation[]\|absent`	Named entity annotations (absent = NER not run; `[]` = NER ran, nothing found)
`table`	`object\|null`	`{markdown, structured, has_header_row, row_count, col_count}`
`equation`	`object\|null`	`{latex, plain_text, mathml}`
`figure`	`object\|null`	`{caption, image_ref, image_sha256, ocr_text}`
`transcript`	`object\|null`	`{speaker, start_time_s, end_time_s, word_timestamps}`
`admonition`	`object\|null`	`{kind, title}`

Element types

structural:  title  section_header  subtitle  divider  page_header  page_footer
text:        narrative  abstract  admonition  pull_quote  footnote  caption  sidebar  transcript_segment
table:       simple  complex  continuation
code:        block  cell  inline
list:        item  item_ordered  item_definition
media:       figure  image  audio  video
scientific:  equation_display  equation_inline  citation  reference_entry  theorem  definition  proof
meta:        document_title  author  date  url  email  page_number  extraction_error
form:        field  label  checkbox
composite:   chunk

Atomic types (never split across chunks): table:simple, table:complex, table:continuation, media:figure, media:image, code:block, scientific:equation_display

JSONL envelope format

{"type":"envelope","extractor_version":"1.2.0","source":{...},"run_config":{...},"created_at":"..."}
{"id":"el_a1b2c3d4e5f6a7b8","element_type":"structural:title","text":"Introduction",...}
{"id":"el_...","element_type":"text:narrative","text":"...",...}
...
{"type":"stream_end","status":"complete","total_elements":42,"schema_version":"1.0.0"}

A manifest.json companion file is written alongside every --out file with full stats.

Supported formats

Format	Library	Notes
PDF (digital)	pymupdf4llm	Fast, heading-aware
PDF (scientific)	GROBID TEI → pymupdf4llm fallback	Equations, citations, references
PDF (scanned)	surya-ocr → pymupdf fallback	Layout detection + OCR
DOCX	python-docx	Headings, tables, runs, images
XLSX	openpyxl	Sheet-per-section, dual-format tables
PPTX	python-pptx	Slide titles + body, speaker notes
HTML	trafilatura + lxml	Boilerplate removal, GFM alerts
Markdown	mistletoe (GFM)	Headings, tables, alerts, code fences
EPUB	ebooklib + BS4	Spine-order chapter extraction
LaTeX	pure-regex parser	Sections, equations, tables, figures, bibliography
JSON	stdlib	Key-value pairs as narrative
XML	lxml	Title/paragraph heuristics
CSV	stdlib csv	Entire file as dual-format table
Plain text	heuristic	Heading pattern detection
Audio (mp3/wav/m4a/flac)	faster-whisper + pyannote	Diarization, word timestamps

CLI reference

extractor run <target> [options]

  Output
  --mode            elements | chunks | dual-chunks  (default: elements)
  --out             Output file or directory (default: stdout)
  --quiet / -q      Suppress progress output
  --debug           Show full tracebacks on errors
  --include-full-path  Store absolute source path instead of filename

  Chunking
  --chunk-size      Max tokens per chunk (default: 512)
  --overlap         Overlap tokens between chunks (default: 0)
  --tokenizer       tiktoken encoding (default: cl100k_base)
  --context-prefix  Prepend section breadcrumb to each chunk text
  --parent-size     Token budget for coarse chunks in dual-chunks mode (default: 512)
  --child-size      Token budget for fine chunks in dual-chunks mode (default: 128)

  Extraction
  --strategy        fast | accurate | ocr  (default: fast)
  --include-types   Comma-separated element types to emit
  --exclude-types   Comma-separated element types to suppress
  --table-text-mode markdown | nl-rows | nl-columns | hybrid  (default: markdown)
  --figures-dir     Directory to export figure assets (PNG/JPEG)

  NER / GLiNER2
  --entities/--no-entities      Run GLiNER2 NER (default: on when extractor[ner] installed)
  --entities-model              Local GLiNER2 model (default: fastino/gliner2-base-v1)
  --entities-types              Comma-separated NER label list
  --entities-threshold          Min confidence score (default: 0.50)
  --classify-as                 Comma-separated classification labels
  --relations/--no-relations    Extract (subject, predicate, object) triples
  --relation-types              Comma-separated relation predicates
  --extract-schema              JSON file with schema dict for structured field extraction
  --extract-on                  Comma-separated element types for structured extraction
  --canonicalize/--no-canonicalize  Cross-document entity canonicalization
  --registry-path               Path to EntityRegistry JSON file

  Quality
  --quality/--no-quality        Emit ChunkQuality scores on each chunk

  Parallel / incremental
  --workers / -w    Parallel workers for local directory extraction (default: 1)
  --incremental     Skip files unchanged since last run (requires --out)

  Source connectors
  --source-filter       Glob pattern to filter remote files, e.g. "*.pdf"
  --source-concurrency  Max parallel downloads from remote sources (default: 4)
  --source-tmp-dir      Directory for temp files during remote download

  Database sinks
  --sink            clickhouse | mongodb | postgres | elasticsearch | qdrant | weaviate | kafka | webhook
  --sink-uri        Connection URI or host:port
  --sink-database   Database name (default: extractor)
  --sink-table      Table/collection/index/topic name (default: elements)
  --sink-batch      Batch size for database writes (default: 1000)

  Config
  --config          Path to extractor.toml config file

extractor view <jsonl-file>
  --max-text        Max chars per element (default: 200)
  --types / -t      Comma-separated element types to show
  --count / -n      Print element counts by type and exit
  --no-meta         Hide envelope/manifest lines

extractor validate <jsonl-file>
  --level           basic | schema | invariants  (default: schema)

extractor info [file]
extractor schema [element-type] [--json] [--type element|chunk|manifest] [--out file]
extractor cache clear [cache-file] [--older-than N]

Environment variables

All environment variables are optional. The library works out of the box without any of them — each one unlocks a specific optional capability.

PDF (scientific)

Variable	Default	Description
`GROBID_URL`	`http://localhost:8070`	URL of a running GROBID server. When set and reachable, scientific PDFs are parsed via GROBID TEI (better equation/citation/reference extraction). Without it, scientific PDFs automatically fall back to the standard digital PDF parser — no errors.

Audio transcription — only relevant if you install extractor[audio] and pass audio files (MP3/WAV/M4A)

Variable	Default	Description
`HF_TOKEN`	—	Hugging Face API token. Required only for speaker diarization ("who said what"). Without it, you still get a full transcript — just without speaker labels. Get a free token at huggingface.co/settings/tokens and accept the pyannote.audio model license.
`WHISPER_MODEL`	`base`	Whisper model size controlling accuracy vs. speed. `base` (~150 MB) is fast and good for most uses. Use `large-v2` (~3 GB) for production-quality transcription. Options: `tiny` / `base` / `small` / `medium` / `large-v2`.
`WHISPER_DEVICE`	auto	Hardware to run Whisper on. Auto-detected: uses NVIDIA GPU (`cuda`), Apple Silicon (`mps`), or falls back to `cpu`. Set explicitly if auto-detection picks the wrong device.
`WHISPER_COMPUTE`	auto	Float precision for faster-whisper. `int8` on CPU (faster, less RAM). `float16` on GPU (fastest). `float32` for maximum accuracy. Auto-set based on device.

Entity extraction (NER) — only relevant if you install extractor[ner]

NER uses local inference only (fastino/gliner2-base-v1). No external API key required.

Exit codes

Code	Meaning
0	Success
1	Partial success (some files failed)
2	Quarantine failure (file rejected)
3	Unsupported format
4	Parser crash
5	Output write error
6	Configuration error

Documentation

Document	Description
Protocol Specification	Full schema and protocol spec v1.2.0
Source Connectors	S3, Azure, GCS, HTTP, IMAP connectors
Database Sinks	ClickHouse, MongoDB, Postgres, ES, Qdrant, Weaviate, Kafka, Webhook
Architecture	System architecture and design decisions
Writing a Parser	How to add support for a new format
Contributing	Dev setup, test workflow, PR guidelines

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.1

Mar 25, 2026

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coresdk_extractor-1.2.1.tar.gz (242.8 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

coresdk_extractor-1.2.1-py3-none-any.whl (156.9 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file coresdk_extractor-1.2.1.tar.gz.

File metadata

Download URL: coresdk_extractor-1.2.1.tar.gz
Upload date: Mar 25, 2026
Size: 242.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coresdk_extractor-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`343c10b2ddb25957ad950a9b039225185ec8378312e31b22d5e6c68c665d85f5`
MD5	`c77e6f31a8facac8787433203e051ffa`
BLAKE2b-256	`1c42a5fbda58bcf687b5fe4be313acacd2b0e29b6bbfca198f88e1b91cfbac8a`

See more details on using hashes here.

File details

Details for the file coresdk_extractor-1.2.1-py3-none-any.whl.

File metadata

Download URL: coresdk_extractor-1.2.1-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 156.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coresdk_extractor-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6678b4350e21158a789f22d4dcde71ad520b30380eda78534aaacbfc5e40f0dd`
MD5	`e6edc959cb4208d3115a8f7067593553`
BLAKE2b-256	`e73494637d9a7427e2958e02b23cc8be879bed18b85f6378f975b9beaa82e3b6`

See more details on using hashes here.

coresdk-extractor 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

extractor

Why extractor?

Installation

Verify installation

Quick start

Source connectors

Auth env vars by connector

Database sinks

Config file

New in v1.2.0

Entity extraction (NER)

Python API

Error handling

Output schema

Element types

JSONL envelope format

Supported formats

CLI reference

Environment variables

Exit codes

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes