A modular, high-performance toolkit for extracting structured data from documents.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

document-extraction-tools

A modular, high-performance toolkit for building document extraction pipelines. The library provides clear interfaces for every pipeline stage, plus orchestrators that wire the stages together with async I/O and CPU-bound parallelism.

This repo is intentionally implementation-light: you plug in your own components (readers, converters, extractors, exporters, evaluators) for each specific document type or data source.

document-extraction-tools

Installation

Install from PyPI:

pip install document-extraction-tools

Or with uv:

uv add document-extraction-tools

Project layout

.
├── src                               
│   └── document_extraction_tools     
│       ├── base                      # abstract base classes you implement
│       │   ├── converter             # conversion interface definitions
│       │   ├── evaluator             # evaluation interface definitions
│       │   ├── exporter              # export interface definitions
│       │   ├── extractor             # extraction interface definitions
│       │   ├── file_lister           # file discovery interface definitions
│       │   ├── reader                # document read interface definitions
│       │   └── test_data_loader      # evaluation dataset loader interfaces
│       ├── config                    # Pydantic configs + YAML loader helpers
│       ├── runners                   # orchestrators that run pipelines
│       │   ├── evaluation            # evaluation pipeline orchestration
│       │   └── extraction            # extraction pipeline orchestration
│       ├── types                     # shared models/types used across modules
│       └── py.typed                  
├── tests                             
├── pull_request_template.md          
├── pyproject.toml                    
├── README.md                         
└── uv.lock

What this library gives you

A consistent set of interfaces for the entire document-extraction lifecycle.
A typed data model for documents, pages, and extraction results.
Orchestrators that run extraction and evaluation pipelines concurrently and safely.
A configuration system (Pydantic + YAML) for repeatable pipelines.

Core concepts and components

Data models

PathIdentifier: A uniform handle for file locations plus optional context.
DocumentBytes: Raw bytes + MIME type + path identifier.
Document: Parsed content (pages, text/image data, metadata).
ExtractionSchema: Your Pydantic model (the target output).
EvaluationExample: (path, ground truth) pair for evaluation runs.
EvaluationResult: Name + result + description for evaluation metrics.

Extraction pipeline

FileLister (BaseFileLister)
- Discovers input files and returns a list of PathIdentifier objects.
Reader (BaseReader)
- Reads raw bytes from the source and returns DocumentBytes.
Converter (BaseConverter)
- Converts raw bytes into a structured Document (pages, metadata, content type).
Extractor (BaseExtractor)
- Asynchronously extracts structured data into a Pydantic schema (ExtractionSchema).
ExtractionExporter (BaseExtractionExporter)
- Asynchronously persists extracted data to your desired destination (DB, files, API, etc.).
ExtractionOrchestrator
- Runs the pipeline with a thread pool for CPU-bound steps (read/convert) and async concurrency for I/O-bound steps (extract/export).

Evaluation pipeline

TestDataLoader (BaseTestDataLoader)
- Loads evaluation examples (ground truth + file path) as EvaluationExample.
Evaluator (BaseEvaluator)
- Computes a metric by comparing true vs. pred schemas.
EvaluationExporter (BaseEvaluationExporter)
- Persists evaluation results.
EvaluationOrchestrator
- Runs extraction + evaluation across examples with the same concurrency model (thread pool + async I/O).

Configuration

Each component has a matching base config class (Pydantic model) that defines a default YAML filename and acts as the parent for your own config fields. You’ll subclass these to add settings specific to your implementation.

Extraction config base classes:

BaseFileListerConfig
BaseReaderConfig
BaseConverterConfig
BaseExtractorConfig
BaseExtractionExporterConfig
ExtractionOrchestratorConfig (you can use as-is; no need to subclass)

Evaluation specific config base classes:

BaseTestDataLoaderConfig
BaseEvaluatorConfig
BaseEvaluationExporterConfig
EvaluationOrchestratorConfig (you can use as-is; no need to subclass)

How to implement an extraction pipeline

For a full worked example including evaluation, please see the document-extraction-examples repository. Below we outline the steps for a successful implementation.

1) Define your extraction schema

Create a Pydantic model that represents the structured data you want out of each document.

Example implementation:

from pydantic import BaseModel, Field

class InvoiceSchema(BaseModel):
    invoice_id: str = Field(..., description="Unique invoice identifier.")
    vendor: str = Field(..., description="Vendor or issuer name.")
    total: float = Field(..., description="Total invoice amount.")

2) Implement pipeline components

Subclass the base interfaces and implement the required methods.

Example implementations:

from document_extraction_tools.base import (
    BaseFileLister,
    BaseReader,
    BaseConverter,
    BaseExtractor,
    BaseExtractionExporter,
)
from document_extraction_tools.types import Document, DocumentBytes, PathIdentifier
from document_extraction_tools.config import (
    BaseFileListerConfig,
    BaseReaderConfig,
    BaseConverterConfig,
    BaseExtractorConfig,
    BaseExtractionExporterConfig,
)

class MyFileLister(BaseFileLister):
    def __init__(self, config: BaseFileListerConfig) -> None:
        super().__init__(config)

    def list_files(self) -> list[PathIdentifier]:
        # Discover and return file identifiers
        ...


class MyReader(BaseReader):
    def __init__(self, config: BaseReaderConfig) -> None:
        super().__init__(config)

    def read(self, path_identifier: PathIdentifier) -> DocumentBytes:
        # Read file bytes from disk, object storage, etc.
        ...


class MyConverter(BaseConverter):
    def __init__(self, config: BaseConverterConfig) -> None:
        super().__init__(config)

    def convert(self, document_bytes: DocumentBytes) -> Document:
        # Parse PDF, OCR, etc. and return a Document
        ...


class MyExtractor(BaseExtractor):
    def __init__(self, config: BaseExtractorConfig) -> None:
        super().__init__(config)

    async def extract(self, document: Document, schema: type[InvoiceSchema]) -> InvoiceSchema:
        # Call LLM or rules-based system
        ...


class MyExtractionExporter(BaseExtractionExporter):
    def __init__(self, config: BaseExtractionExporterConfig) -> None:
        super().__init__(config)

    async def export(self, document: Document, data: InvoiceSchema) -> None:
        # Persist data to DB, filesystem, etc.
        ...

3) Create configuration models and YAML files

Each component has a base config class with a default filename (e.g. extractor.yaml). Subclass the config models to add your own fields, then provide YAML files in the directory you pass as config_dir to load_config (default is config/yaml/).

Default filenames:

extraction_orchestrator.yaml
file_lister.yaml
reader.yaml
converter.yaml
extractor.yaml
extraction_exporter.yaml

Example config model:

from document_extraction_tools.config import BaseExtractorConfig

class MyExtractorConfig(BaseExtractorConfig):
    model_name: str

Example YAML (config/yaml/extractor.yaml):

# add fields your Extractor config defines
model_name: "gemini-3-flash-preview"

4) Load config and run the pipeline

Example usage:

import asyncio
from document_extraction_tools.config import load_config
from document_extraction_tools.runners import ExtractionOrchestrator
from document_extraction_tools.config import ExtractionOrchestratorConfig

config = load_config(
    lister_config_cls=MyFileListerConfig,
    reader_config_cls=MyReaderConfig,
    converter_config_cls=MyConverterConfig,
    extractor_config_cls=MyExtractorConfig,
    exporter_config_cls=MyExtractionExporterConfig,
    orchestrator_config_cls=ExtractionOrchestratorConfig,
    config_dir=Path("config/yaml"),
)

orchestrator = ExtractionOrchestrator.from_config(
    config=config,
    schema=InvoiceSchema,
    reader_cls=MyReader,
    converter_cls=MyConverter,
    extractor_cls=MyExtractor,
    exporter_cls=MyExtractionExporter,
)

file_lister = MyFileLister(config.file_lister)
file_paths = file_lister.list_files()

asyncio.run(orchestrator.run(file_paths))

How to implement an evaluation pipeline

1) Implement evaluation pipeline components

The evaluation pipeline reuses your reader/converter/extractor and adds three pieces:

TestDataLoader: loads evaluation examples (file + ground truth)
Evaluator(s): compute metrics for each example
EvaluationExporter: persist results

Example implementations:

from document_extraction_tools.base import (
    BaseTestDataLoader,
    BaseEvaluator,
    BaseEvaluationExporter,
)
from document_extraction_tools.config import (
    BaseTestDataLoaderConfig,
    BaseEvaluatorConfig,
    BaseEvaluationExporterConfig,
)
from document_extraction_tools.types import EvaluationExample, EvaluationResult, PathIdentifier


class MyTestDataLoader(BaseTestDataLoader[InvoiceSchema]):
    def __init__(self, config: BaseTestDataLoaderConfig) -> None:
        super().__init__(config)

    def load_test_data(
        self, path_identifier: PathIdentifier
    ) -> list[EvaluationExample[InvoiceSchema]]:
        # Load ground-truth + path pairs from disk/DB/etc.
        ...


class MyEvaluator(BaseEvaluator[InvoiceSchema]):
    def __init__(self, config: BaseEvaluatorConfig) -> None:
        super().__init__(config)

    def evaluate(
        self, true: InvoiceSchema, pred: InvoiceSchema
    ) -> EvaluationResult:
        # Compare true vs pred and return a metric
        ...


class MyEvaluationExporter(BaseEvaluationExporter):
    def __init__(self, config: BaseEvaluationExporterConfig) -> None:
        super().__init__(config)

    async def export(
        self, results: list[tuple[Document, list[EvaluationResult]]]
    ) -> None:
        # Persist evaluation results
        ...

2) Create configuration models and YAML files

Implement your own config models by subclassing the base evaluation configs and adding any fields your components need.

Default YAML filenames for evaluation:

evaluation_orchestrator.yaml
test_data_loader.yaml
evaluator.yaml (one top-level key per evaluator config class name)
reader.yaml
converter.yaml
extractor.yaml
evaluation_exporter.yaml

Warning: The top-level key in the YAML MUST match the evaluator configuration class name, and the evaluator configuration class name MUST be the name of the evaluator class with the suffix Config. For example:

class MyEvaluator(BaseEvaluator):
    ...

class MyEvaluatorConfig(BaseEvaluatorConfig):
    ...

Example YAML (config/yaml/evaluator.yaml):

MyEvaluatorConfig:
  # add fields your Evaluator config defines
  threshold: 0.8

3) Load config and run the pipeline

Example usage:

from document_extraction_tools.config import load_evaluation_config
from document_extraction_tools.runners import EvaluationOrchestrator
from document_extraction_tools.config import EvaluationOrchestratorConfig

config = load_evaluation_config(
    test_data_loader_config_cls=MyTestDataLoaderConfig,
    evaluator_config_classes=[MyEvaluatorConfig],
    reader_config_cls=MyReaderConfig,
    converter_config_cls=MyConverterConfig,
    extractor_config_cls=MyExtractorConfig,
    evaluation_exporter_config_cls=MyEvaluationExporterConfig,
    orchestrator_config_cls=EvaluationOrchestratorConfig,
    config_dir=Path("config/yaml"),
)

orchestrator = EvaluationOrchestrator.from_config(
    config=config,
    schema=InvoiceSchema,
    reader_cls=MyReader,
    converter_cls=MyConverter,
    extractor_cls=MyExtractor,
    test_data_loader_cls=MyTestDataLoader,
    evaluator_classes=[MyEvaluator],
    evaluation_exporter_cls=MyEvaluationExporter,
)

examples = MyTestDataLoader(config.test_data_loader).load_test_data(
    PathIdentifier(path="/path/to/eval-set")
)

asyncio.run(orchestrator.run(examples))

Concurrency model

Reader + Converter run in a thread pool (CPU-bound work).
Extractor + Exporter run concurrently in the event loop (I/O-bound work).
Tuning options live in extraction_orchestrator.yaml and evaluation_orchestrator.yaml:
- max_workers (thread pool size)
- max_concurrency (async I/O semaphore limit)

Development

Install dependencies: uv sync
Run pre-commit: uv run pre-commit run --all-files
Run tests: uv run pytest

Releasing

Test release (TestPyPI)

Create a release branch and bump version:

git checkout -b release/v0.2.0-rc1
uv version --bump rc
# Or manually: uv version 0.2.0-rc1

Commit and push the branch:

VERSION=$(uv version --short)
git add pyproject.toml
git commit -m "Bump version to $VERSION"
git push -u origin release/v$VERSION

Create and merge a PR to main.

Tag the merge commit and push:

git checkout main && git pull
VERSION=$(uv version --short)
git tag "v$VERSION"
git push --tags

The publish-test.yaml workflow automatically publishes to TestPyPI.

Verify installation:

uv pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ document-extraction-tools

Production release (PyPI)

Create a release branch and bump version:

git checkout -b release/v0.2.0
uv version --bump minor  # or: major, minor, patch

Commit and push the branch:

VERSION=$(uv version --short)
git add pyproject.toml
git commit -m "Bump version to $VERSION"
git push -u origin release/v$VERSION

Create and merge a PR to main.

Tag the merge commit and create the release:

git checkout main && git pull
VERSION=$(uv version --short)
git tag "v$VERSION"
git push --tags
gh release create "v$VERSION" --title "v$VERSION" --generate-notes

The publish.yaml workflow automatically builds, publishes to PyPI, and runs smoke tests.

Contributing

Contributions are welcome. Please:

Report bugs or feature requests by opening an issue.
Create a new branch using the following naming conventions: feat/short-description, fix/short-description, etc.
Describe the change clearly in the PR description.
Add or update tests in tests/.
Run linting and tests before pushing: uv run pre-commit run --all-files and uv run pytest.
If you open a PR, please notify the maintainers (Ollie Kemp or Nikolas Moatsos).

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

oliver.kemp

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.0

Feb 6, 2026

0.2.0

Feb 2, 2026

This version

0.1.1

Feb 2, 2026

0.1.0

Feb 2, 2026

0.0.1rc1 pre-release

Feb 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_extraction_tools-0.1.1.tar.gz (16.6 kB view details)

Uploaded Feb 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

document_extraction_tools-0.1.1-py3-none-any.whl (34.9 kB view details)

Uploaded Feb 2, 2026 Python 3

File details

Details for the file document_extraction_tools-0.1.1.tar.gz.

File metadata

Download URL: document_extraction_tools-0.1.1.tar.gz
Upload date: Feb 2, 2026
Size: 16.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for document_extraction_tools-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f35c62ec4791dad8ee3b6554102bec68956fd86ba3c131547fb02076523a5c65`
MD5	`423f69927e53ff68d4dce634ff2390b1`
BLAKE2b-256	`e978c2361edbdae218f857fcb75a17b5c5a99714beb2010ab8ff957445bda0a1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for document_extraction_tools-0.1.1.tar.gz:

Publisher: publish.yaml on artefactory-uk/document-extraction-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: document_extraction_tools-0.1.1.tar.gz
- Subject digest: f35c62ec4791dad8ee3b6554102bec68956fd86ba3c131547fb02076523a5c65
- Sigstore transparency entry: 903553484
- Sigstore integration time: Feb 2, 2026
Source repository:
- Permalink: artefactory-uk/document-extraction-tools@5bf0dda0763d5142129094b1d3ada11fef4f103f
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/artefactory-uk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@5bf0dda0763d5142129094b1d3ada11fef4f103f
- Trigger Event: release

File details

Details for the file document_extraction_tools-0.1.1-py3-none-any.whl.

File metadata

Download URL: document_extraction_tools-0.1.1-py3-none-any.whl
Upload date: Feb 2, 2026
Size: 34.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for document_extraction_tools-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7602617f1373a67d21852f9e8aa84f2d3141b646b731b05b1e76ace903cde77c`
MD5	`86b375c8885a7c4e5198f8a725bc365d`
BLAKE2b-256	`23bfd0f54b7b827ba3fadd66f54b6b178ff7047425862853dc521e0b60ca3166`

See more details on using hashes here.

Provenance

The following attestation bundles were made for document_extraction_tools-0.1.1-py3-none-any.whl:

Publisher: publish.yaml on artefactory-uk/document-extraction-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: document_extraction_tools-0.1.1-py3-none-any.whl
- Subject digest: 7602617f1373a67d21852f9e8aa84f2d3141b646b731b05b1e76ace903cde77c
- Sigstore transparency entry: 903553553
- Sigstore integration time: Feb 2, 2026
Source repository:
- Permalink: artefactory-uk/document-extraction-tools@5bf0dda0763d5142129094b1d3ada11fef4f103f
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/artefactory-uk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@5bf0dda0763d5142129094b1d3ada11fef4f103f
- Trigger Event: release

document-extraction-tools 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

document-extraction-tools

Table of Contents

Installation

Project layout

What this library gives you

Core concepts and components

Data models

Extraction pipeline

Evaluation pipeline

Configuration

How to implement an extraction pipeline

1) Define your extraction schema

2) Implement pipeline components

3) Create configuration models and YAML files

4) Load config and run the pipeline

How to implement an evaluation pipeline

1) Implement evaluation pipeline components

2) Create configuration models and YAML files

3) Load config and run the pipeline

Concurrency model

Development

Releasing

Test release (TestPyPI)

Production release (PyPI)

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance