Document data extraction made by Dark Matter

Project description

Extractly

Extractly is a Python library for turning unstructured text into structured, typed data. Built on top of Pydantic AI, it orchestrates large language model calls, merges incremental field updates, and optionally performs OCR through Mistral so you can ingest PDFs or images.

Extractly

Features

LLM-powered extraction: Uses Pydantic AI agents (default model google-gla:gemini-2.5-flash-lite) to turn raw text into structured fields.
Schema-aware control: Provide a Schema describing entities and fields, or enable discovery mode with identify_fields=True.
Action pipeline: Reconcile incremental model responses via configurable handlers (merge lists, upsert tables, etc.).
OCR integration: Convert images and documents into Markdown with the built-in Mistral OCR helper before extraction.
Chunking built in: Large inputs are automatically split into manageable chunks to stay within model limits.
Confidence scoring: Every ExtractedField includes a 0-1 confidence score for downstream quality checks.

Requirements

Python 3.13 or newer (per pyproject.toml)
Credentials for your LLM provider. The default model google-gla:gemini-2.5-flash-lite expects GOOGLE_API_KEY to be set or otherwise visible to Pydantic AI.
A Mistral API key (MISTRAL_API_KEY) when using the OCR helpers.

A local .env file is loaded automatically via python-dotenv, making it convenient to store credentials during development.

Installation

Install from PyPI using pip:

pip install extractly

Or with uv:

uv add extractly

For development, clone the repository and install with dev dependencies:

git clone https://github.com/Darkmatter-AI/extractly.git
cd extractly
pip install ".[dev]"

Quick Start

With the package installed and credentials exported, you can start extracting data programmatically.

Discover fields automatically

from extractly import Extractor

content = """
Invoice #12345
Issued on 2024-01-15
Total due: $1,250.00
"""

extractor = Extractor(content=content)
extracted_schema = extractor.extract_fields()

for entity in extracted_schema.entities:
    for field in entity.fields:
        print(f"{field.name}: {field.value} (confidence: {field.confidence:.2f})")

Extractor.extract_fields() returns an ExtractedSchema, which exposes the extracted fields via fields_by_id keyed by <entity name>.<field name>.

Schema-driven extraction

from extractly import Extractor
from extractly.schemas import Schema, SchemaEntity, SchemaField, Table

invoice_schema = Schema(
    name="Invoice",
    description="Fields expected in an invoice document.",
    entities=[
        SchemaEntity(
            name="invoice",
            description="Top-level invoice information.",
            fields=[
                SchemaField(
                    name="invoice_number",
                    description="Unique invoice identifier.",
                    data_type="string",
                    example="INV-12345",
                ),
                SchemaField(
                    name="amount_due",
                    description="Total amount owed.",
                    data_type="currency",
                    example="$1,250.00",
                ),
                SchemaField(
                    name="line_items",
                    description="Table of invoice line items.",
                    data_type="table<description, quantity, price>",
                    example=Table(
                        headers=["Description", "Quantity", "Price"],
                        rows=[["Design work", 10, "$125.00"]],
                    ),
                ),
            ],
        ),
    ],
)

invoice_text = """
Invoice INV-12345
Amount due: $1,250.00
Line items:
- Design work, 10 hours @ $125.00
"""

extractor = Extractor(
    content=invoice_text,
    schema=invoice_schema,
    identify_fields=False,  # Only return fields defined in the schema
)

extracted_schema = extractor.extract_fields()
for field_id, field in extracted_schema.fields_by_id.items():
    print(f"{field_id}: {field.value}")

Leave identify_fields=True (the default) if you want the agent to return schema fields and discover additional fields that look relevant.

Schema definitions can also be loaded from JSON. For example, the sample invoice schema used in the examples can be loaded with:

from pathlib import Path
from extractly.schemas import Schema

schema = Schema.model_validate_json(
    Path("samples/invoice/invoice_schema.json").read_text()
)

OCR then extract

from pathlib import Path
from extractly import Extractor
from extractly.ocr import OCR
from extractly.schemas import Schema

schema = Schema.model_validate_json(
    Path("samples/invoice/invoice_schema.json").read_text()
)

ocr = OCR()
extractor = Extractor.from_file(
    input_file_path=Path("samples/invoice/invoice_image.jpg"),
    use_ocr=True,
    ocr_service=ocr,
    # use ocr_filename if the file name is missing/hashed to improve type detection
    ocr_filename="invoice.jpg",
    schema=schema,
    identify_fields=False,
)

extracted_schema = extractor.extract_fields()
for entity in extracted_schema.entities:
    for field in entity.fields:
        print(f"{field.name}: {field.value} (confidence: {field.confidence})")

Set use_ocr=True to have the extractor run OCR before chunking; leave it False to read text files directly (you can pass encoding= for non-UTF-8 text). The OCR service automatically detects whether the file is an image when is_image is omitted. Pass is_image=True or False to override the detection, ocr_filename when the original file name is missing or extensionless, and ocr_output_file_path to save the rendered Markdown.

OCR.extract_text_from_file_path also accepts PDFs and can optionally write the Markdown output to disk.

Batch extraction with multiple schemas

You can process multiple documents with different schemas in a single call:

from extractly import Extractor, DocumentInput
from extractly.schemas import Schema

# Define your schemas...
invoice_schema = Schema(name="Invoice", ...)
receipt_schema = Schema(name="Receipt", ...)

extractor = Extractor(
    schemas=[invoice_schema, receipt_schema],
)

documents = [
    DocumentInput(file_path="invoice.jpg", schema_name="Invoice", use_ocr=True),
    DocumentInput(file_path="receipt.pdf", schema_name="Receipt", use_ocr=True),
]

result = extractor.extract_fields(documents)
for doc_result in result.results:
    if doc_result.error:
        print(f"Error processing {doc_result.document_id}: {doc_result.error}")
    else:
        print(f"Extracted {doc_result.schema_name} from {doc_result.document_id}")
        print(f"Extracted {doc_result.schema_name} from {doc_result.document_id}")

If you set infer_schema=True in DocumentInput, the extractor will attempt to identify the correct schema automatically by analyzing the document content and matching it against the provided schemas.

DocumentInput(
    file_path="unknown_receipt.jpg",
    # schema_name is omitted, so we must explicitly ask to infer it
    infer_schema=True,
    use_ocr=True,
)

Architecture

High-Level Extraction Flow

flowchart TD
    A[Input Content] --> B{OCR Needed?}
    B -->|Yes| C[OCR Service]
    B -->|No| D[Raw Text]
    C --> D
    D --> E[Extractor]
    S[Schema Optional] -.->|provided| E
    S -.->|not provided| E
    E --> F[Chunk Content]
    F --> G{More Chunks?}
    G -->|Yes| H[Process Chunk]
    G -->|No| M[Return Extracted Fields]
    H --> I[LLM Agent]
    I --> J[Field Responses]
    J --> K[Action Service]
    K --> L[Update Field Repository]
    L --> G

    style S fill:#f9f,stroke:#333,stroke-dasharray: 5 5

Component Architecture

classDiagram
    class Extractor {
        +content: str
        +fields: FieldRepository
        +model: Model
        +schema: Schema
        +extract_fields() ExtractedSchema
        +process_chunk() list
        +handle_field_response()
    }

    class FieldRepository {
        -_extracted_fields: dict
        -_schema: Schema
        +upsert_extracted_field()
        +get_extracted_field()
        +extracted_fields: dict
        +build_extracted_schema() ExtractedSchema
    }

    class ActionService {
        -_actions: dict
        +register()
        +dispatch()
        +available_actions()
    }

    class PromptService {
        +system_prompt: str
        +get_user_message()
    }

    class Agent {
        +run_sync()
    }

    class OCR {
        +extract_text_from_file_path()
        +extract_text_from_bytes()
        +extract_text_from_file_url()
    }

    Extractor --> FieldRepository
    Extractor --> ActionService
    Extractor --> PromptService
    Extractor --> Agent
    OCR --> Extractor: provides content

Chunk Processing Pipeline

sequenceDiagram
    participant S as Schema (optional)
    participant E as Extractor
    participant PS as PromptService
    participant C as Chunking
    participant A as Agent
    participant AS as ActionService
    participant FR as FieldRepository

    Note over S,E: Schema may or may not be provided
    S--)E: schema (optional)
    E->>PS: initialize with schema + identify_fields
    E->>FR: initialize with schema
    E->>C: chunk_markdown(content)
    C-->>E: list of chunks

    loop For each chunk
        E->>PS: get_user_message(chunk, fields)
        PS-->>E: prompt with schema constraints<br/>or field discovery mode
        E->>A: process_chunk(chunk)
        A->>A: Apply system prompt<br/>+ user message
        A-->>E: list[FieldResponse]

        loop For each response
            E->>AS: handle_field_response()
            AS->>AS: dispatch to action handler
            AS->>FR: update field
            FR-->>AS: updated
        end
    end

    E->>FR: build_extracted_schema
    FR-->>E: ExtractedSchema

Action Handling Flow

flowchart LR
    A[Field Response] --> B{Action Type}
    B -->|add_new_field| C[Create New Field]
    B -->|replace_value_in_existing_field| D[Replace Value]
    B -->|add_value_to_existing_list| E[Append to List]
    B -->|add_row_to_existing_table_field| F[Add Table Row]

    C --> G[Field Repository]
    D --> G
    E --> G
    F --> G
    G --> H[Updated Extracted Fields]

Schema Modes

flowchart TD
    A[Extractor Configuration] --> B{Schema Provided?}
    B -->|Yes| C{identify_fields?}
    B -->|No| D[Pure Discovery Mode]

    C -->|True| E[Hybrid Mode]
    C -->|False| F[Schema-Only Mode]

    D --> D1[Agent discovers all fields<br/>from content]
    E --> E1[Agent extracts schema fields<br/>+ discovers additional fields]
    F --> F1[Agent only extracts<br/>predefined schema fields]

    D1 --> G[Field Responses]
    E1 --> G
    F1 --> G

    style D fill:#e1f5ff
    style E fill:#fff4e1
    style F fill:#ffe1f5

Configuration

Export GOOGLE_API_KEY (or configure another Pydantic AI-supported model via the model parameter).
Export MISTRAL_API_KEY when using the OCR helpers.
Add either value to .env if you prefer not to export in the shell; the package loads it on import.

Tune the extraction behaviour via the Extractor parameters:

model: Override the default LLM (google-gla:gemini-2.5-flash-lite).
max_chunk_size: Control chunking; defaults to 3000 characters.
schema: Provide a Schema to constrain extraction targets.
identify_fields: Toggle auto-discovery of new fields (defaults to True).
actions: Supply custom Action handlers to override the default merge logic.

Schemas and data types

Schemas describe the fields you care about:

Schema groups one or more SchemaEntity definitions.
Each SchemaEntity contains SchemaField objects with a name, optional description, data_type string hint, and example value.
Schema.field_ids (and ExtractedSchema.field_ids) expose the canonical <entity>.<field> identifiers used as keys in extraction results.
Table represents tabular data with headers and rows. You can pass a Table or a JSON-serialisable dict with the same shape as the example value.

The data_type string is forwarded to the agent as a hint (for example, table<date, hours, project>). Use any structure that helps the model return the right shape.

Actions

Extraction results can arrive incrementally across chunks. The action service resolves those updates using handler functions. The default handlers (registered by ActionService) are:

handle_add_new_field
handle_replace_value_in_existing_field
handle_add_value_to_existing_list
handle_add_row_to_existing_table_field

Each action name matches the handler’s qualified name and is surfaced to the agent so it can choose how to merge a response. You can append or replace handlers:

from extractly import Extractor
from extractly.actions.schemas import Action
from extractly.fields import FieldRepository
from extractly.schemas import FieldResponse

def custom_action(field_response: FieldResponse, fields: FieldRepository) -> None:
    # Custom merge logic goes here.
    ...

extractor = Extractor(
    content="Your content...",
    actions=[
        Action(
            handler=custom_action,
            description="Describe when the agent should call this action.",
        )
    ],
)

Examples

Ready-to-run scripts live under examples/:

extract_discover_fields.py – Discover fields without providing a schema (python examples/extract_discover_fields.py)
extract_from_invoice_text.py – Schema-driven extraction from sample invoice text (python examples/extract_from_invoice_text.py)
extract_from_contract_text.py – Apply a JSON schema to contract-like Markdown (python examples/extract_from_contract_text.py)
extract_given_fields_dry_run.py – Run in dry-run mode to inspect the generated prompts (python examples/extract_given_fields_dry_run.py)
extract_list_fields.py – Work with list-typed schema fields (python examples/extract_list_fields.py)
extract_table_data.py – Capture table-shaped data with schema hints (python examples/extract_table_data.py)
extract_with_ocr.py – Combine OCR with schema-based extraction (python examples/extract_with_ocr.py)
batch_extraction.py – Process multiple files with different schemas (python examples/batch_extraction.py)

API reference

`Extractor`

content: str – Text to analyse.
model: models.Model | str – Pydantic AI model to use (google-gla:gemini-2.5-flash-lite by default).
max_chunk_size: int – Soft limit for chunking (3000 by default).
schema: Schema | None – Schema describing the fields you want returned.
actions: list[Action] | None – Optional custom action handlers (defaults registered automatically).
identify_fields: bool – When True, the agent may return new fields beyond the schema.

Key methods:

extract_fields(dry_run: bool = False) -> ExtractedSchema
process_chunk(chunk: str, dry_run: bool = False) -> list[FieldResponse]
handle_field_response(field_response: FieldResponse) -> None

Set dry_run=True to inspect prompt construction without calling the model.

`OCR`

Helper around Mistral OCR:

extract_text_from_file_path(input_file_path, output_file_path=None, filename=None, is_image=None)
extract_text_from_bytes(content, filename, is_image=None)
extract_text_from_file_url(file_url, filename, is_image=None)

Schemas and responses

Schema, SchemaEntity, SchemaField
ExtractedSchema, ExtractedEntity, ExtractedField
FieldResponse
Table
DefaultActionsT – Literal union of the canonical action identifiers ("add_new_field", "replace_value_in_existing_field", "add_value_to_existing_list", "add_row_to_existing_table_field").

Development

Running tests

pytest tests/

Type checking

basedpyright

Formatting

Format code with ruff:

uv run ruff format .

Pre-commit hook

Install the development dependencies with uv (they include pre-commit) and install the hook so ruff runs before every commit:

uv sync --extra dev
uv run pre-commit install

Run the hook manually with uv run pre-commit run --all-files if you want to lint the entire repository before pushing changes.

Publishing (maintainers only)

These steps publish the extract package itself via our CI pipeline. No PAT is required on your machine.

Bump the version in pyproject.toml.
Commit the change then tag the release:
```
git tag -a v0.1.2 -m "Release 0.1.2"
```
Replace 0.1.2 with the version you just set.
Push the commit and tag so CI can build and publish:
```
git push --follow-tags
```

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

Release history Release notifications | RSS feed

This version

0.4.0

Jan 13, 2026

0.3.0

Jan 13, 2026

0.2.3

Jan 13, 2026

0.2.2

Jan 13, 2026

0.2.1

Jan 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractly-0.4.0.tar.gz (33.0 kB view details)

Uploaded Jan 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extractly-0.4.0-py3-none-any.whl (22.1 kB view details)

Uploaded Jan 13, 2026 Python 3

File details

Details for the file extractly-0.4.0.tar.gz.

File metadata

Download URL: extractly-0.4.0.tar.gz
Upload date: Jan 13, 2026
Size: 33.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for extractly-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`b4f4cc588754ff4da6d625bd9c761c3d4a93d8693cf4d2dfa937f4f2ef651606`
MD5	`9284d1ae4b15de6aa088fd53b7ca880a`
BLAKE2b-256	`718905dc8b6cab9bbf070cf853fc0a6946577dba45c0941d07d23e5c5a42cbb9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for extractly-0.4.0.tar.gz:

Publisher: release.yml on Darkmatter-AI/extractly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: extractly-0.4.0.tar.gz
- Subject digest: b4f4cc588754ff4da6d625bd9c761c3d4a93d8693cf4d2dfa937f4f2ef651606
- Sigstore transparency entry: 819182273
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: Darkmatter-AI/extractly@a257d6d378d2a89f86fc62f55068c670f0b960d5
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Darkmatter-AI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a257d6d378d2a89f86fc62f55068c670f0b960d5
- Trigger Event: workflow_dispatch

File details

Details for the file extractly-0.4.0-py3-none-any.whl.

File metadata

Download URL: extractly-0.4.0-py3-none-any.whl
Upload date: Jan 13, 2026
Size: 22.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for extractly-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99540fd15dc07561608b7ab3158289657869fcac528686f9cda9a15d0fd20ee5`
MD5	`43dce8e3ecd984a8878d69f230993e83`
BLAKE2b-256	`51e6997ca8784e1902c2cefcc8d744a06ba5a85b2bc3b9d4b43c14b55037092a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for extractly-0.4.0-py3-none-any.whl:

Publisher: release.yml on Darkmatter-AI/extractly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: extractly-0.4.0-py3-none-any.whl
- Subject digest: 99540fd15dc07561608b7ab3158289657869fcac528686f9cda9a15d0fd20ee5
- Sigstore transparency entry: 819182311
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: Darkmatter-AI/extractly@a257d6d378d2a89f86fc62f55068c670f0b960d5
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Darkmatter-AI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a257d6d378d2a89f86fc62f55068c670f0b960d5
- Trigger Event: workflow_dispatch

extractly 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Extractly

Table of Contents

Features

Requirements

Installation

Quick Start

Discover fields automatically

Schema-driven extraction

OCR then extract

Batch extraction with multiple schemas

Architecture

High-Level Extraction Flow

Component Architecture

Chunk Processing Pipeline

Action Handling Flow

Schema Modes

Configuration

Schemas and data types

Actions

Examples

API reference

Extractor

OCR

Schemas and responses

Development

Running tests

Type checking

Formatting

Pre-commit hook

Publishing (maintainers only)

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`Extractor`

`OCR`