Document data extraction made by Dark Matter
Project description
Extractly
Extractly is a Python library for turning unstructured text into structured, typed data. Built on top of Pydantic AI, it orchestrates large language model calls, merges incremental field updates, and optionally performs OCR through Mistral so you can ingest PDFs or images.
Table of Contents
- Extractly
Features
- LLM-powered extraction: Uses Pydantic AI agents (default model
google-gla:gemini-2.5-flash-lite) to turn raw text into structured fields. - Schema-aware control: Provide a
Schemadescribing entities and fields, or enable discovery mode withidentify_fields=True. - Action pipeline: Reconcile incremental model responses via configurable handlers (merge lists, upsert tables, etc.).
- OCR integration: Convert images and documents into Markdown with the built-in Mistral OCR helper before extraction.
- Chunking built in: Large inputs are automatically split into manageable chunks to stay within model limits.
- Confidence scoring: Every
ExtractedFieldincludes a 0-1 confidence score for downstream quality checks.
Requirements
- Python 3.13 or newer (per
pyproject.toml) - Credentials for your LLM provider. The default model
google-gla:gemini-2.5-flash-liteexpectsGOOGLE_API_KEYto be set or otherwise visible to Pydantic AI. - A Mistral API key (
MISTRAL_API_KEY) when using the OCR helpers.
A local .env file is loaded automatically via python-dotenv, making it convenient to store credentials during development.
Installation
Install from PyPI using pip:
pip install extractly
Or with uv:
uv add extractly
For development, clone the repository and install with dev dependencies:
git clone https://github.com/Darkmatter-AI/extractly.git
cd extractly
pip install ".[dev]"
Quick Start
With the package installed and credentials exported, you can start extracting data programmatically.
Discover fields automatically
from extractly import Extractor
content = """
Invoice #12345
Issued on 2024-01-15
Total due: $1,250.00
"""
extractor = Extractor(content=content)
extracted_schema = extractor.extract_fields()
for entity in extracted_schema.entities:
for field in entity.fields:
print(f"{field.name}: {field.value} (confidence: {field.confidence:.2f})")
Extractor.extract_fields() returns an ExtractedSchema, which exposes the extracted fields via fields_by_id keyed by <entity name>.<field name>.
Schema-driven extraction
from extractly import Extractor
from extractly.schemas import Schema, SchemaEntity, SchemaField, Table
invoice_schema = Schema(
name="Invoice",
description="Fields expected in an invoice document.",
entities=[
SchemaEntity(
name="invoice",
description="Top-level invoice information.",
fields=[
SchemaField(
name="invoice_number",
description="Unique invoice identifier.",
data_type="string",
example="INV-12345",
),
SchemaField(
name="amount_due",
description="Total amount owed.",
data_type="currency",
example="$1,250.00",
),
SchemaField(
name="line_items",
description="Table of invoice line items.",
data_type="table<description, quantity, price>",
example=Table(
headers=["Description", "Quantity", "Price"],
rows=[["Design work", 10, "$125.00"]],
),
),
],
),
],
)
invoice_text = """
Invoice INV-12345
Amount due: $1,250.00
Line items:
- Design work, 10 hours @ $125.00
"""
extractor = Extractor(
content=invoice_text,
schema=invoice_schema,
identify_fields=False, # Only return fields defined in the schema
)
extracted_schema = extractor.extract_fields()
for field_id, field in extracted_schema.fields_by_id.items():
print(f"{field_id}: {field.value}")
Leave identify_fields=True (the default) if you want the agent to return schema fields and discover additional fields that look relevant.
Schema definitions can also be loaded from JSON. For example, the sample invoice schema used in the examples can be loaded with:
from pathlib import Path
from extractly.schemas import Schema
schema = Schema.model_validate_json(
Path("samples/invoice/invoice_schema.json").read_text()
)
OCR then extract
from pathlib import Path
from extractly import Extractor
from extractly.ocr import OCR
from extractly.schemas import Schema
schema = Schema.model_validate_json(
Path("samples/invoice/invoice_schema.json").read_text()
)
ocr = OCR()
extractor = Extractor.from_file(
input_file_path=Path("samples/invoice/invoice_image.jpg"),
use_ocr=True,
ocr_service=ocr,
# use ocr_filename if the file name is missing/hashed to improve type detection
ocr_filename="invoice.jpg",
schema=schema,
identify_fields=False,
)
extracted_schema = extractor.extract_fields()
for entity in extracted_schema.entities:
for field in entity.fields:
print(f"{field.name}: {field.value} (confidence: {field.confidence})")
Set use_ocr=True to have the extractor run OCR before chunking; leave it False to read text files directly (you can pass encoding= for non-UTF-8 text). The OCR service automatically detects whether the file is an image when is_image is omitted. Pass is_image=True or False to override the detection, ocr_filename when the original file name is missing or extensionless, and ocr_output_file_path to save the rendered Markdown.
OCR.extract_text_from_file_path also accepts PDFs and can optionally write the Markdown output to disk.
Batch extraction with multiple schemas
You can process multiple documents with different schemas in a single call:
from extractly import Extractor, DocumentInput
from extractly.schemas import Schema
# Define your schemas...
invoice_schema = Schema(name="Invoice", ...)
receipt_schema = Schema(name="Receipt", ...)
extractor = Extractor(
schemas=[invoice_schema, receipt_schema],
)
documents = [
DocumentInput(file_path="invoice.jpg", schema_name="Invoice", use_ocr=True),
DocumentInput(file_path="receipt.pdf", schema_name="Receipt", use_ocr=True),
]
result = extractor.extract_fields(documents)
for doc_result in result.results:
if doc_result.error:
print(f"Error processing {doc_result.document_id}: {doc_result.error}")
else:
print(f"Extracted {doc_result.schema_name} from {doc_result.document_id}")
Architecture
High-Level Extraction Flow
flowchart TD
A[Input Content] --> B{OCR Needed?}
B -->|Yes| C[OCR Service]
B -->|No| D[Raw Text]
C --> D
D --> E[Extractor]
S[Schema Optional] -.->|provided| E
S -.->|not provided| E
E --> F[Chunk Content]
F --> G{More Chunks?}
G -->|Yes| H[Process Chunk]
G -->|No| M[Return Extracted Fields]
H --> I[LLM Agent]
I --> J[Field Responses]
J --> K[Action Service]
K --> L[Update Field Repository]
L --> G
style S fill:#f9f,stroke:#333,stroke-dasharray: 5 5
Component Architecture
classDiagram
class Extractor {
+content: str
+fields: FieldRepository
+model: Model
+schema: Schema
+extract_fields() ExtractedSchema
+process_chunk() list
+handle_field_response()
}
class FieldRepository {
-_extracted_fields: dict
-_schema: Schema
+upsert_extracted_field()
+get_extracted_field()
+extracted_fields: dict
+build_extracted_schema() ExtractedSchema
}
class ActionService {
-_actions: dict
+register()
+dispatch()
+available_actions()
}
class PromptService {
+system_prompt: str
+get_user_message()
}
class Agent {
+run_sync()
}
class OCR {
+extract_text_from_file_path()
+extract_text_from_bytes()
+extract_text_from_file_url()
}
Extractor --> FieldRepository
Extractor --> ActionService
Extractor --> PromptService
Extractor --> Agent
OCR --> Extractor: provides content
Chunk Processing Pipeline
sequenceDiagram
participant S as Schema (optional)
participant E as Extractor
participant PS as PromptService
participant C as Chunking
participant A as Agent
participant AS as ActionService
participant FR as FieldRepository
Note over S,E: Schema may or may not be provided
S--)E: schema (optional)
E->>PS: initialize with schema + identify_fields
E->>FR: initialize with schema
E->>C: chunk_markdown(content)
C-->>E: list of chunks
loop For each chunk
E->>PS: get_user_message(chunk, fields)
PS-->>E: prompt with schema constraints<br/>or field discovery mode
E->>A: process_chunk(chunk)
A->>A: Apply system prompt<br/>+ user message
A-->>E: list[FieldResponse]
loop For each response
E->>AS: handle_field_response()
AS->>AS: dispatch to action handler
AS->>FR: update field
FR-->>AS: updated
end
end
E->>FR: build_extracted_schema
FR-->>E: ExtractedSchema
Action Handling Flow
flowchart LR
A[Field Response] --> B{Action Type}
B -->|add_new_field| C[Create New Field]
B -->|replace_value_in_existing_field| D[Replace Value]
B -->|add_value_to_existing_list| E[Append to List]
B -->|add_row_to_existing_table_field| F[Add Table Row]
C --> G[Field Repository]
D --> G
E --> G
F --> G
G --> H[Updated Extracted Fields]
Schema Modes
flowchart TD
A[Extractor Configuration] --> B{Schema Provided?}
B -->|Yes| C{identify_fields?}
B -->|No| D[Pure Discovery Mode]
C -->|True| E[Hybrid Mode]
C -->|False| F[Schema-Only Mode]
D --> D1[Agent discovers all fields<br/>from content]
E --> E1[Agent extracts schema fields<br/>+ discovers additional fields]
F --> F1[Agent only extracts<br/>predefined schema fields]
D1 --> G[Field Responses]
E1 --> G
F1 --> G
style D fill:#e1f5ff
style E fill:#fff4e1
style F fill:#ffe1f5
Configuration
- Export
GOOGLE_API_KEY(or configure another Pydantic AI-supported model via themodelparameter). - Export
MISTRAL_API_KEYwhen using the OCR helpers. - Add either value to
.envif you prefer not to export in the shell; the package loads it on import.
Tune the extraction behaviour via the Extractor parameters:
model: Override the default LLM (google-gla:gemini-2.5-flash-lite).max_chunk_size: Control chunking; defaults to3000characters.schema: Provide aSchemato constrain extraction targets.identify_fields: Toggle auto-discovery of new fields (defaults toTrue).actions: Supply customActionhandlers to override the default merge logic.
Schemas and data types
Schemas describe the fields you care about:
Schemagroups one or moreSchemaEntitydefinitions.- Each
SchemaEntitycontainsSchemaFieldobjects with aname, optionaldescription,data_typestring hint, andexamplevalue. Schema.field_ids(andExtractedSchema.field_ids) expose the canonical<entity>.<field>identifiers used as keys in extraction results.Tablerepresents tabular data withheadersandrows. You can pass aTableor a JSON-serialisable dict with the same shape as the example value.
The data_type string is forwarded to the agent as a hint (for example, table<date, hours, project>). Use any structure that helps the model return the right shape.
Actions
Extraction results can arrive incrementally across chunks. The action service resolves those updates using handler functions. The default handlers (registered by ActionService) are:
handle_add_new_fieldhandle_replace_value_in_existing_fieldhandle_add_value_to_existing_listhandle_add_row_to_existing_table_field
Each action name matches the handler’s qualified name and is surfaced to the agent so it can choose how to merge a response. You can append or replace handlers:
from extractly import Extractor
from extractly.actions.schemas import Action
from extractly.fields import FieldRepository
from extractly.schemas import FieldResponse
def custom_action(field_response: FieldResponse, fields: FieldRepository) -> None:
# Custom merge logic goes here.
...
extractor = Extractor(
content="Your content...",
actions=[
Action(
handler=custom_action,
description="Describe when the agent should call this action.",
)
],
)
Examples
Ready-to-run scripts live under examples/:
extract_discover_fields.py– Discover fields without providing a schema (python examples/extract_discover_fields.py)extract_from_invoice_text.py– Schema-driven extraction from sample invoice text (python examples/extract_from_invoice_text.py)extract_from_contract_text.py– Apply a JSON schema to contract-like Markdown (python examples/extract_from_contract_text.py)extract_given_fields_dry_run.py– Run in dry-run mode to inspect the generated prompts (python examples/extract_given_fields_dry_run.py)extract_list_fields.py– Work with list-typed schema fields (python examples/extract_list_fields.py)extract_table_data.py– Capture table-shaped data with schema hints (python examples/extract_table_data.py)extract_with_ocr.py– Combine OCR with schema-based extraction (python examples/extract_with_ocr.py)batch_extraction.py– Process multiple files with different schemas (python examples/batch_extraction.py)
API reference
Extractor
content: str– Text to analyse.model: models.Model | str– Pydantic AI model to use (google-gla:gemini-2.5-flash-liteby default).max_chunk_size: int– Soft limit for chunking (3000by default).schema: Schema | None– Schema describing the fields you want returned.actions: list[Action] | None– Optional custom action handlers (defaults registered automatically).identify_fields: bool– WhenTrue, the agent may return new fields beyond the schema.
Key methods:
extract_fields(dry_run: bool = False) -> ExtractedSchemaprocess_chunk(chunk: str, dry_run: bool = False) -> list[FieldResponse]handle_field_response(field_response: FieldResponse) -> None
Set dry_run=True to inspect prompt construction without calling the model.
OCR
Helper around Mistral OCR:
extract_text_from_file_path(input_file_path, output_file_path=None, filename=None, is_image=None)extract_text_from_bytes(content, filename, is_image=None)extract_text_from_file_url(file_url, filename, is_image=None)
Schemas and responses
Schema,SchemaEntity,SchemaFieldExtractedSchema,ExtractedEntity,ExtractedFieldFieldResponseTableDefaultActionsT– Literal union of the canonical action identifiers ("add_new_field","replace_value_in_existing_field","add_value_to_existing_list","add_row_to_existing_table_field").
Development
Running tests
pytest tests/
Type checking
basedpyright
Formatting
Format code with ruff:
uv run ruff format .
Pre-commit hook
Install the development dependencies with uv (they include pre-commit) and install the hook so ruff runs before every commit:
uv sync --extra dev
uv run pre-commit install
Run the hook manually with uv run pre-commit run --all-files if you want to lint the entire repository before pushing changes.
Publishing (maintainers only)
These steps publish the extract package itself via our CI pipeline. No PAT is required on your machine.
-
Bump the version in
pyproject.toml. -
Commit the change then tag the release:
git tag -a v0.1.2 -m "Release 0.1.2"
Replace
0.1.2with the version you just set. -
Push the commit and tag so CI can build and publish:
git push --follow-tags
License
This project is licensed under the MIT License.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extractly-0.3.0.tar.gz.
File metadata
- Download URL: extractly-0.3.0.tar.gz
- Upload date:
- Size: 30.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b688c2a72f3fdbed66aef71234a68df1bc50df22bed7c5bc2a842c5ddaeb69f
|
|
| MD5 |
8368de6ba0de890f01dc4500be464a2d
|
|
| BLAKE2b-256 |
cd4c9fe7cd20a5da6a11250ba993ab795fee2f15870e4fc2d49bd9292f9b502e
|
Provenance
The following attestation bundles were made for extractly-0.3.0.tar.gz:
Publisher:
release.yml on Darkmatter-AI/extractly
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
extractly-0.3.0.tar.gz -
Subject digest:
9b688c2a72f3fdbed66aef71234a68df1bc50df22bed7c5bc2a842c5ddaeb69f - Sigstore transparency entry: 819099863
- Sigstore integration time:
-
Permalink:
Darkmatter-AI/extractly@f1cdd264b26ca3f798df319392fab6580b6f0546 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Darkmatter-AI
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f1cdd264b26ca3f798df319392fab6580b6f0546 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file extractly-0.3.0-py3-none-any.whl.
File metadata
- Download URL: extractly-0.3.0-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9f08d2dd6b0f5b6b0672be8fa6dc58bb9515e28c65a1cbacba77c88b7fe6761
|
|
| MD5 |
d0b04ba261188620eb7f01d6b380a7d0
|
|
| BLAKE2b-256 |
c45e6b415a041e6118c94bc9d5e851aadbb38d30148affddf57f275fec2df365
|
Provenance
The following attestation bundles were made for extractly-0.3.0-py3-none-any.whl:
Publisher:
release.yml on Darkmatter-AI/extractly
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
extractly-0.3.0-py3-none-any.whl -
Subject digest:
a9f08d2dd6b0f5b6b0672be8fa6dc58bb9515e28c65a1cbacba77c88b7fe6761 - Sigstore transparency entry: 819099889
- Sigstore integration time:
-
Permalink:
Darkmatter-AI/extractly@f1cdd264b26ca3f798df319392fab6580b6f0546 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Darkmatter-AI
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f1cdd264b26ca3f798df319392fab6580b6f0546 -
Trigger Event:
workflow_dispatch
-
Statement type: