A provider-agnostic, entity-centric LLM-powered document entity extraction tool

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

entityxtract

Entity-first, schema-driven extraction of structured data from unstructured documents (PDF, DOCX, TXT, images). Define custom entities with schemas, few-shot examples, and instructions, then extract reliably using any local or SOTA LLM.

Built as an open-source alternative to Google Cloud Document AI, Azure AI Document Intelligence, and Adobe PDF Extract — but provider-agnostic and designed to work with any LLM.

Features

🎯 Entity-first extraction — Smart structured data extraction with pre-defined / auto-identified entities.
📄 Multiple document formats — Support for PDF, TXT, MD, and images.
🔀 Smart input modes — Extract information using text, OCR, or hybrid approaches.
🌐 Provider-agnostic design — Works with any LLM via OpenAI-compatible APIs.
🔄 Robust execution — Built-in retries, parallel extraction, strictly structured and typed output.
📊 Observability — Structured logs, token usage tracking, and optional cost tracking.
📦 PyPI Package — Easily install and use entityxtract in your projects.

Coming Soon

🌐 FastAPI REST API for remote extraction services.
🖥️ Web UI for visual entity/schema management and job monitoring.
🔍 Auto-detect mode to automatically identify extractable entities in documents.
💰 Cost Optimization using PDF annotation caching, and smart input data pruning.
👁️ Deepseek OCR integration for enhanced document processing.
🔌 MCP server for agentic applications.

Installation

To use entityxtract, you'll need Python 3.12+ and uv (recommended):

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/Prathamesh-Ghatole/entityxtract.git
cd entityxtract

# Install dependencies
uv sync

Getting Started

Extract pre-defined entities:

from pathlib import Path
import polars as pl
from entityxtract.extractor_types import (
    Document, TableToExtract, ObjectsToExtract, 
    ExtractionConfig, FileInputMode
)
from entityxtract.extractor import extract_objects

# 1. Load your document
doc = Document(Path("document.pdf"))

# 2. Define what to extract
table = TableToExtract(
    name="Events",
    example_table=pl.DataFrame([
        {"Time": "02:05", "Type": "Operation", "Description": "Example event"},
        {"Time": "03:25", "Type": "Transit", "Description": "Another event"}
    ]),
    instructions="Extract the events table with Time, Type, and Description columns.",
    required=True
)

# 3. Configure extraction
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",  # Recommended
    temperature=0.0,
    file_input_modes=[FileInputMode.FILE]
)

# 4. Extract!
results = extract_objects(doc, ObjectsToExtract(objects=[table], config=config))

# Use your results
for name, result in results.results.items():
    if result.success:
        df = pl.DataFrame(result.extracted_data)
        print(df)
    else:
        print(f"Failed: {result.message}")

Configuration

Copy the sample environment file .env.sample to .env, or set the following environment variables directly:

# For all OpenAI-compatible endpoints [OpenAI, OpenRouter, Ollama, lm-studio, etc.]
export OPENAI_API_KEY="your-api-key"
export OPENAI_API_BASE="https://openrouter.ai/api/v1"

# Default model
export OPENAI_DEFAULT_MODEL="google/gemini-2.5-flash"

Usage Examples

Complete Example with Multiple Entities

from pathlib import Path
import polars as pl

from entityxtract.extractor_types import (
    Document, ExtractionConfig, FileInputMode,
    TableToExtract, StringToExtract, ObjectsToExtract
)
from entityxtract.extractor import extract_objects

# Load document
doc = Document(Path("reports/quarterly_summary.pdf"))

# Define entities to extract
table = TableToExtract(
    name="Financial Summary",
    example_table=pl.DataFrame([
        {"Quarter": "Q1 2024", "Revenue": "$1.2M", "Expenses": "$800K", "Profit": "$400K"},
        {"Quarter": "Q2 2024", "Revenue": "$1.5M", "Expenses": "$900K", "Profit": "$600K"}
    ]),
    instructions="Extract the quarterly financial summary table with Quarter, Revenue, Expenses, and Profit columns.",
    required=True
)

report_id = StringToExtract(
    name="Report ID",
    example_string="RPT-2024-Q2-001",
    instructions="Extract the report identifier from the document header.",
    required=False
)

# Configure extraction with cost tracking
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    temperature=0.0,
    file_input_modes=[FileInputMode.FILE],
    parallel_requests=4,
    calculate_costs=True
)

# Run extraction
objects = ObjectsToExtract(objects=[table, report_id], config=config)
results = extract_objects(doc, objects)

# Process results
for name, res in results.results.items():
    if res.success:
        print(f"✓ [{name}] extracted successfully")
        print(f"  Tokens: {res.input_tokens} in / {res.output_tokens} out")
        print(f"  Cost: ${res.cost:.4f}")
        
        # Export table to CSV
        if isinstance(res.extracted_data, list):
            df = pl.DataFrame(res.extracted_data)
            df.write_csv(f"{name}.csv")
            print(f"  Saved to {name}.csv")
    else:
        print(f"✗ [{name}] failed: {res.message}")

print(f"\nTotals: {results.total_input_tokens} tokens in, {results.total_output_tokens} tokens out")
print(f"Total cost: ${results.total_cost:.4f}")

Different Input Modes

# Pass document as file attachment
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    file_input_modes=[FileInputMode.FILE]
)

# Pass document as text content
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    file_input_modes=[FileInputMode.TEXT]
)

# Pass document as images (useful for scanned documents)
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    file_input_modes=[FileInputMode.IMAGE]
)

# Combine multiple input modes
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    file_input_modes=[FileInputMode.FILE, FileInputMode.TEXT]
)

See tests/test.py for more complete examples.

Roadmap

Interfaces

🌐 FastAPI REST API for remote extraction services
🖥️ Web UI for entity management, job runs, and results review
🤖 Auto-detect mode: automatically identify entities in documents

Developer Experience

📦 Publish to PyPI for easy pip install entityxtract
⚡ ENV-first configuration (deprecate YAML)
💾 Document annotation caching to reduce token usage
🔧 JSON import/export for entity schemas and results
📝 Enhanced CLI with entityxtract command

Providers & Models

🏠 Local inference via Ollama
🔌 Native adapters for OpenAI, Gemini, Claude, and more
🌍 Support for additional LLM providers

Quality & Testing

✅ Expanded test coverage
📊 Benchmark suite for accuracy and performance
📚 Comprehensive documentation site

Comparisons

entityxtract positions itself as a flexible, open-source alternative to both commercial services and closed-source solutions:

Key Differentiators:

Provider Agnostic: Works with any LLM, not locked to a single provider
Open Source: Full transparency, customizable, and community-driven
Schema + Examples: Strong emphasis on structured entity definitions with few-shot learning
Complete Stack: Python SDK today, REST API and Web UI coming soon

Contributing

We welcome contributions! entityxtract uses modern Python tooling:

# Use uv for environment management
uv sync

# Run tests
uv run pytest tests/

# Code formatting with Ruff
uv run ruff check .
uv run ruff format .

Guidelines:

Follow strict JSON output conventions
Include tests for new features
Update documentation as needed
Use structured logging patterns

Open an issue or PR with a clear description and we'll be happy to review!

Get Help and Support

💬 GitHub Discussions - Ask questions and share ideas
🐛 Issues - Report bugs or request features
📧 Contact: prathamesh.s.ghatole@gmail.com

License

entityxtract is released under the MIT License. Free for commercial and personal use.

Built with ❤️ by Prathamesh Ghatole

entityxtract was built out of the need for intelligent entity extraction from documents using AI with minimal effort. Define what you need, and let AI handle the rest.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

PrathameshGhatole

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.4

Apr 7, 2026

0.5.2

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entityxtract-0.5.4.tar.gz (2.7 MB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

entityxtract-0.5.4-py3-none-any.whl (21.1 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file entityxtract-0.5.4.tar.gz.

File metadata

Download URL: entityxtract-0.5.4.tar.gz
Upload date: Apr 7, 2026
Size: 2.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for entityxtract-0.5.4.tar.gz
Algorithm	Hash digest
SHA256	`395dec4246f07d4eb8b4f307455a0d7278eee1b1b46dfe4506571941c50595a2`
MD5	`3f26056c3c70543eae7964d5b097460e`
BLAKE2b-256	`fea64c0daf51961bad856b2459ef3bc4a95f55217054df6380294fcbe934ef4b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for entityxtract-0.5.4.tar.gz:

Publisher: publish.yml on Prathamesh-Ghatole/entityxtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: entityxtract-0.5.4.tar.gz
- Subject digest: 395dec4246f07d4eb8b4f307455a0d7278eee1b1b46dfe4506571941c50595a2
- Sigstore transparency entry: 1249473940
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: Prathamesh-Ghatole/entityxtract@616e80e58ec19019718c71d7d9068ebd3fec770d
- Branch / Tag: refs/tags/v0.5.4
- Owner: https://github.com/Prathamesh-Ghatole
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@616e80e58ec19019718c71d7d9068ebd3fec770d
- Trigger Event: release

File details

Details for the file entityxtract-0.5.4-py3-none-any.whl.

File metadata

Download URL: entityxtract-0.5.4-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 21.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for entityxtract-0.5.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3fdab908b6e4745ed87b3a52fa7f846465e53dc2abd61db6d58f0d4d2e37e302`
MD5	`a0e2122749b5c4664d5834010bc4871c`
BLAKE2b-256	`78e5660a689995a19de4aca5b6ee98e54e7badcd7b3cd07fdbcd102e73dce81f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for entityxtract-0.5.4-py3-none-any.whl:

Publisher: publish.yml on Prathamesh-Ghatole/entityxtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: entityxtract-0.5.4-py3-none-any.whl
- Subject digest: 3fdab908b6e4745ed87b3a52fa7f846465e53dc2abd61db6d58f0d4d2e37e302
- Sigstore transparency entry: 1249474147
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: Prathamesh-Ghatole/entityxtract@616e80e58ec19019718c71d7d9068ebd3fec770d
- Branch / Tag: refs/tags/v0.5.4
- Owner: https://github.com/Prathamesh-Ghatole
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@616e80e58ec19019718c71d7d9068ebd3fec770d
- Trigger Event: release

entityxtract 0.5.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

entityxtract

Features

Coming Soon

Installation

Getting Started

Configuration

Usage Examples

Complete Example with Multiple Entities

Different Input Modes

Roadmap

Interfaces

Developer Experience

Providers & Models

Quality & Testing

Comparisons

Contributing

Get Help and Support

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance