Skip to main content

A provider-agnostic, entity-centric LLM-powered document entity extraction tool

Project description

entityxtract

uv Ruff Pydantic v2 Python 3.12+ License MIT

Entity-first, schema-driven extraction of structured data from unstructured documents (PDF, DOCX, TXT, images). Define custom entities with schemas, few-shot examples, and instructions, then extract reliably using any local or SOTA LLM.

Built as an open-source alternative to Google Cloud Document AI, Azure AI Document Intelligence, and Adobe PDF Extract โ€” but provider-agnostic and designed to work with any LLM.

entityxtract

Features

  • ๐ŸŽฏ Entity-first extraction โ€” Smart structured data extraction with pre-defined / auto-identified entities.
  • ๐Ÿ“„ Multiple document formats โ€” Support for PDF, TXT, MD, and images.
  • ๐Ÿ”€ Smart input modes โ€” Extract information using text, OCR, or hybrid approaches.
  • ๐ŸŒ Provider-agnostic design โ€” Works with any LLM via OpenAI-compatible APIs.
  • ๐Ÿ”„ Robust execution โ€” Built-in retries, parallel extraction, strictly structured and typed output.
  • ๐Ÿ“Š Observability โ€” Structured logs, token usage tracking, and optional cost tracking.
  • ๐Ÿ“ฆ PyPI Package โ€” Easily install and use entityxtract in your projects.

Coming Soon

  • ๐ŸŒ FastAPI REST API for remote extraction services.
  • ๐Ÿ–ฅ๏ธ Web UI for visual entity/schema management and job monitoring.
  • ๐Ÿ” Auto-detect mode to automatically identify extractable entities in documents.
  • ๐Ÿ’ฐ Cost Optimization using PDF annotation caching, and smart input data pruning.
  • ๐Ÿ‘๏ธ Deepseek OCR integration for enhanced document processing.
  • ๐Ÿ”Œ MCP server for agentic applications.

Installation

To use entityxtract, you'll need Python 3.12+ and uv (recommended):

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/Prathamesh-Ghatole/entityxtract.git
cd entityxtract

# Install dependencies
uv sync

Getting Started

Extract pre-defined entities:

from pathlib import Path
import polars as pl
from entityxtract.extractor_types import (
    Document, TableToExtract, ObjectsToExtract, 
    ExtractionConfig, FileInputMode
)
from entityxtract.extractor import extract_objects

# 1. Load your document
doc = Document(Path("document.pdf"))

# 2. Define what to extract
table = TableToExtract(
    name="Events",
    example_table=pl.DataFrame([
        {"Time": "02:05", "Type": "Operation", "Description": "Example event"},
        {"Time": "03:25", "Type": "Transit", "Description": "Another event"}
    ]),
    instructions="Extract the events table with Time, Type, and Description columns.",
    required=True
)

# 3. Configure extraction
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",  # Recommended
    temperature=0.0,
    file_input_modes=[FileInputMode.FILE]
)

# 4. Extract!
results = extract_objects(doc, ObjectsToExtract(objects=[table], config=config))

# Use your results
for name, result in results.results.items():
    if result.success:
        df = pl.DataFrame(result.extracted_data)
        print(df)
    else:
        print(f"Failed: {result.message}")

Configuration

Copy the sample environment file .env.sample to .env, or set the following environment variables directly:

# For all OpenAI-compatible endpoints [OpenAI, OpenRouter, Ollama, lm-studio, etc.]
export OPENAI_API_KEY="your-api-key"
export OPENAI_API_BASE="https://openrouter.ai/api/v1"

# Default model
export OPENAI_DEFAULT_MODEL="google/gemini-2.5-flash"

Usage Examples

Complete Example with Multiple Entities

from pathlib import Path
import polars as pl

from entityxtract.extractor_types import (
    Document, ExtractionConfig, FileInputMode,
    TableToExtract, StringToExtract, ObjectsToExtract
)
from entityxtract.extractor import extract_objects

# Load document
doc = Document(Path("reports/quarterly_summary.pdf"))

# Define entities to extract
table = TableToExtract(
    name="Financial Summary",
    example_table=pl.DataFrame([
        {"Quarter": "Q1 2024", "Revenue": "$1.2M", "Expenses": "$800K", "Profit": "$400K"},
        {"Quarter": "Q2 2024", "Revenue": "$1.5M", "Expenses": "$900K", "Profit": "$600K"}
    ]),
    instructions="Extract the quarterly financial summary table with Quarter, Revenue, Expenses, and Profit columns.",
    required=True
)

report_id = StringToExtract(
    name="Report ID",
    example_string="RPT-2024-Q2-001",
    instructions="Extract the report identifier from the document header.",
    required=False
)

# Configure extraction with cost tracking
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    temperature=0.0,
    file_input_modes=[FileInputMode.FILE],
    parallel_requests=4,
    calculate_costs=True
)

# Run extraction
objects = ObjectsToExtract(objects=[table, report_id], config=config)
results = extract_objects(doc, objects)

# Process results
for name, res in results.results.items():
    if res.success:
        print(f"โœ“ [{name}] extracted successfully")
        print(f"  Tokens: {res.input_tokens} in / {res.output_tokens} out")
        print(f"  Cost: ${res.cost:.4f}")
        
        # Export table to CSV
        if isinstance(res.extracted_data, list):
            df = pl.DataFrame(res.extracted_data)
            df.write_csv(f"{name}.csv")
            print(f"  Saved to {name}.csv")
    else:
        print(f"โœ— [{name}] failed: {res.message}")

print(f"\nTotals: {results.total_input_tokens} tokens in, {results.total_output_tokens} tokens out")
print(f"Total cost: ${results.total_cost:.4f}")

Different Input Modes

# Pass document as file attachment
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    file_input_modes=[FileInputMode.FILE]
)

# Pass document as text content
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    file_input_modes=[FileInputMode.TEXT]
)

# Pass document as images (useful for scanned documents)
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    file_input_modes=[FileInputMode.IMAGE]
)

# Combine multiple input modes
config = ExtractionConfig(
    model_name="google/gemini-2.5-flash",
    file_input_modes=[FileInputMode.FILE, FileInputMode.TEXT]
)

See tests/test.py for more complete examples.

Roadmap

Interfaces

  • ๐ŸŒ FastAPI REST API for remote extraction services
  • ๐Ÿ–ฅ๏ธ Web UI for entity management, job runs, and results review
  • ๐Ÿค– Auto-detect mode: automatically identify entities in documents

Developer Experience

  • ๐Ÿ“ฆ Publish to PyPI for easy pip install entityxtract
  • โšก ENV-first configuration (deprecate YAML)
  • ๐Ÿ’พ Document annotation caching to reduce token usage
  • ๐Ÿ”ง JSON import/export for entity schemas and results
  • ๐Ÿ“ Enhanced CLI with entityxtract command

Providers & Models

  • ๐Ÿ  Local inference via Ollama
  • ๐Ÿ”Œ Native adapters for OpenAI, Gemini, Claude, and more
  • ๐ŸŒ Support for additional LLM providers

Quality & Testing

  • โœ… Expanded test coverage
  • ๐Ÿ“Š Benchmark suite for accuracy and performance
  • ๐Ÿ“š Comprehensive documentation site

Comparisons

entityxtract positions itself as a flexible, open-source alternative to both commercial services and closed-source solutions:

Key Differentiators:

  • Provider Agnostic: Works with any LLM, not locked to a single provider
  • Open Source: Full transparency, customizable, and community-driven
  • Schema + Examples: Strong emphasis on structured entity definitions with few-shot learning
  • Complete Stack: Python SDK today, REST API and Web UI coming soon

Contributing

We welcome contributions! entityxtract uses modern Python tooling:

# Use uv for environment management
uv sync

# Run tests
uv run pytest tests/

# Code formatting with Ruff
uv run ruff check .
uv run ruff format .

Guidelines:

  • Follow strict JSON output conventions
  • Include tests for new features
  • Update documentation as needed
  • Use structured logging patterns

Open an issue or PR with a clear description and we'll be happy to review!

Get Help and Support

License

entityxtract is released under the MIT License. Free for commercial and personal use.


Built with โค๏ธ by Prathamesh Ghatole

entityxtract was built out of the need for intelligent entity extraction from documents using AI with minimal effort. Define what you need, and let AI handle the rest.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entityxtract-0.5.4.tar.gz (2.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entityxtract-0.5.4-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file entityxtract-0.5.4.tar.gz.

File metadata

  • Download URL: entityxtract-0.5.4.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for entityxtract-0.5.4.tar.gz
Algorithm Hash digest
SHA256 395dec4246f07d4eb8b4f307455a0d7278eee1b1b46dfe4506571941c50595a2
MD5 3f26056c3c70543eae7964d5b097460e
BLAKE2b-256 fea64c0daf51961bad856b2459ef3bc4a95f55217054df6380294fcbe934ef4b

See more details on using hashes here.

Provenance

The following attestation bundles were made for entityxtract-0.5.4.tar.gz:

Publisher: publish.yml on Prathamesh-Ghatole/entityxtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file entityxtract-0.5.4-py3-none-any.whl.

File metadata

  • Download URL: entityxtract-0.5.4-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for entityxtract-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3fdab908b6e4745ed87b3a52fa7f846465e53dc2abd61db6d58f0d4d2e37e302
MD5 a0e2122749b5c4664d5834010bc4871c
BLAKE2b-256 78e5660a689995a19de4aca5b6ee98e54e7badcd7b3cd07fdbcd102e73dce81f

See more details on using hashes here.

Provenance

The following attestation bundles were made for entityxtract-0.5.4-py3-none-any.whl:

Publisher: publish.yml on Prathamesh-Ghatole/entityxtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page