Skip to main content

Structured AI document processing with robust fallback strategies.

Project description

strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

CI License: GPL v3 Python 3.10+ PyPI

pip install strutex

The Simplest Example

from strutex import extract
from strutex.schemas import INVOICE_US

invoice = extract("invoice.pdf", model=INVOICE_US)
print(invoice.invoice_number, invoice.total)

That's it. Three lines. No custom schema to write.

Schemas are required — but you have options:

  • Built-in schemas — 10+ ready-to-use (invoices, receipts, shipping docs, resumes)
  • Native typesObject, String, Number, Array (lightweight, no dependencies)
  • Pydantic models — Full type safety and validation

What You Can Do

Level Features When to use
Basic extract(), schemas Most use cases — just extract data
Reliability verify=True, validators Production — ensure accuracy
Scale caching, async, batch High volume — reduce costs
Extensibility plugins, hooks, CLI Advanced — extend anything

Most users only need Level 1. The rest is there when you need it.


Level 1: Basic Extraction

With Pydantic (recommended)

import strutex
from pydantic import BaseModel

class Receipt(BaseModel):
    store: str
    date: str
    total: float

receipt = strutex.extract("receipt.jpg", model=Receipt)

With Native Schema

from strutex import extract, Object, String, Number

schema = Object(properties={
    "invoice_number": String,
    "total": Number,
})

result = extract("invoice.pdf", schema=schema)

With Built-in Schemas

from strutex import extract
from strutex.schemas import INVOICE_US, BILL_OF_LADING

invoice = extract("invoice.pdf", model=INVOICE_US)
bol = extract("bl.pdf", model=BILL_OF_LADING)

Available: INVOICE_GENERIC, INVOICE_US, INVOICE_EU, RECEIPT, PURCHASE_ORDER, BILL_OF_LADING, RESUME, BANK_STATEMENT, etc.


Level 2: Reliability Features

Optional Double-Check

Ask the LLM to validate its own answers automatically — adds accuracy, completely optional:

result = strutex.extract(
    "contract.pdf",
    model=ContractSchema,
    verify=True  # LLM reviews its own output
)

Choosing a Provider

Create a provider instance for full control over API keys and configuration:

from strutex import DocumentProcessor
from strutex import GeminiProvider, OpenAIProvider, AnthropicProvider, OllamaProvider
from strutex.schemas import INVOICE_US
# Google Gemini
processor = DocumentProcessor(provider=GeminiProvider(api_key="your-key"))

# OpenAI
processor = DocumentProcessor(provider=OpenAIProvider(api_key="your-key", model="gpt-4o"))

# Anthropic Claude
processor = DocumentProcessor(provider=AnthropicProvider(api_key="your-key"))

# Local with Ollama (no API key needed)
processor = DocumentProcessor(provider=OllamaProvider(model="llama3"))

result = processor.process("doc.pdf", "Extract data", model=INVOICE_US)

Note: String providers like provider="gemini" are convenience shortcuts that assume correct environment variables. For production, explicit provider instances are recommended.


Level 3: Scale Features

Caching (reduce API costs)

from strutex import DocumentProcessor
from strutex.cache import SQLiteCache

processor = DocumentProcessor(
    provider="gemini",
    cache=SQLiteCache("cache.db")
)

Async Processing

import asyncio
from strutex import DocumentProcessor

async def main():
    processor = DocumentProcessor(provider="anthropic")
    results = await asyncio.gather(
        processor.aprocess("doc1.pdf", "Extract", schema),
        processor.aprocess("doc2.pdf", "Extract", schema)
    )

asyncio.run(main())

Level 4: Extensibility

Plugin System

Everything is pluggable. Just inherit from a base class:

Type Purpose Examples
Provider LLM backends Gemini, OpenAI, Claude, Ollama
Extractor Document parsing PDF, Image OCR, Excel
Validator Output validation Schema, sum checks, date formats
SecurityPlugin Input/output protection Injection detection, sanitization
Postprocessor Data transformation Date/number normalization
from strutex.plugins import Provider, Extractor, Validator

# Custom LLM Provider
class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Call your LLM API
        ...

# Custom Document Extractor
class WordExtractor(Extractor, name="word"):
    """Handle .docx files"""
    mime_types = ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]

    def extract(self, file_path: str) -> str:
        # Parse .docx and return text
        ...

# Custom Validator
class TotalValidator(Validator):
    """Verify line items sum to total"""
    def validate(self, data, schema, context):
        items_sum = sum(item["amount"] for item in data.get("items", []))
        return ValidationResult(
            valid=abs(items_sum - data["total"]) < 0.01,
            message="Line items must sum to total"
        )

CLI Commands

strutex plugins list              # List all plugins
strutex plugins list --type provider
strutex plugins info gemini --type provider

For Distributable Packages

# pyproject.toml
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Hooks System

Inject logic at any point in the processing pipeline:

from strutex import DocumentProcessor

processor = DocumentProcessor(provider="gemini")

@processor.on_pre_process
def add_instructions(file_path, prompt, schema, mime_type, context):
    """Modify prompt before sending to LLM"""
    return {"prompt": prompt + "\nBe precise and thorough."}

@processor.on_post_process
def normalize_dates(result, context):
    """Transform output after extraction"""
    if "date" in result:
        result["date"] = parse_date(result["date"])
    return result

@processor.on_error
def handle_rate_limit(error, file_path, context):
    """Custom error handling"""
    if "rate limit" in str(error).lower():
        return {"error": "Rate limited, please retry"}
    return None  # Propagate other errors

Optional Extras

pip install strutex[cli]          # CLI commands
pip install strutex[ocr]          # OCR support
pip install strutex[langchain]    # LangChain integration
pip install strutex[llamaindex]   # LlamaIndex integration
pip install strutex[all]          # Everything

Supported Formats

Format Extensions Method
PDF .pdf Text extraction with fallback chain
Images .png, .jpg, .tiff Direct vision or OCR
Excel .xlsx, .xls Converted to structured text
Text .txt, .csv Direct input

Full Feature List

Click to expand all features
  • Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
  • Hooks — Callbacks and decorators for pre/post processing pipeline
  • CLI Toolingstrutex plugins list|info|refresh commands
  • Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, Ollama, Groq, Langdock
  • Universal Document Support — PDFs, images, Excel, and custom formats
  • Schema-Driven Extraction — Define your output structure, get consistent JSON
  • Verification & Self-Correction — Built-in audit loop for high accuracy
  • Security First — Built-in input sanitization and output validation
  • Framework Integrations — LangChain, LlamaIndex, Haystack compatibility
  • Caching — Memory, SQLite, and file-based caching
  • Async & Batch — Process multiple documents in parallel
  • Streaming — Real-time extraction feedback

Documentation

📚 Read the Docs


Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

  • v0.1.0 — Core functionality
  • v0.2.0 — Plugin registry + Security layer
  • v0.3.0 — Plugin System v2
  • v0.6.0 — Built-in Schemas & Logging
  • v0.7.0 — Providers & Retries
  • v0.8.0 — Async, Batch, Cache, Verification
  • v0.8.1 — Documentation & Coverage Fixes

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.


Contributing

Contributions welcome! Priority areas:

  1. New plugins — Providers, extractors, validators
  2. Documentation — Examples and tutorials
  3. Testing — Expand test coverage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strutex-1.0.1.tar.gz (107.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strutex-1.0.1-py3-none-any.whl (143.6 kB view details)

Uploaded Python 3

File details

Details for the file strutex-1.0.1.tar.gz.

File metadata

  • Download URL: strutex-1.0.1.tar.gz
  • Upload date:
  • Size: 107.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-1.0.1.tar.gz
Algorithm Hash digest
SHA256 27ae7e6393b83844c5240fce54a6dbd00ad7a834a67552d027b8349ff8843d0c
MD5 797b8e16f15acafed889ae435a43a0e2
BLAKE2b-256 7b77717ca1dcdb018e3b50418d1dd8280a7eaf363e337eca5f49492d76bf10d2

See more details on using hashes here.

File details

Details for the file strutex-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: strutex-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 143.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 37204eea67a8245292a9816d4c244472865edb98427452599efa67a94d3b1147
MD5 06dbbbc3b4612c15c858639368797c6a
BLAKE2b-256 06293bade92268b1ff9880a9b55fccb2065cfb679ad0e6b48d78f9011f598366

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page