Skip to main content

Structured AI document processing with robust fallback strategies.

Project description

strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

CI License: GPL v3 Python 3.10+ PyPI

pip install strutex

The Simplest Example

from strutex import extract
from strutex.schemas import INVOICE_US

invoice = extract("invoice.pdf", model=INVOICE_US)
print(invoice.invoice_number, invoice.total)

That's it. Three lines. No custom schema to write.

Schemas are required — but you have options:

  • Built-in schemas — 10+ ready-to-use (invoices, receipts, shipping docs, resumes)
  • Native typesObject, String, Number, Array (lightweight, no dependencies)
  • Pydantic models — Full type safety and validation

What You Can Do

Level Features When to use
Basic extract(), schemas Most use cases — just extract data
Reliability verify=True, validators Production — ensure accuracy
Scale caching, async, batch High volume — reduce costs
Extensibility plugins, hooks, CLI Advanced — extend anything

Most users only need Level 1. The rest is there when you need it.


Level 1: Basic Extraction

With Pydantic (recommended)

import strutex
from pydantic import BaseModel

class Receipt(BaseModel):
    store: str
    date: str
    total: float

receipt = strutex.extract("receipt.jpg", model=Receipt)

With Native Schema

from strutex import extract, Object, String, Number

schema = Object(properties={
    "invoice_number": String,
    "total": Number,
})

result = extract("invoice.pdf", schema=schema)

With Built-in Schemas

from strutex import extract
from strutex.schemas import INVOICE_US, BILL_OF_LADING

invoice = extract("invoice.pdf", model=INVOICE_US)
bol = extract("bl.pdf", model=BILL_OF_LADING)

Available: INVOICE_GENERIC, INVOICE_US, INVOICE_EU, RECEIPT, PURCHASE_ORDER, BILL_OF_LADING, RESUME, BANK_STATEMENT, etc.


Level 2: Reliability Features

Optional Double-Check

Ask the LLM to validate its own answers automatically — adds accuracy, completely optional:

result = strutex.extract(
    "contract.pdf",
    model=ContractSchema,
    verify=True  # LLM reviews its own output
)

Choosing a Provider

Create a provider instance for full control over API keys and configuration:

from strutex import DocumentProcessor
from strutex import GeminiProvider, OpenAIProvider, AnthropicProvider, OllamaProvider
from strutex.schemas import INVOICE_US
# Google Gemini
processor = DocumentProcessor(provider=GeminiProvider(api_key="your-key"))

# OpenAI
processor = DocumentProcessor(provider=OpenAIProvider(api_key="your-key", model="gpt-4o"))

# Anthropic Claude
processor = DocumentProcessor(provider=AnthropicProvider(api_key="your-key"))

# Local with Ollama (no API key needed)
processor = DocumentProcessor(provider=OllamaProvider(model="llama3"))

result = processor.process("doc.pdf", "Extract data", model=INVOICE_US)

Note: String providers like provider="gemini" are convenience shortcuts that assume correct environment variables. For production, explicit provider instances are recommended.


Level 3: Scale Features

Caching (reduce API costs)

from strutex import DocumentProcessor
from strutex.cache import SQLiteCache

processor = DocumentProcessor(
    provider="gemini",
    cache=SQLiteCache("cache.db")
)

Async Processing

import asyncio
from strutex import DocumentProcessor

async def main():
    processor = DocumentProcessor(provider="anthropic")
    results = await asyncio.gather(
        processor.aprocess("doc1.pdf", "Extract", schema),
        processor.aprocess("doc2.pdf", "Extract", schema)
    )

asyncio.run(main())

Level 4: Extensibility

Plugin System

Everything is pluggable. Just inherit from a base class:

Type Purpose Examples
Provider LLM backends Gemini, OpenAI, Claude, Ollama
Extractor Document parsing PDF, Image OCR, Excel
Validator Output validation Schema, sum checks, date formats
SecurityPlugin Input/output protection Injection detection, sanitization
Postprocessor Data transformation Date/number normalization
from strutex.plugins import Provider, Extractor, Validator

# Custom LLM Provider
class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Call your LLM API
        ...

# Custom Document Extractor
class WordExtractor(Extractor, name="word"):
    """Handle .docx files"""
    mime_types = ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]

    def extract(self, file_path: str) -> str:
        # Parse .docx and return text
        ...

# Custom Validator
class TotalValidator(Validator):
    """Verify line items sum to total"""
    def validate(self, data, schema, context):
        items_sum = sum(item["amount"] for item in data.get("items", []))
        return ValidationResult(
            valid=abs(items_sum - data["total"]) < 0.01,
            message="Line items must sum to total"
        )

CLI Commands

strutex plugins list              # List all plugins
strutex plugins list --type provider
strutex plugins info gemini --type provider

For Distributable Packages

# pyproject.toml
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Hooks System

Inject logic at any point in the processing pipeline:

from strutex import DocumentProcessor

processor = DocumentProcessor(provider="gemini")

@processor.on_pre_process
def add_instructions(file_path, prompt, schema, mime_type, context):
    """Modify prompt before sending to LLM"""
    return {"prompt": prompt + "\nBe precise and thorough."}

@processor.on_post_process
def normalize_dates(result, context):
    """Transform output after extraction"""
    if "date" in result:
        result["date"] = parse_date(result["date"])
    return result

@processor.on_error
def handle_rate_limit(error, file_path, context):
    """Custom error handling"""
    if "rate limit" in str(error).lower():
        return {"error": "Rate limited, please retry"}
    return None  # Propagate other errors

Optional Extras

pip install strutex[cli]          # CLI commands
pip install strutex[ocr]          # OCR support
pip install strutex[langchain]    # LangChain integration
pip install strutex[llamaindex]   # LlamaIndex integration
pip install strutex[all]          # Everything

Supported Formats

Format Extensions Method
PDF .pdf Text extraction with fallback chain
Images .png, .jpg, .tiff Direct vision or OCR
Excel .xlsx, .xls Converted to structured text
Text .txt, .csv Direct input

Full Feature List

Click to expand all features
  • Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
  • Hooks — Callbacks and decorators for pre/post processing pipeline
  • CLI Toolingstrutex plugins list|info|refresh commands
  • Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, Ollama, Groq, Langdock
  • Universal Document Support — PDFs, images, Excel, and custom formats
  • Schema-Driven Extraction — Define your output structure, get consistent JSON
  • Verification & Self-Correction — Built-in audit loop for high accuracy
  • Security First — Built-in input sanitization and output validation
  • Framework Integrations — LangChain, LlamaIndex, Haystack compatibility
  • Caching — Memory, SQLite, and file-based caching
  • Async & Batch — Process multiple documents in parallel
  • Streaming — Real-time extraction feedback

Documentation

📚 Read the Docs


Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

  • v0.1.0 — Core functionality
  • v0.2.0 — Plugin registry + Security layer
  • v0.3.0 — Plugin System v2
  • v0.6.0 — Built-in Schemas & Logging
  • v0.7.0 — Providers & Retries
  • v0.8.0 — Async, Batch, Cache, Verification
  • v0.8.1 — Documentation & Coverage Fixes

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.


Contributing

Contributions welcome! Priority areas:

  1. New plugins — Providers, extractors, validators
  2. Documentation — Examples and tutorials
  3. Testing — Expand test coverage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strutex-1.0.0.tar.gz (107.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strutex-1.0.0-py3-none-any.whl (143.0 kB view details)

Uploaded Python 3

File details

Details for the file strutex-1.0.0.tar.gz.

File metadata

  • Download URL: strutex-1.0.0.tar.gz
  • Upload date:
  • Size: 107.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9ad2758ee0f8d28604507c2c63a946d5fb87f2ed53e541ee6698231d573aae81
MD5 a1bf1d66c666cbdc4bf24be447a8d6bd
BLAKE2b-256 9c4e917ffb506ba71e1dae8baccdb6e2891f04a35d107d3b23123d1fccf6f68a

See more details on using hashes here.

File details

Details for the file strutex-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: strutex-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 143.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c878f0110105495e9c23d9c70e89197aad39e38b4db4067a4d60e514576f82dc
MD5 6e246d09299acd0c41c7fe001a6d62ca
BLAKE2b-256 7c2a7557244d8e25052392855a1400af7a6b3f82a3af773a2a68c851d234f70a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page