Skip to main content

Structured AI document processing with robust fallback strategies.

Project description

strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

CI License: GPL v3 Python 3.10+ PyPI codecov


The Simplest Example

import strutex
from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    total: float

result = strutex.extract("invoice.pdf", model=Invoice)
print(result.invoice_number, result.total)

That's it. Everything else in strutex is optional.


Installation

pip install strutex

What You Can Do

Level Features When to use
Basic extract(), schemas Most use cases — just extract data
Reliability verify=True, validators Production — ensure accuracy
Scale caching, async, batch High volume — reduce costs
Extensibility plugins, hooks, CLI Custom needs — extend anything

Most users only need Level 1. The rest is there when you need it.


Level 1: Basic Extraction

With Pydantic (recommended)

import strutex
from pydantic import BaseModel

class Receipt(BaseModel):
    store: str
    date: str
    total: float

receipt = strutex.extract("receipt.jpg", model=Receipt)

With Native Schema

from strutex import extract, Object, String, Number

schema = Object(properties={
    "invoice_number": String,
    "total": Number,
})

result = extract("invoice.pdf", schema=schema)

With Built-in Schemas

from strutex import extract
from strutex.schemas import INVOICE_US, BILL_OF_LADING

invoice = extract("invoice.pdf", model=INVOICE_US)
bol = extract("bl.pdf", model=BILL_OF_LADING)

Available: INVOICE_GENERIC, INVOICE_US, INVOICE_EU, RECEIPT, PURCHASE_ORDER, BILL_OF_LADING, RESUME, BANK_STATEMENT, etc.


Level 2: Reliability Features

Verification & Self-Correction

result = strutex.extract(
    "contract.pdf",
    model=ContractSchema,
    verify=True  # LLM double-checks its work
)

Choosing a Provider

Create a provider instance for full control over API keys and configuration:

from strutex import DocumentProcessor
from strutex import GeminiProvider, OpenAIProvider, AnthropicProvider, OllamaProvider
from strutex.schemas import INVOICE_US
# Google Gemini
processor = DocumentProcessor(provider=GeminiProvider(api_key="your-key"))

# OpenAI
processor = DocumentProcessor(provider=OpenAIProvider(api_key="your-key", model="gpt-4o"))

# Anthropic Claude
processor = DocumentProcessor(provider=AnthropicProvider(api_key="your-key"))

# Local with Ollama (no API key needed)
processor = DocumentProcessor(provider=OllamaProvider(model="llama3"))

result = processor.process("doc.pdf", "Extract data", model=INVOICE_US)

Level 3: Scale Features

Caching (reduce API costs)

from strutex import DocumentProcessor
from strutex.cache import SQLiteCache

processor = DocumentProcessor(
    provider="gemini",
    cache=SQLiteCache("cache.db")
)

Async Processing

import asyncio
from strutex import DocumentProcessor

async def main():
    processor = DocumentProcessor(provider="anthropic")
    results = await asyncio.gather(
        processor.aprocess("doc1.pdf", "Extract", schema),
        processor.aprocess("doc2.pdf", "Extract", schema)
    )

asyncio.run(main())

Level 4: Extensibility

Plugin System

Everything is pluggable. Just inherit from a base class:

Type Purpose Examples
Provider LLM backends Gemini, OpenAI, Claude, Ollama
Extractor Document parsing PDF, Image OCR, Excel
Validator Output validation Schema, sum checks, date formats
SecurityPlugin Input/output protection Injection detection, sanitization
Postprocessor Data transformation Date/number normalization
from strutex.plugins import Provider, Extractor, Validator

# Custom LLM Provider
class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Call your LLM API
        ...

# Custom Document Extractor
class WordExtractor(Extractor, name="word"):
    """Handle .docx files"""
    mime_types = ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]

    def extract(self, file_path: str) -> str:
        # Parse .docx and return text
        ...

# Custom Validator
class TotalValidator(Validator):
    """Verify line items sum to total"""
    def validate(self, data, schema, context):
        items_sum = sum(item["amount"] for item in data.get("items", []))
        return ValidationResult(
            valid=abs(items_sum - data["total"]) < 0.01,
            message="Line items must sum to total"
        )

CLI Commands

strutex plugins list              # List all plugins
strutex plugins list --type provider
strutex plugins info gemini --type provider

For Distributable Packages

# pyproject.toml
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Hooks System

Inject logic at any point in the processing pipeline:

from strutex import DocumentProcessor

processor = DocumentProcessor(provider="gemini")

@processor.on_pre_process
def add_instructions(file_path, prompt, schema, mime_type, context):
    """Modify prompt before sending to LLM"""
    return {"prompt": prompt + "\nBe precise and thorough."}

@processor.on_post_process
def normalize_dates(result, context):
    """Transform output after extraction"""
    if "date" in result:
        result["date"] = parse_date(result["date"])
    return result

@processor.on_error
def handle_rate_limit(error, file_path, context):
    """Custom error handling"""
    if "rate limit" in str(error).lower():
        return {"error": "Rate limited, please retry"}
    return None  # Propagate other errors

Optional Extras

pip install strutex[cli]          # CLI commands
pip install strutex[ocr]          # OCR support
pip install strutex[langchain]    # LangChain integration
pip install strutex[llamaindex]   # LlamaIndex integration
pip install strutex[all]          # Everything

Supported Formats

Format Extensions Method
PDF .pdf Text extraction with fallback chain
Images .png, .jpg, .tiff Direct vision or OCR
Excel .xlsx, .xls Converted to structured text
Text .txt, .csv Direct input

Full Feature List

Click to expand all features
  • Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
  • Hooks — Callbacks and decorators for pre/post processing pipeline
  • CLI Toolingstrutex plugins list|info|refresh commands
  • Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, Ollama, Groq, Langdock
  • Universal Document Support — PDFs, images, Excel, and custom formats
  • Schema-Driven Extraction — Define your output structure, get consistent JSON
  • Verification & Self-Correction — Built-in audit loop for high accuracy
  • Security First — Built-in input sanitization and output validation
  • Framework Integrations — LangChain, LlamaIndex, Haystack compatibility
  • Caching — Memory, SQLite, and file-based caching
  • Async & Batch — Process multiple documents in parallel
  • Streaming — Real-time extraction feedback

Documentation

📚 Read the Docs


Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

  • v0.1.0 — Core functionality
  • v0.2.0 — Plugin registry + Security layer
  • v0.3.0 — Plugin System v2
  • v0.6.0 — Built-in Schemas & Logging
  • v0.7.0 — Providers & Retries
  • v0.8.0 — Async, Batch, Cache, Verification
  • v0.8.1 — Documentation & Coverage Fixes

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.


Contributing

Contributions welcome! Priority areas:

  1. New plugins — Providers, extractors, validators
  2. Documentation — Examples and tutorials
  3. Testing — Expand test coverage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strutex-0.9.2.tar.gz (106.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strutex-0.9.2-py3-none-any.whl (142.0 kB view details)

Uploaded Python 3

File details

Details for the file strutex-0.9.2.tar.gz.

File metadata

  • Download URL: strutex-0.9.2.tar.gz
  • Upload date:
  • Size: 106.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.9.2.tar.gz
Algorithm Hash digest
SHA256 a822736b54f1dd6edcdcd9c16bdf75424f6234d67ee4b92ac2d626fdb0e91679
MD5 5932a0f3d57ba1b917f4101bba6df81b
BLAKE2b-256 2ccc32dd0655aba1b78b6f309613ea963243c268530ecc196d2649d3b7f915d3

See more details on using hashes here.

File details

Details for the file strutex-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: strutex-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 142.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 43fc20a891f8130161aa0340388188131e9f27638cc650719939adcad44aeaee
MD5 47e7a47f44a736ef5cc46d98e19defae
BLAKE2b-256 0c574f31abd00f9405a334a65b4600be12538e22582dda3dcb227de4b5c40267

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page