Structured AI document processing with robust fallback strategies.

These details have not been verified by PyPI

Project links

Project description

strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

pip install strutex

The Simplest Example

from strutex import extract
from strutex.schemas import INVOICE_US

invoice = extract("invoice.pdf", model=INVOICE_US)
print(invoice.invoice_number, invoice.total)

That's it. Three lines. No custom schema to write.

Schemas are required — but you have options:

Built-in schemas — 10+ ready-to-use (invoices, receipts, shipping docs, resumes)

Native types — Object, String, Number, Array (lightweight, no dependencies)

Pydantic models — Full type safety and validation

What You Can Do

Level	Features	When to use
Basic	`extract()`, schemas	Most use cases — just extract data
Reliability	`verify=True`, validators	Production — ensure accuracy
Scale	caching, async, batch	High volume — reduce costs
Extensibility	plugins, hooks, CLI	Advanced — extend anything

Most users only need Level 1. The rest is there when you need it.

Level 1: Basic Extraction

With Pydantic (recommended)

import strutex
from pydantic import BaseModel

class Receipt(BaseModel):
    store: str
    date: str
    total: float

receipt = strutex.extract("receipt.jpg", model=Receipt)

With Native Schema

from strutex import extract, Object, String, Number

schema = Object(properties={
    "invoice_number": String,
    "total": Number,
})

result = extract("invoice.pdf", schema=schema)

With Built-in Schemas

from strutex import extract
from strutex.schemas import INVOICE_US, BILL_OF_LADING

invoice = extract("invoice.pdf", model=INVOICE_US)
bol = extract("bl.pdf", model=BILL_OF_LADING)

Available: INVOICE_GENERIC, INVOICE_US, INVOICE_EU, RECEIPT, PURCHASE_ORDER, BILL_OF_LADING, RESUME, BANK_STATEMENT, etc.

Level 2: Reliability Features

Optional Double-Check

Ask the LLM to validate its own answers automatically — adds accuracy, completely optional:

result = strutex.extract(
    "contract.pdf",
    model=ContractSchema,
    verify=True  # LLM reviews its own output
)

Choosing a Provider

Create a provider instance for full control over API keys and configuration:

from strutex import DocumentProcessor
from strutex import GeminiProvider, OpenAIProvider, AnthropicProvider, OllamaProvider
from strutex.schemas import INVOICE_US
# Google Gemini
processor = DocumentProcessor(provider=GeminiProvider(api_key="your-key"))

# OpenAI
processor = DocumentProcessor(provider=OpenAIProvider(api_key="your-key", model="gpt-4o"))

# Anthropic Claude
processor = DocumentProcessor(provider=AnthropicProvider(api_key="your-key"))

# Local with Ollama (no API key needed)
processor = DocumentProcessor(provider=OllamaProvider(model="llama3"))

result = processor.process("doc.pdf", "Extract data", model=INVOICE_US)

Note: String providers like provider="gemini" are convenience shortcuts that assume correct environment variables. For production, explicit provider instances are recommended.

Level 3: Scale Features

Caching (reduce API costs)

from strutex import DocumentProcessor
from strutex.cache import SQLiteCache

processor = DocumentProcessor(
    provider="gemini",
    cache=SQLiteCache("cache.db")
)

Async Processing

import asyncio
from strutex import DocumentProcessor

async def main():
    processor = DocumentProcessor(provider="anthropic")
    results = await asyncio.gather(
        processor.aprocess("doc1.pdf", "Extract", schema),
        processor.aprocess("doc2.pdf", "Extract", schema)
    )

asyncio.run(main())

Level 4: Extensibility

Plugin System

Everything is pluggable. Just inherit from a base class:

Type	Purpose	Examples
`Provider`	LLM backends	Gemini, OpenAI, Claude, Ollama
`Extractor`	Document parsing	PDF, Image OCR, Excel
`Validator`	Output validation	Schema, sum checks, date formats
`SecurityPlugin`	Input/output protection	Injection detection, sanitization
`Postprocessor`	Data transformation	Date/number normalization

from strutex.plugins import Provider, Extractor, Validator

# Custom LLM Provider
class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Call your LLM API
        ...

# Custom Document Extractor
class WordExtractor(Extractor, name="word"):
    """Handle .docx files"""
    mime_types = ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]

    def extract(self, file_path: str) -> str:
        # Parse .docx and return text
        ...

# Custom Validator
class TotalValidator(Validator):
    """Verify line items sum to total"""
    def validate(self, data, schema, context):
        items_sum = sum(item["amount"] for item in data.get("items", []))
        return ValidationResult(
            valid=abs(items_sum - data["total"]) < 0.01,
            message="Line items must sum to total"
        )

CLI Commands

strutex plugins list              # List all plugins
strutex plugins list --type provider
strutex plugins info gemini --type provider

For Distributable Packages

# pyproject.toml
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Hooks System

Inject logic at any point in the processing pipeline:

from strutex import DocumentProcessor

processor = DocumentProcessor(provider="gemini")

@processor.on_pre_process
def add_instructions(file_path, prompt, schema, mime_type, context):
    """Modify prompt before sending to LLM"""
    return {"prompt": prompt + "\nBe precise and thorough."}

@processor.on_post_process
def normalize_dates(result, context):
    """Transform output after extraction"""
    if "date" in result:
        result["date"] = parse_date(result["date"])
    return result

@processor.on_error
def handle_rate_limit(error, file_path, context):
    """Custom error handling"""
    if "rate limit" in str(error).lower():
        return {"error": "Rate limited, please retry"}
    return None  # Propagate other errors

Optional Extras

pip install strutex[cli]          # CLI commands
pip install strutex[ocr]          # OCR support
pip install strutex[langchain]    # LangChain integration
pip install strutex[llamaindex]   # LlamaIndex integration
pip install strutex[all]          # Everything

Supported Formats

Format	Extensions	Method
PDF	`.pdf`	Text extraction with fallback chain
Images	`.png`, `.jpg`, `.tiff`	Direct vision or OCR
Excel	`.xlsx`, `.xls`	Converted to structured text
Text	`.txt`, `.csv`	Direct input

Full Feature List

Click to expand all features

Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
Hooks — Callbacks and decorators for pre/post processing pipeline
CLI Tooling — strutex plugins list|info|refresh commands
Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, Ollama, Groq, Langdock
Universal Document Support — PDFs, images, Excel, and custom formats
Schema-Driven Extraction — Define your output structure, get consistent JSON
Verification & Self-Correction — Built-in audit loop for high accuracy
Security First — Built-in input sanitization and output validation
Framework Integrations — LangChain, LlamaIndex, Haystack compatibility
Caching — Memory, SQLite, and file-based caching
Async & Batch — Process multiple documents in parallel
Streaming — Real-time extraction feedback

Documentation

📚 Read the Docs

Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

v0.1.0 — Core functionality
v0.2.0 — Plugin registry + Security layer
v0.3.0 — Plugin System v2
v0.6.0 — Built-in Schemas & Logging
v0.7.0 — Providers & Retries
v0.8.0 — Async, Batch, Cache, Verification
v0.8.1 — Documentation & Coverage Fixes

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.

Contributing

Contributions welcome! Priority areas:

New plugins — Providers, extractors, validators
Documentation — Examples and tutorials
Testing — Expand test coverage

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.7

Jan 10, 2026

1.2.0

Jan 10, 2026

1.1.0

Dec 31, 2025

1.0.1

Dec 28, 2025

This version

1.0.0

Dec 28, 2025

0.9.3

Dec 28, 2025

0.9.2

Dec 28, 2025

0.9.1

Dec 27, 2025

0.9.0

Dec 27, 2025

0.8.8

Dec 27, 2025

0.8.7

Dec 27, 2025

0.8.6

Dec 27, 2025

0.8.5

Dec 27, 2025

0.8.1

Dec 26, 2025

0.8.0

Dec 26, 2025

0.5.2

Dec 25, 2025

0.5.1

Dec 25, 2025

0.5.0

Dec 25, 2025

0.4.2

Dec 24, 2025

0.4.1

Dec 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strutex-1.0.0.tar.gz (107.1 kB view details)

Uploaded Dec 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strutex-1.0.0-py3-none-any.whl (143.0 kB view details)

Uploaded Dec 28, 2025 Python 3

File details

Details for the file strutex-1.0.0.tar.gz.

File metadata

Download URL: strutex-1.0.0.tar.gz
Upload date: Dec 28, 2025
Size: 107.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9ad2758ee0f8d28604507c2c63a946d5fb87f2ed53e541ee6698231d573aae81`
MD5	`a1bf1d66c666cbdc4bf24be447a8d6bd`
BLAKE2b-256	`9c4e917ffb506ba71e1dae8baccdb6e2891f04a35d107d3b23123d1fccf6f68a`

See more details on using hashes here.

File details

Details for the file strutex-1.0.0-py3-none-any.whl.

File metadata

Download URL: strutex-1.0.0-py3-none-any.whl
Upload date: Dec 28, 2025
Size: 143.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c878f0110105495e9c23d9c70e89197aad39e38b4db4067a4d60e514576f82dc`
MD5	`6e246d09299acd0c41c7fe001a6d62ca`
BLAKE2b-256	`7c2a7557244d8e25052392855a1400af7a6b3f82a3af773a2a68c851d234f70a`

See more details on using hashes here.

strutex 1.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

strutex

The Simplest Example

What You Can Do

Level 1: Basic Extraction

With Pydantic (recommended)

With Native Schema

With Built-in Schemas

Level 2: Reliability Features

Optional Double-Check

Choosing a Provider

Level 3: Scale Features

Caching (reduce API costs)

Async Processing

Level 4: Extensibility

Plugin System

CLI Commands

For Distributable Packages

Hooks System

Optional Extras

Supported Formats

Full Feature List

Documentation

Roadmap

License

Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes