Structured AI document processing with robust fallback strategies.
Project description
strutex
Structured Text Extraction — Extract structured JSON from documents using LLMs
pip install strutex
The Simplest Example
from strutex import extract
from strutex.schemas import INVOICE_US
invoice = extract("invoice.pdf", model=INVOICE_US)
print(invoice.invoice_number, invoice.total)
That's it. Three lines. No custom schema to write.
Schemas are required — but you have options:
- Built-in schemas — 10+ ready-to-use (invoices, receipts, shipping docs, resumes)
- Native types —
Object,String,Number,Array(lightweight, no dependencies)- Pydantic models — Full type safety and validation
What You Can Do
| Level | Features | When to use |
|---|---|---|
| Basic | extract(), schemas |
Most use cases — just extract data |
| Reliability | verify=True, validators |
Production — ensure accuracy |
| Scale | caching, async, batch | High volume — reduce costs |
| Extensibility | plugins, hooks, CLI | Advanced — extend anything |
Most users only need Level 1. The rest is there when you need it.
Level 1: Basic Extraction
With Pydantic (recommended)
import strutex
from pydantic import BaseModel
class Receipt(BaseModel):
store: str
date: str
total: float
receipt = strutex.extract("receipt.jpg", model=Receipt)
With Native Schema
from strutex import extract, Object, String, Number
schema = Object(properties={
"invoice_number": String,
"total": Number,
})
result = extract("invoice.pdf", schema=schema)
With Built-in Schemas
from strutex import extract
from strutex.schemas import INVOICE_US, BILL_OF_LADING
invoice = extract("invoice.pdf", model=INVOICE_US)
bol = extract("bl.pdf", model=BILL_OF_LADING)
Available: INVOICE_GENERIC, INVOICE_US, INVOICE_EU, RECEIPT, PURCHASE_ORDER, BILL_OF_LADING, RESUME, BANK_STATEMENT, etc.
Level 2: Reliability Features
Optional Double-Check
Ask the LLM to validate its own answers automatically — adds accuracy, completely optional:
result = strutex.extract(
"contract.pdf",
model=ContractSchema,
verify=True # LLM reviews its own output
)
Choosing a Provider
Create a provider instance for full control over API keys and configuration:
from strutex import DocumentProcessor
from strutex import GeminiProvider, OpenAIProvider, AnthropicProvider, OllamaProvider
from strutex.schemas import INVOICE_US
# Google Gemini
processor = DocumentProcessor(provider=GeminiProvider(api_key="your-key"))
# OpenAI
processor = DocumentProcessor(provider=OpenAIProvider(api_key="your-key", model="gpt-4o"))
# Anthropic Claude
processor = DocumentProcessor(provider=AnthropicProvider(api_key="your-key"))
# Local with Ollama (no API key needed)
processor = DocumentProcessor(provider=OllamaProvider(model="llama3"))
result = processor.process("doc.pdf", "Extract data", model=INVOICE_US)
Note: String providers like
provider="gemini"are convenience shortcuts that assume correct environment variables. For production, explicit provider instances are recommended.
Level 3: Scale Features
Caching (reduce API costs)
from strutex import DocumentProcessor
from strutex.cache import SQLiteCache
processor = DocumentProcessor(
provider="gemini",
cache=SQLiteCache("cache.db")
)
Async Processing
import asyncio
from strutex import DocumentProcessor
async def main():
processor = DocumentProcessor(provider="anthropic")
results = await asyncio.gather(
processor.aprocess("doc1.pdf", "Extract", schema),
processor.aprocess("doc2.pdf", "Extract", schema)
)
asyncio.run(main())
Level 4: Extensibility
Plugin System
Everything is pluggable. Just inherit from a base class:
| Type | Purpose | Examples |
|---|---|---|
Provider |
LLM backends | Gemini, OpenAI, Claude, Ollama |
Extractor |
Document parsing | PDF, Image OCR, Excel |
Validator |
Output validation | Schema, sum checks, date formats |
SecurityPlugin |
Input/output protection | Injection detection, sanitization |
Postprocessor |
Data transformation | Date/number normalization |
from strutex.plugins import Provider, Extractor, Validator
# Custom LLM Provider
class MyProvider(Provider):
"""Auto-registered as 'myprovider'"""
def process(self, file_path, prompt, schema, mime_type, **kwargs):
# Call your LLM API
...
# Custom Document Extractor
class WordExtractor(Extractor, name="word"):
"""Handle .docx files"""
mime_types = ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]
def extract(self, file_path: str) -> str:
# Parse .docx and return text
...
# Custom Validator
class TotalValidator(Validator):
"""Verify line items sum to total"""
def validate(self, data, schema, context):
items_sum = sum(item["amount"] for item in data.get("items", []))
return ValidationResult(
valid=abs(items_sum - data["total"]) < 0.01,
message="Line items must sum to total"
)
CLI Commands
strutex plugins list # List all plugins
strutex plugins list --type provider
strutex plugins info gemini --type provider
For Distributable Packages
# pyproject.toml
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"
Hooks System
Inject logic at any point in the processing pipeline:
from strutex import DocumentProcessor
processor = DocumentProcessor(provider="gemini")
@processor.on_pre_process
def add_instructions(file_path, prompt, schema, mime_type, context):
"""Modify prompt before sending to LLM"""
return {"prompt": prompt + "\nBe precise and thorough."}
@processor.on_post_process
def normalize_dates(result, context):
"""Transform output after extraction"""
if "date" in result:
result["date"] = parse_date(result["date"])
return result
@processor.on_error
def handle_rate_limit(error, file_path, context):
"""Custom error handling"""
if "rate limit" in str(error).lower():
return {"error": "Rate limited, please retry"}
return None # Propagate other errors
Optional Extras
pip install strutex[cli] # CLI commands
pip install strutex[ocr] # OCR support
pip install strutex[langchain] # LangChain integration
pip install strutex[llamaindex] # LlamaIndex integration
pip install strutex[all] # Everything
Supported Formats
| Format | Extensions | Method |
|---|---|---|
.pdf |
Text extraction with fallback chain | |
| Images | .png, .jpg, .tiff |
Direct vision or OCR |
| Excel | .xlsx, .xls |
Converted to structured text |
| Text | .txt, .csv |
Direct input |
Full Feature List
Click to expand all features
- Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
- Hooks — Callbacks and decorators for pre/post processing pipeline
- CLI Tooling —
strutex plugins list|info|refreshcommands - Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, Ollama, Groq, Langdock
- Universal Document Support — PDFs, images, Excel, and custom formats
- Schema-Driven Extraction — Define your output structure, get consistent JSON
- Verification & Self-Correction — Built-in audit loop for high accuracy
- Security First — Built-in input sanitization and output validation
- Framework Integrations — LangChain, LlamaIndex, Haystack compatibility
- Caching — Memory, SQLite, and file-based caching
- Async & Batch — Process multiple documents in parallel
- Streaming — Real-time extraction feedback
Documentation
Roadmap
See ROADMAP.md for the full development plan.
Recent releases:
- v0.1.0 — Core functionality
- v0.2.0 — Plugin registry + Security layer
- v0.3.0 — Plugin System v2
- v0.6.0 — Built-in Schemas & Logging
- v0.7.0 — Providers & Retries
- v0.8.0 — Async, Batch, Cache, Verification
- v0.8.1 — Documentation & Coverage Fixes
License
This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.
Contributing
Contributions welcome! Priority areas:
- New plugins — Providers, extractors, validators
- Documentation — Examples and tutorials
- Testing — Expand test coverage
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strutex-1.0.1.tar.gz.
File metadata
- Download URL: strutex-1.0.1.tar.gz
- Upload date:
- Size: 107.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27ae7e6393b83844c5240fce54a6dbd00ad7a834a67552d027b8349ff8843d0c
|
|
| MD5 |
797b8e16f15acafed889ae435a43a0e2
|
|
| BLAKE2b-256 |
7b77717ca1dcdb018e3b50418d1dd8280a7eaf363e337eca5f49492d76bf10d2
|
File details
Details for the file strutex-1.0.1-py3-none-any.whl.
File metadata
- Download URL: strutex-1.0.1-py3-none-any.whl
- Upload date:
- Size: 143.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37204eea67a8245292a9816d4c244472865edb98427452599efa67a94d3b1147
|
|
| MD5 |
06dbbbc3b4612c15c858639368797c6a
|
|
| BLAKE2b-256 |
06293bade92268b1ff9880a9b55fccb2065cfb679ad0e6b48d78f9011f598366
|