Structured AI document processing with robust fallback strategies.
Project description
strutex
Structured Text Extraction — Extract structured JSON from documents using LLMs
Features
- Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
- Hooks — Callbacks and decorators for pre/post processing pipeline
- CLI Tooling —
strutex plugins list|info|refreshcommands - Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, and custom endpoints
- Universal Document Support — PDFs, images, Excel, and custom formats
- Schema-Driven Extraction — Define your output structure, get consistent JSON
- Verification & Self-Correction — built-in audit loop for high accuracy
- Security First — Built-in input sanitization and output validation
- Framework Integrations — LangChain, LlamaIndex, Haystack compatibility
When to Choose Strutex
Good fit:
- Document → JSON (invoices, receipts, forms, tables)
- Schema-validated output, not free-form LLM text
- Security by default (injection detection, PII redaction)
- Local/air-gapped (Ollama, custom endpoints)
- Lightweight deps, pluggable architecture
- Production-ready: caching, batch/async, verification
- LangChain/LlamaIndex integration for RAG pipelines
Not a fit:
- Complex multi-step agents or autonomous workflows
- Vector search / embedding pipelines (use with LlamaIndex instead)
- Full LLM orchestration framework → combine with LangChain
TL;DR: strutex turns messy documents into trustworthy structured data. Use it standalone or plugged into your RAG stack.
What's New
- Framework Integrations: LangChain, LlamaIndex, Haystack, and Unstructured.io fallback
- DocumentInput: Unified handling for file paths and BytesIO (HTTP uploads)
- Optional Extras: Install only the integrations you need
Quick Start
Installation
View on PyPI: https://pypi.org/project/strutex/
# Core only
pip install strutex
# With CLI commands
pip install strutex[cli]
# With OCR support
pip install strutex[ocr]
# Framework integrations
pip install strutex[langchain] # LangChain
pip install strutex[llamaindex] # LlamaIndex
pip install strutex[haystack] # Haystack
pip install strutex[fallback] # Unstructured.io
# Everything
pip install strutex[all]
Basic Usage
from strutex import DocumentProcessor, Object, String, Number, Array
# Define your output schema
invoice_schema = Object(
description="Invoice data",
properties={
"invoice_number": String(description="The invoice ID"),
"total": Number(),
"items": Array(
items=Object(
properties={
"description": String(),
"amount": Number(),
}
)
)
}
)
# Process a document
processor = DocumentProcessor(provider="gemini")
result = processor.process(
file_path="invoice.pdf",
prompt="Extract the invoice details.",
schema=invoice_schema
)
print(result["invoice_number"]) # "INV-2024-001"
print(result["total"]) # 1250.00
Advanced Usage
1. Caching
Save API costs by caching results. Smart hashing avoids re-processing identical files/prompts.
from strutex.cache import SQLiteCache
# Persistent cache across runs
processor = DocumentProcessor(
provider="openai",
cache=SQLiteCache("strutex_cache.db")
)
2. Async Processing
Process multiple documents in parallel.
import asyncio
async def main():
processor = DocumentProcessor(provider="anthropic")
# Run in parallel
results = await asyncio.gather(
processor.aprocess("doc1.pdf", "Summary", schema),
processor.aprocess("doc2.pdf", "Summary", schema)
)
asyncio.run(main())
3. Verification & Self-Correction
Enable the audit loop to have the LLM double-check its work.
result = processor.process(
"contract.pdf",
prompt="Extract clauses",
schema=contract_schema,
verify=True # triggers self-correction loop
)
CLI Commands (v0.3.0+)
# List all plugins
strutex plugins list
# Filter by type
strutex plugins list --type provider
# Get plugin details
strutex plugins info gemini --type provider
# Refresh discovery cache
strutex plugins refresh
Plugin System
Everything is pluggable. Just inherit from a base class:
from strutex.plugins import Provider
class MyProvider(Provider):
"""Auto-registered as 'myprovider'"""
capabilities = ["vision"]
def process(self, file_path, prompt, schema, mime_type, **kwargs):
# Your LLM logic
...
# Customize with class arguments
class FastProvider(Provider, name="fast"):
"""Registered as 'fast' with high priority"""
priority = 90 # Class attribute
cost = 0.5
def process(self, ...): ...
For Distributable Packages
Use entry points in pyproject.toml:
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"
Plugin Types
| Type | Purpose | Examples |
|---|---|---|
provider |
LLM backends | Gemini, OpenAI, Claude, Ollama |
security |
Input/output protection | Injection detection, sanitization |
extractor |
Document parsing | PDF, Image OCR, Excel |
validator |
Output validation | Schema, sum checks, date formats |
postprocessor |
Data transformation | Date/number normalization |
Supported Formats
| Format | Extensions | Method |
|---|---|---|
.pdf |
Text extraction with fallback chain | |
| Images | .png, .jpg, .tiff |
Direct vision or OCR |
| Excel | .xlsx, .xls |
Converted to structured text |
| Text | .txt, .csv |
Direct input |
Roadmap
See ROADMAP.md for the full development plan.
Recent releases:
- v0.1.0 — Core functionality
- v0.2.0 — Plugin registry + Security layer
- v0.3.0 — Plugin System v2
- v0.6.0 — Built-in Schemas & Logging
- v0.7.0 — Providers & Retries
- v0.8.0 — Async, Batch, Cache, Verification
- v0.8.1 — Documentation & Coverage Fixes
Documentation
# Install docs dependencies
pip install mkdocs mkdocs-material mkdocstrings[python] mike
# Serve locally
mkdocs serve
# Build static site
mkdocs build
# Deploy with versioning
mike deploy 0.3.0 latest --push
License
This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.
Contributing
Contributions welcome! Priority areas:
- New plugins — Providers, extractors, validators
- Documentation — Examples and tutorials
- Testing — Expand test coverage
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strutex-0.8.8.tar.gz.
File metadata
- Download URL: strutex-0.8.8.tar.gz
- Upload date:
- Size: 96.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63d9c872ca8f19ee4af8a0998040cfe98fb80ba2991e4248de488bb6f703880d
|
|
| MD5 |
4cbd82a6d091e0472051ed0c62380e7c
|
|
| BLAKE2b-256 |
3550c3b4793d27fa78c72f075d301b35f7d647ed7ead743dffbb99d432cad7af
|
File details
Details for the file strutex-0.8.8-py3-none-any.whl.
File metadata
- Download URL: strutex-0.8.8-py3-none-any.whl
- Upload date:
- Size: 131.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd73c9fbdf4508afa9364b832012c14d45993f30fda82a1d0d5297b38dcb10a0
|
|
| MD5 |
92317af580fe37b7a3e64f2cc891eb3c
|
|
| BLAKE2b-256 |
bba640a48e60a13fc82245a697c2f5a735af05638ae9293ed9675494bd341ecf
|