Structured AI document processing with robust fallback strategies.

These details have not been verified by PyPI

Project links

Project description

strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

Features

Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
Hooks — Callbacks and decorators for pre/post processing pipeline
CLI Tooling — strutex plugins list|info|refresh commands
Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, and custom endpoints
Universal Document Support — PDFs, images, Excel, and custom formats
Schema-Driven Extraction — Define your output structure, get consistent JSON
Verification & Self-Correction — built-in audit loop for high accuracy
Security First — Built-in input sanitization and output validation
Framework Integrations — LangChain, LlamaIndex, Haystack compatibility

When to Choose Strutex

Good fit:

Document → JSON (invoices, receipts, forms, tables)
Schema-validated output, not free-form LLM text
Security by default (injection detection, PII redaction)
Local/air-gapped (Ollama, custom endpoints)
Lightweight deps, pluggable architecture
Production-ready: caching, batch/async, verification
LangChain/LlamaIndex integration for RAG pipelines

Not a fit:

Complex multi-step agents or autonomous workflows
Vector search / embedding pipelines (use with LlamaIndex instead)
Full LLM orchestration framework → combine with LangChain

TL;DR: strutex turns messy documents into trustworthy structured data. Use it standalone or plugged into your RAG stack.

What's New

Framework Integrations: LangChain, LlamaIndex, Haystack
DocumentInput: Unified handling for file paths and BytesIO (HTTP uploads)
OCR Fallback: Automatic integration with Unstructured.io for complex layouts
Optional Extras: Install only the integrations you need

Quick Start

Installation

View on PyPI: https://pypi.org/project/strutex/

# Core only
pip install strutex

# With CLI commands
pip install strutex[cli]

# With OCR support
pip install strutex[ocr]

# Framework integrations
pip install strutex[langchain]     # LangChain
pip install strutex[llamaindex]    # LlamaIndex
pip install strutex[haystack]      # Haystack
pip install strutex[fallback]      # Unstructured.io

# Everything
pip install strutex[all]

Basic Usage

from strutex import DocumentProcessor, Object, String, Number, Array

# Define your output schema
invoice_schema = Object(
    description="Invoice data",
    properties={
        "invoice_number": String(description="The invoice ID"),
        "total": Number,
        "items": Array(
            items=Object(
                properties={
                    "description": String,
                    "amount": Number,
                }
            )
        )
    }
)

# Process a document
# Process a document
# 'provider="gemini"' selects Google's Gemini models.
# You can also use "openai", "anthropic", or "ollama".
processor = DocumentProcessor(provider="gemini")
result = processor.process(
    file_path="invoice.pdf",
    prompt="Extract the invoice details.",
    schema=invoice_schema
)

print(result["invoice_number"])  # "INV-2024-001"
print(result["total"])           # 1250.00

Advanced Usage

1. Caching

Save API costs by caching results. Smart hashing avoids re-processing identical files/prompts.

from strutex.cache import SQLiteCache

# Persistent cache across runs
processor = DocumentProcessor(
    provider="openai",
    cache=SQLiteCache("strutex_cache.db")
)

2. Async Processing

Process multiple documents in parallel.

import asyncio

async def main():
    processor = DocumentProcessor(provider="anthropic")

    # Run in parallel
    results = await asyncio.gather(
        processor.aprocess("doc1.pdf", "Summary", schema),
        processor.aprocess("doc2.pdf", "Summary", schema)
    )

asyncio.run(main())

3. Verification & Self-Correction

Enable the audit loop to have the LLM double-check its work.

result = processor.process(
    "contract.pdf",
    prompt="Extract clauses",
    schema=contract_schema,
    verify=True  # triggers self-correction loop
)

CLI Commands (v0.3.0+)

# List all plugins
strutex plugins list

# Filter by type
strutex plugins list --type provider

# Get plugin details
strutex plugins info gemini --type provider

# Refresh discovery cache
strutex plugins refresh

Plugin System

Everything is pluggable. Just inherit from a base class:

from strutex.plugins import Provider

class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    capabilities = ["vision"]

    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Your LLM logic
        ...

# Customize with class arguments
class FastProvider(Provider, name="fast"):
    """Registered as 'fast' with high priority"""
    priority = 90  # Class attribute
    cost = 0.5

    def process(self, ...): ...

For Distributable Packages

Use entry points in pyproject.toml:

[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Plugin Types

Type	Purpose	Examples
`provider`	LLM backends	Gemini, OpenAI, Claude, Ollama
`security`	Input/output protection	Injection detection, sanitization
`extractor`	Document parsing	PDF, Image OCR, Excel
`validator`	Output validation	Schema, sum checks, date formats
`postprocessor`	Data transformation	Date/number normalization

Supported Formats

Format	Extensions	Method
PDF	`.pdf`	Text extraction with fallback chain
Images	`.png`, `.jpg`, `.tiff`	Direct vision or OCR
Excel	`.xlsx`, `.xls`	Converted to structured text
Text	`.txt`, `.csv`	Direct input

Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

v0.1.0 — Core functionality
v0.2.0 — Plugin registry + Security layer
v0.3.0 — Plugin System v2
v0.6.0 — Built-in Schemas & Logging
v0.7.0 — Providers & Retries
v0.8.0 — Async, Batch, Cache, Verification
v0.8.1 — Documentation & Coverage Fixes

Documentation

📚 Read the Docs

# Install docs dependencies
pip install mkdocs mkdocs-material mkdocstrings[python] mike

# Serve locally
mkdocs serve

# Build static site
mkdocs build

# Deploy with versioning
mike deploy 0.3.0 latest --push

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.

Contributing

Contributions welcome! Priority areas:

New plugins — Providers, extractors, validators
Documentation — Examples and tutorials
Testing — Expand test coverage

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.7

Jan 10, 2026

1.2.0

Jan 10, 2026

1.1.0

Dec 31, 2025

1.0.1

Dec 28, 2025

1.0.0

Dec 28, 2025

0.9.3

Dec 28, 2025

0.9.2

Dec 28, 2025

This version

0.9.1

Dec 27, 2025

0.9.0

Dec 27, 2025

0.8.8

Dec 27, 2025

0.8.7

Dec 27, 2025

0.8.6

Dec 27, 2025

0.8.5

Dec 27, 2025

0.8.1

Dec 26, 2025

0.8.0

Dec 26, 2025

0.5.2

Dec 25, 2025

0.5.1

Dec 25, 2025

0.5.0

Dec 25, 2025

0.4.2

Dec 24, 2025

0.4.1

Dec 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strutex-0.9.1.tar.gz (103.6 kB view details)

Uploaded Dec 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strutex-0.9.1-py3-none-any.whl (138.9 kB view details)

Uploaded Dec 27, 2025 Python 3

File details

Details for the file strutex-0.9.1.tar.gz.

File metadata

Download URL: strutex-0.9.1.tar.gz
Upload date: Dec 27, 2025
Size: 103.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.9.1.tar.gz
Algorithm	Hash digest
SHA256	`c586d7856f52ecb561bcbec5c61d305e5163c2ee410d2538bc5362ce94353df7`
MD5	`a0a605e10cf12074ee9bbdb19f744cca`
BLAKE2b-256	`5fbfaa20c4f6e394851604962c9aec60c3ffe3269bd72d81917e406cfc285864`

See more details on using hashes here.

File details

Details for the file strutex-0.9.1-py3-none-any.whl.

File metadata

Download URL: strutex-0.9.1-py3-none-any.whl
Upload date: Dec 27, 2025
Size: 138.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b965e409787aca632d4091575b8f52e3cb1a49377640d60092b7196d5f3c893e`
MD5	`29d64d7f9a691ce4bbecd88e5d57d0c7`
BLAKE2b-256	`99e86c0c8bd5ca100641f196d23df4146fdf6f72bd03f7896c40c83a2e7eeb9b`

See more details on using hashes here.

strutex 0.9.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

strutex

Features

When to Choose Strutex

What's New

Quick Start

Installation

Basic Usage

Advanced Usage

1. Caching

2. Async Processing

3. Verification & Self-Correction

CLI Commands (v0.3.0+)

Plugin System

For Distributable Packages

Plugin Types

Supported Formats

Roadmap

Documentation

License

Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes