Skip to main content

Structured AI document processing with robust fallback strategies.

Project description

strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

CI License: GPL v3 Python 3.10+ PyPI codecov


Features

  • Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
  • Hooks — Callbacks and decorators for pre/post processing pipeline
  • CLI Toolingstrutex plugins list|info|refresh commands
  • Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, and custom endpoints
  • Universal Document Support — PDFs, images, Excel, and custom formats
  • Schema-Driven Extraction — Define your output structure, get consistent JSON
  • Verification & Self-Correction — built-in audit loop for high accuracy
  • Security First — Built-in input sanitization and output validation
  • Framework Integrations — LangChain, LlamaIndex, Haystack compatibility

When to Choose Strutex

Good fit:

  • Document → JSON (invoices, receipts, forms, tables)
  • Schema-validated output, not free-form LLM text
  • Security by default (injection detection, PII redaction)
  • Local/air-gapped (Ollama, custom endpoints)
  • Lightweight deps, pluggable architecture
  • Production-ready: caching, batch/async, verification
  • LangChain/LlamaIndex integration for RAG pipelines

Not a fit:

  • Complex multi-step agents or autonomous workflows
  • Vector search / embedding pipelines (use with LlamaIndex instead)
  • Full LLM orchestration framework → combine with LangChain

TL;DR: strutex turns messy documents into trustworthy structured data. Use it standalone or plugged into your RAG stack.


What's New

  • Framework Integrations: LangChain, LlamaIndex, Haystack
  • DocumentInput: Unified handling for file paths and BytesIO (HTTP uploads)
  • OCR Fallback: Automatic integration with Unstructured.io for complex layouts
  • Optional Extras: Install only the integrations you need

Quick Start

Installation

View on PyPI: https://pypi.org/project/strutex/

# Core only
pip install strutex

# With CLI commands
pip install strutex[cli]

# With OCR support
pip install strutex[ocr]

# Framework integrations
pip install strutex[langchain]     # LangChain
pip install strutex[llamaindex]    # LlamaIndex
pip install strutex[haystack]      # Haystack
pip install strutex[fallback]      # Unstructured.io

# Everything
pip install strutex[all]

Basic Usage

from strutex import DocumentProcessor, Object, String, Number, Array

# Define your output schema
invoice_schema = Object(
    description="Invoice data",
    properties={
        "invoice_number": String(description="The invoice ID"),
        "total": Number,
        "items": Array(
            items=Object(
                properties={
                    "description": String,
                    "amount": Number,
                }
            )
        )
    }
)

# Process a document
# Process a document
# 'provider="gemini"' selects Google's Gemini models.
# You can also use "openai", "anthropic", or "ollama".
processor = DocumentProcessor(provider="gemini")
result = processor.process(
    file_path="invoice.pdf",
    prompt="Extract the invoice details.",
    schema=invoice_schema
)

print(result["invoice_number"])  # "INV-2024-001"
print(result["total"])           # 1250.00

Advanced Usage

1. Caching

Save API costs by caching results. Smart hashing avoids re-processing identical files/prompts.

from strutex.cache import SQLiteCache

# Persistent cache across runs
processor = DocumentProcessor(
    provider="openai",
    cache=SQLiteCache("strutex_cache.db")
)

2. Async Processing

Process multiple documents in parallel.

import asyncio

async def main():
    processor = DocumentProcessor(provider="anthropic")

    # Run in parallel
    results = await asyncio.gather(
        processor.aprocess("doc1.pdf", "Summary", schema),
        processor.aprocess("doc2.pdf", "Summary", schema)
    )

asyncio.run(main())

3. Verification & Self-Correction

Enable the audit loop to have the LLM double-check its work.

result = processor.process(
    "contract.pdf",
    prompt="Extract clauses",
    schema=contract_schema,
    verify=True  # triggers self-correction loop
)

CLI Commands (v0.3.0+)

# List all plugins
strutex plugins list

# Filter by type
strutex plugins list --type provider

# Get plugin details
strutex plugins info gemini --type provider

# Refresh discovery cache
strutex plugins refresh

Plugin System

Everything is pluggable. Just inherit from a base class:

from strutex.plugins import Provider

class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    capabilities = ["vision"]

    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Your LLM logic
        ...

# Customize with class arguments
class FastProvider(Provider, name="fast"):
    """Registered as 'fast' with high priority"""
    priority = 90  # Class attribute
    cost = 0.5

    def process(self, ...): ...

For Distributable Packages

Use entry points in pyproject.toml:

[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Plugin Types

Type Purpose Examples
provider LLM backends Gemini, OpenAI, Claude, Ollama
security Input/output protection Injection detection, sanitization
extractor Document parsing PDF, Image OCR, Excel
validator Output validation Schema, sum checks, date formats
postprocessor Data transformation Date/number normalization

Supported Formats

Format Extensions Method
PDF .pdf Text extraction with fallback chain
Images .png, .jpg, .tiff Direct vision or OCR
Excel .xlsx, .xls Converted to structured text
Text .txt, .csv Direct input

Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

  • v0.1.0 — Core functionality
  • v0.2.0 — Plugin registry + Security layer
  • v0.3.0 — Plugin System v2
  • v0.6.0 — Built-in Schemas & Logging
  • v0.7.0 — Providers & Retries
  • v0.8.0 — Async, Batch, Cache, Verification
  • v0.8.1 — Documentation & Coverage Fixes

Documentation

📚 Read the Docs

# Install docs dependencies
pip install mkdocs mkdocs-material mkdocstrings[python] mike

# Serve locally
mkdocs serve

# Build static site
mkdocs build

# Deploy with versioning
mike deploy 0.3.0 latest --push

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.


Contributing

Contributions welcome! Priority areas:

  1. New plugins — Providers, extractors, validators
  2. Documentation — Examples and tutorials
  3. Testing — Expand test coverage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strutex-0.9.1.tar.gz (103.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strutex-0.9.1-py3-none-any.whl (138.9 kB view details)

Uploaded Python 3

File details

Details for the file strutex-0.9.1.tar.gz.

File metadata

  • Download URL: strutex-0.9.1.tar.gz
  • Upload date:
  • Size: 103.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.9.1.tar.gz
Algorithm Hash digest
SHA256 c586d7856f52ecb561bcbec5c61d305e5163c2ee410d2538bc5362ce94353df7
MD5 a0a605e10cf12074ee9bbdb19f744cca
BLAKE2b-256 5fbfaa20c4f6e394851604962c9aec60c3ffe3269bd72d81917e406cfc285864

See more details on using hashes here.

File details

Details for the file strutex-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: strutex-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 138.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b965e409787aca632d4091575b8f52e3cb1a49377640d60092b7196d5f3c893e
MD5 29d64d7f9a691ce4bbecd88e5d57d0c7
BLAKE2b-256 99e86c0c8bd5ca100641f196d23df4146fdf6f72bd03f7896c40c83a2e7eeb9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page