Skip to main content

Structured AI document processing with robust fallback strategies.

Project description

strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

CI License: GPL v3 Python 3.10+ PyPI


Features

  • Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
  • Hooks — Callbacks and decorators for pre/post processing pipeline
  • CLI Toolingstrutex plugins list|info|refresh commands
  • Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, and custom endpoints
  • Universal Document Support — PDFs, images, Excel, and custom formats
  • Schema-Driven Extraction — Define your output structure, get consistent JSON
  • Security First — Built-in input sanitization and output validation

Quick Start

Installation

# Core only
pip install strutex

# With CLI commands
pip install strutex[cli]

# With OCR support
pip install strutex[ocr]

# Everything
pip install strutex[all]

Basic Usage

from strutex import DocumentProcessor, Object, String, Number, Array

# Define your output schema
invoice_schema = Object(
    description="Invoice data",
    properties={
        "invoice_number": String(description="The invoice ID"),
        "total": Number(),
        "items": Array(
            items=Object(
                properties={
                    "description": String(),
                    "amount": Number(),
                }
            )
        )
    }
)

# Process a document
processor = DocumentProcessor(provider="gemini")
result = processor.process(
    file_path="invoice.pdf",
    prompt="Extract the invoice details.",
    schema=invoice_schema
)

print(result["invoice_number"])  # "INV-2024-001"
print(result["total"])           # 1250.00

CLI Commands (v0.3.0+)

# List all plugins
strutex plugins list

# Filter by type
strutex plugins list --type provider

# Get plugin details
strutex plugins info gemini --type provider

# Refresh discovery cache
strutex plugins refresh

Plugin System

Everything is pluggable. Just inherit from a base class:

from strutex.plugins import Provider

class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    capabilities = ["vision"]

    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Your LLM logic
        ...

# Customize with class arguments
class FastProvider(Provider, name="fast"):
    """Registered as 'fast' with high priority"""
    priority = 90  # Class attribute
    cost = 0.5

    def process(self, ...): ...

For Distributable Packages

Use entry points in pyproject.toml:

[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Plugin Types

Type Purpose Examples
provider LLM backends Gemini, OpenAI, Claude, Ollama
security Input/output protection Injection detection, sanitization
extractor Document parsing PDF, Image OCR, Excel
validator Output validation Schema, sum checks, date formats
postprocessor Data transformation Date/number normalization

Supported Formats

Format Extensions Method
PDF .pdf Text extraction with fallback chain
Images .png, .jpg, .tiff Direct vision or OCR
Excel .xlsx, .xls Converted to structured text
Text .txt, .csv Direct input

Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

  • v0.1.0 — Core functionality
  • v0.2.0 — Plugin registry + Security layer
  • v0.3.0 — Plugin System v2 (lazy loading, CLI, hooks)
  • v0.4.0 — Additional providers (OpenAI, Anthropic, Ollama)

Documentation

📚 Read the Docs

# Install docs dependencies
pip install mkdocs mkdocs-material mkdocstrings[python] mike

# Serve locally
mkdocs serve

# Build static site
mkdocs build

# Deploy with versioning
mike deploy 0.3.0 latest --push

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.

For commercial use, please contact me.


Contributing

Contributions welcome! Priority areas:

  1. New plugins — Providers, extractors, validators
  2. Documentation — Examples and tutorials
  3. Testing — Expand test coverage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strutex-0.5.2.tar.gz (45.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strutex-0.5.2-py3-none-any.whl (55.8 kB view details)

Uploaded Python 3

File details

Details for the file strutex-0.5.2.tar.gz.

File metadata

  • Download URL: strutex-0.5.2.tar.gz
  • Upload date:
  • Size: 45.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.5.2.tar.gz
Algorithm Hash digest
SHA256 d7858305d004527695b17f5c71d006356b394dc5b9988849b1e08b9ba59dfc98
MD5 0e145f8442ec1001d0a62a0d0144c124
BLAKE2b-256 410d248f17156ee04e1f90dcaa056d414510e4219f4a505fd563be43b9deb64f

See more details on using hashes here.

File details

Details for the file strutex-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: strutex-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 55.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c7fa41ff20fa151c66292ad4a91c4f0c252d5aa88a871634beb925468e328163
MD5 6abca8bde8d57d9eddc366101d80fd4c
BLAKE2b-256 7eb5524f2b6a5149483bba35e4b8075f9cf0b98a55f97eefae8b7f07bb5bd237

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page