Skip to main content

Structured AI document processing with robust fallback strategies.

Project description

strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

CI License: GPL v3 Python 3.10+ PyPI


Features

  • Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
  • CLI Toolingstrutex plugins list|info|refresh commands
  • Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, and custom endpoints
  • Universal Document Support — PDFs, images, Excel, and custom formats
  • Schema-Driven Extraction — Define your output structure, get consistent JSON
  • Security First — Built-in input sanitization and output validation

Quick Start

Installation

# Core only
pip install strutex

# With CLI commands
pip install strutex[cli]

# With OCR support
pip install strutex[ocr]

# Everything
pip install strutex[all]

Basic Usage

from strutex import DocumentProcessor, Object, String, Number, Array

# Define your output schema
invoice_schema = Object(
    description="Invoice data",
    properties={
        "invoice_number": String(description="The invoice ID"),
        "total": Number(),
        "items": Array(
            items=Object(
                properties={
                    "description": String(),
                    "amount": Number(),
                }
            )
        )
    }
)

# Process a document
processor = DocumentProcessor(provider="gemini")
result = processor.process(
    file_path="invoice.pdf",
    prompt="Extract the invoice details.",
    schema=invoice_schema
)

print(result["invoice_number"])  # "INV-2024-001"
print(result["total"])           # 1250.00

CLI Commands (v0.3.0+)

# List all plugins
strutex plugins list

# Filter by type
strutex plugins list --type provider

# Get plugin details
strutex plugins info gemini --type provider

# Refresh discovery cache
strutex plugins refresh

Plugin System

Everything is pluggable. Just inherit from a base class:

from strutex.plugins import Provider

class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    capabilities = ["vision"]

    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Your LLM logic
        ...

# Customize with class arguments
class FastProvider(Provider, name="fast"):
    """Registered as 'fast' with high priority"""
    priority = 90  # Class attribute
    cost = 0.5

    def process(self, ...): ...

For Distributable Packages

Use entry points in pyproject.toml:

[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Plugin Types

Type Purpose Examples
provider LLM backends Gemini, OpenAI, Claude, Ollama
security Input/output protection Injection detection, sanitization
extractor Document parsing PDF, Image OCR, Excel
validator Output validation Schema, sum checks, date formats
postprocessor Data transformation Date/number normalization

Supported Formats

Format Extensions Method
PDF .pdf Text extraction with fallback chain
Images .png, .jpg, .tiff Direct vision or OCR
Excel .xlsx, .xls Converted to structured text
Text .txt, .csv Direct input

Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

  • v0.1.0 — Core functionality
  • v0.2.0 — Plugin registry + Security layer
  • v0.3.0 — Plugin System v2 (lazy loading, CLI, hooks)
  • v0.4.0 — Additional providers (OpenAI, Anthropic, Ollama)

Documentation

📚 Read the Docs

# Install docs dependencies
pip install mkdocs mkdocs-material mkdocstrings[python] mike

# Serve locally
mkdocs serve

# Build static site
mkdocs build

# Deploy with versioning
mike deploy 0.3.0 latest --push

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.

For commercial use, please contact me.


Contributing

Contributions welcome! Priority areas:

  1. New plugins — Providers, extractors, validators
  2. Documentation — Examples and tutorials
  3. Testing — Expand test coverage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strutex-0.4.2.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strutex-0.4.2-py3-none-any.whl (53.7 kB view details)

Uploaded Python 3

File details

Details for the file strutex-0.4.2.tar.gz.

File metadata

  • Download URL: strutex-0.4.2.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.4.2.tar.gz
Algorithm Hash digest
SHA256 5d155105defcc86692c542a67beecce1628bd9d821e0751ea2913ded58c140b2
MD5 e72f20884a5082309447600a48f04f79
BLAKE2b-256 01508b509d6bae113576918566f66d2629d7a263e658520b3e316c27b3c3c1d2

See more details on using hashes here.

File details

Details for the file strutex-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: strutex-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 53.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strutex-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 eac7c381fd43c487139ea7cd5439f760d6c3e43527213322bd20af9168cc8dbe
MD5 2c3d1c666481b5530513a7f8058a4cf4
BLAKE2b-256 69e1c43806525f56df68e11c78533b8cd0780795df8baac600fbf0b25c26f18f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page