Structured data extraction from text using LLMs and dynamic model generation

Project description

structx

Advanced structured data extraction from any document using LLMs with multimodal support.

structx is a powerful Python library for extracting structured data from text, tables, and documents using Large Language Models (LLMs). It passes existing PDFs directly to vision-capable models; the optional docs extra converts other document formats to PDF first.

🔔 Package rename notice (PyPI)

The PyPI distribution has been renamed from structx-llm to structx (September 2025).

Imports are unchanged: continue using import structx
Document processing now lives in the optional docs extra
Please update your environments and requirement files to use the new name

Upgrade commands:

pip uninstall -y structx-llm
pip install -U structx

If you previously pinned structx-llm in requirements or lock files, replace it with structx. Install structx[docs] for non-PDF document conversion.

✨ Key Features

🎯 Advanced Document Processing

� Multimodal PDF Pipeline: Passes PDFs directly to vision-capable models and converts supported non-PDF documents to PDF
🖼️ Vision-Enabled Extraction: Native instructor multimodal support for PDFs and images
🔄 Smart Format Detection: Automatic processing mode selection for best results
📊 Flexible File Support: CSV, Excel, JSON, Parquet, raw text, and existing PDFs in the base install; DOCX, PPTX, images, and more via structx[docs]

🚀 Intelligent Data Extraction

🔄 Dynamic Model Generation: Create type-safe Pydantic models from natural language queries
🎯 Automatic Schema Inference: Intelligent schema generation and refinement
📊 Complex Data Structures: Support for nested and hierarchical data
🔄 Natural Language Refinement: Improve models with conversational instructions

⚡ Performance & Reliability

🚀 High-Performance Processing: Threaded sync and native async row requests
🔄 Robust Error Handling: Automatic retry mechanism with exponential backoff
📈 Token Usage Tracking: Detailed step-by-step metrics for cost monitoring
Flexible Configuration: Model settings from arguments, YAML, environment variables, dotenv files, and secrets through Pydantic Settings
🔌 Multiple LLM Providers: Support through litellm integration

Installation

pip install structx

For converting DOCX, PowerPoint, OpenDocument, markup, image, and other non-PDF document formats:

pip install "structx[docs]"

🔧 What The Package Provides

Structured readers for CSV, Excel, JSON, Parquet, and Feather
Instructor multimodal vision support
Optional Docling document parsing with CPU-only PyTorch resolution for uv on Linux
Optional WeasyPrint PDF rendering for non-PDF document formats

Quick Start

Basic Text Extraction

from structx import Extractor

# Initialize extractor
extractor = Extractor.from_litellm(
    model="gpt-4o",
    api_key="your-api-key",
    max_retries=3,      # Automatically retry on transient errors
    min_wait=1,         # Start with 1 second wait
    max_wait=10         # Maximum 10 seconds between retries
)

# Extract from text
result = extractor.extract(
    data="System check on 2024-01-15 detected high CPU usage (92%) on server-01.",
    query="extract incident date and details"
)

# Access results
print(f"Successful rows: {result.success_count}")
print(result.data[0].model_dump_json(indent=2))

📄 Document Processing with Multimodal Support

Install structx[docs] before using non-PDF document formats. Existing PDFs can be passed directly through the multimodal path with the base installation.

# Process a PDF invoice through the multimodal pipeline
result = extractor.extract(
    data="scripts/example_input/S0305SampleInvoice.pdf",
    query="extract the invoice number, total amount, and line items"
)

# Convert a DOCX contract and process with multimodal support
result = extractor.extract(
    data="scripts/example_input/free-consultancy-agreement.docx",
    query="extract parties, effective date, and payment terms"
)

📊 Token Usage Monitoring

# Check token usage for cost monitoring
usage = result.usage
if usage:
    print(f"Total tokens: {usage.total_tokens}")
    for step, calls in usage.steps.items():
        print(step.value, [call.total_tokens for call in calls])

🚀 Why Multimodal PDF Processing?

The innovative multimodal approach provides significant advantages over traditional text-based extraction:

📄 Context Preservation: Full document layout and structure are maintained
🎯 Higher Accuracy: Vision models can interpret tables, charts, and complex layouts
🔄 No Chunking Issues: Eliminates problems with information split across chunks
📊 Universal Format: Existing PDFs are passed through directly; supported non-PDF documents become processable through PDF conversion
🖼️ Visual Understanding: Handles documents with visual elements, formatting, and structure

📚 Documentation

For comprehensive documentation, examples, and guides, visit our documentation site.

Examples

Check out our example gallery for real-world use cases,

📁 Supported File Formats

📊 Structured Data (Direct Processing)

CSV: Comma-separated values with custom delimiters
Excel: .xlsx/.xls with sheet selection and custom options
JSON: JavaScript Object Notation with nested support
Parquet: Columnar storage format for large datasets
Feather: Fast binary format for data frames

📄 Unstructured Documents (Multimodal Pipeline)

Format	Extensions	Processing Method	Quality
PDF	`.pdf`	PDF → Multimodal	⭐⭐⭐⭐⭐
Word	`.docx`, `.doc`	Docling → HTML → PDF → Multimodal	⭐⭐⭐⭐⭐
PowerPoint	`.pptx`, `.ppt`	Docling → HTML → PDF → Multimodal	⭐⭐⭐⭐
Text	`.txt`, `.md`, `.py`, `.log`, `.xml`, `.html`	Docling → HTML → PDF → Multimodal	⭐⭐⭐⭐

🔄 Processing Pipeline

PDF passthrough: Existing PDFs are sent directly to multimodal extraction
Docling parsing: Reads non-PDF document-like inputs into a structured document model
WeasyPrint rendering: Converts Docling HTML to a temporary PDF for non-PDF inputs
Multimodal extraction: Sends the rendered PDF to instructor's multimodal API

Contributing

Contributions are welcome! Please read our Contributing Guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

Release history Release notifications | RSS feed

This version

0.6.0

Jul 11, 2026

0.5.1

Jul 10, 2026

0.5.0

Jul 10, 2026

0.4.11

Jul 10, 2026

0.4.10

Sep 24, 2025

0.4.8

Sep 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structx-0.6.0.tar.gz (48.8 kB view details)

Uploaded Jul 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

structx-0.6.0-py3-none-any.whl (40.9 kB view details)

Uploaded Jul 11, 2026 Python 3

File details

Details for the file structx-0.6.0.tar.gz.

File metadata

Download URL: structx-0.6.0.tar.gz
Upload date: Jul 11, 2026
Size: 48.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for structx-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`fffb5a7aa035230e39f9eacf93b253d5e2598f057cf96b22b9504f7cf209dafc`
MD5	`ebf34325c3240d0d8132b561a89c2aee`
BLAKE2b-256	`da88203690b446a8d84321e9e31bb96f469fe825844308b0fa2f4cd739da50f0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for structx-0.6.0.tar.gz:

Publisher: publish.yml on Blacksuan19/structx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: structx-0.6.0.tar.gz
- Subject digest: fffb5a7aa035230e39f9eacf93b253d5e2598f057cf96b22b9504f7cf209dafc
- Sigstore transparency entry: 2141617184
- Sigstore integration time: Jul 11, 2026
Source repository:
- Permalink: Blacksuan19/structx@45292cbcfe35981f74a3b556a8867523e691f20b
- Branch / Tag: refs/tags/0.6.0
- Owner: https://github.com/Blacksuan19
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@45292cbcfe35981f74a3b556a8867523e691f20b
- Trigger Event: push

File details

Details for the file structx-0.6.0-py3-none-any.whl.

File metadata

Download URL: structx-0.6.0-py3-none-any.whl
Upload date: Jul 11, 2026
Size: 40.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for structx-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09d7a25e6dcc6df04c041d17d71cf9d6e34d3f2fcdc9d3cfe6848841b2b09e35`
MD5	`25b7bc49150460bcac8c88c1d3a4a136`
BLAKE2b-256	`d445bf6292e6a6da2abc375d3d444ce44c4d5dc4ac28cc2a582f747dd478390d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for structx-0.6.0-py3-none-any.whl:

Publisher: publish.yml on Blacksuan19/structx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: structx-0.6.0-py3-none-any.whl
- Subject digest: 09d7a25e6dcc6df04c041d17d71cf9d6e34d3f2fcdc9d3cfe6848841b2b09e35
- Sigstore transparency entry: 2141617199
- Sigstore integration time: Jul 11, 2026
Source repository:
- Permalink: Blacksuan19/structx@45292cbcfe35981f74a3b556a8867523e691f20b
- Branch / Tag: refs/tags/0.6.0
- Owner: https://github.com/Blacksuan19
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@45292cbcfe35981f74a3b556a8867523e691f20b
- Trigger Event: push

structx 0.6.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

structx

🔔 Package rename notice (PyPI)

✨ Key Features

🎯 Advanced Document Processing

🚀 Intelligent Data Extraction

⚡ Performance & Reliability

Installation

🔧 What The Package Provides

Quick Start

Basic Text Extraction

📄 Document Processing with Multimodal Support

📊 Token Usage Monitoring

🚀 Why Multimodal PDF Processing?

📚 Documentation

Examples

📁 Supported File Formats

📊 Structured Data (Direct Processing)

📄 Unstructured Documents (Multimodal Pipeline)

🔄 Processing Pipeline

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance