Structured data extraction from text using LLMs and dynamic model generation

These details have not been verified by PyPI

Project description

structx

Advanced structured data extraction from any document using LLMs with multimodal support.

structx is a powerful Python library for extracting structured data from any document or text using Large Language Models (LLMs). It features an innovative multimodal PDF processing pipeline that converts any document to PDF and uses instructor's vision capabilities for superior extraction quality.

✨ Key Features

🎯 Advanced Document Processing

� Multimodal PDF Pipeline: Converts any document (TXT, DOCX, etc.) to PDF for optimal extraction
🖼️ Vision-Enabled Extraction: Native instructor multimodal support for PDFs and images
🔄 Smart Format Detection: Automatic processing mode selection for best results
📊 Universal File Support: CSV, Excel, JSON, Parquet, PDF, DOCX, TXT, Markdown, and more

🚀 Intelligent Data Extraction

🔄 Dynamic Model Generation: Create type-safe Pydantic models from natural language queries
🎯 Automatic Schema Inference: Intelligent schema generation and refinement
📊 Complex Data Structures: Support for nested and hierarchical data
🔄 Natural Language Refinement: Improve models with conversational instructions

⚡ Performance & Reliability

🚀 High-Performance Processing: Multi-threaded and async operations
🔄 Robust Error Handling: Automatic retry mechanism with exponential backoff
📈 Token Usage Tracking: Detailed step-by-step metrics for cost monitoring
� Flexible Configuration: Configurable extraction using OmegaConf
🔌 Multiple LLM Providers: Support through litellm integration

Installation

# Core package with basic extraction capabilities
pip install structx-llm

📄 Enhanced Document Processing (Recommended)

For the best experience with all document types including advanced multimodal PDF processing:

# Complete document processing support
pip install structx-llm[docs]

# Individual components
pip install structx-llm[pdf]   # PDF processing with multimodal support
pip install structx-llm[docx]  # Advanced DOCX conversion via docling

🔧 What Each Extra Provides

[docs]: Complete multimodal document processing pipeline
- PDF conversion from any document type
- Instructor multimodal vision support
- Advanced DOCX processing via docling
- Enhanced extraction quality
[pdf]: PDF-specific processing
- Multimodal PDF support via instructor
- PDF generation capabilities
- Basic PDF text extraction fallback
[docx]: Advanced DOCX support
- Document conversion via docling
- Structure preservation
- Markdown-based processing pipeline

Quick Start

Basic Text Extraction

from structx import Extractor

# Initialize extractor
extractor = Extractor.from_litellm(
    model="gpt-4o-mini",
    api_key="your-api-key",
    max_retries=3,      # Automatically retry on transient errors
    min_wait=1,         # Start with 1 second wait
    max_wait=10         # Maximum 10 seconds between retries
)

# Extract from text
result = extractor.extract(
    data="System check on 2024-01-15 detected high CPU usage (92%) on server-01.",
    query="extract incident date and details"
)

# Access results
print(f"Extracted {result.success_count} items")
print(result.data[0].model_dump_json(indent=2))

📄 Document Processing with Multimodal Support

# Process PDF documents directly with vision capabilities
result = extractor.extract(
    data="financial_report.pdf",      # Direct multimodal processing
    query="extract revenue figures, profit margins, and key financial metrics"
)

# Convert DOCX to PDF and process with multimodal support
result = extractor.extract(
    data="contract.docx",             # Auto-converted via docling → PDF → multimodal
    query="extract parties, dates, payment terms, and key obligations"
)

# Process any text file with enhanced PDF conversion
result = extractor.extract(
    data="meeting_notes.txt",         # Converted to styled PDF → multimodal
    query="extract action items, deadlines, and responsible parties"
)

📊 Token Usage Monitoring

# Check token usage for cost monitoring
usage = result.get_token_usage()
if usage:
    print(f"Total tokens: {usage.total_tokens}")
    print(f"By step: {[(s.name, s.tokens) for s in usage.steps]}")

🚀 Why Multimodal PDF Processing?

The innovative multimodal approach provides significant advantages over traditional text-based extraction:

📄 Context Preservation: Full document layout and structure are maintained
🎯 Higher Accuracy: Vision models can interpret tables, charts, and complex layouts
🔄 No Chunking Issues: Eliminates problems with information split across chunks
📊 Universal Format: Any document type becomes processable through PDF conversion
🖼️ Visual Understanding: Handles documents with visual elements, formatting, and structure

📚 Documentation

For comprehensive documentation, examples, and guides, visit our documentation site.

Examples

Check out our example gallery for real-world use cases,

📁 Supported File Formats

📊 Structured Data (Direct Processing)

CSV: Comma-separated values with custom delimiters
Excel: .xlsx/.xls with sheet selection and custom options
JSON: JavaScript Object Notation with nested support
Parquet: Columnar storage format for large datasets
Feather: Fast binary format for data frames

📄 Unstructured Documents (Multimodal Pipeline)

Format	Extensions	Processing Method	Quality
PDF	`.pdf`	Direct multimodal processing	⭐⭐⭐⭐⭐
Word	`.docx`, `.doc`	Docling → Markdown → PDF → Multimodal	⭐⭐⭐⭐⭐
Text	`.txt`, `.md`, `.py`, `.log`, `.xml`, `.html`	Styled PDF → Multimodal	⭐⭐⭐⭐

🔄 Processing Modes

Multimodal PDF (default): Best quality, preserves layout and context
Simple Text: Fallback mode with chunking for memory-constrained environments
Simple PDF: Basic PDF text extraction without vision capabilities

Contributing

Contributions are welcome! Please read our Contributing Guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.9

Sep 24, 2025

0.4.4

Aug 1, 2025

0.4.3

Aug 1, 2025

0.4.2

Aug 1, 2025

0.4.1

Aug 1, 2025

This version

0.4.0

Jul 31, 2025

0.3.1

May 18, 2025

0.3.0

May 18, 2025

0.2.28

May 18, 2025

0.2.27

May 8, 2025

0.2.26

Apr 5, 2025

0.2.25

Apr 5, 2025

0.2.24

Apr 4, 2025

0.2.23

Apr 4, 2025

0.2.22

Apr 4, 2025

0.2.21

Mar 10, 2025

0.2.20

Mar 10, 2025

0.2.19

Mar 5, 2025

0.2.18

Mar 5, 2025

0.2.17

Mar 5, 2025

0.2.16

Mar 5, 2025

0.2.15

Mar 5, 2025

0.2.14

Mar 4, 2025

0.2.12

Mar 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structx_llm-0.4.0.tar.gz (39.4 kB view details)

Uploaded Jul 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

structx_llm-0.4.0-py3-none-any.whl (44.1 kB view details)

Uploaded Jul 31, 2025 Python 3

File details

Details for the file structx_llm-0.4.0.tar.gz.

File metadata

Download URL: structx_llm-0.4.0.tar.gz
Upload date: Jul 31, 2025
Size: 39.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for structx_llm-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`dfca407a3588b05d0d23a883611eea61090e46da20e3248b17036d383b608cfb`
MD5	`f1870f6c3ee10533f8debc6e3445e950`
BLAKE2b-256	`bf69be4325b84caef77ac699083b50cf02f1b9c21464262c3ea38114af1e3f11`

See more details on using hashes here.

Provenance

The following attestation bundles were made for structx_llm-0.4.0.tar.gz:

Publisher: publish.yml on Blacksuan19/structx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: structx_llm-0.4.0.tar.gz
- Subject digest: dfca407a3588b05d0d23a883611eea61090e46da20e3248b17036d383b608cfb
- Sigstore transparency entry: 337870527
- Sigstore integration time: Jul 31, 2025
Source repository:
- Permalink: Blacksuan19/structx@d5f02560667420cb2a74a8e58e1c66db440b4992
- Branch / Tag: refs/tags/0.4.0
- Owner: https://github.com/Blacksuan19
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5f02560667420cb2a74a8e58e1c66db440b4992
- Trigger Event: push

File details

Details for the file structx_llm-0.4.0-py3-none-any.whl.

File metadata

Download URL: structx_llm-0.4.0-py3-none-any.whl
Upload date: Jul 31, 2025
Size: 44.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for structx_llm-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33c1ea7b74b7da7a8944a6f61f06c694d01336a62b29535b670ada063c5d9b43`
MD5	`5cf314e0873f22c6bca3e9eccaa4be41`
BLAKE2b-256	`c31b9c734366f3e90e8eace874b3f4f74fbc25c65b41b72e9f1866678f70eeae`

See more details on using hashes here.

Provenance

The following attestation bundles were made for structx_llm-0.4.0-py3-none-any.whl:

Publisher: publish.yml on Blacksuan19/structx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: structx_llm-0.4.0-py3-none-any.whl
- Subject digest: 33c1ea7b74b7da7a8944a6f61f06c694d01336a62b29535b670ada063c5d9b43
- Sigstore transparency entry: 337870539
- Sigstore integration time: Jul 31, 2025
Source repository:
- Permalink: Blacksuan19/structx@d5f02560667420cb2a74a8e58e1c66db440b4992
- Branch / Tag: refs/tags/0.4.0
- Owner: https://github.com/Blacksuan19
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5f02560667420cb2a74a8e58e1c66db440b4992
- Trigger Event: push

structx-llm 0.4.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

structx

✨ Key Features

🎯 Advanced Document Processing

🚀 Intelligent Data Extraction

⚡ Performance & Reliability

Installation

📄 Enhanced Document Processing (Recommended)

🔧 What Each Extra Provides

Quick Start

Basic Text Extraction

📄 Document Processing with Multimodal Support

📊 Token Usage Monitoring

🚀 Why Multimodal PDF Processing?

📚 Documentation

Examples

📁 Supported File Formats

📊 Structured Data (Direct Processing)

📄 Unstructured Documents (Multimodal Pipeline)

🔄 Processing Modes

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance