Structured data extraction from text using LLMs and dynamic model generation
Project description
structx
Advanced structured data extraction from any document using LLMs with multimodal support.
structx is a powerful Python library for extracting structured data from any
document or text using Large Language Models (LLMs). It features an innovative
multimodal PDF processing pipeline that converts any document to PDF and uses
instructor's vision capabilities for superior extraction quality.
🔔 Package rename notice (PyPI)
The PyPI distribution has been renamed from structx-llm to structx
(September 2025).
- Imports are unchanged: continue using
import structx - Extras are unchanged:
structx[docs],structx[pdf],structx[docx] - Please update your environments and requirement files to use the new name
Upgrade commands:
pip uninstall -y structx-llm
pip install -U structx
If you previously pinned structx-llm in requirements or lock files, replace it
with structx.
✨ Key Features
🎯 Advanced Document Processing
- � Multimodal PDF Pipeline: Converts any document (TXT, DOCX, etc.) to PDF for optimal extraction
- 🖼️ Vision-Enabled Extraction: Native instructor multimodal support for PDFs and images
- 🔄 Smart Format Detection: Automatic processing mode selection for best results
- 📊 Universal File Support: CSV, Excel, JSON, Parquet, PDF, DOCX, TXT, Markdown, and more
🚀 Intelligent Data Extraction
- 🔄 Dynamic Model Generation: Create type-safe Pydantic models from natural language queries
- 🎯 Automatic Schema Inference: Intelligent schema generation and refinement
- 📊 Complex Data Structures: Support for nested and hierarchical data
- 🔄 Natural Language Refinement: Improve models with conversational instructions
⚡ Performance & Reliability
- 🚀 High-Performance Processing: Multi-threaded and async operations
- 🔄 Robust Error Handling: Automatic retry mechanism with exponential backoff
- 📈 Token Usage Tracking: Detailed step-by-step metrics for cost monitoring
- � Flexible Configuration: Configurable extraction using OmegaConf
- 🔌 Multiple LLM Providers: Support through litellm integration
Installation
# Core package with basic extraction capabilities
pip install structx
📄 Enhanced Document Processing (Recommended)
For the best experience with all document types including advanced multimodal PDF processing:
# Complete document processing support
pip install structx[docs]
# Individual components
pip install structx[pdf] # PDF processing with multimodal support
pip install structx[docx] # Advanced DOCX conversion via docling
🔧 What Each Extra Provides
-
[docs]: Complete multimodal document processing pipeline- PDF conversion from any document type
- Instructor multimodal vision support
- Advanced DOCX processing via docling
- Enhanced extraction quality
-
[pdf]: PDF-specific processing- Multimodal PDF support via instructor
- PDF generation capabilities
- Basic PDF text extraction fallback
-
[docx]: Advanced DOCX support- Document conversion via docling
- Structure preservation
- Markdown-based processing pipeline
Quick Start
Basic Text Extraction
from structx import Extractor
# Initialize extractor
extractor = Extractor.from_litellm(
model="gpt-4o",
api_key="your-api-key",
max_retries=3, # Automatically retry on transient errors
min_wait=1, # Start with 1 second wait
max_wait=10 # Maximum 10 seconds between retries
)
# Extract from text
result = extractor.extract(
data="System check on 2024-01-15 detected high CPU usage (92%) on server-01.",
query="extract incident date and details"
)
# Access results
print(f"Extracted {result.success_count} items")
print(result.data[0].model_dump_json(indent=2))
📄 Document Processing with Multimodal Support
# Process a PDF invoice directly with vision capabilities
result = extractor.extract(
data="scripts/example_input/S0305SampleInvoice.pdf", # Direct multimodal processing
query="extract the invoice number, total amount, and line items"
)
# Convert a DOCX contract and process with multimodal support
result = extractor.extract(
data="scripts/example_input/free-consultancy-agreement.docx", # Auto-converted to PDF -> multimodal
query="extract parties, effective date, and payment terms"
)
📊 Token Usage Monitoring
# Check token usage for cost monitoring
usage = result.get_token_usage()
if usage:
print(f"Total tokens: {usage.total_tokens}")
print(f"By step: {[(s.name, s.tokens) for s in usage.steps]}")
🚀 Why Multimodal PDF Processing?
The innovative multimodal approach provides significant advantages over traditional text-based extraction:
- 📄 Context Preservation: Full document layout and structure are maintained
- 🎯 Higher Accuracy: Vision models can interpret tables, charts, and complex layouts
- 🔄 No Chunking Issues: Eliminates problems with information split across chunks
- 📊 Universal Format: Any document type becomes processable through PDF conversion
- 🖼️ Visual Understanding: Handles documents with visual elements, formatting, and structure
📚 Documentation
For comprehensive documentation, examples, and guides, visit our documentation site.
- Getting Started
- Basic Extraction
- Unstructured Text Processing
- Async Operations
- Multiple Queries
- Custom Models
- Token Usage Tracking
- API Reference
Examples
Check out our example gallery for real-world use cases,
📁 Supported File Formats
📊 Structured Data (Direct Processing)
- CSV: Comma-separated values with custom delimiters
- Excel: .xlsx/.xls with sheet selection and custom options
- JSON: JavaScript Object Notation with nested support
- Parquet: Columnar storage format for large datasets
- Feather: Fast binary format for data frames
📄 Unstructured Documents (Multimodal Pipeline)
| Format | Extensions | Processing Method | Quality |
|---|---|---|---|
.pdf |
Direct multimodal processing | ⭐⭐⭐⭐⭐ | |
| Word | .docx, .doc |
Docling → Markdown → PDF → Multimodal | ⭐⭐⭐⭐⭐ |
| Text | .txt, .md, .py, .log, .xml, .html |
Styled PDF → Multimodal | ⭐⭐⭐⭐ |
🔄 Processing Modes
- Multimodal PDF (default): Best quality, preserves layout and context
- Simple Text: Fallback mode with chunking for memory-constrained environments
- Simple PDF: Basic PDF text extraction without vision capabilities
Contributing
Contributions are welcome! Please read our Contributing Guidelines for details.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file structx-0.4.10.tar.gz.
File metadata
- Download URL: structx-0.4.10.tar.gz
- Upload date:
- Size: 38.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
133e647056c5c3211d96c745218cefd9956d035efa62d6a577e8be775cb00104
|
|
| MD5 |
87139215c3878cb234e881ab933c192f
|
|
| BLAKE2b-256 |
b20a2901f0678a2de03f492a1a8e852b5f7cfb1b01237ceed3c987a4a0778043
|
Provenance
The following attestation bundles were made for structx-0.4.10.tar.gz:
Publisher:
publish.yml on Blacksuan19/structx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structx-0.4.10.tar.gz -
Subject digest:
133e647056c5c3211d96c745218cefd9956d035efa62d6a577e8be775cb00104 - Sigstore transparency entry: 554327837
- Sigstore integration time:
-
Permalink:
Blacksuan19/structx@44c2460dddf4886d18dc84559a78c9b4cb2bd7a2 -
Branch / Tag:
refs/tags/0.4.10 - Owner: https://github.com/Blacksuan19
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44c2460dddf4886d18dc84559a78c9b4cb2bd7a2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file structx-0.4.10-py3-none-any.whl.
File metadata
- Download URL: structx-0.4.10-py3-none-any.whl
- Upload date:
- Size: 42.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dee70b0da655f0d76dffd3f040360706459d5e924e95d540113fc8083bf5dc2e
|
|
| MD5 |
f346f3eba098a3dd879d98cb650c1450
|
|
| BLAKE2b-256 |
4f88929eb3fb50a5dd6dbd1fe35df398cae3320d82e208ef88d46198f2ffc313
|
Provenance
The following attestation bundles were made for structx-0.4.10-py3-none-any.whl:
Publisher:
publish.yml on Blacksuan19/structx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structx-0.4.10-py3-none-any.whl -
Subject digest:
dee70b0da655f0d76dffd3f040360706459d5e924e95d540113fc8083bf5dc2e - Sigstore transparency entry: 554327852
- Sigstore integration time:
-
Permalink:
Blacksuan19/structx@44c2460dddf4886d18dc84559a78c9b4cb2bd7a2 -
Branch / Tag:
refs/tags/0.4.10 - Owner: https://github.com/Blacksuan19
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@44c2460dddf4886d18dc84559a78c9b4cb2bd7a2 -
Trigger Event:
push
-
Statement type: