LLM-powered structured information extraction using DSPy optimization
Project description
๐ง LangStruct
Extract structured data from any text โ no prompt engineering required
TL;DR: Extract structured information from any text - documents, emails, reports, transcripts - into clean JSON data. No prompt engineering required. Built on DSPy 3.0 for automatic optimization.
LangStruct turns messy, unstructured text into clean, typed, validated data. Whether you're processing medical records, financial documents, customer feedback, or legal contracts, LangStruct extracts exactly what you need with source tracking and confidence scores.
What LangStruct Does
LangStruct extracts structured information from unstructured text:
Input (messy text) โ Output (clean data)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"Dr. Smith diagnosed the 45-year-old โ {
patient with Type 2 diabetes and โ "physician": "Dr. Smith",
prescribed metformin 500mg twice โ "patient_age": 45,
daily. Follow-up in 3 months." โ "diagnosis": "Type 2 diabetes",
โ "medication": "metformin",
โ "dosage": "500mg",
โ "frequency": "twice daily",
โ "followup": "3 months"
โ }
Key Features
Core Capabilities
- Automatic Optimization: Uses DSPy MIPROv2 for prompt optimization
- Refinement System: Best-of-N + iterative improvement for 15-30% accuracy boost
- Source Tracking: Character-level mapping of extracted data to source text
- Schema Generation: Create Pydantic schemas from examples
- Type Safety: Full Pydantic validation and type hints
- Model Support: Compatible with OpenAI, Anthropic, Google, Ollama, and other LLMs
- Persistence: Save and load extractors with full state preservation
- Visualization: HTML output with source highlighting
Quick Example
โ ๏ธ API Key Required: You need an API key to run LangStruct. Get one free here โ or see setup options below.
from langstruct import LangStruct
# Define what you want to extract with a simple example
extractor = LangStruct(example={
"invoice_number": "INV-001",
"amount": 1250.00,
"due_date": "2024-03-15",
"line_items": ["Widget A", "Service B"]
})
# Extract from any text
text = """
Dear Customer,
Your invoice INV-2024-789 for $3,450.00 is due on April 20th, 2024.
Items:
- Premium Widget Set
- Installation Service
- Extended Warranty
Thank you for your business!
"""
result = extractor.extract(text)
print(result.entities)
# {
# "invoice_number": "INV-2024-789",
# "amount": 3450.00,
# "due_date": "2024-04-20",
# "line_items": ["Premium Widget Set", "Installation Service", "Extended Warranty"]
# }
# Boost accuracy with refinement (15-30% improvement)
result = extractor.extract(text, refine=True)
print(f"Confidence: {result.confidence:.1%}") # Higher confidence score
Quick Start
1. Get an API Key (Required)
Choose one option:
| Provider | Get Key | Best For |
|---|---|---|
| Google Gemini | Get Free Key โ | Fast & generous free tier |
| OpenAI | Get Key โ | GPT models |
| Anthropic | Get Key โ | Claude models |
| Local (Ollama) | Install Ollama โ | Privacy, no API needed |
Set your API key:
# Google Gemini (free)
export GOOGLE_API_KEY="your-key-here"
# Or use others:
export OPENAI_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"
2. Installation
Install from PyPI:
# uv (recommended)
uv add langstruct
# or pip
pip install langstruct
# Optional extras
pip install langstruct[viz] # Visualization tools (HTML helpers)
pip install langstruct[examples] # Example integrations (ChromaDB, LangChain)
pip install langstruct[parallel] # tqdm for nicer progress bars
pip install langstruct[dev] # Test and lint toolchain
pip install langstruct[all] # Everything above
3. Basic Usage
from langstruct import LangStruct
# Create an extractor from an example (simplest approach)
extractor = LangStruct(example={
"name": "Dr. Sarah Johnson",
"age": 34,
"location": "Cambridge, Massachusetts",
"occupation": "cardiologist"
})
# Extract structured data from text
text = """
Dr. Sarah Johnson is a 34-year-old cardiologist working at Boston General Hospital.
She currently lives in Cambridge, Massachusetts, with her family.
"""
result = extractor.extract(text)
print(result.entities)
# Output: {
# "name": "Dr. Sarah Johnson",
# "age": 34,
# "location": "Cambridge, Massachusetts",
# "occupation": "cardiologist"
# }
print(f"Confidence: {result.confidence:.2%}")
# Output: Confidence: 94%
That's it! LangStruct automatically handles schema generation, optimization, and source tracking.
๐ Common Applications
1. Data Pipeline Automation
Extract structured data from documents for databases, analytics, or APIs:
# Process invoices, receipts, reports, emails
invoice_data = extractor.extract(invoice_pdf_text)
# โ {"invoice_no": "INV-2024-001", "amount": 5420.00, "due_date": "2024-03-15"}
2. Content Analysis & Research
Analyze transcripts, reviews, surveys, or social media:
# Extract insights from customer feedback
feedback = extractor.extract(review_text)
# โ {"sentiment": "positive", "product_issues": [], "feature_requests": ["dark mode"]}
3. Compliance & Validation
Extract and validate required information from legal or regulatory documents:
# Check contracts for specific clauses
contract_data = extractor.extract(contract_text)
# โ {"term_length": "2 years", "termination_clause": true, "liability_cap": 1000000}
๐ RAG System Enhancement
Transform your RAG system from simple search to intelligent retrieval:
Note: LangStruct enhances ANY vector database or search system (Pinecone, Weaviate, Elasticsearch, etc.).
1. Document โ Structured Metadata
# Extract structured metadata from documents
extractor = LangStruct(example={
"company": "Apple Inc.",
"revenue": 125.3,
"quarter": "Q3 2024"
})
metadata = extractor.extract(document).entities
# Now your documents have precise, filterable metadata
2. Query โ Structured Filters
from langstruct import LangStruct
# Parse natural language queries into filters
ls = LangStruct(example=same_schema) # Same schema as extraction!
query = "Show me Q3 2024 tech companies with revenue over $100B discussing AI investments"
parsed = ls.query(query)
print(parsed.semantic_terms)
# ["tech companies", "AI investments", "artificial intelligence"]
print(parsed.structured_filters)
# {"quarter": "Q3 2024", "revenue": {"$gte": 100.0}}
3. Precise Retrieval
# Combine semantic search with exact filters
rag_results = vector_store.similarity_search(
query=' '.join(parsed.semantic_terms), # Semantic search
where=parsed.structured_filters # Exact filters
)
# Returns only docs matching BOTH semantic AND structural requirements
Why RAG + LangStruct?
Traditional RAG systems struggle with structured requirements. LangStruct solves this:
| Query | Traditional RAG | With LangStruct |
|---|---|---|
| "invoices over $10k from Q3" | Returns any document with "invoice" OR "Q3" | Returns ONLY invoices >$10k from Q3 |
| "patients over 65 with diabetes" | Returns any medical document | Returns ONLY matching patient records |
| "contracts expiring in 2024" | Returns any contract | Returns ONLY 2024 expirations |
See our complete RAG integration guide for implementation.
๐ Where LangStruct Excels
Perfect for:
- ๐ Document Processing: Invoices, reports, forms, emails
- ๐ฅ Healthcare: Medical records, clinical notes, lab results
- ๐ผ Financial: Statements, filings, contracts, reports
- โ๏ธ Legal: Contracts, agreements, regulations, cases
- ๐ฌ Research: Papers, patents, technical documentation
- ๐ฏ Customer Data: Reviews, feedback, support tickets
Key Advantages:
- No prompt engineering: DSPy handles optimization automatically
- Type safety: Pydantic schemas with full validation
- Source grounding: Know exactly where each field came from
- Confidence scores: Understand extraction reliability
- Model agnostic: Works with any LLM provider
๐ Comparison with Alternatives
LangStruct vs LangExtract
Both are excellent tools for structured extraction with different strengths:
| Feature | LangStruct | LangExtract |
|---|---|---|
| Optimization | โ Automatic (DSPy MIPROv2) | โ Manual prompt tuning |
| Refinement | โ Best-of-N + iterative improvement | โ ๏ธ Multi-pass extraction; no Best-of-N/judge pipeline |
| Schema Definition | โ From examples OR Pydantic | โ ๏ธ Prompt + examples (no Pydantic models) |
| Source Grounding | โ Character-level tracking | โ Character-level tracking |
| Confidence Scores | โ Built-in | โ ๏ธ Not surfaced as scores |
| Query Parsing | โ Bidirectional (docs + queries) | โ Documents only |
| Model Support | โ Any LLM (via DSPy/LiteLLM) | โ Gemini, OpenAI, local via Ollama; extensible |
| Learning Curve | โ Simple (example-based) | โ ๏ธ Requires prompt + example design |
| Performance | โ Self-optimizing | Depends on manual tuning |
| Project Type | Community open-source | Google open-source |
Note: Comparison verified on 2025-09-10 against the latest LangExtract README and examples. See LangExtract: https://github.com/google/langextract and example walkthroughs (e.g., longer text extraction): https://github.com/google/langextract/blob/main/docs/examples/longer_text_example.md
Choose LangStruct if you want:
- Automatic optimization without prompt engineering
- Best-of-N refinement for higher accuracy
- Flexibility to define schemas from examples
- Query parsing for RAG systems
- Confidence scores for extraction quality
- Support for any LLM provider
Choose LangExtract if you prefer:
- Direct control over prompts
- Google's backing and support
- Simpler architecture without DSPy
๐ฏ Getting Started
Once you're comfortable with the basics, you can:
Define Custom Schemas for more control:
from pydantic import BaseModel, Field
from langstruct import LangStruct
class PersonSchema(BaseModel):
name: str = Field(description="Full name of the person")
age: int = Field(description="Age in years")
location: str = Field(description="Current location")
extractor = LangStruct(schema=PersonSchema)
Process Multiple Documents at once:
documents = [doc1, doc2, doc3]
results = extractor.extract(documents) # Handles batch processing automatically
Save and Load Extractors for reuse:
# Save an optimized extractor (preserves all state)
extractor.save("./my_extractor")
# Load anywhere (API keys must be available in environment)
loaded_extractor = LangStruct.load("./my_extractor")
result = loaded_extractor.extract("New text")
View Source Locations to see where data came from:
for field, spans in result.sources.items():
for span in spans:
print(f"{field}: '{span.text}' at chars {span.start}-{span.end}")
๐ Supported Models
LangStruct works with any LLM provider:
- Google Gemini: gemini/gemini-2.5-flash, gemini/gemini-2.5-pro
- OpenAI: gpt-5-pro, gpt-5-mini, gpt-4o, gpt-4o-mini
- Anthropic: claude-opus-4-1, claude-sonnet-4-0, claude-3-7-sonnet-latest, claude-3-5-haiku-latest
- Local: Any model via Ollama (llama3, mistral, etc.)
๐จ Visualization & Export
Create Interactive Visualizations:
from langstruct import HTMLVisualizer
viz = HTMLVisualizer()
viz.save_visualization(text, result, "results.html") # Shows highlighted sources
Export Results:
# Save to various formats
result.save_json("data.json")
extractor.export_batch(results, "data.csv") # CSV, Excel, Parquet supported
JSONL RoundโTrip + Visualization:
# Save annotated documents to JSONL
results = extractor.extract(texts, validate=False)
extractor.save_annotated_documents(results, "extractions.jsonl")
# Load later
loaded = extractor.load_annotated_documents("extractions.jsonl")
# Generate interactive HTML
extractor.visualize(loaded, "results.html")
๐งต Batch, Rate Limits, Retries
LangStruct batches efficiently and helps respect provider quotas.
# Control concurrency and quotas
results = extractor.extract(
texts,
max_workers=8, # Thread workers
show_progress=True, # Requires langstruct[parallel]
rate_limit=60, # Calls per minute
retry_failed=True # Raise on failures or surface warnings
)
- Retries: exponential backoff (3 attempts by default) for transient errors.
- Rate limiting: simple tokenโbucket; set
rate_limit=Nonefor unlimited. - Failures: when
retry_failed=False, failures are warned and skipped; otherwise an exception summarizes first errors.
๐ Advanced Features
Optimization (For Power Users)
LangStruct optimizes automatically, but you can fine-tune for your specific data:
# Train on your examples
training_texts = ["Your domain-specific texts..."]
expected_results = [{"name": "Expected outputs..."}]
extractor.optimize(
texts=training_texts,
expected_results=expected_results,
num_trials=50 # More trials = better results
)
# Evaluate performance
scores = extractor.evaluate(test_texts, test_expected)
print(f"Accuracy: {scores['accuracy']:.2%}")
Refinement for Higher Accuracy
Boost extraction accuracy by 15-30% with Best-of-N candidate selection and iterative improvement:
# Simple refinement
result = extractor.extract(text, refine=True)
# Advanced refinement with custom configuration
result = extractor.extract(text, refine={
"strategy": "bon_then_refine", # Best-of-N + iterative improvement
"n_candidates": 5, # Generate 5 candidates
"judge": "Prefer candidates that exactly match cited text spans",
"max_refine_steps": 2,
"budget": {"max_calls": 10} # Cost control
})
print(f"Accuracy improvement: {result.confidence:.1%}")
Custom Configuration
from langstruct import ChunkingConfig
# For large documents
config = ChunkingConfig(
max_tokens=1500,
overlap_tokens=150,
preserve_sentences=True
)
extractor = LangStruct(
schema=YourSchema,
model="gemini/gemini-2.5-flash",
chunking_config=config,
optimize=True # Enabled for training data
)
๐ง Troubleshooting
API Key Issues
Error: "No API keys found" or "Authentication failed"
-
Check your API key is set:
echo $GOOGLE_API_KEY # Should show your key
-
Common fixes:
# Make sure you're using the right format export GOOGLE_API_KEY="your-actual-key-here" # No quotes in the key itself # For persistent setup, add to your shell profile: echo 'export GOOGLE_API_KEY="your-key"' >> ~/.bashrc source ~/.bashrc
-
Test your key works:
import os print("API key set:", bool(os.getenv("GOOGLE_API_KEY"))) # Quick test from langstruct import LangStruct ls = LangStruct(example={"name": "test"}) result = ls.extract("Hello John") # Should work without errors
Error: "Model not found" or "Rate limit exceeded"
- Model not found: Your API key might be for a different provider
- Rate limits: Try a different model or wait a few minutes
- Billing: Check your account has credits (OpenAI/Anthropic)
Installation Issues
Error: Package not found on PyPI
If you encounter package installation issues, try:
# Update pip and try again
pip install --upgrade pip
pip install langstruct
# Or install from source for development
git clone https://github.com/langstruct-ai/langstruct.git
cd langstruct
uv sync --extra dev
uv pip install -e .
Import errors or missing dependencies
# Reinstall with all dependencies
pip install -e ".[dev,examples,viz,parallel]"
Getting Help
- ๐ Bug reports: GitHub Issues
- ๐ฌ Questions: GitHub Discussions
- ๐ Documentation: langstruct.dev
๐ค Contributing
We welcome contributions! Please see our contributing guide for details.
Development Setup
# Clone the repository
git clone https://github.com/yourusername/langstruct.git
cd langstruct
# Install dependencies with uv
uv sync --extra dev
# Run tests
uv run pytest
# Format code
uv run black . && uv run isort .
๐ License
MIT License - see LICENSE for details.
๐ Acknowledgments
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langstruct-0.1.1.tar.gz.
File metadata
- Download URL: langstruct-0.1.1.tar.gz
- Upload date:
- Size: 81.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4e03fdd06d7c69480a55c1613b6cbbbfcc13fb72b3be2dfe3f4af3f04d0c40a
|
|
| MD5 |
0bc8df27e3a27e13b6061b06efdccb68
|
|
| BLAKE2b-256 |
bc8e6a50d76433f96e2efd0e73a18730a749af17a5960f7ef1839ee71f7839c5
|
File details
Details for the file langstruct-0.1.1-py3-none-any.whl.
File metadata
- Download URL: langstruct-0.1.1-py3-none-any.whl
- Upload date:
- Size: 78.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abbdf504c491281b261e2b5a5cccf7681bfc4e294b2884da48846bf8079f2c87
|
|
| MD5 |
44fba0dd3a691c2559f279e4de2c54a0
|
|
| BLAKE2b-256 |
3664ccf219b5c1b98dcdd8a85d181bff9fdbfc1d18bfb66dd09e4392a6cb60c6
|