Skip to main content

LLM-powered structured information extraction using DSPy optimization

Project description

๐Ÿง  LangStruct

Python 3.12+ MIT License DSPy 3.0 Docs

Extract structured data from any text โ€“ no prompt engineering required

TL;DR: Extract structured information from any text - documents, emails, reports, transcripts - into clean JSON data. No prompt engineering required. Built on DSPy 3.0 for automatic optimization.

LangStruct turns messy, unstructured text into clean, typed, validated data. Whether you're processing medical records, financial documents, customer feedback, or legal contracts, LangStruct extracts exactly what you need with source tracking and confidence scores.

What LangStruct Does

LangStruct extracts structured information from unstructured text:

Input (messy text)                    โ†’  Output (clean data)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
"Dr. Smith diagnosed the 45-year-old  โ†’  {
 patient with Type 2 diabetes and     โ†’    "physician": "Dr. Smith",
 prescribed metformin 500mg twice     โ†’    "patient_age": 45,
 daily. Follow-up in 3 months."       โ†’    "diagnosis": "Type 2 diabetes",
                                      โ†’    "medication": "metformin",
                                      โ†’    "dosage": "500mg",
                                      โ†’    "frequency": "twice daily",
                                      โ†’    "followup": "3 months"
                                      โ†’  }

Key Features

Core Capabilities

  • Automatic Optimization: Uses DSPy MIPROv2 for prompt optimization
  • Refinement System: Best-of-N + iterative improvement for 15-30% accuracy boost
  • Source Tracking: Character-level mapping of extracted data to source text
  • Schema Generation: Create Pydantic schemas from examples
  • Type Safety: Full Pydantic validation and type hints
  • Model Support: Compatible with OpenAI, Anthropic, Google, Ollama, and other LLMs
  • Persistence: Save and load extractors with full state preservation
  • Visualization: HTML output with source highlighting

Quick Example

โš ๏ธ API Key Required: You need an API key to run LangStruct. Get one free here โ†’ or see setup options below.

from langstruct import LangStruct

# Define what you want to extract with a simple example
extractor = LangStruct(example={
    "invoice_number": "INV-001",
    "amount": 1250.00,
    "due_date": "2024-03-15",
    "line_items": ["Widget A", "Service B"]
})

# Extract from any text
text = """
Dear Customer,

Your invoice INV-2024-789 for $3,450.00 is due on April 20th, 2024.

Items:
- Premium Widget Set
- Installation Service
- Extended Warranty

Thank you for your business!
"""

result = extractor.extract(text)
print(result.entities)
# {
#   "invoice_number": "INV-2024-789",
#   "amount": 3450.00,
#   "due_date": "2024-04-20",
#   "line_items": ["Premium Widget Set", "Installation Service", "Extended Warranty"]
# }

# Boost accuracy with refinement (15-30% improvement)
result = extractor.extract(text, refine=True)
print(f"Confidence: {result.confidence:.1%}")  # Higher confidence score

Quick Start

1. Get an API Key (Required)

Choose one option:

Provider Get Key Best For
Google Gemini Get Free Key โ†’ Fast & generous free tier
OpenAI Get Key โ†’ GPT models
Anthropic Get Key โ†’ Claude models
Local (Ollama) Install Ollama โ†’ Privacy, no API needed

Set your API key:

# Google Gemini (free)
export GOOGLE_API_KEY="your-key-here"

# Or use others:
export OPENAI_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"

2. Installation

Install from PyPI:

# uv (recommended)
uv add langstruct

# or pip
pip install langstruct

# Optional extras
pip install langstruct[viz]        # Visualization tools (HTML helpers)
pip install langstruct[examples]   # Example integrations (ChromaDB, LangChain)
pip install langstruct[parallel]   # tqdm for nicer progress bars
pip install langstruct[dev]        # Test and lint toolchain
pip install langstruct[all]        # Everything above

3. Basic Usage

from langstruct import LangStruct

# Create an extractor from an example (simplest approach)
extractor = LangStruct(example={
    "name": "Dr. Sarah Johnson",
    "age": 34,
    "location": "Cambridge, Massachusetts",
    "occupation": "cardiologist"
})

# Extract structured data from text
text = """
Dr. Sarah Johnson is a 34-year-old cardiologist working at Boston General Hospital.
She currently lives in Cambridge, Massachusetts, with her family.
"""

result = extractor.extract(text)

print(result.entities)
# Output: {
#   "name": "Dr. Sarah Johnson",
#   "age": 34,
#   "location": "Cambridge, Massachusetts",
#   "occupation": "cardiologist"
# }

print(f"Confidence: {result.confidence:.2%}")
# Output: Confidence: 94%

That's it! LangStruct automatically handles schema generation, optimization, and source tracking.

๐Ÿ“š Common Applications

1. Data Pipeline Automation

Extract structured data from documents for databases, analytics, or APIs:

# Process invoices, receipts, reports, emails
invoice_data = extractor.extract(invoice_pdf_text)
# โ†’ {"invoice_no": "INV-2024-001", "amount": 5420.00, "due_date": "2024-03-15"}

2. Content Analysis & Research

Analyze transcripts, reviews, surveys, or social media:

# Extract insights from customer feedback
feedback = extractor.extract(review_text)
# โ†’ {"sentiment": "positive", "product_issues": [], "feature_requests": ["dark mode"]}

3. Compliance & Validation

Extract and validate required information from legal or regulatory documents:

# Check contracts for specific clauses
contract_data = extractor.extract(contract_text)
# โ†’ {"term_length": "2 years", "termination_clause": true, "liability_cap": 1000000}

๐Ÿš€ RAG System Enhancement

Transform your RAG system from simple search to intelligent retrieval:

Note: LangStruct enhances ANY vector database or search system (Pinecone, Weaviate, Elasticsearch, etc.).

1. Document โ†’ Structured Metadata

# Extract structured metadata from documents
extractor = LangStruct(example={
    "company": "Apple Inc.",
    "revenue": 125.3,
    "quarter": "Q3 2024"
})

metadata = extractor.extract(document).entities
# Now your documents have precise, filterable metadata

2. Query โ†’ Structured Filters

from langstruct import LangStruct

# Parse natural language queries into filters
ls = LangStruct(example=same_schema)  # Same schema as extraction!

query = "Show me Q3 2024 tech companies with revenue over $100B discussing AI investments"
parsed = ls.query(query)

print(parsed.semantic_terms)
# ["tech companies", "AI investments", "artificial intelligence"]

print(parsed.structured_filters)
# {"quarter": "Q3 2024", "revenue": {"$gte": 100.0}}

3. Precise Retrieval

# Combine semantic search with exact filters
rag_results = vector_store.similarity_search(
    query=' '.join(parsed.semantic_terms),  # Semantic search
    where=parsed.structured_filters         # Exact filters
)
# Returns only docs matching BOTH semantic AND structural requirements

Why RAG + LangStruct?

Traditional RAG systems struggle with structured requirements. LangStruct solves this:

Query Traditional RAG With LangStruct
"invoices over $10k from Q3" Returns any document with "invoice" OR "Q3" Returns ONLY invoices >$10k from Q3
"patients over 65 with diabetes" Returns any medical document Returns ONLY matching patient records
"contracts expiring in 2024" Returns any contract Returns ONLY 2024 expirations

See our complete RAG integration guide for implementation.

๐ŸŒŸ Where LangStruct Excels

Perfect for:

  • ๐Ÿ“„ Document Processing: Invoices, reports, forms, emails
  • ๐Ÿฅ Healthcare: Medical records, clinical notes, lab results
  • ๐Ÿ’ผ Financial: Statements, filings, contracts, reports
  • โš–๏ธ Legal: Contracts, agreements, regulations, cases
  • ๐Ÿ”ฌ Research: Papers, patents, technical documentation
  • ๐ŸŽฏ Customer Data: Reviews, feedback, support tickets

Key Advantages:

  • No prompt engineering: DSPy handles optimization automatically
  • Type safety: Pydantic schemas with full validation
  • Source grounding: Know exactly where each field came from
  • Confidence scores: Understand extraction reliability
  • Model agnostic: Works with any LLM provider

๐Ÿ“Š Comparison with Alternatives

LangStruct vs LangExtract

Both are excellent tools for structured extraction with different strengths:

Feature LangStruct LangExtract
Optimization โœ… Automatic (DSPy MIPROv2) โŒ Manual prompt tuning
Refinement โœ… Best-of-N + iterative improvement โš ๏ธ Multi-pass extraction; no Best-of-N/judge pipeline
Schema Definition โœ… From examples OR Pydantic โš ๏ธ Prompt + examples (no Pydantic models)
Source Grounding โœ… Character-level tracking โœ… Character-level tracking
Confidence Scores โœ… Built-in โš ๏ธ Not surfaced as scores
Query Parsing โœ… Bidirectional (docs + queries) โŒ Documents only
Model Support โœ… Any LLM (via DSPy/LiteLLM) โœ… Gemini, OpenAI, local via Ollama; extensible
Learning Curve โœ… Simple (example-based) โš ๏ธ Requires prompt + example design
Performance โœ… Self-optimizing Depends on manual tuning
Project Type Community open-source Google open-source

Note: Comparison verified on 2025-09-10 against the latest LangExtract README and examples. See LangExtract: https://github.com/google/langextract and example walkthroughs (e.g., longer text extraction): https://github.com/google/langextract/blob/main/docs/examples/longer_text_example.md

Choose LangStruct if you want:

  • Automatic optimization without prompt engineering
  • Best-of-N refinement for higher accuracy
  • Flexibility to define schemas from examples
  • Query parsing for RAG systems
  • Confidence scores for extraction quality
  • Support for any LLM provider

Choose LangExtract if you prefer:

  • Direct control over prompts
  • Google's backing and support
  • Simpler architecture without DSPy

๐ŸŽฏ Getting Started

Once you're comfortable with the basics, you can:

Define Custom Schemas for more control:

from pydantic import BaseModel, Field
from langstruct import LangStruct

class PersonSchema(BaseModel):
    name: str = Field(description="Full name of the person")
    age: int = Field(description="Age in years")
    location: str = Field(description="Current location")

extractor = LangStruct(schema=PersonSchema)

Process Multiple Documents at once:

documents = [doc1, doc2, doc3]
results = extractor.extract(documents)  # Handles batch processing automatically

Save and Load Extractors for reuse:

# Save an optimized extractor (preserves all state)
extractor.save("./my_extractor")

# Load anywhere (API keys must be available in environment)
loaded_extractor = LangStruct.load("./my_extractor")
result = loaded_extractor.extract("New text")

View Source Locations to see where data came from:

for field, spans in result.sources.items():
    for span in spans:
        print(f"{field}: '{span.text}' at chars {span.start}-{span.end}")

๐Ÿ“‹ Supported Models

LangStruct works with any LLM provider:

  • Google Gemini: gemini/gemini-2.5-flash, gemini/gemini-2.5-pro
  • OpenAI: gpt-5-pro, gpt-5-mini, gpt-4o, gpt-4o-mini
  • Anthropic: claude-opus-4-1, claude-sonnet-4-0, claude-3-7-sonnet-latest, claude-3-5-haiku-latest
  • Local: Any model via Ollama (llama3, mistral, etc.)

๐ŸŽจ Visualization & Export

Create Interactive Visualizations:

from langstruct import HTMLVisualizer

viz = HTMLVisualizer()
viz.save_visualization(text, result, "results.html")  # Shows highlighted sources

Export Results:

# Save to various formats
result.save_json("data.json")
extractor.export_batch(results, "data.csv")  # CSV, Excel, Parquet supported

JSONL Roundโ€‘Trip + Visualization:

# Save annotated documents to JSONL
results = extractor.extract(texts, validate=False)
extractor.save_annotated_documents(results, "extractions.jsonl")

# Load later
loaded = extractor.load_annotated_documents("extractions.jsonl")

# Generate interactive HTML
extractor.visualize(loaded, "results.html")

๐Ÿงต Batch, Rate Limits, Retries

LangStruct batches efficiently and helps respect provider quotas.

# Control concurrency and quotas
results = extractor.extract(
    texts,
    max_workers=8,        # Thread workers
    show_progress=True,   # Requires langstruct[parallel]
    rate_limit=60,        # Calls per minute
    retry_failed=True     # Raise on failures or surface warnings
)
  • Retries: exponential backoff (3 attempts by default) for transient errors.
  • Rate limiting: simple tokenโ€‘bucket; set rate_limit=None for unlimited.
  • Failures: when retry_failed=False, failures are warned and skipped; otherwise an exception summarizes first errors.

๐Ÿš€ Advanced Features

Optimization (For Power Users)

LangStruct optimizes automatically, but you can fine-tune for your specific data:

# Train on your examples
training_texts = ["Your domain-specific texts..."]
expected_results = [{"name": "Expected outputs..."}]

extractor.optimize(
    texts=training_texts,
    expected_results=expected_results,
    num_trials=50  # More trials = better results
)

# Evaluate performance
scores = extractor.evaluate(test_texts, test_expected)
print(f"Accuracy: {scores['accuracy']:.2%}")

Refinement for Higher Accuracy

Boost extraction accuracy by 15-30% with Best-of-N candidate selection and iterative improvement:

# Simple refinement
result = extractor.extract(text, refine=True)

# Advanced refinement with custom configuration
result = extractor.extract(text, refine={
    "strategy": "bon_then_refine",  # Best-of-N + iterative improvement
    "n_candidates": 5,              # Generate 5 candidates
    "judge": "Prefer candidates that exactly match cited text spans",
    "max_refine_steps": 2,
    "budget": {"max_calls": 10}     # Cost control
})

print(f"Accuracy improvement: {result.confidence:.1%}")

Custom Configuration

from langstruct import ChunkingConfig

# For large documents
config = ChunkingConfig(
    max_tokens=1500,
    overlap_tokens=150,
    preserve_sentences=True
)

extractor = LangStruct(
    schema=YourSchema,
    model="gemini/gemini-2.5-flash",
    chunking_config=config,
    optimize=True  # Enabled for training data
)

๐Ÿ”ง Troubleshooting

API Key Issues

Error: "No API keys found" or "Authentication failed"

  1. Check your API key is set:

    echo $GOOGLE_API_KEY  # Should show your key
    
  2. Common fixes:

    # Make sure you're using the right format
    export GOOGLE_API_KEY="your-actual-key-here"  # No quotes in the key itself
    
    # For persistent setup, add to your shell profile:
    echo 'export GOOGLE_API_KEY="your-key"' >> ~/.bashrc
    source ~/.bashrc
    
  3. Test your key works:

    import os
    print("API key set:", bool(os.getenv("GOOGLE_API_KEY")))
    
    # Quick test
    from langstruct import LangStruct
    ls = LangStruct(example={"name": "test"})
    result = ls.extract("Hello John")  # Should work without errors
    

Error: "Model not found" or "Rate limit exceeded"

  • Model not found: Your API key might be for a different provider
  • Rate limits: Try a different model or wait a few minutes
  • Billing: Check your account has credits (OpenAI/Anthropic)

Installation Issues

Error: Package not found on PyPI

If you encounter package installation issues, try:

# Update pip and try again
pip install --upgrade pip
pip install langstruct

# Or install from source for development
git clone https://github.com/langstruct-ai/langstruct.git
cd langstruct
uv sync --extra dev
uv pip install -e .

Import errors or missing dependencies

# Reinstall with all dependencies
pip install -e ".[dev,examples,viz,parallel]"

Getting Help

๐Ÿค Contributing

We welcome contributions! Please see our contributing guide for details.

Development Setup

# Clone the repository
git clone https://github.com/yourusername/langstruct.git
cd langstruct

# Install dependencies with uv
uv sync --extra dev

# Run tests
uv run pytest

# Format code
uv run black . && uv run isort .

๐Ÿ“„ License

MIT License - see LICENSE for details.

๐Ÿ™ Acknowledgments

  • Built on DSPy for self-optimizing LM pipelines
  • Uses Pydantic for type-safe schemas

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langstruct-0.1.1.tar.gz (81.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langstruct-0.1.1-py3-none-any.whl (78.4 kB view details)

Uploaded Python 3

File details

Details for the file langstruct-0.1.1.tar.gz.

File metadata

  • Download URL: langstruct-0.1.1.tar.gz
  • Upload date:
  • Size: 81.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for langstruct-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d4e03fdd06d7c69480a55c1613b6cbbbfcc13fb72b3be2dfe3f4af3f04d0c40a
MD5 0bc8df27e3a27e13b6061b06efdccb68
BLAKE2b-256 bc8e6a50d76433f96e2efd0e73a18730a749af17a5960f7ef1839ee71f7839c5

See more details on using hashes here.

File details

Details for the file langstruct-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: langstruct-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 78.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for langstruct-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 abbdf504c491281b261e2b5a5cccf7681bfc4e294b2884da48846bf8079f2c87
MD5 44fba0dd3a691c2559f279e4de2c54a0
BLAKE2b-256 3664ccf219b5c1b98dcdd8a85d181bff9fdbfc1d18bfb66dd09e4392a6cb60c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page