LangCore: LLM-powered structured information extraction from text

These details have not been verified by PyPI

Project links

Upstream

Project description

LangCore

Overview

LangCore is a Python library for LLM-powered structured information extraction from unstructured text. It is built on top of Google's open-source LangExtract library (Apache 2.0), extending it with additional capabilities for production document processing workflows.

Attribution: The core extraction engine is derived from LangExtract by Google LLC. See the NOTICE file for full attribution details.

Overview
What's Added Over LangExtract
Core Capabilities
Feature Comparison
Quick Start
Schema-First Extraction with Pydantic
Confidence Scoring
Extraction Hooks & Events
Quality Metrics & Evaluation
Installation
API Key Setup for Cloud Models
Adding Custom Model Providers
Using OpenAI Models
Using Local LLMs with Ollama
More Examples
Ecosystem Plugins
Contributing
Testing
License

What's Added Over LangExtract

LangCore extends Google's LangExtract with the following features developed by Veritas Lex:

Feature	Description
Pydantic Schema Extraction	Define extraction targets as Pydantic models with auto-generated prompts and JSON schema constraints (`schema_adapter`, `schema_generator`)
Confidence Scoring	Per-extraction confidence (0.0–1.0) combining alignment quality + token overlap, with configurable weights
Extraction Hooks & Events	6 lifecycle events (`extraction:start`, `chunk`, `llm_call`, `alignment`, `complete`, `error`) with fault-tolerant callbacks
Quality Metrics & Evaluation	Built-in P/R/F1/accuracy metrics with per-field and per-document breakdown
Multi-pass Confidence	Cross-pass frequency augmentation for higher-confidence extractions
Prompt Alignment Validation	Warnings when few-shot examples contain non-verbatim text
Plugin Ecosystem	First-party plugins: `langcore-rag`, `langextract-guardrails`, `langextract-dspy`, `langextract-litellm`, `langextract-audit`

Core Capabilities

Precise Source Grounding: Maps every extraction to its exact location in the source text, enabling visual highlighting for easy traceability and verification.
Reliable Structured Outputs: Enforces a consistent output schema based on your few-shot examples, leveraging controlled generation in supported models like Gemini to guarantee robust, structured results.
Optimized for Long Documents: Overcomes the "needle-in-a-haystack" challenge of large document extraction by using an optimized strategy of text chunking, parallel processing, and multiple passes for higher recall.
Interactive Visualization: Instantly generates a self-contained, interactive HTML file to visualize and review thousands of extracted entities in their original context.
Flexible LLM Support: Supports your preferred models, from cloud-based LLMs like the Google Gemini family to local open-source models via the built-in Ollama interface.
Adaptable to Any Domain: Define extraction tasks for any domain using just a few examples — no model fine-tuning required.
Leverages LLM World Knowledge: Utilize precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge.

Feature Comparison

How LangCore and its plugin ecosystem compare to LangStruct, Instructor, and Guardrails AI.

Core Extraction Capabilities

Feature	LangCore	LangStruct	Instructor	Guardrails AI
Structured extraction from text	✅ Few-shot + prompt-driven	✅ Schema-driven	✅ Pydantic response model	⚠️ Guard wrapping
Source grounding / alignment	✅ Exact span mapping with char offsets	❌	❌	❌
Long document chunking	✅ Optimized chunking + parallel processing	⚠️ Basic chunking	❌	❌
Multi-pass extraction	✅ Configurable `extraction_passes` for higher recall	❌	❌	❌
Interactive HTML visualization	✅ Built-in entity-in-context viewer	❌	❌	❌
URL/file document input	✅ Accepts URLs, file paths, and raw text	❌	❌	❌
Batch API support	✅ Vertex AI Batch API	❌	❌	❌

Schema & Typing

Feature	LangCore	LangStruct	Instructor	Guardrails AI
Pydantic schema extraction	✅ `schema=MyModel` with auto-prompt generation	✅ Native	✅ Native response model	⚠️ Via Pydantic integration
Few-shot examples	✅ `ExampleData` with text + extractions	⚠️ Limited	❌	❌
Pydantic ↔ ExampleData bridge	✅ `to_pydantic()` / `schema_from_pydantic()`	❌	N/A	N/A
Schema from dict	✅ `schema_from_example({"key": "val"})`	❌	❌	❌
Controlled generation	✅ JSON schema constraints via supported models	⚠️	⚠️ Mode-dependent	❌
Union type support	❌	❌	✅ `Union[A, B]`	❌

Confidence & Quality

Feature	LangCore	LangStruct	Instructor	Guardrails AI
Confidence scoring	✅ Per-extraction (alignment quality + token overlap)	❌	❌	❌
Document-level confidence	✅ `result.average_confidence`	❌	❌	❌
Multi-pass confidence boost	✅ Cross-pass frequency augmentation	❌	❌	❌
Prompt alignment validation	✅ Warnings for non-verbatim examples	❌	❌	❌

Hooks & Observability

Feature	LangCore	LangStruct	Instructor	Guardrails AI
Event hook system	✅ 6 lifecycle events via `Hooks` class	❌	✅ `completion:kwargs`, `parse:error` etc.	❌
Hook composition	✅ Merge with `hooks_a + hooks_b`	❌	❌	❌
Fault-tolerant callbacks	✅ Exceptions logged & swallowed	❌	❌	❌
Token usage tracking	⚠️ Via API layer	❌	✅ `response.usage`	❌

Validation & Guardrails (`langextract-guardrails` plugin)

Feature	LangCore + Guardrails	LangStruct	Instructor	Guardrails AI
Validation + retry loop	✅ Corrective prompts with error feedback	❌	✅ Auto retry on Pydantic failure	✅ Guard wrapping with retry
Pydantic schema validation	✅ `SchemaValidator` — strict or coercive	❌	✅ Native	⚠️ Via integration
JSON Schema validation	✅ `JsonSchemaValidator` with strict mode	❌	❌	✅ JSON Schema guard
Confidence threshold	✅ `ConfidenceThresholdValidator`	❌	❌	❌
Field completeness	✅ `FieldCompletenessValidator`	❌	❌	⚠️ Custom validators
Consistency rules	✅ `ConsistencyValidator`	❌	❌	⚠️ Custom validators
Regex validation	✅ `RegexValidator`	❌	❌	✅ Regex guard
On-fail actions	✅ `EXCEPTION` / `REASK` / `FILTER` / `NOOP`	❌	⚠️ Exception only	✅ `EXCEPTION` / `REASK` / `FIX` / `NOOP`
Validator registry	✅ `@register_validator` decorator	❌	❌	✅ Hub (67+ validators)
Validator chaining	✅ `ValidatorChain` with per-validator actions	❌	❌	✅ Guard chaining
Error-only correction mode	✅ Omit invalid output from retry prompt	❌	❌	❌
Batch-independent retries	✅ Each prompt retries independently	❌	❌	❌
Async concurrency control	✅ `max_concurrency` semaphore	❌	✅	❌

DSPy Prompt Optimization (`langextract-dspy` plugin)

Feature	LangCore + DSPy	LangStruct	Instructor	Guardrails AI
MIPROv2 optimizer	✅ Fast, general-purpose	✅	❌	❌
GEPA optimizer	✅ Reflective / feedback-driven	✅	❌	❌
Persist optimized configs	✅ `save()` / `load()` to directory	✅	❌	❌
Evaluation (P/R/F1)	✅ `evaluate()` with per-document details	⚠️ Basic	❌	❌
Native pipeline integration	✅ `optimized_config` param in `extract()`	❌ Separate pipeline	❌	❌

Model Support

Feature	LangCore	LangStruct	Instructor	Guardrails AI
Google Gemini	✅ Built-in	❌	✅	✅
OpenAI / GPT	✅ Via providers	❌	✅ Native	✅
Local LLMs (Ollama)	✅ Built-in	❌	⚠️ Via patches	❌
LiteLLM (100+ models)	✅ Via `langextract-litellm`	✅	❌	✅
Custom model providers	✅ `BaseLanguageModel` ABC	❌	❌	❌
Community provider plugins	✅ Plugin registry	❌	❌	❌

Async & Performance

Feature	LangCore	LangStruct	Instructor	Guardrails AI
Async extraction	✅ `async_extract()`	⚠️	✅	⚠️
Parallel workers	✅ `max_workers` for concurrent chunk processing	❌	❌	❌
Response caching	✅ Built-in with cache-busting for multi-pass	⚠️	✅	❌

Quality Metrics & Evaluation

Feature	LangCore	LangStruct	Instructor	Guardrails AI
Precision / Recall / F1	✅ `ExtractionMetrics` static helpers + `.evaluate()`	✅ `ExtractionMetrics`	❌	❌
Accuracy (exact-match ratio)	✅	✅	❌	❌
Per-field breakdown	✅ `FieldReport` per schema field	⚠️ Basic	❌	❌
Per-document breakdown	✅ Per-document P/R/F1 dicts	❌	❌	❌
Pydantic schema integration	✅ `ExtractionMetrics(schema=Invoice)`	❌	❌	❌
Strict attribute matching	✅ `strict_attributes=True`	❌	❌	❌
Top-level convenience	✅ `lx.evaluate()`	❌	❌	❌

RAG Query Parsing (`langcore-rag` plugin)

Feature	LangCore + RAG	LangStruct	Instructor	Guardrails AI
Query → semantic terms + filters	✅ `QueryParser.parse()`	✅ `.query()`	❌	❌
Async parsing	✅ `async_parse()`	✅	❌	❌
Pydantic schema introspection	✅ Auto-discovers filterable fields	✅	❌	❌
MongoDB-style operators	✅ `$eq`, `$gte`, `$lte`, `$in`, etc.	✅	❌	❌
Parse confidence score	✅ 0.0 – 1.0	❌	❌	❌
Explanation / rationale	✅ Human-readable	❌	❌	❌
Any LLM backend	✅ Via LiteLLM (100+ providers)	✅	❌	❌

Quick Start

Note: Using cloud-hosted models like Gemini requires an API key. See the API Key Setup section for instructions on how to get and configure your key.

Extract structured information with just a few lines of code.

1. Define Your Extraction Task

First, create a prompt that clearly describes what you want to extract. Then, provide a high-quality example to guide the model.

import langcore as lx
import textwrap

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

Note: Examples drive model behavior. Each extraction_text should ideally be verbatim from the example's text (no paraphrasing), listed in order of appearance. LangExtract raises Prompt alignment warnings by default if examples don't follow this pattern—resolve these for best results.

2. Run the Extraction

Provide your input text and the prompt materials to the lx.extract function.

# The input text to be processed
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

# Run the extraction
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

Model Selection: gemini-2.5-flash is the recommended default, offering an excellent balance of speed, cost, and quality. For highly complex tasks requiring deeper reasoning, gemini-2.5-pro may provide superior results. For large-scale or production use, a Tier 2 Gemini quota is suggested to increase throughput and avoid rate limits. See the rate-limit documentation for details.

Model Lifecycle: Note that Gemini models have a lifecycle with defined retirement dates. Users should consult the official model version documentation to stay informed about the latest stable and legacy versions.

3. Visualize the Results

The extractions can be saved to a .jsonl file, a popular format for working with language model data. LangExtract can then generate an interactive HTML visualization from this file to review the entities in context.

# Save the results to a JSONL file
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")

# Generate the visualization from the file
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # For Jupyter/Colab
    else:
        f.write(html_content)

This creates an animated and interactive HTML file:

Romeo and Juliet Basic Visualization

Note on LLM Knowledge Utilization: This example demonstrates extractions that stay close to the text evidence - extracting "longing" for Lady Juliet's emotional state and identifying "yearning" from "gazed longingly at the stars." The task could be modified to generate attributes that draw more heavily from the LLM's world knowledge (e.g., adding "identity": "Capulet family daughter" or "literary_context": "tragic heroine"). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.

Scaling to Longer Documents

For larger texts, you can process entire documents directly from URLs with parallel processing and enhanced sensitivity:

# Process Romeo & Juliet directly from Project Gutenberg
result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,    # Improves recall through multiple passes
    max_workers=20,         # Parallel processing for speed
    max_char_buffer=1000    # Smaller contexts for better accuracy
)

Multi-pass & caching: When extraction_passes > 1, the first pass uses normal caching behaviour while subsequent passes include a pass_num keyword argument that providers can use to bypass response caches. The langextract-litellm provider does this automatically — passes ≥ 2 always hit the live LLM API.

This approach can extract hundreds of entities from full novels while maintaining high accuracy. The interactive visualization seamlessly handles large result sets, making it easy to explore hundreds of entities from the output JSONL file. See the full Romeo and Juliet extraction example → for detailed results and performance insights.

Vertex AI Batch Processing

Save costs on large-scale tasks by enabling Vertex AI Batch API: language_model_params={"vertexai": True, "batch": {"enabled": True}}.

See an example of the Vertex AI Batch API usage in this example.

Schema-First Extraction with Pydantic

Instead of manually constructing ExampleData objects, you can define your extraction schema as a Pydantic model. LangExtract will auto-generate the prompt and schema constraints for you.

from pydantic import BaseModel, Field
import langcore as lx

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice ID like INV-001")
    amount: float = Field(description="Total amount in dollars")
    due_date: str = Field(description="Due date in YYYY-MM-DD format")

result = lx.extract(
    text="Invoice INV-2024-789 for $3,450 is due April 20th, 2024",
    schema=Invoice,
    model_id="gemini-2.5-flash",
)

# Convert extractions back to typed Pydantic instances
invoices = result.to_pydantic(Invoice)
for inv in invoices:
    print(f"{inv.invoice_number}: ${inv.amount} due {inv.due_date}")

You can also combine schema with explicit examples for the best of both worlds — the Pydantic model defines the structure, and examples provide few-shot guidance:

result = lx.extract(
    text="...",
    schema=Invoice,
    examples=[
        lx.data.ExampleData(
            text="Invoice INV-001 for $100 due Jan 1, 2024",
            extractions=[
                lx.data.Extraction(
                    extraction_class="Invoice",
                    extraction_text="INV-001",
                    attributes={"amount": "100.0", "due_date": "2024-01-01"},
                )
            ],
        )
    ],
    model_id="gemini-2.5-flash",
)

Tip: Use lx.schema_from_pydantic(Invoice) to inspect the auto-generated prompt and JSON schema before running extraction. Use lx.schema_from_example({"name": "John", "age": 30}) to auto-generate a Pydantic model from a plain dict.

Confidence Scoring

Every extraction is automatically assigned a confidence_score between 0.0 and 1.0 after alignment. The score combines two signals:

Alignment quality (70% weight) — how well the extraction text matched the source: exact match = 1.0, lesser = 0.8, greater = 0.7, fuzzy = 0.5, unaligned = 0.2.
Token overlap ratio (30% weight) — how many tokens in the extraction text vs. the matched source span.

result = lx.extract(
    text="Patient Jane Doe received Lisinopril for hypertension.",
    examples=[...],
    model_id="gemini-2.5-flash",
)

for extraction in result.extractions:
    print(f"{extraction.extraction_class}: {extraction.extraction_text} "
          f"(confidence: {extraction.confidence_score})")

# Document-level average confidence
print(f"Average confidence: {result.average_confidence}")

For multi-pass extraction, confidence is further augmented by cross-pass appearance frequency — extractions confirmed across multiple passes receive higher scores.

Extraction Hooks & Events

The langcore.hooks module provides a lightweight event system inspired by Instructor hooks to inject custom logic at every stage of the extraction pipeline — without modifying core code.

Lifecycle events:

Event	Fires when	Payload keys
`extraction:start`	Pipeline begins (after components are built)	`text`, `examples`, `model_id`
`extraction:chunk`	A document chunk has been processed	`chunk_index`, `num_chunks`, `chunk_text`, `extractions`
`extraction:llm_call`	An LLM inference call completes	`prompt`, `response`
`extraction:alignment`	Extraction alignment is performed	`extractions`
`extraction:complete`	Pipeline finishes successfully	`result`
`extraction:error`	An exception is raised	`error`

Quick example:

from langcore.hooks import Hooks

hooks = Hooks()
hooks.on("extraction:start", lambda payload: print("Starting extraction…"))
hooks.on("extraction:llm_call", lambda payload: print(f"LLM responded"))
hooks.on("extraction:error", lambda payload: alert_team(payload["error"]))

result = lx.extract(
    text="Patient received Lisinopril 10mg daily.",
    examples=[...],
    model_id="gemini-2.5-flash",
    hooks=hooks,
)

Composing hooks — merge two Hooks instances with +:

logging_hooks = Hooks().on("extraction:llm_call", log_llm_call)
metrics_hooks = Hooks().on("extraction:complete", record_metrics)
combined = logging_hooks + metrics_hooks

Callbacks are fault-tolerant: if a handler raises an exception it is logged and swallowed so it never breaks the extraction pipeline.

Quality Metrics & Evaluation

The langcore.evaluation module provides built-in quality metrics for measuring extraction accuracy against ground truth. Compute precision, recall, F1, and accuracy at both the extraction level and per-field level.

from langcore.evaluation import ExtractionMetrics

# Quick static helpers
print(ExtractionMetrics.f1(predictions=results, ground_truth=expected))
print(ExtractionMetrics.precision(predictions=results, ground_truth=expected))

Full evaluation with per-field breakdown — pass a Pydantic schema for field-level metrics:

from pydantic import BaseModel, Field
from langcore.evaluation import ExtractionMetrics

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice ID")
    amount: str = Field(description="Total amount")
    due_date: str = Field(description="Due date YYYY-MM-DD")

metrics = ExtractionMetrics(schema=Invoice)
report = metrics.evaluate(predictions=results, ground_truth=expected)
print(report.f1)          # 0.92
print(report.per_field)   # {"invoice_number": FieldReport(...), "amount": ...}

Convenience function — lx.evaluate() wraps ExtractionMetrics for quick one-liners:

import langcore as lx

report = lx.evaluate(predictions=results, ground_truth=expected, schema=Invoice)

The EvaluationReport includes:

Aggregate precision, recall, f1, accuracy
per_document — list of per-document metric dicts
per_field — dict of FieldReport objects with field-level P/R/F1 and support counts
strict_attributes=True mode for matching on attribute values (not just class + text)

Installation

From Source

LangCore uses modern Python packaging with pyproject.toml for dependency management:

git clone https://github.com/IgnatG/langcore.git
cd langcore

# For basic installation:
pip install -e .

# For development (includes linting tools):
pip install -e ".[dev]"

# For testing (includes pytest):
pip install -e ".[test]"

Docker

docker build -t langcore .
docker run --rm -e LANGEXTRACT_API_KEY="your-api-key" langcore python your_script.py

API Key Setup for Cloud Models

When using LangExtract with cloud-hosted models (like Gemini or OpenAI), you'll need to set up an API key. On-device models don't require an API key. For developers using local LLMs, LangExtract offers built-in support for Ollama and can be extended to other third-party APIs by updating the inference endpoints.

API Key Sources

Get API keys from:

AI Studio for Gemini models
Vertex AI for enterprise use
OpenAI Platform for OpenAI models

Setting up API key in your environment

Option 1: Environment Variable

export LANGEXTRACT_API_KEY="your-api-key-here"

Option 2: .env File (Recommended)

Add your API key to a .env file:

# Add API key to .env file
cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF

# Keep your API key secure
echo '.env' >> .gitignore

In your Python code:

import langcore as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-2.5-flash"
)

Option 3: Direct API Key (Not Recommended for Production)

You can also provide the API key directly in your code, though this is not recommended for production use:

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-2.5-flash",
    api_key="your-api-key-here"  # Only use this for testing/development
)

Option 4: Vertex AI (Service Accounts)

Use Vertex AI for authentication with service accounts:

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-2.5-flash",
    language_model_params={
        "vertexai": True,
        "project": "your-project-id",
        "location": "global"  # or regional endpoint
    }
)

Adding Custom Model Providers

LangExtract supports custom LLM providers via a lightweight plugin system. You can add support for new models without changing core code.

Add new model support independently of the core library
Distribute your provider as a separate Python package
Keep custom dependencies isolated
Override or extend built-in providers via priority-based resolution

See the detailed guide in Provider System Documentation to learn how to:

Register a provider with @registry.register(...)
Publish an entry point for discovery
Optionally provide a schema with get_schema_class() for structured output
Integrate with the factory via create_model(...)

Using OpenAI Models

LangExtract supports OpenAI models (requires optional dependency: pip install langcore[openai]):

import langcore as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",  # Automatically selects OpenAI provider
    api_key=os.environ.get('OPENAI_API_KEY'),
    fence_output=True,
    use_schema_constraints=False
)

Note: OpenAI models require fence_output=True and use_schema_constraints=False because LangExtract doesn't implement schema constraints for OpenAI yet.

Using Local LLMs with Ollama

LangExtract supports local inference using Ollama, allowing you to run models without API keys:

import langcore as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemma2:2b",  # Automatically selects Ollama provider
    model_url="http://localhost:11434",
    fence_output=False,
    use_schema_constraints=False
)

Quick setup: Install Ollama from ollama.com, run ollama pull gemma2:2b, then ollama serve.

For detailed installation, Docker setup, and examples, see examples/ollama/.

More Examples

Additional examples of LangExtract in action:

Romeo and Juliet Full Text Extraction

LangExtract can process complete documents directly from URLs. This example demonstrates extraction from the full text of Romeo and Juliet from Project Gutenberg (147,843 characters), showing parallel processing, sequential extraction passes, and performance optimization for long document processing.

View Romeo and Juliet Full Text Example →

Medication Extraction

Disclaimer: This demonstration is for illustrative purposes of LangCore's baseline capability only. It does not represent a finished or approved product, is not intended to diagnose or suggest treatment of any disease or condition, and should not be used for medical advice.

LangExtract excels at extracting structured medical information from clinical text. These examples demonstrate both basic entity recognition (medication names, dosages, routes) and relationship extraction (connecting medications to their attributes), showing LangCore's effectiveness for healthcare applications.

View Medication Examples →

Radiology Report Structuring: RadExtract

Explore RadExtract, a live interactive demo on HuggingFace Spaces that shows how LangExtract can automatically structure radiology reports. Try it directly in your browser with no setup required.

View RadExtract Demo →

Ecosystem Plugins

LangCore has a growing plugin ecosystem:

Plugin	Description
`langcore-rag`	RAG query parsing — converts natural language queries into semantic terms + structured filters
`langextract-litellm`	LiteLLM provider for 100+ LLM backends (OpenAI, Azure, Anthropic, etc.)
`langextract-guardrails`	Validation + retry loop with Pydantic, JSON Schema, confidence threshold, and consistency validators
`langextract-dspy`	DSPy prompt optimization (MIPROv2, GEPA) with evaluation and persist/load support
`langextract-audit`	Audit and compliance tooling for extraction pipelines

For detailed instructions on creating a provider plugin, see the Custom Provider Plugin Example.

Contributing

Contributions are welcome! See CONTRIBUTING.md to get started with development, testing, and pull requests.

Testing

# Install with test dependencies
pip install -e ".[test]"

# Run all tests
pytest tests

Or reproduce the full CI matrix locally with tox:

tox

Ollama Integration Testing

If you have Ollama installed locally, you can run integration tests:

# Test Ollama integration (requires Ollama running with gemma2:2b model)
tox -e ollama-integration

Development

Code Formatting

# Auto-format all code
./autoformat.sh

# Or run formatters separately
isort langcore tests --profile google --line-length 80
pyink langcore tests --config pyproject.toml

Pre-commit Hooks

pre-commit install  # One-time setup
pre-commit run --all-files  # Manual run

Linting

pylint --rcfile=.pylintrc langcore tests

See CONTRIBUTING.md for full development guidelines.

License

Licensed under the Apache License, Version 2.0. See LICENSE for full terms.

This project includes code originally developed by Google LLC as LangExtract. See NOTICE for attribution details.

Happy Extracting!

Project details

These details have not been verified by PyPI

Project links

Upstream

Release history Release notifications | RSS feed

1.1.7

Feb 25, 2026

1.1.6

Feb 25, 2026

1.1.5

Feb 25, 2026

1.1.4

Feb 25, 2026

1.1.3

Feb 24, 2026

1.1.2

Feb 24, 2026

1.1.1

Feb 24, 2026

1.1.0

Feb 24, 2026

1.0.2

Feb 23, 2026

1.0.1

Feb 23, 2026

This version

1.0.0

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langcore-1.0.0.tar.gz (154.6 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langcore-1.0.0-py3-none-any.whl (136.7 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file langcore-1.0.0.tar.gz.

File metadata

Download URL: langcore-1.0.0.tar.gz
Upload date: Feb 23, 2026
Size: 154.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langcore-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`0e69640def49e26184922d18cb65ef3598be909eabc353a32b6258e11e25f54b`
MD5	`a5d413274dee077724c19f9a8d5d03ff`
BLAKE2b-256	`95ca5ef994e5100295093f3c726b04a76358952c01cd43742562e34deb5f77ed`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langcore-1.0.0.tar.gz:

Publisher: release.yml on IgnatG/langcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langcore-1.0.0.tar.gz
- Subject digest: 0e69640def49e26184922d18cb65ef3598be909eabc353a32b6258e11e25f54b
- Sigstore transparency entry: 983199762
- Sigstore integration time: Feb 23, 2026
Source repository:
- Permalink: IgnatG/langcore@f84e0969e18c7c6b280ea776de4d6f750b8b5cad
- Branch / Tag: refs/heads/main
- Owner: https://github.com/IgnatG
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f84e0969e18c7c6b280ea776de4d6f750b8b5cad
- Trigger Event: push

File details

Details for the file langcore-1.0.0-py3-none-any.whl.

File metadata

Download URL: langcore-1.0.0-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 136.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langcore-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0bbafc22f709507d54dc7cd9cd9b4a80d43c2e68d4384ec4b13453aa5f1c573f`
MD5	`0d9d7ad8a3c63362e5bf5279cca8a3ad`
BLAKE2b-256	`7168b7ef97e4025acc0040b6b8a1a8eb86759a755b9f54923195ca4a429ca380`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langcore-1.0.0-py3-none-any.whl:

Publisher: release.yml on IgnatG/langcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langcore-1.0.0-py3-none-any.whl
- Subject digest: 0bbafc22f709507d54dc7cd9cd9b4a80d43c2e68d4384ec4b13453aa5f1c573f
- Sigstore transparency entry: 983199810
- Sigstore integration time: Feb 23, 2026
Source repository:
- Permalink: IgnatG/langcore@f84e0969e18c7c6b280ea776de4d6f750b8b5cad
- Branch / Tag: refs/heads/main
- Owner: https://github.com/IgnatG
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f84e0969e18c7c6b280ea776de4d6f750b8b5cad
- Trigger Event: push

langcore 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

LangCore

Overview

Table of Contents

What's Added Over LangExtract

Core Capabilities

Feature Comparison

Core Extraction Capabilities

Schema & Typing

Confidence & Quality

Hooks & Observability

Validation & Guardrails (langextract-guardrails plugin)

DSPy Prompt Optimization (langextract-dspy plugin)

Model Support

Async & Performance

Quality Metrics & Evaluation

RAG Query Parsing (langcore-rag plugin)

Quick Start

1. Define Your Extraction Task

2. Run the Extraction

3. Visualize the Results

Scaling to Longer Documents

Vertex AI Batch Processing

Schema-First Extraction with Pydantic

Confidence Scoring

Extraction Hooks & Events

Quality Metrics & Evaluation

Installation

From Source

Docker

API Key Setup for Cloud Models

API Key Sources

Setting up API key in your environment

Adding Custom Model Providers

Using OpenAI Models

Using Local LLMs with Ollama

More Examples

Romeo and Juliet Full Text Extraction

Medication Extraction

Radiology Report Structuring: RadExtract

Ecosystem Plugins

Contributing

Testing

Ollama Integration Testing

Development

Code Formatting

Pre-commit Hooks

Linting

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Validation & Guardrails (`langextract-guardrails` plugin)

DSPy Prompt Optimization (`langextract-dspy` plugin)

RAG Query Parsing (`langcore-rag` plugin)