A library to extract structured information from unstructured text using LLMs, powered by LangChain.

These details have not been verified by PyPI

Project links

Project description

llmextract

llmextract is a small, pragmatic Python library for extracting structured information from unstructured text using large language models (LLMs). It is designed to be stable, easy to integrate, and to produce grounded results (including character intervals in the source text). The library uses LangChain provider wrappers and Pydantic models for robust, typed outputs.

Highlights

Multi-provider support (OpenRouter / OpenAI-compatible and Ollama) via LangChain.
Structured outputs represented with Pydantic models (Extraction, AnnotatedDocument).
Source grounding: extractions are aligned to character offsets in the source text.
Long-document support via chunking with configurable size & overlap.
Synchronous and asynchronous APIs.
Self-contained HTML visualization of extraction results.

Table of contents

Features
Installation
Environment variables
Quick start (sync)
Quick start (async)
Visualization
Demo CLI
Advanced usage and options
API reference
Data models
Troubleshooting & tips
Development / testing
License

Features

Multi-provider support: OpenRouter/OpenAI-compatible and Ollama.
Robust parsing and alignment with fallbacks and logging.
Configurable chunking and concurrency for large documents.
Logging at three verbosity levels; raw prompts and raw LLM outputs are visible at full verbosity (-vv).
Simple HTML visualization that shows extractions in context, with tooltips and a downloadable JSON export.

Installation

Recommended: install from PyPI:

pip install llmextract

Core runtime dependencies (declared in pyproject.toml):

python >= 3.10
pydantic >= 2.0
langchain, langchain-openai, langchain-ollama (provider integrations)
python-dotenv (for the demo)

Optional (recommended for development / improved matching):

tenacity (better retry primitives)
rapidfuzz (faster / more accurate fuzzy matching)
pytest, pytest-asyncio, ruff (development & testing)

You can add optional packages to your project or to pyproject.toml extras.

Environment variables

llmextract reads provider credentials and demo controls from environment variables:

OPENROUTER_API_KEY — API key used for OpenRouter / OpenAI-compatible providers.
OPENAI_API_KEY — alternative API key environment variable.
OLLAMA_BASE_URL — base URL for a remote or local Ollama server.
Demo-specific:
- LLME_FORCE_SIMULATE=1 — force demo to run in simulation mode (no network).
- LLME_MODELS — comma-separated default models used by the demo.
- LLME_VERBOSE — default verbosity (0,1,2) used by demo if CLI flag is not provided.
- LLME_CHUNK_SIZE, LLME_CHUNK_OVERLAP, LLME_RETRIES, LLME_RETRY_BACKOFF, LLME_MAX_WORKERS, LLME_MAX_CONCURRENCY — demo tuning variables.

Quick start — synchronous

Minimal synchronous example. This shows how to call extract() and read the results.

from dotenv import load_dotenv
from llmextract import extract, ExampleData, Extraction, configure_logging

load_dotenv()
# Optionally configure package-level logging (0=warning,1=info,2=debug)
configure_logging(1)

prompt = "Extract patient names, medications, and conditions."
examples = [
    ExampleData(
        text="Jane took 20mg of Zoloft for depression.",
        extractions=[
            Extraction(extraction_class="patient", extraction_text="Jane"),
            Extraction(extraction_class="medication", extraction_text="Zoloft"),
            Extraction(extraction_class="condition", extraction_text="depression"),
        ],
    )
]

text = "The patient, John Doe, was prescribed Lisinopril for hypertension."

result_doc = extract(
    text=text,
    prompt_description=prompt,
    examples=examples,
    model_name="mistralai/mistral-7b-instruct:free",
    provider_kwargs={"provider": "openrouter", "api_key": "<YOUR_KEY_HERE>"},
    chunk_size=1000,
    chunk_overlap=100,
    verbose=1,          # 0=quiet, 1=info, 2=debug (raw prompt & response)
    error_mode="return",# or "raise"
    retries=2,
    retry_backoff=0.5,
    dedupe=True,
)

for ext in result_doc.extractions:
    print(ext.extraction_class, ":", ext.extraction_text)
    if ext.char_interval:
        print("  interval:", ext.char_interval.start, ext.char_interval.end)

Notes:

verbose=2 (or calling configure_logging(2) / running the demo with -vv) will emit the full prompt and raw LLM response at DEBUG level — useful for debugging model behavior.
If error_mode="return" the function collects per-chunk errors into result_doc.metadata["errors"] rather than raising immediately.

Quick start — asynchronous

Use the async entry-point aextract() for concurrency-friendly environments.

import asyncio
from llmextract import aextract, ExampleData, Extraction

async def run():
    examples = [ ... ]  # same as above
    doc = await aextract(
        text="...",
        prompt_description="...",
        examples=examples,
        model_name="ollama-model",
        provider_kwargs={"provider": "ollama", "ollama_base_url": "http://localhost:11434"},
        max_concurrency=4,
        verbose=2,
    )
    print(len(doc.extractions))

asyncio.run(run())

Visualization

visualize() produces a self-contained HTML string that you can save and open in a browser.

from llmextract import visualize

html = visualize(result_doc)            # single document
# or visualize([doc1, doc2]) for multiple
with open("report.html", "w", encoding="utf-8") as fh:
    fh.write(html)

The generated HTML:

Lets you select document and model (if multiple were provided).
Shows a metadata panel and a legend for extraction classes.
Highlights extracted spans in the text and shows attributes in tooltips.
Includes a "Download JSON" button with the underlying serialized results.

Security note: the visualization HTML escapes content and sanitizes script-closing sequences, but you should only open generated HTML from trusted sources.

Demo CLI

A ready-to-run demo is provided at demo.py. Example usage:

# run demo with default verbosity (info-level suppressed)
python demo.py

# increase verbosity to INFO
python demo.py -v

# full verbosity (DEBUG): logs full prompt + raw LLM response
python demo.py -vv

# force the demo to work offline (use the built-in simulator)
LLME_FORCE_SIMULATE=1 python demo.py -vv

Demo environment variables (optional):

LLME_FORCE_SIMULATE=1 — always use the deterministic simulator (no network).
LLME_MODELS — comma-separated models used by the demo.
LLME_OUTPUT_FILE — path for the generated visualization HTML (default llmextract_report.html).

Advanced usage & options

When calling extract() / aextract() you can tune behavior:

chunking
- chunk_size (default 4000): characters per chunk
- chunk_overlap (default 200)
concurrency / workers
- max_workers (sync) and max_concurrency (async)
error handling
- error_mode: "return" (collect per-chunk errors in metadata) or "raise" (raise on first chunk failure)
- retries and retry_backoff (simple retry/backoff per chunk)
verbose:
- 0: minimal (warnings and above)
- 1: info-level package logs
- 2: debug — logs full prompt and raw LLM outputs for each chunk (useful for debugging)

Returned AnnotatedDocument.metadata will include operational info:

model_name, chunk_size, chunk_overlap, num_chunks, num_extractions
provider kwargs merged in (if supplied)
errors key when error_mode="return" and chunks failed

Dedupe: set dedupe=True to perform a simple deduplication pass (by extraction class and normalized text).

Provider configuration

Pass provider_kwargs to extract() / aextract() to control provider and connection details:
- provider: "openrouter" or "ollama" (if omitted, provider is inferred)
- api_key: for OpenRouter/OpenAI-compatible providers
- base_url: OpenRouter base URL (example: "https://openrouter.ai/api/v1")
- ollama_base_url: base URL for Ollama (default: "http://localhost:11434")
- default_headers: additional headers for OpenRouter/OpenAI-compatible clients

Example provider usage:

result = extract(
    ...,
    model_name="mistralai/mistral-7b-instruct:free",
    provider_kwargs={"provider": "openrouter", "api_key": os.getenv("OPENROUTER_API_KEY")},
)

API reference (summary)

extract(text, prompt_description, examples, model_name, provider_kwargs=None, chunk_size=4000, chunk_overlap=200, max_workers=10, verbose=0, error_mode="return", retries=2, retry_backoff=1.0, dedupe=False) -> AnnotatedDocument
aextract(text, prompt_description, examples, model_name, provider_kwargs=None, chunk_size=4000, chunk_overlap=200, max_concurrency=None, verbose=0, error_mode="return", retries=2, retry_backoff=1.0, dedupe=False) -> AnnotatedDocument (async)
visualize(AnnotatedDocument | List[AnnotatedDocument]) -> str (HTML)
configure_logging(verbosity: int = 0) — convenience to set llmextract package logging level:
- 0 => WARNING
- 1 => INFO
- 2 => DEBUG

Data models

All models are Pydantic v2 models exported from the package:

Extraction
- extraction_class: str — label or category (e.g., "medication")
- extraction_text: str — the exact extracted substring
- attributes: Dict[str, Any] — optional structured attributes
- char_interval: Optional[CharInterval] — optional character interval (start inclusive, end exclusive)
CharInterval
- start: int — inclusive start index
- end: int — exclusive end index (must be > start)
ExampleData
- text: str — example text
- extractions: List[Extraction] — correct extractions for the example
AnnotatedDocument
- text: str — original document text
- extractions: List[Extraction]
- metadata: Dict[str, Any] — operation metadata (e.g., model_name, chunk sizes, provider info)

Troubleshooting & tips

Empty visualization dropdown / no results:
- Confirm you passed one or more AnnotatedDocument(s) to visualize() and that each has text and metadata. The visualizer defaults doc_id to Document_1 and model_name to Unknown_Model if missing, so check the generated JSON in the HTML (open "View source").
Debugging model outputs:
- Use verbose=2 (or configure_logging(2) / demo -vv) to see the full prompt and raw LLM output for each chunk.
Alignment failures:
- Alignment uses normalized text searches, flexible whitespace regex, and a lightweight fuzzy fallback. If many extractions fail to align, try:
  - Improving examples in prompts to match expected formats
  - Inspecting raw LLM output (verbose=2) for unexpected formatting (e.g., extra prose, missing fields)
Prompt engineering:
- Encourage the model to respond strictly with JSON. We instruct models to return only a single JSON object with an "extractions" key, but some models still emit human explanations — use verbose logging to inspect and adjust.

Development & testing

Recommended dev dependencies (add to project.optional-dependencies or your dev env):

ruff (linting)
pytest & pytest-asyncio (tests)
rapidfuzz (optional fuzzy matching)
tenacity (optional retries)

Suggested local workflow:

Create a virtual environment.
Install dev deps: pip install -e ".[dev]" or pip install ruff pytest pytest-asyncio.
Run tests: pytest.
Lint: ruff check ..

I encourage contributions — open an issue to discuss changes.

Roadmap / future work

Optional persistence of raw LLM responses to metadata (opt-in).
Pluggable alignment strategies (configurable fuzzy match engines).
Additional provider adapters (OpenAI official SDK adapters, other local engines).
More advanced visualization options (stacked overlapping spans, multiple highlight styles).

License

Apache-2.0 — see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Oct 26, 2025

0.1.0

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmextract-0.2.0.tar.gz (31.4 kB view details)

Uploaded Oct 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmextract-0.2.0-py3-none-any.whl (28.2 kB view details)

Uploaded Oct 26, 2025 Python 3

File details

Details for the file llmextract-0.2.0.tar.gz.

File metadata

Download URL: llmextract-0.2.0.tar.gz
Upload date: Oct 26, 2025
Size: 31.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for llmextract-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d2fcd2c85a81789aaef8647fde8989ab12c602e062836093e65d03e5868d09f8`
MD5	`c61538f45ec51037a737146d48306080`
BLAKE2b-256	`42ede7b7192c787b398fcc2bc766f610048a150eab84a217ebe8dd06b2fc31f4`

See more details on using hashes here.

File details

Details for the file llmextract-0.2.0-py3-none-any.whl.

File metadata

Download URL: llmextract-0.2.0-py3-none-any.whl
Upload date: Oct 26, 2025
Size: 28.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for llmextract-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`743541931c38916e7c918267179dab885e1b4a3f704d16f7abce181e54ef6b1e`
MD5	`0cd5c3c511454fec72a1ba0a28d943a0`
BLAKE2b-256	`80a0bc04ec4e7f1017f4159f0aa7e7d985d465e115842e5576e25e7d620300d2`

See more details on using hashes here.

llmextract 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmextract

Features

Installation

Environment variables

Quick start — synchronous

Quick start — asynchronous

Visualization

Demo CLI

Advanced usage & options

API reference (summary)

Data models

Troubleshooting & tips

Development & testing

Roadmap / future work

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes