A lightweight library for extracting structured data from language models
Project description
LangExtract (Lightweight Fork)
A lightweight fork of Google's langextract optimized for serverless deployment (AWS Lambda, Cloud Functions, etc.).
What's Different in This Fork?
This fork removes heavy dependencies and unused features:
| Removed | Reason |
|---|---|
| numpy, pandas | Only used in visualization/benchmarks and CSV reading |
| google-genai | Users provide their own providers |
| google-cloud-storage | Batch processing removed |
| absl-py | Replaced with stdlib logging |
| tqdm | Progress bars removed |
| requests, aiohttp | URL fetching removed |
Key changes:
- No built-in providers (Gemini, OpenAI, Ollama) - you register your own
- No visualization module
- No URL fetching - pass text directly
- No progress bars
- Stdlib logging with optional logger injection
Installation
pip install bioscope-langextract
Or with uv:
uv add bioscope-langextract
Quick Start
1. Create a Custom Provider
Since this fork has no built-in providers, you must register your own:
import langextract as lx
from langextract.core.base_model import BaseLanguageModel
from langextract.core.types import ScoredOutput
from langextract.providers import router
@router.register(r"^my-model", priority=100)
class MyProvider(BaseLanguageModel):
"""Custom provider that wraps your LLM API."""
def infer(self, batch_prompts, **kwargs):
# Implement your LLM call here
for prompt in batch_prompts:
response = call_your_llm(prompt) # Your implementation
yield [ScoredOutput(score=1.0, output=response)]
2. Define Your Extraction Task
import textwrap
prompt = textwrap.dedent("""\
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.""")
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks?",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
),
]
)
]
3. Run the Extraction
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="my-model", # Matches your registered pattern
)
4. Optional: Inject a Custom Logger
For serverless environments, you can inject your own logger:
import logging
logger = logging.getLogger("my-app")
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="my-model",
logger=logger,
)
Provider Registration
Register providers using regex patterns:
from langextract.providers import router
# Decorator style
@router.register(r"^claude-", r"^anthropic/", priority=100)
class ClaudeProvider(BaseLanguageModel):
...
# Or register after definition
router.register(r"^gpt-")(OpenAIProvider)
# Lazy registration (defers import)
router.register_lazy(
r"^bedrock/",
target="my_package.providers:BedrockProvider",
priority=50
)
Higher priority wins when multiple patterns match.
API Reference
lx.extract()
Main extraction function:
lx.extract(
text_or_documents, # str or list of documents
prompt_description, # Task description
examples, # List of ExampleData
model_id, # Model ID (matched against registered patterns)
*,
extraction_passes=1, # Number of extraction passes
max_workers=1, # Parallel workers for chunked processing
max_char_buffer=4000, # Chunk size for long documents
logger=None, # Optional custom logger
**kwargs
)
Data Classes
# Example for few-shot prompting
lx.data.ExampleData(
text="...",
extractions=[...]
)
# Single extraction
lx.data.Extraction(
extraction_class="category",
extraction_text="exact text from source",
attributes={"key": "value"}
)
License
This is a modified fork of langextract by Google LLC, licensed under the Apache License 2.0. This fork removes heavy dependencies (numpy, pandas, google-genai, etc.) for serverless deployment environments.
See LICENSE for the full license text.
Disclaimer
This is not an officially supported Google product. Use is subject to the Apache 2.0 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bioscope_langextract-0.0.1.tar.gz.
File metadata
- Download URL: bioscope_langextract-0.0.1.tar.gz
- Upload date:
- Size: 109.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2254717da06a5db4e49ac758563f674c6d11bc7b84dfb2b616c5a0d83c778aa2
|
|
| MD5 |
c99a03d79675c514d47fa942d4633be6
|
|
| BLAKE2b-256 |
d148442fbf2047fc9154412602e73578a9daf2eb373c0de7ee610db0894433ed
|
Provenance
The following attestation bundles were made for bioscope_langextract-0.0.1.tar.gz:
Publisher:
publish.yml on bioscope-ai/bioscope-langextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bioscope_langextract-0.0.1.tar.gz -
Subject digest:
2254717da06a5db4e49ac758563f674c6d11bc7b84dfb2b616c5a0d83c778aa2 - Sigstore transparency entry: 767463975
- Sigstore integration time:
-
Permalink:
bioscope-ai/bioscope-langextract@8cf732ca0b609e45a64abdb464942985897721fa -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/bioscope-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8cf732ca0b609e45a64abdb464942985897721fa -
Trigger Event:
release
-
Statement type:
File details
Details for the file bioscope_langextract-0.0.1-py3-none-any.whl.
File metadata
- Download URL: bioscope_langextract-0.0.1-py3-none-any.whl
- Upload date:
- Size: 89.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fdacf755778bf6ce9ceeedcd6238c9b8991f00ebf3f879f2df05efce82c6f1b
|
|
| MD5 |
13c6f1f27994daaba726a6c80fcb4399
|
|
| BLAKE2b-256 |
bfde62dc407eda01722ba5b084c7d1da99af6e386e187e5627d84e9a4766d0e5
|
Provenance
The following attestation bundles were made for bioscope_langextract-0.0.1-py3-none-any.whl:
Publisher:
publish.yml on bioscope-ai/bioscope-langextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bioscope_langextract-0.0.1-py3-none-any.whl -
Subject digest:
7fdacf755778bf6ce9ceeedcd6238c9b8991f00ebf3f879f2df05efce82c6f1b - Sigstore transparency entry: 767463977
- Sigstore integration time:
-
Permalink:
bioscope-ai/bioscope-langextract@8cf732ca0b609e45a64abdb464942985897721fa -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/bioscope-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8cf732ca0b609e45a64abdb464942985897721fa -
Trigger Event:
release
-
Statement type: