DELM:Data Extraction with Language Models
Project description
Data Extraction with Language Models
DELM is a Python toolkit for extracting structured data from unstructured text using language models.
Features
- Multiple input formats: TXT, HTML, MD, DOCX, PDF, CSV, Excel, Parquet, Feather
- Flexible schemas: Simple key-value → nested objects → multiple schemas
- Multiple LLM providers: OpenAI, Anthropic, Google, Groq, Together AI, Fireworks AI
- Cost management: Automatic cost tracking, caching, and budget limits
- Built for scale: Batch processing with parallel execution and checkpointing
Installation
pip install delm
Quick Start
Define your extraction schema and extract structured data in just a few lines:
from delm import DELM, Schema, ExtractionVariable
# Define what to extract
schema = Schema.simple(
variables_list=[
ExtractionVariable(
name="company",
description="Company name mentioned",
data_type="string",
required=True,
),
ExtractionVariable(
name="price",
description="Price value if mentioned",
data_type="number",
required=False,
),
]
)
# Initialize and extract
delm = DELM(
schema=schema,
provider="openai",
model="gpt-4o-mini",
)
# Extract from any supported file format
results = delm.extract("data/earnings_calls.txt")
print(results)
# Check costs
print(delm.get_cost_summary())
Schema Types
DELM supports three schema types for different extraction needs:
Simple Schema
Extract key-value pairs from text:
schema = Schema.simple(
variables_list=[
ExtractionVariable(name="author", data_type="string"),
ExtractionVariable(name="date", data_type="date"),
]
)
Nested Schema
Extract lists of structured objects:
schema = Schema.nested(
container_name="products",
variables_list=[
ExtractionVariable(name="name", data_type="string"),
ExtractionVariable(name="price", data_type="number"),
ExtractionVariable(name="features", data_type="[string]"),
]
)
Multiple Schemas
Extract multiple different schemas simultaneously:
schema = Schema.multiple({
"companies": Schema.nested(
container_name="companies",
variables_list=[...],
),
"products": Schema.nested(
container_name="products",
variables_list=[...],
),
})
Supported Data Types
| Type | Description | Example |
|---|---|---|
string |
Text values | "Apple Inc." |
number |
Floating-point | 150.5 |
integer |
Whole numbers | 2024 |
boolean |
True/False | true |
date |
Date strings | "2025-09-15" |
[string] |
List of strings | ["oil", "gas"] |
[number] |
List of numbers | [100, 200] |
Advanced Features
Custom Prompts
delm = DELM(
schema=schema,
provider="openai",
model="gpt-4o-mini",
prompt_template="""You are a financial data extraction expert.
Extract the following information:
{variables}
Text to analyze:
{text}""",
)
Process CSV/Structured Data
delm = DELM(
schema=schema,
provider="openai",
model="gpt-4o-mini",
target_column="transcript_text", # Column containing text to process
)
results = delm.extract("earnings_data.csv")
Cost Tracking & Limits
delm = DELM(
schema=schema,
provider="openai",
model="gpt-4o-mini",
track_cost=True,
max_budget=10.0, # Stop if cost exceeds $10
)
results = delm.extract("data.txt")
summary = delm.get_cost_summary()
print(f"Total cost: ${summary['total_cost']:.2f}")
Batch Processing
delm = DELM(
schema=schema,
provider="openai",
model="gpt-4o-mini",
batch_size=50, # Process 50 records per batch
max_workers=5, # Use 5 parallel workers
)
results = delm.extract("large_dataset.csv")
Configuration Options
For a complete list of configuration options, see the documentation.
Common parameters:
provider: LLM provider ("openai","anthropic","google", etc.)model: Model name ("gpt-4o-mini","claude-3-sonnet-20240229", etc.)temperature: Generation temperature (default:0.0)batch_size: Records per batch (default:10)max_workers: Concurrent workers (default:1)track_cost: Enable cost tracking (default:True)max_budget: Maximum cost limit in dollars (default:None)target_column: Column name for CSV/tabular data (default:None)
Documentation
Learn more about:
File Format Support
| Format | Extensions | Additional Dependencies |
|---|---|---|
| Text | .txt |
None |
| HTML/Markdown | .html, .htm, .md |
beautifulsoup4 |
| Word | .docx |
python-docx |
.pdf |
marker-pdf |
|
| CSV | .csv |
pandas |
| Excel | .xlsx |
openpyxl |
| Parquet | .parquet |
pyarrow |
| Feather | .feather |
pyarrow |
Contributing
We welcome contributions! Please see our documentation for guidelines.
License
This project is licensed under the MIT License - see the LICENSE.md file for details.
Acknowledgments
- Built on Instructor for structured outputs
- Uses Marker for PDF processing
- Developed at the Center for Applied AI at Chicago Booth
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file delm-1.1.0.tar.gz.
File metadata
- Download URL: delm-1.1.0.tar.gz
- Upload date:
- Size: 69.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
807e8bceb0ace9eba8f986dc80cb9109b0b18a4acb6589b4d45d532ad0b80186
|
|
| MD5 |
18ef9cd5e4de68aa525ef57ee94946e6
|
|
| BLAKE2b-256 |
66e985c908d080b96c036447048f8014a7f84cbf3c48d15e7e52d133801fe614
|
Provenance
The following attestation bundles were made for delm-1.1.0.tar.gz:
Publisher:
publish.yml on Center-for-Applied-AI/delm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
delm-1.1.0.tar.gz -
Subject digest:
807e8bceb0ace9eba8f986dc80cb9109b0b18a4acb6589b4d45d532ad0b80186 - Sigstore transparency entry: 990231898
- Sigstore integration time:
-
Permalink:
Center-for-Applied-AI/delm@6eb0e75a466757f07f0d9025d62fd4dcbc6c2547 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/Center-for-Applied-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6eb0e75a466757f07f0d9025d62fd4dcbc6c2547 -
Trigger Event:
release
-
Statement type:
File details
Details for the file delm-1.1.0-py3-none-any.whl.
File metadata
- Download URL: delm-1.1.0-py3-none-any.whl
- Upload date:
- Size: 77.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
952df18ad3388cdb684dc927317ee2897a8b6580d126466c9bb28789ddbfb027
|
|
| MD5 |
65228cc761ed0e254b0e21356850b961
|
|
| BLAKE2b-256 |
68d9c9631db9dac5bf3018c1bb078bdaf5cdf45e36c1ec2601f5cec4de9dd48b
|
Provenance
The following attestation bundles were made for delm-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on Center-for-Applied-AI/delm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
delm-1.1.0-py3-none-any.whl -
Subject digest:
952df18ad3388cdb684dc927317ee2897a8b6580d126466c9bb28789ddbfb027 - Sigstore transparency entry: 990231899
- Sigstore integration time:
-
Permalink:
Center-for-Applied-AI/delm@6eb0e75a466757f07f0d9025d62fd4dcbc6c2547 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/Center-for-Applied-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6eb0e75a466757f07f0d9025d62fd4dcbc6c2547 -
Trigger Event:
release
-
Statement type: