DELM:Data Extraction with Language Models

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ericfithian skblv

These details have not been verified by PyPI

Project description

Data Extraction with Language Models

DELM is a Python toolkit for extracting structured data from unstructured text using language models.

📖 Full Documentation

Features

Multiple input formats: TXT, HTML, MD, DOCX, PDF, CSV, Excel, Parquet, Feather
Flexible schemas: Simple key-value → nested objects → multiple schemas
Multiple LLM providers: OpenAI, Anthropic, Google, Groq, Together AI, Fireworks AI
Cost management: Automatic cost tracking, caching, and budget limits
Built for scale: Batch processing with parallel execution and checkpointing

Installation

pip install delm

Quick Start

Define your extraction schema and extract structured data in just a few lines:

from delm import DELM, Schema, ExtractionVariable

# Define what to extract
schema = Schema.simple(
    variables_list=[
        ExtractionVariable(
            name="company",
            description="Company name mentioned",
            data_type="string",
            required=True,
        ),
        ExtractionVariable(
            name="price",
            description="Price value if mentioned",
            data_type="number",
            required=False,
        ),
    ]
)

# Initialize and extract
delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
)

# Extract from any supported file format
results = delm.extract("data/earnings_calls.txt")
print(results)

# Check costs
print(delm.get_cost_summary())

Schema Types

DELM supports three schema types for different extraction needs:

Simple Schema

Extract key-value pairs from text:

schema = Schema.simple(
    variables_list=[
        ExtractionVariable(name="author", data_type="string"),
        ExtractionVariable(name="date", data_type="date"),
    ]
)

Nested Schema

Extract lists of structured objects:

schema = Schema.nested(
    container_name="products",
    variables_list=[
        ExtractionVariable(name="name", data_type="string"),
        ExtractionVariable(name="price", data_type="number"),
        ExtractionVariable(name="features", data_type="[string]"),
    ]
)

Multiple Schemas

Extract multiple different schemas simultaneously:

schema = Schema.multiple({
    "companies": Schema.nested(
        container_name="companies",
        variables_list=[...],
    ),
    "products": Schema.nested(
        container_name="products",
        variables_list=[...],
    ),
})

Supported Data Types

Type	Description	Example
`string`	Text values	`"Apple Inc."`
`number`	Floating-point	`150.5`
`integer`	Whole numbers	`2024`
`boolean`	True/False	`true`
`date`	Date strings	`"2025-09-15"`
`[string]`	List of strings	`["oil", "gas"]`
`[number]`	List of numbers	`[100, 200]`

Advanced Features

Custom Prompts

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    prompt_template="""You are a financial data extraction expert.

Extract the following information:
{variables}

Text to analyze:
{text}""",
)

Process CSV/Structured Data

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    target_column="transcript_text",  # Column containing text to process
)

results = delm.extract("earnings_data.csv")

Cost Tracking & Limits

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    track_cost=True,
    max_budget=10.0,  # Stop if cost exceeds $10
)

results = delm.extract("data.txt")
summary = delm.get_cost_summary()
print(f"Total cost: ${summary['total_cost']:.2f}")

Batch Processing

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    batch_size=50,      # Process 50 records per batch
    max_workers=5,      # Use 5 parallel workers
)

results = delm.extract("large_dataset.csv")

Configuration Options

For a complete list of configuration options, see the documentation.

Common parameters:

provider: LLM provider ("openai", "anthropic", "google", etc.)
model: Model name ("gpt-4o-mini", "claude-3-sonnet-20240229", etc.)
temperature: Generation temperature (default: 0.0)
batch_size: Records per batch (default: 10)
max_workers: Concurrent workers (default: 1)
track_cost: Enable cost tracking (default: True)
max_budget: Maximum cost limit in dollars (default: None)
target_column: Column name for CSV/tabular data (default: None)

Documentation

📖 Full Documentation

Learn more about:

File Format Support

Format	Extensions	Additional Dependencies
Text	`.txt`	None
HTML/Markdown	`.html`, `.htm`, `.md`	`beautifulsoup4`
Word	`.docx`	`python-docx`
PDF	`.pdf`	`marker-pdf`
CSV	`.csv`	`pandas`
Excel	`.xlsx`	`openpyxl`
Parquet	`.parquet`	`pyarrow`
Feather	`.feather`	`pyarrow`

Contributing

We welcome contributions! Please see our documentation for guidelines.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Acknowledgments

Built on Instructor for structured outputs
Uses Marker for PDF processing
Developed at the Center for Applied AI at Chicago Booth

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ericfithian skblv

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

Feb 24, 2026

1.0.3

Feb 13, 2026

1.0.1

Jan 13, 2026

1.0.0

Nov 29, 2025

0.1.4

Sep 23, 2025

0.1.3

Sep 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delm-1.1.0.tar.gz (69.4 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

delm-1.1.0-py3-none-any.whl (77.7 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file delm-1.1.0.tar.gz.

File metadata

Download URL: delm-1.1.0.tar.gz
Upload date: Feb 24, 2026
Size: 69.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for delm-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`807e8bceb0ace9eba8f986dc80cb9109b0b18a4acb6589b4d45d532ad0b80186`
MD5	`18ef9cd5e4de68aa525ef57ee94946e6`
BLAKE2b-256	`66e985c908d080b96c036447048f8014a7f84cbf3c48d15e7e52d133801fe614`

See more details on using hashes here.

Provenance

The following attestation bundles were made for delm-1.1.0.tar.gz:

Publisher: publish.yml on Center-for-Applied-AI/delm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: delm-1.1.0.tar.gz
- Subject digest: 807e8bceb0ace9eba8f986dc80cb9109b0b18a4acb6589b4d45d532ad0b80186
- Sigstore transparency entry: 990231898
- Sigstore integration time: Feb 24, 2026
Source repository:
- Permalink: Center-for-Applied-AI/delm@6eb0e75a466757f07f0d9025d62fd4dcbc6c2547
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/Center-for-Applied-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6eb0e75a466757f07f0d9025d62fd4dcbc6c2547
- Trigger Event: release

File details

Details for the file delm-1.1.0-py3-none-any.whl.

File metadata

Download URL: delm-1.1.0-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 77.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for delm-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`952df18ad3388cdb684dc927317ee2897a8b6580d126466c9bb28789ddbfb027`
MD5	`65228cc761ed0e254b0e21356850b961`
BLAKE2b-256	`68d9c9631db9dac5bf3018c1bb078bdaf5cdf45e36c1ec2601f5cec4de9dd48b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for delm-1.1.0-py3-none-any.whl:

Publisher: publish.yml on Center-for-Applied-AI/delm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: delm-1.1.0-py3-none-any.whl
- Subject digest: 952df18ad3388cdb684dc927317ee2897a8b6580d126466c9bb28789ddbfb027
- Sigstore transparency entry: 990231899
- Sigstore integration time: Feb 24, 2026
Source repository:
- Permalink: Center-for-Applied-AI/delm@6eb0e75a466757f07f0d9025d62fd4dcbc6c2547
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/Center-for-Applied-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6eb0e75a466757f07f0d9025d62fd4dcbc6c2547
- Trigger Event: release

delm 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Data Extraction with Language Models

Features

Installation

Quick Start

Schema Types

Simple Schema

Nested Schema

Multiple Schemas

Supported Data Types

Advanced Features

Custom Prompts

Process CSV/Structured Data

Cost Tracking & Limits

Batch Processing

Configuration Options

Documentation

File Format Support

Contributing

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance