Skip to main content

DELM:Data Extraction with Language Models

Project description

DELM logo

Data Extraction with Language Models


DELM is a Python toolkit for extracting structured data from unstructured text using language models.

📖 Full Documentation

Features

  • Multiple input formats: TXT, HTML, MD, DOCX, PDF, CSV, Excel, Parquet, Feather
  • Flexible schemas: Simple key-value → nested objects → multiple schemas
  • Multiple LLM providers: OpenAI, Anthropic, Google, Groq, Together AI, Fireworks AI
  • Cost management: Automatic cost tracking, caching, and budget limits
  • Built for scale: Batch processing with parallel execution and checkpointing

Installation

pip install delm

Quick Start

Define your extraction schema and extract structured data in just a few lines:

from delm import DELM, Schema, ExtractionVariable

# Define what to extract
schema = Schema.simple(
    variables_list=[
        ExtractionVariable(
            name="company",
            description="Company name mentioned",
            data_type="string",
            required=True,
        ),
        ExtractionVariable(
            name="price",
            description="Price value if mentioned",
            data_type="number",
            required=False,
        ),
    ]
)

# Initialize and extract
delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
)

# Extract from any supported file format
results = delm.extract("data/earnings_calls.txt")
print(results)

# Check costs
print(delm.get_cost_summary())

Schema Types

DELM supports three schema types for different extraction needs:

Simple Schema

Extract key-value pairs from text:

schema = Schema.simple(
    variables_list=[
        ExtractionVariable(name="author", data_type="string"),
        ExtractionVariable(name="date", data_type="date"),
    ]
)

Nested Schema

Extract lists of structured objects:

schema = Schema.nested(
    container_name="products",
    variables_list=[
        ExtractionVariable(name="name", data_type="string"),
        ExtractionVariable(name="price", data_type="number"),
        ExtractionVariable(name="features", data_type="[string]"),
    ]
)

Multiple Schemas

Extract multiple different schemas simultaneously:

schema = Schema.multiple({
    "companies": Schema.nested(
        container_name="companies",
        variables_list=[...],
    ),
    "products": Schema.nested(
        container_name="products",
        variables_list=[...],
    ),
})

Supported Data Types

Type Description Example
string Text values "Apple Inc."
number Floating-point 150.5
integer Whole numbers 2024
boolean True/False true
date Date strings "2025-09-15"
[string] List of strings ["oil", "gas"]
[number] List of numbers [100, 200]

Advanced Features

Custom Prompts

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    prompt_template="""You are a financial data extraction expert.

Extract the following information:
{variables}

Text to analyze:
{text}""",
)

Process CSV/Structured Data

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    target_column="transcript_text",  # Column containing text to process
)

results = delm.extract("earnings_data.csv")

Cost Tracking & Limits

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    track_cost=True,
    max_budget=10.0,  # Stop if cost exceeds $10
)

results = delm.extract("data.txt")
summary = delm.get_cost_summary()
print(f"Total cost: ${summary['total_cost']:.2f}")

Batch Processing

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    batch_size=50,      # Process 50 records per batch
    max_workers=5,      # Use 5 parallel workers
)

results = delm.extract("large_dataset.csv")

Configuration Options

For a complete list of configuration options, see the documentation.

Common parameters:

  • provider: LLM provider ("openai", "anthropic", "google", etc.)
  • model: Model name ("gpt-4o-mini", "claude-3-sonnet-20240229", etc.)
  • temperature: Generation temperature (default: 0.0)
  • batch_size: Records per batch (default: 10)
  • max_workers: Concurrent workers (default: 1)
  • track_cost: Enable cost tracking (default: True)
  • max_budget: Maximum cost limit in dollars (default: None)
  • target_column: Column name for CSV/tabular data (default: None)

Documentation

📖 Full Documentation

Learn more about:

File Format Support

Format Extensions Additional Dependencies
Text .txt None
HTML/Markdown .html, .htm, .md beautifulsoup4
Word .docx python-docx
PDF .pdf marker-pdf
CSV .csv pandas
Excel .xlsx openpyxl
Parquet .parquet pyarrow
Feather .feather pyarrow

Contributing

We welcome contributions! Please see our documentation for guidelines.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delm-1.0.0.tar.gz (69.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

delm-1.0.0-py3-none-any.whl (77.9 kB view details)

Uploaded Python 3

File details

Details for the file delm-1.0.0.tar.gz.

File metadata

  • Download URL: delm-1.0.0.tar.gz
  • Upload date:
  • Size: 69.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for delm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7a0df93316806eb4e5d704d1b8f3d5c3b0d6ce1a9489c801e2da286586e9dae1
MD5 e32b36089192c0ff0a7c750c844ad142
BLAKE2b-256 1c9f72ac797c5260ff0670be97f0c4240c05922be94232f1cebb499d165911ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for delm-1.0.0.tar.gz:

Publisher: publish.yml on Center-for-Applied-AI/delm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file delm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: delm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 77.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for delm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9dcf66f530f5482e934b84ee424b593b9f7184158bf03e03c2edfa2f21b29282
MD5 7bd7f1bc39f5a51ecdbc79931d04d56c
BLAKE2b-256 8c4f5dc7bba943d1b4f7e4a1abe88b90c6ed0af43dd419d6ba381d8d91cf367c

See more details on using hashes here.

Provenance

The following attestation bundles were made for delm-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Center-for-Applied-AI/delm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page