Skip to main content

DELM:Data Extraction with Language Models

Project description

DELM logo

Data Extraction with Language Models


DELM is a Python toolkit for extracting structured data from unstructured text using language models.

📖 Full Documentation

Features

  • Multiple input formats: TXT, HTML, MD, DOCX, PDF, CSV, Excel, Parquet, Feather
  • Flexible schemas: Simple key-value → nested objects → multiple schemas
  • Multiple LLM providers: OpenAI, Anthropic, Google, Groq, Together AI, Fireworks AI
  • Cost management: Automatic cost tracking, caching, and budget limits
  • Built for scale: Batch processing with parallel execution and checkpointing

Installation

pip install delm

Quick Start

Define your extraction schema and extract structured data in just a few lines:

from delm import DELM, Schema, ExtractionVariable

# Define what to extract
schema = Schema.simple(
    variables_list=[
        ExtractionVariable(
            name="company",
            description="Company name mentioned",
            data_type="string",
            required=True,
        ),
        ExtractionVariable(
            name="price",
            description="Price value if mentioned",
            data_type="number",
            required=False,
        ),
    ]
)

# Initialize and extract
delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
)

# Extract from any supported file format
results = delm.extract("data/earnings_calls.txt")
print(results)

# Check costs
print(delm.get_cost_summary())

Schema Types

DELM supports three schema types for different extraction needs:

Simple Schema

Extract key-value pairs from text:

schema = Schema.simple(
    variables_list=[
        ExtractionVariable(name="author", data_type="string"),
        ExtractionVariable(name="date", data_type="date"),
    ]
)

Nested Schema

Extract lists of structured objects:

schema = Schema.nested(
    container_name="products",
    variables_list=[
        ExtractionVariable(name="name", data_type="string"),
        ExtractionVariable(name="price", data_type="number"),
        ExtractionVariable(name="features", data_type="[string]"),
    ]
)

Multiple Schemas

Extract multiple different schemas simultaneously:

schema = Schema.multiple({
    "companies": Schema.nested(
        container_name="companies",
        variables_list=[...],
    ),
    "products": Schema.nested(
        container_name="products",
        variables_list=[...],
    ),
})

Supported Data Types

Type Description Example
string Text values "Apple Inc."
number Floating-point 150.5
integer Whole numbers 2024
boolean True/False true
date Date strings "2025-09-15"
[string] List of strings ["oil", "gas"]
[number] List of numbers [100, 200]

Advanced Features

Custom Prompts

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    prompt_template="""You are a financial data extraction expert.

Extract the following information:
{variables}

Text to analyze:
{text}""",
)

Process CSV/Structured Data

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    target_column="transcript_text",  # Column containing text to process
)

results = delm.extract("earnings_data.csv")

Cost Tracking & Limits

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    track_cost=True,
    max_budget=10.0,  # Stop if cost exceeds $10
)

results = delm.extract("data.txt")
summary = delm.get_cost_summary()
print(f"Total cost: ${summary['total_cost']:.2f}")

Batch Processing

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    batch_size=50,      # Process 50 records per batch
    max_workers=5,      # Use 5 parallel workers
)

results = delm.extract("large_dataset.csv")

Configuration Options

For a complete list of configuration options, see the documentation.

Common parameters:

  • provider: LLM provider ("openai", "anthropic", "google", etc.)
  • model: Model name ("gpt-4o-mini", "claude-3-sonnet-20240229", etc.)
  • temperature: Generation temperature (default: 0.0)
  • batch_size: Records per batch (default: 10)
  • max_workers: Concurrent workers (default: 1)
  • track_cost: Enable cost tracking (default: True)
  • max_budget: Maximum cost limit in dollars (default: None)
  • target_column: Column name for CSV/tabular data (default: None)

Documentation

📖 Full Documentation

Learn more about:

File Format Support

Format Extensions Additional Dependencies
Text .txt None
HTML/Markdown .html, .htm, .md beautifulsoup4
Word .docx python-docx
PDF .pdf marker-pdf
CSV .csv pandas
Excel .xlsx openpyxl
Parquet .parquet pyarrow
Feather .feather pyarrow

Contributing

We welcome contributions! Please see our documentation for guidelines.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delm-1.0.3.tar.gz (70.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

delm-1.0.3-py3-none-any.whl (78.6 kB view details)

Uploaded Python 3

File details

Details for the file delm-1.0.3.tar.gz.

File metadata

  • Download URL: delm-1.0.3.tar.gz
  • Upload date:
  • Size: 70.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for delm-1.0.3.tar.gz
Algorithm Hash digest
SHA256 91c48d2ceee8da8703579636b11943ec517d592e4ea7d5ea5a0ef53b40fe2538
MD5 f7eea0385ff6ef18c62af85c971fc160
BLAKE2b-256 98fb8976c604177f1b78ada8b94994b78589c8663c795aa51de8f3ee74053c28

See more details on using hashes here.

Provenance

The following attestation bundles were made for delm-1.0.3.tar.gz:

Publisher: publish.yml on Center-for-Applied-AI/delm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file delm-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: delm-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 78.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for delm-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8261d135e1662108058604f71e9c0f291da89bff07f5600fad6afd5876ec9dbd
MD5 6026be03ed21eace18d2d66ae1b4de54
BLAKE2b-256 855fc0b39a1348ef4950615144f4e0813825055ac2d4809e5e59ce2e8798c405

See more details on using hashes here.

Provenance

The following attestation bundles were made for delm-1.0.3-py3-none-any.whl:

Publisher: publish.yml on Center-for-Applied-AI/delm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page