Data Extraction Language Model - A pipeline for extracting structured data from text using language models

These details have not been verified by PyPI

Project links

Project description

DELM (Data Extraction with Language Models)

A comprehensive Python toolkit for extracting structured data from unstructured text using language models. DELM provides a configurable, scalable pipeline with built-in cost tracking, caching, and evaluation capabilities.

Features

Multi-format Support: TXT, HTML, MD, DOCX, PDF, CSV, Excel, Parquet, Feather
Progressive Schema System: Simple → Nested → Multiple schemas for any complexity
Multi-Provider Support: OpenAI, Anthropic, Google, Groq, Together AI, Fireworks AI
Smart Processing: Configurable text splitting, relevance scoring, and filtering
Cost Optimization: Built-in cost tracking, caching, and budget management
Batch Processing: Parallel execution with checkpointing and resume capabilities
Comprehensive Evaluation: Performance metrics and cost analysis tools

Installation

# Clone the repository
git clone https://github.com/your-org/delm.git
cd delm

# Install from source
pip install -e .

Quick Start

Basic Usage

from pathlib import Path
from delm import DELM

# Initialize DELM from a pipeline config YAML
delm = DELM.from_yaml(
    config_path="example.config.yaml",
    experiment_name="my_experiment",
    experiment_directory=Path("experiments"),
)

# Process data
df = delm.prep_data("data/input.txt")
results = delm.process_via_llm()

# Get results
final_df = delm.get_extraction_results()
cost_summary = delm.get_cost_summary()

Configuration Files

DELM uses two configuration files:

1. Pipeline Configuration (config.yaml)

llm_extraction:
  provider: "openai"
  name: "gpt-4o-mini"
  temperature: 0.0
  batch_size: 10
  track_cost: true
  max_budget: 50.0

data_preprocessing:
  target_column: "text"
  splitting:
    type: "ParagraphSplit"
  scoring:
    type: "KeywordScorer"
    keywords: ["price", "forecast", "guidance"]

schema:
  spec_path: "schema_spec.yaml"

2. Schema Specification (schema_spec.yaml)

schema_type: "nested"
container_name: "commodities"

variables:
  - name: "commodity_type"
    description: "Type of commodity mentioned"
    data_type: "string"
    required: true
    allowed_values: ["oil", "gas", "copper", "gold"]
  
  - name: "price_value"
    description: "Price mentioned in text"
    data_type: "number"
    required: false

Schema Types

DELM supports three levels of schema complexity:

Simple Schema (Level 1)

Extract key-value pairs from each text chunk:

schema_type: "simple"
variables:
  - name: "price"
    description: "Price mentioned"
    data_type: "number"
  - name: "company"
    description: "Company name"
    data_type: "string"

Nested Schema (Level 2)

Extract structured objects with multiple fields:

schema_type: "nested"
container_name: "commodities"
variables:
  - name: "type"
    description: "Commodity type"
    data_type: "string"
  - name: "price"
    description: "Price value"
    data_type: "number"

Multiple Schema (Level 3)

Extract multiple independent schemas simultaneously:

schema_type: "multiple"
commodities:
  schema_type: "nested"
  container_name: "commodities"
  variables: [...]
companies:
  schema_type: "nested"
  container_name: "companies"
  variables: [...]

Supported Data Types

Type	Description	Example
`string`	Text values	`"Apple Inc."`
`number`	Floating-point numbers	`150.5`
`integer`	Whole numbers	`2024`
`boolean`	True/False values	`true`
`[string]`	List of strings	`["oil", "gas"]`
`[number]`	List of numbers	`[100, 200, 300]`
`[integer]`	List of integers	`[1, 2, 3, 4]`
`[boolean]`	List of booleans	`[true, false, true]`

Advanced Features

Cost Summary

# Get cost summary after extraction
cost_summary = delm.get_cost_summary()
print(f"Total cost: ${cost_summary['total_cost']}")

Semantic Caching

Reuses api responses from identical calls. Ensures no wasted api credits for certain experiment re-runs.

semantic_cache:
  backend: "sqlite"        # sqlite, lmdb, filesystem
  path: ".delm_cache"
  max_size_mb: 512
  synchronous: "normal"    # sqlite only: "normal" or "full"

Relevance Filtering

data_preprocessing:
  scoring:
    type: "KeywordScorer"
    keywords: ["price", "forecast", "guidance"]
  pandas_score_filter: "delm_score >= 0.7"

If a scorer is configured but no pandas_score_filter is provided, all chunks are kept (a warning is logged).

Text Splitting Strategies

data_preprocessing:
  splitting:
    type: "ParagraphSplit"      # Split by paragraphs
    # type: "FixedWindowSplit"  # Split by sentence count
    # window: 5
    # stride: 2
    # type: "RegexSplit"        # Custom regex pattern
    # pattern: "\n\n"

Performance & Evaluation

Cost Estimation

Estimate total cost of your current configuration setup before running the full extraction.

from delm.utils.cost_estimation import estimate_input_token_cost, estimate_total_cost

# Estimate input token costs without API calls
input_cost = estimate_input_token_cost(
    config="config.yaml",
    data_source="data.csv"
)
print(f"Input token cost: ${input_cost:.2f}")

# Estimate total costs using API calls on a sample
total_cost = estimate_total_cost(
    config="config.yaml",
    data_source="data.csv",
    sample_size=100
)
print(f"Estimated total cost: ${total_cost:.2f}")

Performance Evaluation

Estimate the performance of your current configuration before running the full extraction.

from delm.utils.performance_estimation import estimate_performance

# Evaluate against human-labeled data
metrics, expected_and_extracted_df = estimate_performance(
    config="config.yaml",
    data_source="test_data.csv",
    expected_extraction_output_df=human_labeled_df,
    true_json_column="expected_json",
    matching_id_column="id",
    record_sample_size=50  # Optional: limit sample size
)

# Display performance metrics
for key, value in metrics.items():
    precision = value.get("precision", 0)
    recall = value.get("recall", 0)
    f1 = value.get("f1", 0)
    print(f"{key:<30} Precision: {precision:.3f}  Recall: {recall:.3f}  F1: {f1:.3f}")

Configuration Reference

Required Fields

llm_extraction.provider: LLM provider (openai, anthropic, google, etc.)
llm_extraction.name: Model name (gpt-4o-mini, claude-3-sonnet, etc.)
schema.spec_path: Path to schema specification file

Optional Fields with Defaults

llm_extraction.temperature: 0.0 (deterministic)
llm_extraction.batch_size: 10 (records per batch)
llm_extraction.max_workers: 1 (concurrent workers)
llm_extraction.track_cost: true (cost tracking)
semantic_cache.backend: "sqlite" (cache backend)

Additional LLM Fields

llm_extraction.max_retries: 3 (retry attempts)
llm_extraction.base_delay: 1.0 (seconds, exponential backoff base)
llm_extraction.dotenv_path: null (path to “.env” for credentials)
llm_extraction.model_input_cost_per_1M_tokens: null (override pricing)
llm_extraction.model_output_cost_per_1M_tokens: null (override pricing)

If using providers not present in the built-in pricing DB, set both model_input_cost_per_1M_tokens and model_output_cost_per_1M_tokens, or set track_cost: false.

Data Preprocessing Fields

data_preprocessing.drop_target_column: false
data_preprocessing.pandas_score_filter: null (e.g., "delm_score >= 0.7")
data_preprocessing.preprocessed_data_path: null (path to “.feather” with delm_text_chunk and delm_chunk_id; when set, omit splitting/scoring/filter fields)

Semantic Cache Fields

semantic_cache.backend: "sqlite" | "lmdb" | "filesystem"
semantic_cache.path: ".delm_cache"
semantic_cache.max_size_mb: 512
semantic_cache.synchronous: "normal" | "full" (sqlite only)

Experiment Storage & Logging

Disk storage (default): checkpointing, resume, and results persisted under delm_experiments/<experiment_name>/.
In-memory storage: use_disk_storage=False for fast prototyping (no persistence, no resume).
Logging: by default, rotating file logs under delm_logs/<experiment_name>/ when save_file_log=True.
- Tunables: save_file_log, log_dir, console_log_level, file_log_level, override_logging.
- Or call delm.logging.configure(...) directly.

Architecture

Core Components

DataProcessor: Handles loading, splitting, and scoring
SchemaManager: Manages schema loading and validation
ExtractionManager: Orchestrates LLM extraction
ExperimentManager: Handles experiment state and checkpointing
CostTracker: Monitors API costs and budgets

Strategy Classes

SplitStrategy: Text chunking (Paragraph, FixedWindow, Regex)
RelevanceScorer: Content scoring (Keyword, Fuzzy)
SchemaRegistry: Schema type management

Estimation Functions

estimate_input_token_cost: Estimate input token costs without API calls
estimate_total_cost: Estimate total costs using API calls on a sample
estimate_performance: Evaluate extraction performance against human-labeled data

File Format Support

Format	Extension	Requirements
Text	`.txt`	Built-in
HTML/Markdown	`.html`, `.htm`, `.md`	`beautifulsoup4`
Word Documents	`.docx`	`python-docx`
PDF	`.pdf`	`marker` (OCR)
CSV	`.csv`	`pandas`
Excel	`.xlsx`	`openpyxl`
Parquet	`.parquet`	`pyarrow`
Feather	`.feather`	`pyarrow`

Documentation

Local MkDocs Site

Install the documentation dependencies: pip install -e .[docs]
Serve the docs locally: mkdocs serve
Open http://127.0.0.1:8000/ in your browser to explore the site.

Use mkdocs build to generate a static site in the site/ directory when you need a distributable bundle.

Reference Materials

Schema Reference - Detailed schema configuration guide
Configuration Examples - Complete configuration templates
Schema Examples - Schema specification templates

Acknowledgments

Built on Instructor for structured outputs
Uses Marker for PDF processing
Developed at the Center for Applied AI at Chicago Booth

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Feb 24, 2026

1.0.3

Feb 13, 2026

1.0.1

Jan 13, 2026

1.0.0

Nov 29, 2025

0.1.4

Sep 23, 2025

This version

0.1.3

Sep 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delm-0.1.3.tar.gz (68.1 kB view details)

Uploaded Sep 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

delm-0.1.3-py3-none-any.whl (74.9 kB view details)

Uploaded Sep 16, 2025 Python 3

File details

Details for the file delm-0.1.3.tar.gz.

File metadata

Download URL: delm-0.1.3.tar.gz
Upload date: Sep 16, 2025
Size: 68.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for delm-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`9db467bade15823618260c372f5d659058e299c759b5f7ae841dc771b5af8c09`
MD5	`49eb3fdc8885e96aa80ed380838b28df`
BLAKE2b-256	`3e91e1e6ac4b582a3c19700676804c7365f7ceb217cf3d1940b3d39cd9274bfb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for delm-0.1.3.tar.gz:

Publisher: publish.yml on Center-for-Applied-AI/delm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: delm-0.1.3.tar.gz
- Subject digest: 9db467bade15823618260c372f5d659058e299c759b5f7ae841dc771b5af8c09
- Sigstore transparency entry: 525846206
- Sigstore integration time: Sep 16, 2025
Source repository:
- Permalink: Center-for-Applied-AI/delm@b361180fa10765e76aa29e7e68a55e7ffe9f7aab
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/Center-for-Applied-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b361180fa10765e76aa29e7e68a55e7ffe9f7aab
- Trigger Event: release

File details

Details for the file delm-0.1.3-py3-none-any.whl.

File metadata

Download URL: delm-0.1.3-py3-none-any.whl
Upload date: Sep 16, 2025
Size: 74.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for delm-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36d054f3af29968e8cdf7c05bb55c77f986baa9f69c227b7c17cd962f340f86a`
MD5	`dd7c6f0a6539011bcb3223f39c113730`
BLAKE2b-256	`c6f617c97332132499ca1a818c733d5584e7b5220f04c995bcc4bcf633d1af4c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for delm-0.1.3-py3-none-any.whl:

Publisher: publish.yml on Center-for-Applied-AI/delm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: delm-0.1.3-py3-none-any.whl
- Subject digest: 36d054f3af29968e8cdf7c05bb55c77f986baa9f69c227b7c17cd962f340f86a
- Sigstore transparency entry: 525846476
- Sigstore integration time: Sep 16, 2025
Source repository:
- Permalink: Center-for-Applied-AI/delm@b361180fa10765e76aa29e7e68a55e7ffe9f7aab
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/Center-for-Applied-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b361180fa10765e76aa29e7e68a55e7ffe9f7aab
- Trigger Event: release

delm 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DELM (Data Extraction with Language Models)

Features

Installation

Quick Start

Basic Usage

Configuration Files

Schema Types

Simple Schema (Level 1)

Nested Schema (Level 2)

Multiple Schema (Level 3)

Supported Data Types

Advanced Features

Cost Summary

Semantic Caching

Relevance Filtering

Text Splitting Strategies

Performance & Evaluation

Cost Estimation

Performance Evaluation

Configuration Reference

Required Fields

Optional Fields with Defaults

Additional LLM Fields

Data Preprocessing Fields

Semantic Cache Fields

Experiment Storage & Logging

Architecture

Core Components

Strategy Classes

Estimation Functions

File Format Support

Documentation

Local MkDocs Site

Reference Materials

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance