This python package is designed to provide an adaptable framework for data extraction. It can be used to manage and extract data from text across multiple topics using Large Language Models (LLMs).

These details have not been verified by PyPI

Project description

Data Element Extractor

An adaptable Python framework for extracting structured data from unstructured text using Large Language Models (LLMs). This library provides a flexible system for managing and extracting data across multiple topics, with support for categorical classification, number extraction, date parsing, and custom text extraction.

Features

Multiple LLM Backends: Support for Hugging Face Transformers models, OpenAI API, DeepInfra, and custom inference servers
Topic Management: Define and manage multiple extraction topics with custom prompts and categories
Conditional Extraction: Extract data elements conditionally based on previous extraction results
Prompt Optimization: Iterative prompt improvement and performance evaluation tools
Flexible Data Types: Support for categorical (value list), number, date, and text extraction
Batch Processing: Extract data from CSV files with batch processing capabilities
Server Integration: Load data elements and lists from remote CDE (Common Data Element) servers
Graphical UI: Built-in Tkinter-based user interface for interactive data extraction
Prompt Generation: Automatic prompt creation and few-shot learning support

Installation

Install the package using pip:

pip install data-element-extractor

Dependencies

The package requires:

torch - PyTorch for local model inference
transformers - Hugging Face Transformers library
openai - OpenAI Python client
dateparser - Date parsing utilities
requests - HTTP client for server communication
tk - Tkinter for the UI (usually included with Python)

Quick Start

Basic Usage

from data_element_extractor import DataElementExtractor

# Initialize the extractor
extractor = DataElementExtractor()

# Set up your LLM model
# Option 1: Use a local Transformers model
extractor.set_model("microsoft/Phi-3-mini-4k-instruct", model_type="Transformers")

# Option 2: Use OpenAI API
# extractor.set_model("gpt-3.5-turbo", model_type="OpenAI", api_key="your-api-key")

# Add a categorical topic (classification)
topic_id = extractor.add_topic(
    topic_name="Sentiment",
    topic_data=["Positive", "Negative", "Neutral"]
)

# Add a number extraction topic
topic_id_2 = extractor.add_topic(
    topic_name="Age",
    topic_data="number"
)

# Extract data from text
text = "The customer was very happy with the service. They are 25 years old."
results, probabilities = extractor.extract(text)

print(f"Sentiment: {results[0]} (confidence: {probabilities[0]:.2%})")
print(f"Age: {results[1]} (confidence: {probabilities[1]:.2%})")

Working with CSV Files

# Extract data from a CSV file
results = extractor.extract_from_table(
    csv_file_path="data.csv",
    delimiter=";",
    batch_size=100,
    constrained_output=True
)

Core Concepts

Topics

A topic defines what data you want to extract from text. Each topic has:

Name: A descriptive name for the data element (e.g., "Sentiment", "Age")
Data Type: The type of extraction:
- List of categories (categorical/classification)
- "number" - Extract numeric values
- "date" - Extract dates
- "text" - Extract free-form text
Prompt: Custom instruction template for the LLM
Condition: Optional condition to control when extraction runs

Model Configuration

The library supports multiple inference backends:

Local Transformers Models

extractor.set_model(
    model_name="microsoft/Phi-3-mini-4k-instruct",
    model_type="Transformers",
    inference_type="transformers",
    attn_implementation="flash_attention_2",
    move_to_gpu=True
)

OpenAI API

extractor.set_model(
    model_name="gpt-3.5-turbo",
    model_type="OpenAI",
    api_key="your-api-key",
    inference_type="cloud"
)

DeepInfra

extractor.set_model(
    model_name="meta-llama/Llama-3-70b-chat-hf",
    model_type="DeepInfra",
    api_key="your-api-key",
    inference_type="cloud"
)

Custom Inference Server

extractor.set_inference_server_url("http://127.0.0.1:5000")
extractor.set_model(
    model_name="your-model-name",
    model_type="Transformers",
    inference_type="server"
)

Conditional Extraction

You can make topic extraction conditional on previous results:

# First topic extracts a category
topic1_id = extractor.add_topic(
    topic_name="Document Type",
    topic_data=["Contract", "Invoice", "Receipt"]
)

# Second topic only extracts if first topic was "Contract"
topic2_id = extractor.add_topic(
    topic_name="Contract Date",
    topic_data="date",
    condition="T1 == 'Contract'"
)

Conditions reference topic IDs (e.g., T1, T2) and can check for:

Category matches: T1 == 'CategoryName'
Non-empty values: T1 != ''
Complex expressions using and, or, not

Prompt Customization

Customize prompts for each topic:

topic_id = extractor.add_topic(
    topic_name="Medical Condition",
    topic_data=["Diabetes", "Hypertension", "Asthma"],
    prompt="You are a medical expert. Classify the following medical text into one of the categories: [CATEGORIES]. Text: [TEXT]. Category:"
)

The library automatically replaces:

[TOPIC] - The topic name
[TEXT] - The input text
[CATEGORIES] - The list of categories (for categorical topics)

Advanced Features

Prompt Optimization

Evaluate and improve prompt performance:

# Evaluate current prompt performance
performance = extractor.evaluate_prompt_performance_for_topic(
    topic_id="T1",
    dataset_path="evaluation_data.csv",
    truth_col=1,
    text_col=0,
    delimiter=";"
)

# Iteratively improve a prompt
extractor.iteratively_improve_prompt(
    topic_id="T1",
    dataset_path="training_data.csv",
    text_column_index=0,
    ground_truth_column_index=1,
    num_iterations=3,
    delimiter=";"
)

Few-Shot Learning

Generate few-shot prompts automatically:

# Create few-shot prompt for a single topic
extractor.create_few_shot_prompt(
    topic_id="T1",
    csv_path="examples.csv",
    text_col_idx=0,
    label_col_idx=1,
    delimiter=";",
    num_examples=3
)

# Create few-shot prompts for all topics
extractor.create_few_shot_prompts_for_all_topics(
    csv_path="examples.csv",
    delimiter=";",
    num_examples=3
)

Server Integration

Load data elements from remote servers:

# Get all available CDEs (Common Data Elements)
all_cdes = extractor.get_all_cdes_from_server()

# Get CDE lists
cde_lists = extractor.get_cde_lists_from_server()

# Load a data element list
topics = extractor.load_data_element_list_from_server(cde_list_id="list-123")

# Load a single data element
topics = extractor.load_data_element_from_server(cde_id="cde-456")

Topic Management

# Get topic information
topic = extractor.get_topic_by_id("T1")
topic_id = extractor.get_topic_id_by_name("Sentiment")

# Modify topics
extractor.set_prompt(topic_id="T1", new_prompt="New prompt text")
extractor.increase_topic_order("T1")
extractor.decrease_topic_order("T1")

# Category management
extractor.add_category(topic_id="T1", category_name="New Category")
extractor.remove_category(topic_id="T1", category_id="cat-uuid")

# Category conditions
extractor.add_category_condition(topic_id="T1", category_id="cat-uuid", condition_str="T2 > 10")

# Save and load topics
extractor.save_topics("topics.json")
extractor.load_topics("topics.json")

# Display all topics
extractor.show_topics_and_categories()

Thinking/Chain-of-Thought

Configure chain-of-thought reasoning for improved accuracy:

# Set global thinking config
extractor.thinking_config = {
    "enabled": True,
    "temperature": 0.7,
    "max_length": 500
}

# Or configure per topic when adding
topic_id = extractor.add_topic(
    topic_name="Complex Classification",
    topic_data=["Category A", "Category B"],
    thinking_config={
        "enabled": True,
        "temperature": 0.5
    }
)

Configuration

# Configure choice symbols for categorical output
# Options: "none", "alphabetical", "numerical", or custom list like "A,B,C,D"
extractor.set_choice_symbol_config("alphabetical")

# Set inference server URL
extractor.set_inference_server_url("http://127.0.0.1:5000")

User Interface

The library includes a graphical user interface built with Tkinter:

from data_element_extractor.ui.main_app import ExtractorApp
import tkinter as tk

root = tk.Tk()
app = ExtractorApp(root)
root.mainloop()

Or use the UI module directly:

from data_element_extractor import ui
# Launch UI (if available)

The UI provides:

Model configuration management
Topic creation and editing
Interactive extraction interface
Prompt editing and optimization
CSV file processing

API Reference

Main Class

`DataElementExtractor()`

Main class for data extraction.

Model Management:

set_model(model_name, model_type="Transformers", api_key="", inference_type="transformers", ...) - Configure the main extraction model
set_prompt_model(model_name, model_type="OpenAI", ...) - Configure model for prompt generation
set_model_as_prompt_model() - Use main model for prompt generation

Topic Management:

add_topic(topic_name, topic_data, condition="", prompt="", thinking_config={}) - Add a new extraction topic
get_topic_by_id(topic_id) - Get topic by ID
get_topic_id_by_name(topic_name) - Get topic ID by name
update_topics(topics) - Update all topics
remove_topic(topic_id_str) - Remove a topic
save_topics(filename) - Save topics to file
load_topics(filename) - Load topics from file
show_topics_and_categories() - Display all topics

Extraction:

extract(text, is_single_extraction=True, constrained_output=True, with_evaluation=False, ground_truth_row=None) - Extract data from text
extract_element(topic_id, text, constrained_output=False, thinking_data=None) - Extract a single element
extract_from_table(csv_file_path, delimiter=';', batch_size=100, ...) - Extract from CSV file

Prompt Optimization:

evaluate_prompt_performance_for_topic(topic_id, truth_col, dataset_path, ...) - Evaluate prompt performance
iteratively_improve_prompt(topic_id, dataset_path, ...) - Improve prompt iteratively
create_few_shot_prompt(topic_id, csv_path, ...) - Generate few-shot prompt

Examples

Example 1: Document Classification

from data_element_extractor import DataElementExtractor

extractor = DataElementExtractor()
extractor.set_model("microsoft/Phi-3-mini-4k-instruct", model_type="Transformers")

# Classify document type
extractor.add_topic(
    topic_name="Document Type",
    topic_data=["Invoice", "Receipt", "Contract", "Letter"]
)

# Extract date (conditional)
extractor.add_topic(
    topic_name="Document Date",
    topic_data="date",
    condition="T1 != ''"
)

text = "Invoice dated March 15, 2024 for services rendered."
results, probs = extractor.extract(text)
print(f"Type: {results[0]}, Date: {results[1]}")

Example 2: Medical Data Extraction

extractor = DataElementExtractor()
extractor.set_model("gpt-3.5-turbo", model_type="OpenAI", api_key="your-key")

# Extract diagnosis
extractor.add_topic(
    topic_name="Diagnosis",
    topic_data=["Diabetes", "Hypertension", "Asthma", "Other"]
)

# Extract age (number)
extractor.add_topic(
    topic_name="Patient Age",
    topic_data="number"
)

# Extract date of diagnosis
extractor.add_topic(
    topic_name="Diagnosis Date",
    topic_data="date",
    condition="T1 != 'Other'"
)

medical_text = "Patient is 45 years old, diagnosed with Diabetes on 2023-01-15."
results, _ = extractor.extract(medical_text)

Requirements

Python >= 3.7
PyTorch (for local models)
transformers
openai
dateparser
requests

License

MIT License

Author

Fabio Dennstädt (fabiodennstaedt@gmx.de)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues, questions, or contributions, please open an issue on the project repository.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.0

Nov 1, 2025

This version

0.1.0

Nov 1, 2025

0.0.1

Aug 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_element_extractor-0.1.0.tar.gz (52.7 kB view details)

Uploaded Nov 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_element_extractor-0.1.0-py3-none-any.whl (59.9 kB view details)

Uploaded Nov 1, 2025 Python 3

File details

Details for the file data_element_extractor-0.1.0.tar.gz.

File metadata

Download URL: data_element_extractor-0.1.0.tar.gz
Upload date: Nov 1, 2025
Size: 52.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for data_element_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3029d2b722a74621f7f4978bacfdf7ac7dfdd21b416d88e380051746c27fe628`
MD5	`7c04cf54dd63a6bae09d04372850ca2c`
BLAKE2b-256	`393fbc07226a48c7f455017f307777dd702d6f7f1befc3a2b60a9e9047c6deee`

See more details on using hashes here.

File details

Details for the file data_element_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: data_element_extractor-0.1.0-py3-none-any.whl
Upload date: Nov 1, 2025
Size: 59.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for data_element_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`684a8f7515bddbf3101bbcd941652a3bcac8571e8bde04fedeaf3114cda7786a`
MD5	`a10ce724dade217a3992eafce54ffdf5`
BLAKE2b-256	`987bbb5626778386c0964ba00e32123cd044ccb98b35128c5141f02732a413df`

See more details on using hashes here.

data-element-extractor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Data Element Extractor

Features

Installation

Dependencies

Quick Start

Basic Usage

Working with CSV Files

Core Concepts

Topics

Model Configuration

Local Transformers Models

OpenAI API

DeepInfra

Custom Inference Server

Conditional Extraction

Prompt Customization

Advanced Features

Prompt Optimization

Few-Shot Learning

Server Integration

Topic Management

Thinking/Chain-of-Thought

Configuration

User Interface

API Reference

Main Class

DataElementExtractor()

Examples

Example 1: Document Classification

Example 2: Medical Data Extraction

Requirements

License

Author

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`DataElementExtractor()`