This python package is designed to provide an adaptable framework for data extraction. It can be used to manage and extract data from text across multiple topics using Large Language Models (LLMs).
Project description
Data Element Extractor
An adaptable Python framework for extracting structured data from unstructured text using Large Language Models (LLMs). This library provides a flexible system for managing and extracting data across multiple topics, with support for categorical classification, number extraction, date parsing, and custom text extraction.
Features
- Multiple LLM Backends: Support for Hugging Face Transformers models, OpenAI API, DeepInfra, and custom inference servers
- Topic Management: Define and manage multiple extraction topics with custom prompts and categories
- Conditional Extraction: Extract data elements conditionally based on previous extraction results
- Prompt Optimization: Iterative prompt improvement and performance evaluation tools
- Flexible Data Types: Support for categorical (value list), number, date, and text extraction
- Batch Processing: Extract data from CSV files with batch processing capabilities
- Server Integration: Load data elements and lists from remote CDE (Common Data Element) servers
- Graphical UI: Built-in Tkinter-based user interface for interactive data extraction
- Prompt Generation: Automatic prompt creation and few-shot learning support
Installation
Install the package using pip:
pip install data-element-extractor
Dependencies
The package requires:
torch- PyTorch for local model inferencetransformers- Hugging Face Transformers libraryopenai- OpenAI Python clientdateparser- Date parsing utilitiesrequests- HTTP client for server communicationtk- Tkinter for the UI (usually included with Python)
Quick Start
Basic Usage
from data_element_extractor import DataElementExtractor
# Initialize the extractor
extractor = DataElementExtractor()
# Set up your LLM model
# Option 1: Use a local Transformers model
extractor.set_model("microsoft/Phi-3-mini-4k-instruct", model_type="Transformers")
# Option 2: Use OpenAI API
# extractor.set_model("gpt-3.5-turbo", model_type="OpenAI", api_key="your-api-key")
# Add a categorical topic (classification)
topic_id = extractor.add_topic(
topic_name="Sentiment",
topic_data=["Positive", "Negative", "Neutral"]
)
# Add a number extraction topic
topic_id_2 = extractor.add_topic(
topic_name="Age",
topic_data="number"
)
# Extract data from text
text = "The customer was very happy with the service. She is 25 years old."
results, probabilities = extractor.extract(text)
print(f"Sentiment: {results[0]} (confidence: {probabilities[0]:.2%})")
print(f"Age: {results[1]} (confidence: {probabilities[1]:.2%})")
Working with CSV Files
# Extract data from a CSV file
results = extractor.extract_from_table(
csv_file_path="data.csv",
delimiter=";",
batch_size=100,
constrained_output=True
)
Core Concepts
Topics
A topic defines what data you want to extract from text. Each topic has:
- Name: A descriptive name for the data element (e.g., "Sentiment", "Age")
- Data Type: The type of extraction:
- List of categories (categorical/classification)
"number"- Extract numeric values"date"- Extract dates"text"- Extract free-form text
- Prompt: Custom instruction template for the LLM
- Condition: Optional condition to control when extraction runs
Model Configuration
The library supports multiple inference backends:
Local Transformers Models
extractor.set_model(
model_name="microsoft/Phi-3-mini-4k-instruct",
model_type="Transformers",
inference_type="transformers",
attn_implementation="flash_attention_2",
move_to_gpu=True
)
OpenAI API
extractor.set_model(
model_name="gpt-3.5-turbo",
model_type="OpenAI",
api_key="your-api-key",
inference_type="cloud"
)
DeepInfra
extractor.set_model(
model_name="meta-llama/Llama-3-70b-chat-hf",
model_type="DeepInfra",
api_key="your-api-key",
inference_type="cloud"
)
Custom Inference Server
extractor.set_inference_server_url("http://127.0.0.1:5000")
extractor.set_model(
model_name="your-model-name",
model_type="Transformers",
inference_type="server"
)
Conditional Extraction
You can make topic extraction conditional on previous results:
# First topic extracts a category
topic1_id = extractor.add_topic(
topic_name="Document Type",
topic_data=["Contract", "Invoice", "Receipt"]
)
# Second topic only extracts if first topic was "Contract"
topic2_id = extractor.add_topic(
topic_name="Contract Date",
topic_data="date",
condition="T1 == 'Contract'"
)
Conditions reference topic IDs (e.g., T1, T2) and can check for:
- Category matches:
T1 == 'CategoryName' - Non-empty values:
T1 != '' - Complex expressions using
and,or,not
Prompt Customization
Customize prompts for each topic:
topic_id = extractor.add_topic(
topic_name="Medical Condition",
topic_data=["Diabetes", "Hypertension", "Asthma"],
prompt="You are a medical expert. Classify the following medical text into one of the categories: [CATEGORIES]. Text: [TEXT]. Category:"
)
The library automatically replaces:
[TOPIC]- The topic name[TEXT]- The input text[CATEGORIES]- The list of categories (for categorical topics)
Advanced Features
Prompt Optimization
Evaluate and improve prompt performance:
# Evaluate current prompt performance
performance = extractor.evaluate_prompt_performance_for_topic(
topic_id="T1",
dataset_path="evaluation_data.csv",
truth_col=1,
text_col=0,
delimiter=";"
)
# Iteratively improve a prompt
extractor.iteratively_improve_prompt(
topic_id="T1",
dataset_path="training_data.csv",
text_column_index=0,
ground_truth_column_index=1,
num_iterations=3,
delimiter=";"
)
Few-Shot Learning
Generate few-shot prompts automatically:
# Create few-shot prompt for a single topic
extractor.create_few_shot_prompt(
topic_id="T1",
csv_path="examples.csv",
text_col_idx=0,
label_col_idx=1,
delimiter=";",
num_examples=3
)
# Create few-shot prompts for all topics
extractor.create_few_shot_prompts_for_all_topics(
csv_path="examples.csv",
delimiter=";",
num_examples=3
)
Server Integration
Load data elements from remote servers:
# Get all available CDEs (Common Data Elements)
all_cdes = extractor.get_all_cdes_from_server()
# Get CDE lists
cde_lists = extractor.get_cde_lists_from_server()
# Load a data element list
topics = extractor.load_data_element_list_from_server(cde_list_id="list-123")
# Load a single data element
topics = extractor.load_data_element_from_server(cde_id="cde-456")
Topic Management
# Get topic information
topic = extractor.get_topic_by_id("T1")
topic_id = extractor.get_topic_id_by_name("Sentiment")
# Modify topics
extractor.set_prompt(topic_id="T1", new_prompt="New prompt text")
extractor.increase_topic_order("T1")
extractor.decrease_topic_order("T1")
# Category management
extractor.add_category(topic_id="T1", category_name="New Category")
extractor.remove_category(topic_id="T1", category_id="cat-uuid")
# Category conditions
extractor.add_category_condition(topic_id="T1", category_id="cat-uuid", condition_str="T2 > 10")
# Save and load topics
extractor.save_topics("topics.json")
extractor.load_topics("topics.json")
# Display all topics
extractor.show_topics_and_categories()
Thinking/Chain-of-Thought
Configure chain-of-thought reasoning for improved accuracy:
# Set global thinking config
extractor.thinking_config = {
"enabled": True,
"temperature": 0.7,
"max_length": 500
}
# Or configure per topic when adding
topic_id = extractor.add_topic(
topic_name="Complex Classification",
topic_data=["Category A", "Category B"],
thinking_config={
"enabled": True,
"temperature": 0.5
}
)
Configuration
# Configure choice symbols for categorical output
# Options: "none", "alphabetical", "numerical", or custom list like "A,B,C,D"
extractor.set_choice_symbol_config("alphabetical")
# Set inference server URL
extractor.set_inference_server_url("http://127.0.0.1:5000")
User Interface
The library includes a graphical user interface built with Tkinter:
from data_element_extractor.ui.main_app import ExtractorApp
import tkinter as tk
root = tk.Tk()
app = ExtractorApp(root)
root.mainloop()
Or use the UI module directly:
from data_element_extractor import ui
# Launch UI (if available)
The UI provides:
- Model configuration management
- Topic creation and editing
- Interactive extraction interface
- Prompt editing and optimization
- CSV file processing
API Reference
Main Class
DataElementExtractor()
Main class for data extraction.
Model Management:
set_model(model_name, model_type="Transformers", api_key="", inference_type="transformers", ...)- Configure the main extraction modelset_prompt_model(model_name, model_type="OpenAI", ...)- Configure model for prompt generationset_model_as_prompt_model()- Use main model for prompt generation
Topic Management:
add_topic(topic_name, topic_data, condition="", prompt="", thinking_config={})- Add a new extraction topicget_topic_by_id(topic_id)- Get topic by IDget_topic_id_by_name(topic_name)- Get topic ID by nameupdate_topics(topics)- Update all topicsremove_topic(topic_id_str)- Remove a topicsave_topics(filename)- Save topics to fileload_topics(filename)- Load topics from fileshow_topics_and_categories()- Display all topics
Extraction:
extract(text, is_single_extraction=True, constrained_output=True, with_evaluation=False, ground_truth_row=None)- Extract data from textextract_element(topic_id, text, constrained_output=False, thinking_data=None)- Extract a single elementextract_from_table(csv_file_path, delimiter=';', batch_size=100, ...)- Extract from CSV file
Prompt Optimization:
evaluate_prompt_performance_for_topic(topic_id, truth_col, dataset_path, ...)- Evaluate prompt performanceiteratively_improve_prompt(topic_id, dataset_path, ...)- Improve prompt iterativelycreate_few_shot_prompt(topic_id, csv_path, ...)- Generate few-shot prompt
Examples
Example 1: Document Classification
from data_element_extractor import DataElementExtractor
extractor = DataElementExtractor()
extractor.set_model("microsoft/Phi-3-mini-4k-instruct", model_type="Transformers")
# Classify document type
extractor.add_topic(
topic_name="Document Type",
topic_data=["Invoice", "Receipt", "Contract", "Letter"]
)
# Extract date (conditional)
extractor.add_topic(
topic_name="Document Date",
topic_data="date",
condition="T1 != ''"
)
text = "Invoice dated March 15, 2024 for services rendered."
results, probs = extractor.extract(text)
print(f"Type: {results[0]}, Date: {results[1]}")
Example 2: Medical Data Extraction
extractor = DataElementExtractor()
extractor.set_model("gpt-3.5-turbo", model_type="OpenAI", api_key="your-key")
# Extract diagnosis
extractor.add_topic(
topic_name="Diagnosis",
topic_data=["Diabetes", "Hypertension", "Asthma", "Other"]
)
# Extract age (number)
extractor.add_topic(
topic_name="Patient Age",
topic_data="number"
)
# Extract date of diagnosis
extractor.add_topic(
topic_name="Diagnosis Date",
topic_data="date",
condition="T1 != 'Other'"
)
medical_text = "Patient is 45 years old, diagnosed with Diabetes on 2023-01-15."
results, _ = extractor.extract(medical_text)
Requirements
- Python >= 3.7
- PyTorch (for local models)
- transformers
- openai
- dateparser
- requests
License
MIT License
Author
Fabio Dennstädt (fabiodennstaedt@gmx.de)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_element_extractor-0.2.0.tar.gz.
File metadata
- Download URL: data_element_extractor-0.2.0.tar.gz
- Upload date:
- Size: 56.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f40983df2ea0f603c5eda28b33e67442d6f1ed9a3f6d29f711b2d7f0df24ea7
|
|
| MD5 |
1e8d624e3a0086092d4b904bdf858954
|
|
| BLAKE2b-256 |
6e1493e1693e0f2fedffb16e2706c377131d3e9a4a474107e2838ffc89f1ed6c
|
File details
Details for the file data_element_extractor-0.2.0-py3-none-any.whl.
File metadata
- Download URL: data_element_extractor-0.2.0-py3-none-any.whl
- Upload date:
- Size: 63.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1684f95afaa28cff17bf387d0e2b18605620dcf0a0e79c39a81898ccb9c55029
|
|
| MD5 |
0563af9fef7f4c129161012bd4c9a79f
|
|
| BLAKE2b-256 |
772433e4f0ed7a623fddf9a7f05fec9f7ce48126746ed6ee2e59a33e8d9e4316
|