Skip to main content

This python package is designed to provide an adaptable framework for data extraction. It can be used to manage and extract data from text across multiple topics using Large Language Models (LLMs).

Project description

Data Element Extractor

An adaptable Python framework for extracting structured data from unstructured text using Large Language Models (LLMs). This library provides a flexible system for managing and extracting data across multiple topics, with support for categorical classification, number extraction, date parsing, and custom text extraction.

Features

  • Multiple LLM Backends: Support for Hugging Face Transformers models, OpenAI API, DeepInfra, and custom inference servers
  • Topic Management: Define and manage multiple extraction topics with custom prompts and categories
  • Conditional Extraction: Extract data elements conditionally based on previous extraction results
  • Prompt Optimization: Iterative prompt improvement and performance evaluation tools
  • Flexible Data Types: Support for categorical (value list), number, date, and text extraction
  • Batch Processing: Extract data from CSV files with batch processing capabilities
  • Server Integration: Load data elements and lists from remote CDE (Common Data Element) servers
  • Graphical UI: Built-in Tkinter-based user interface for interactive data extraction
  • Prompt Generation: Automatic prompt creation and few-shot learning support

Installation

Install the package using pip:

pip install data-element-extractor

Dependencies

The package requires:

  • torch - PyTorch for local model inference
  • transformers - Hugging Face Transformers library
  • openai - OpenAI Python client
  • dateparser - Date parsing utilities
  • requests - HTTP client for server communication
  • tk - Tkinter for the UI (usually included with Python)

Quick Start

Basic Usage

from data_element_extractor import DataElementExtractor

# Initialize the extractor
extractor = DataElementExtractor()

# Set up your LLM model
# Option 1: Use a local Transformers model
extractor.set_model("microsoft/Phi-3-mini-4k-instruct", model_type="Transformers")

# Option 2: Use OpenAI API
# extractor.set_model("gpt-3.5-turbo", model_type="OpenAI", api_key="your-api-key")

# Add a categorical topic (classification)
topic_id = extractor.add_topic(
    topic_name="Sentiment",
    topic_data=["Positive", "Negative", "Neutral"]
)

# Add a number extraction topic
topic_id_2 = extractor.add_topic(
    topic_name="Age",
    topic_data="number"
)

# Extract data from text
text = "The customer was very happy with the service. They are 25 years old."
results, probabilities = extractor.extract(text)

print(f"Sentiment: {results[0]} (confidence: {probabilities[0]:.2%})")
print(f"Age: {results[1]} (confidence: {probabilities[1]:.2%})")

Working with CSV Files

# Extract data from a CSV file
results = extractor.extract_from_table(
    csv_file_path="data.csv",
    delimiter=";",
    batch_size=100,
    constrained_output=True
)

Core Concepts

Topics

A topic defines what data you want to extract from text. Each topic has:

  • Name: A descriptive name for the data element (e.g., "Sentiment", "Age")
  • Data Type: The type of extraction:
    • List of categories (categorical/classification)
    • "number" - Extract numeric values
    • "date" - Extract dates
    • "text" - Extract free-form text
  • Prompt: Custom instruction template for the LLM
  • Condition: Optional condition to control when extraction runs

Model Configuration

The library supports multiple inference backends:

Local Transformers Models

extractor.set_model(
    model_name="microsoft/Phi-3-mini-4k-instruct",
    model_type="Transformers",
    inference_type="transformers",
    attn_implementation="flash_attention_2",
    move_to_gpu=True
)

OpenAI API

extractor.set_model(
    model_name="gpt-3.5-turbo",
    model_type="OpenAI",
    api_key="your-api-key",
    inference_type="cloud"
)

DeepInfra

extractor.set_model(
    model_name="meta-llama/Llama-3-70b-chat-hf",
    model_type="DeepInfra",
    api_key="your-api-key",
    inference_type="cloud"
)

Custom Inference Server

extractor.set_inference_server_url("http://127.0.0.1:5000")
extractor.set_model(
    model_name="your-model-name",
    model_type="Transformers",
    inference_type="server"
)

Conditional Extraction

You can make topic extraction conditional on previous results:

# First topic extracts a category
topic1_id = extractor.add_topic(
    topic_name="Document Type",
    topic_data=["Contract", "Invoice", "Receipt"]
)

# Second topic only extracts if first topic was "Contract"
topic2_id = extractor.add_topic(
    topic_name="Contract Date",
    topic_data="date",
    condition="T1 == 'Contract'"
)

Conditions reference topic IDs (e.g., T1, T2) and can check for:

  • Category matches: T1 == 'CategoryName'
  • Non-empty values: T1 != ''
  • Complex expressions using and, or, not

Prompt Customization

Customize prompts for each topic:

topic_id = extractor.add_topic(
    topic_name="Medical Condition",
    topic_data=["Diabetes", "Hypertension", "Asthma"],
    prompt="You are a medical expert. Classify the following medical text into one of the categories: [CATEGORIES]. Text: [TEXT]. Category:"
)

The library automatically replaces:

  • [TOPIC] - The topic name
  • [TEXT] - The input text
  • [CATEGORIES] - The list of categories (for categorical topics)

Advanced Features

Prompt Optimization

Evaluate and improve prompt performance:

# Evaluate current prompt performance
performance = extractor.evaluate_prompt_performance_for_topic(
    topic_id="T1",
    dataset_path="evaluation_data.csv",
    truth_col=1,
    text_col=0,
    delimiter=";"
)

# Iteratively improve a prompt
extractor.iteratively_improve_prompt(
    topic_id="T1",
    dataset_path="training_data.csv",
    text_column_index=0,
    ground_truth_column_index=1,
    num_iterations=3,
    delimiter=";"
)

Few-Shot Learning

Generate few-shot prompts automatically:

# Create few-shot prompt for a single topic
extractor.create_few_shot_prompt(
    topic_id="T1",
    csv_path="examples.csv",
    text_col_idx=0,
    label_col_idx=1,
    delimiter=";",
    num_examples=3
)

# Create few-shot prompts for all topics
extractor.create_few_shot_prompts_for_all_topics(
    csv_path="examples.csv",
    delimiter=";",
    num_examples=3
)

Server Integration

Load data elements from remote servers:

# Get all available CDEs (Common Data Elements)
all_cdes = extractor.get_all_cdes_from_server()

# Get CDE lists
cde_lists = extractor.get_cde_lists_from_server()

# Load a data element list
topics = extractor.load_data_element_list_from_server(cde_list_id="list-123")

# Load a single data element
topics = extractor.load_data_element_from_server(cde_id="cde-456")

Topic Management

# Get topic information
topic = extractor.get_topic_by_id("T1")
topic_id = extractor.get_topic_id_by_name("Sentiment")

# Modify topics
extractor.set_prompt(topic_id="T1", new_prompt="New prompt text")
extractor.increase_topic_order("T1")
extractor.decrease_topic_order("T1")

# Category management
extractor.add_category(topic_id="T1", category_name="New Category")
extractor.remove_category(topic_id="T1", category_id="cat-uuid")

# Category conditions
extractor.add_category_condition(topic_id="T1", category_id="cat-uuid", condition_str="T2 > 10")

# Save and load topics
extractor.save_topics("topics.json")
extractor.load_topics("topics.json")

# Display all topics
extractor.show_topics_and_categories()

Thinking/Chain-of-Thought

Configure chain-of-thought reasoning for improved accuracy:

# Set global thinking config
extractor.thinking_config = {
    "enabled": True,
    "temperature": 0.7,
    "max_length": 500
}

# Or configure per topic when adding
topic_id = extractor.add_topic(
    topic_name="Complex Classification",
    topic_data=["Category A", "Category B"],
    thinking_config={
        "enabled": True,
        "temperature": 0.5
    }
)

Configuration

# Configure choice symbols for categorical output
# Options: "none", "alphabetical", "numerical", or custom list like "A,B,C,D"
extractor.set_choice_symbol_config("alphabetical")

# Set inference server URL
extractor.set_inference_server_url("http://127.0.0.1:5000")

User Interface

The library includes a graphical user interface built with Tkinter:

from data_element_extractor.ui.main_app import ExtractorApp
import tkinter as tk

root = tk.Tk()
app = ExtractorApp(root)
root.mainloop()

Or use the UI module directly:

from data_element_extractor import ui
# Launch UI (if available)

The UI provides:

  • Model configuration management
  • Topic creation and editing
  • Interactive extraction interface
  • Prompt editing and optimization
  • CSV file processing

API Reference

Main Class

DataElementExtractor()

Main class for data extraction.

Model Management:

  • set_model(model_name, model_type="Transformers", api_key="", inference_type="transformers", ...) - Configure the main extraction model
  • set_prompt_model(model_name, model_type="OpenAI", ...) - Configure model for prompt generation
  • set_model_as_prompt_model() - Use main model for prompt generation

Topic Management:

  • add_topic(topic_name, topic_data, condition="", prompt="", thinking_config={}) - Add a new extraction topic
  • get_topic_by_id(topic_id) - Get topic by ID
  • get_topic_id_by_name(topic_name) - Get topic ID by name
  • update_topics(topics) - Update all topics
  • remove_topic(topic_id_str) - Remove a topic
  • save_topics(filename) - Save topics to file
  • load_topics(filename) - Load topics from file
  • show_topics_and_categories() - Display all topics

Extraction:

  • extract(text, is_single_extraction=True, constrained_output=True, with_evaluation=False, ground_truth_row=None) - Extract data from text
  • extract_element(topic_id, text, constrained_output=False, thinking_data=None) - Extract a single element
  • extract_from_table(csv_file_path, delimiter=';', batch_size=100, ...) - Extract from CSV file

Prompt Optimization:

  • evaluate_prompt_performance_for_topic(topic_id, truth_col, dataset_path, ...) - Evaluate prompt performance
  • iteratively_improve_prompt(topic_id, dataset_path, ...) - Improve prompt iteratively
  • create_few_shot_prompt(topic_id, csv_path, ...) - Generate few-shot prompt

Examples

Example 1: Document Classification

from data_element_extractor import DataElementExtractor

extractor = DataElementExtractor()
extractor.set_model("microsoft/Phi-3-mini-4k-instruct", model_type="Transformers")

# Classify document type
extractor.add_topic(
    topic_name="Document Type",
    topic_data=["Invoice", "Receipt", "Contract", "Letter"]
)

# Extract date (conditional)
extractor.add_topic(
    topic_name="Document Date",
    topic_data="date",
    condition="T1 != ''"
)

text = "Invoice dated March 15, 2024 for services rendered."
results, probs = extractor.extract(text)
print(f"Type: {results[0]}, Date: {results[1]}")

Example 2: Medical Data Extraction

extractor = DataElementExtractor()
extractor.set_model("gpt-3.5-turbo", model_type="OpenAI", api_key="your-key")

# Extract diagnosis
extractor.add_topic(
    topic_name="Diagnosis",
    topic_data=["Diabetes", "Hypertension", "Asthma", "Other"]
)

# Extract age (number)
extractor.add_topic(
    topic_name="Patient Age",
    topic_data="number"
)

# Extract date of diagnosis
extractor.add_topic(
    topic_name="Diagnosis Date",
    topic_data="date",
    condition="T1 != 'Other'"
)

medical_text = "Patient is 45 years old, diagnosed with Diabetes on 2023-01-15."
results, _ = extractor.extract(medical_text)

Requirements

  • Python >= 3.7
  • PyTorch (for local models)
  • transformers
  • openai
  • dateparser
  • requests

License

MIT License

Author

Fabio Dennstädt (fabiodennstaedt@gmx.de)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues, questions, or contributions, please open an issue on the project repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_element_extractor-0.1.0.tar.gz (52.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_element_extractor-0.1.0-py3-none-any.whl (59.9 kB view details)

Uploaded Python 3

File details

Details for the file data_element_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: data_element_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 52.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for data_element_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3029d2b722a74621f7f4978bacfdf7ac7dfdd21b416d88e380051746c27fe628
MD5 7c04cf54dd63a6bae09d04372850ca2c
BLAKE2b-256 393fbc07226a48c7f455017f307777dd702d6f7f1befc3a2b60a9e9047c6deee

See more details on using hashes here.

File details

Details for the file data_element_extractor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_element_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 684a8f7515bddbf3101bbcd941652a3bcac8571e8bde04fedeaf3114cda7786a
MD5 a10ce724dade217a3992eafce54ffdf5
BLAKE2b-256 987bbb5626778386c0964ba00e32123cd044ccb98b35128c5141f02732a413df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page