DataFrame column analysis and matching with LLM

These details have not been verified by PyPI

Project links

Project description

ColumnsGPT

DataFrame column analysis and matching with LLM

Overview

ColumnsGPT is a Python package that leverages large language models (LLMs) to analyze and match columns in pandas DataFrames against template schemas. It helps data analysts and scientists automate the tedious task of mapping columns between different data sources, especially when dealing with varied naming conventions.

Features

Semantic column name matching using LLMs
Data type inference and validation
Support for multiple LLM providers (OpenAI, Google, Anthropic, DeepSeek)
Interactive mode for user feedback on column mappings
Detailed analysis reports with confidence scores
Easy-to-use Python API and command-line interface

Installation

pip install columns-gpt

Requirements

Python 3.8+
pandas
numpy
langchain and related packages
python-dotenv
pydantic

Setting Up Environment Variables (IMPORTANT)

ColumnsGPT requires environment variables to be set for LLM access. Follow these steps:

Copy the template: A template file is provided in the repository:
```
cp .env.template .env
```

Edit the file: Open .env and add your API key for the LLM provider you want to use:

LLM_PROVIDER=openai  # Choose: openai, google, anthropic, deepseek
OPENAI_API_KEY=your-api-key-here
LLM_MODEL_OPENAI=gpt-4o  # Optional, will use default if not specified

Alternative: Set environment variables directly:

# For OpenAI
export LLM_PROVIDER=openai
export OPENAI_API_KEY=your-api-key-here
export LLM_MODEL_OPENAI=gpt-4o

# For Google
export LLM_PROVIDER=google
export GOOGLE_API_KEY=your-api-key-here
export LLM_MODEL_GOOGLE=gemini-2.0-flash-exp

# For Anthropic
export LLM_PROVIDER=anthropic
export ANTHROPIC_API_KEY=your-api-key-here
export LLM_MODEL_ANTHROPIC=claude-3-5-sonnet-latest

# For DeepSeek
export LLM_PROVIDER=deepseek
export DEEPSEEK_API_KEY=your-api-key-here
export LLM_MODEL_DEEPSEEK=deepseek-chat

IMPORTANT: Always set environment variables BEFORE importing the package. In production environments, use secure environment variable management rather than .env files.

Basic Usage

# IMPORTANT: Set up environment variables first
import os
import pandas as pd
from dotenv import load_dotenv

# Load environment variables (in development)
load_dotenv()  # Will look for .env in current directory

# Or set them directly (better for production)
os.environ["LLM_PROVIDER"] = "openai"
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
os.environ["LLM_MODEL_OPENAI"] = "gpt-4o"  # Optional

# Create a sample DataFrame
data = {
    "user_id": [1, 2, 3, 4, 5],
    "full_name": ["Alice Johnson", "Bob Smith", "Charlie Brown", "David Miller", "Eve Davis"],
    "customer_age": [25, 30, 35, 40, 45],
    "annual_salary": [50000.0, 60000.0, 70000.0, 80000.0, 90000.0],
    "is_active": [True, True, False, True, False],
    "registration_date": ["2020-01-01", "2019-05-15", "2021-03-10", "2018-11-20", "2022-02-28"],
}
df = pd.DataFrame(data)

# Define a template with different column names but semantically similar concepts
template = {
    "transaction_id": "int",
    "athlete": "str", 
    "years_old": "int",
    "salary": "float",
    "employed": "bool",
    "begin_date": "date",
    "weight": "int",
}

# Import and use the package
from columns_gpt import analyze_dataframe, format_analysis_results

# Analyze the DataFrame
analysis_results, summary, rename_dict = analyze_dataframe(
    df, template, llm_provider="openai", match_threshold=0.5
)

# Print formatted results
print(format_analysis_results(summary, rename_dict))

# Apply the rename dictionary to the DataFrame
df_renamed = df.rename(columns=rename_dict)
print(df_renamed.head())

Command Line Interface

The command-line interface automatically attempts to load environment variables from a .env file in the current directory:

# Non-interactive analysis
columns-gpt --input data.csv --template template.json --provider openai --threshold 0.5 --output results.json

# Interactive analysis (allows you to adjust column mappings)
columns-gpt --input data.csv --template template.json --interactive

API Reference

Main Functions

# Non-interactive analysis
analyze_dataframe(
    df, template, llm_provider="openai", sample_size=10, 
    match_threshold=0.7, max_columns=None, rename_mapping=None
)

# Interactive analysis with user feedback
interactive_analyze_dataframe(
    df, template, llm_provider="openai"
)

Data Models

ColumnAnalysis

A data model representing analysis of a single DataFrame column.

Attributes:

df_column_name: The name of the column in the DataFrame
template_column_name: The matched template column name (if any)
inferred_type: The inferred data type
confidence: Confidence score for type inference (0-1)
sample_values: Sample values from the column
matches_template: Whether the column matches template requirements
template_type: The expected type from the template
match_confidence: Confidence score for column name match (0-1)
notes: Additional observations

AnalysisSummary

A data model containing the overall analysis results.

Attributes:

total_df_columns: Total number of columns in the DataFrame
total_template_columns: Total number of columns in the template
analyzed_columns: Number of columns analyzed
matching_columns: Number of columns matching the template
non_matching_columns: Number of columns not matching the template
unmatched_df_columns: Number of DataFrame columns not matched to template
unmatched_template_columns: Number of template columns not matched to DataFrame
column_analyses: List of detailed column analyses

Environment Variables

These environment variables configure ColumnsGPT's behavior:

Required Variables

For using ColumnsGPT, you MUST set:

LLM_PROVIDER: Which LLM provider to use (required)
- Options: openai, google, anthropic, deepseek
API key for your chosen provider (one of these is required):
- OPENAI_API_KEY: Your OpenAI API key
- GOOGLE_API_KEY: Your Google API key
- ANTHROPIC_API_KEY: Your Anthropic API key
- DEEPSEEK_API_KEY: Your DeepSeek API key

Optional Variables

Model name for your chosen provider (optional, defaults provided):
- LLM_MODEL_OPENAI: OpenAI model (default: "gpt-4o")
- LLM_MODEL_GOOGLE: Google model (default: "gemini-2.0-flash-exp")
- LLM_MODEL_ANTHROPIC: Anthropic model (default: "claude-3-5-sonnet-latest")
- LLM_MODEL_DEEPSEEK: DeepSeek model (default: "deepseek-chat")

Examples

Template Format

The template should be a JSON object with column names as keys and expected data types as values:

{
  "user_id": "int",
  "name": "str",
  "age": "int",
  "salary": "float",
  "active": "bool",
  "start_date": "date"
}

Interactive Analysis

import pandas as pd
from columns_gpt import interactive_analyze_dataframe

# Load your DataFrame
df = pd.read_csv("data.csv")

# Define your template
template = {
    "id": "int",
    "full_name": "str",
    "age": "int",
    "income": "float",
    "is_employed": "bool",
    "registration_date": "date"
}

# Run interactive analysis
interactive_analyze_dataframe(df, template, llm_provider="openai")

Troubleshooting

API Key errors: Make sure your API key is correct and has appropriate permissions
Pydantic version issues: This package uses Pydantic for data models. If you get errors related to Pydantic, try:
```
pip install "pydantic>=2.0.0"
```
LLM provider errors: Check that you've set the correct environment variables for your chosen provider
"No module named dotenv": Install python-dotenv if you're using .env files
```
pip install python-dotenv
```

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Mar 8, 2025

This version

0.1.1

Mar 8, 2025

0.1.0

Mar 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

columns_gpt-0.1.1.tar.gz (19.7 kB view details)

Uploaded Mar 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

columns_gpt-0.1.1-py3-none-any.whl (17.8 kB view details)

Uploaded Mar 8, 2025 Python 3

File details

Details for the file columns_gpt-0.1.1.tar.gz.

File metadata

Download URL: columns_gpt-0.1.1.tar.gz
Upload date: Mar 8, 2025
Size: 19.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.2

File hashes

Hashes for columns_gpt-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`cd645b6d141b0249b0983420486dfee23bf8934e119dd6e69ebd4204249ddfda`
MD5	`cfacd865502f38e52eb0b5fc464c6024`
BLAKE2b-256	`d2928cc638520d5d0e4d1808b95d584199345b47c89a924ed5627df7601398b5`

See more details on using hashes here.

File details

Details for the file columns_gpt-0.1.1-py3-none-any.whl.

File metadata

Download URL: columns_gpt-0.1.1-py3-none-any.whl
Upload date: Mar 8, 2025
Size: 17.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.2

File hashes

Hashes for columns_gpt-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c0da42a968273d1b775cad9195e9480136ded9e35113cde3a2749d0ce9c597a`
MD5	`d6ba7651a9f39376bd37ff2d8475b4e4`
BLAKE2b-256	`7cef36b086e524e0b9f89c67ed9590a0fd7f393a753b1f68737451f5adab3083`

See more details on using hashes here.

columns-gpt 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ColumnsGPT

Overview

Features

Installation

Requirements

Setting Up Environment Variables (IMPORTANT)

Basic Usage

Command Line Interface

API Reference

Main Functions

Data Models

ColumnAnalysis

AnalysisSummary

Environment Variables

Required Variables

Optional Variables

Examples

Template Format

Interactive Analysis

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes