DataFrame column analysis and matching with LLM
Project description
ColumnsGPT
DataFrame column analysis and matching with LLM
Overview
ColumnsGPT is a Python package that leverages large language models (LLMs) to analyze and match columns in pandas DataFrames against template schemas. It helps data analysts and scientists automate the tedious task of mapping columns between different data sources, especially when dealing with varied naming conventions.
Features
- Semantic column name matching using LLMs
- Data type inference and validation
- Support for multiple LLM providers (OpenAI, Google, Anthropic, DeepSeek)
- Interactive mode for user feedback on column mappings
- Detailed analysis reports with confidence scores
- Easy-to-use Python API and command-line interface
Installation
pip install columns-gpt
Requirements
- Python 3.8+
- pandas
- numpy
- langchain and related packages
- python-dotenv
- pydantic
Setting Up Environment Variables (IMPORTANT)
ColumnsGPT requires environment variables to be set for LLM access. Follow these steps:
-
Copy the template: A template file is provided in the repository:
cp .env.template .env
-
Edit the file: Open
.envand add your API key for the LLM provider you want to use:LLM_PROVIDER=openai # Choose: openai, google, anthropic, deepseek OPENAI_API_KEY=your-api-key-here LLM_MODEL_OPENAI=gpt-4o # Optional, will use default if not specified -
Alternative: Set environment variables directly:
# For OpenAI export LLM_PROVIDER=openai export OPENAI_API_KEY=your-api-key-here export LLM_MODEL_OPENAI=gpt-4o # For Google export LLM_PROVIDER=google export GOOGLE_API_KEY=your-api-key-here export LLM_MODEL_GOOGLE=gemini-2.0-flash-exp # For Anthropic export LLM_PROVIDER=anthropic export ANTHROPIC_API_KEY=your-api-key-here export LLM_MODEL_ANTHROPIC=claude-3-5-sonnet-latest # For DeepSeek export LLM_PROVIDER=deepseek export DEEPSEEK_API_KEY=your-api-key-here export LLM_MODEL_DEEPSEEK=deepseek-chat
IMPORTANT: Always set environment variables BEFORE importing the package. In production environments, use secure environment variable management rather than .env files.
Basic Usage
# IMPORTANT: Set up environment variables first
import os
import pandas as pd
from dotenv import load_dotenv
# Load environment variables (in development)
load_dotenv() # Will look for .env in current directory
# Or set them directly (better for production)
os.environ["LLM_PROVIDER"] = "openai"
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
os.environ["LLM_MODEL_OPENAI"] = "gpt-4o" # Optional
# Create a sample DataFrame
data = {
"user_id": [1, 2, 3, 4, 5],
"full_name": ["Alice Johnson", "Bob Smith", "Charlie Brown", "David Miller", "Eve Davis"],
"customer_age": [25, 30, 35, 40, 45],
"annual_salary": [50000.0, 60000.0, 70000.0, 80000.0, 90000.0],
"is_active": [True, True, False, True, False],
"registration_date": ["2020-01-01", "2019-05-15", "2021-03-10", "2018-11-20", "2022-02-28"],
}
df = pd.DataFrame(data)
# Define a template with different column names but semantically similar concepts
template = {
"transaction_id": "int",
"athlete": "str",
"years_old": "int",
"salary": "float",
"employed": "bool",
"begin_date": "date",
"weight": "int",
}
# Import and use the package
from columns_gpt import analyze_dataframe, format_analysis_results
# Analyze the DataFrame
analysis_results, summary, rename_dict = analyze_dataframe(
df, template, llm_provider="openai", match_threshold=0.5
)
# Print formatted results
print(format_analysis_results(summary, rename_dict))
# Apply the rename dictionary to the DataFrame
df_renamed = df.rename(columns=rename_dict)
print(df_renamed.head())
Command Line Interface
The command-line interface automatically attempts to load environment variables from a .env file in the current directory:
# Non-interactive analysis
columns-gpt --input data.csv --template template.json --provider openai --threshold 0.5 --output results.json
# Interactive analysis (allows you to adjust column mappings)
columns-gpt --input data.csv --template template.json --interactive
API Reference
Main Functions
# Non-interactive analysis
analyze_dataframe(
df, template, llm_provider="openai", sample_size=10,
match_threshold=0.7, max_columns=None, rename_mapping=None
)
# Interactive analysis with user feedback
interactive_analyze_dataframe(
df, template, llm_provider="openai"
)
Data Models
ColumnAnalysis
A data model representing analysis of a single DataFrame column.
Attributes:
df_column_name: The name of the column in the DataFrametemplate_column_name: The matched template column name (if any)inferred_type: The inferred data typeconfidence: Confidence score for type inference (0-1)sample_values: Sample values from the columnmatches_template: Whether the column matches template requirementstemplate_type: The expected type from the templatematch_confidence: Confidence score for column name match (0-1)notes: Additional observations
AnalysisSummary
A data model containing the overall analysis results.
Attributes:
total_df_columns: Total number of columns in the DataFrametotal_template_columns: Total number of columns in the templateanalyzed_columns: Number of columns analyzedmatching_columns: Number of columns matching the templatenon_matching_columns: Number of columns not matching the templateunmatched_df_columns: Number of DataFrame columns not matched to templateunmatched_template_columns: Number of template columns not matched to DataFramecolumn_analyses: List of detailed column analyses
Environment Variables
These environment variables configure ColumnsGPT's behavior:
Required Variables
For using ColumnsGPT, you MUST set:
-
LLM_PROVIDER: Which LLM provider to use (required)- Options:
openai,google,anthropic,deepseek
- Options:
-
API key for your chosen provider (one of these is required):
OPENAI_API_KEY: Your OpenAI API keyGOOGLE_API_KEY: Your Google API keyANTHROPIC_API_KEY: Your Anthropic API keyDEEPSEEK_API_KEY: Your DeepSeek API key
Optional Variables
- Model name for your chosen provider (optional, defaults provided):
LLM_MODEL_OPENAI: OpenAI model (default: "gpt-4o")LLM_MODEL_GOOGLE: Google model (default: "gemini-2.0-flash-exp")LLM_MODEL_ANTHROPIC: Anthropic model (default: "claude-3-5-sonnet-latest")LLM_MODEL_DEEPSEEK: DeepSeek model (default: "deepseek-chat")
Examples
Template Format
The template should be a JSON object with column names as keys and expected data types as values:
{
"user_id": "int",
"name": "str",
"age": "int",
"salary": "float",
"active": "bool",
"start_date": "date"
}
Interactive Analysis
import pandas as pd
from columns_gpt import interactive_analyze_dataframe
# Load your DataFrame
df = pd.read_csv("data.csv")
# Define your template
template = {
"id": "int",
"full_name": "str",
"age": "int",
"income": "float",
"is_employed": "bool",
"registration_date": "date"
}
# Run interactive analysis
interactive_analyze_dataframe(df, template, llm_provider="openai")
Troubleshooting
-
API Key errors: Make sure your API key is correct and has appropriate permissions
-
Pydantic version issues: This package uses Pydantic for data models. If you get errors related to Pydantic, try:
pip install "pydantic>=2.0.0"
-
LLM provider errors: Check that you've set the correct environment variables for your chosen provider
-
"No module named dotenv": Install python-dotenv if you're using .env files
pip install python-dotenv
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file columns_gpt-0.1.1.tar.gz.
File metadata
- Download URL: columns_gpt-0.1.1.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd645b6d141b0249b0983420486dfee23bf8934e119dd6e69ebd4204249ddfda
|
|
| MD5 |
cfacd865502f38e52eb0b5fc464c6024
|
|
| BLAKE2b-256 |
d2928cc638520d5d0e4d1808b95d584199345b47c89a924ed5627df7601398b5
|
File details
Details for the file columns_gpt-0.1.1-py3-none-any.whl.
File metadata
- Download URL: columns_gpt-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c0da42a968273d1b775cad9195e9480136ded9e35113cde3a2749d0ce9c597a
|
|
| MD5 |
d6ba7651a9f39376bd37ff2d8475b4e4
|
|
| BLAKE2b-256 |
7cef36b086e524e0b9f89c67ed9590a0fd7f393a753b1f68737451f5adab3083
|