An AI-powered Python library for context-aware data cleaning using local LLMs
Project description
AILLMCleaner
An AI-powered Python library for intelligent, context-aware automated data cleaning using both Local LLMs (Ollama) and Cloud AI APIs (Google Gemini, OpenAI, Groq).
What Does It Do?
Real-world datasets are messy — names in wrong case, duplicate city names, missing values, invalid emails. aillmcleaner uses Large Language Models to understand and fix your data intelligently, not just with simple rules.
Requirements
- Python 3.8 or higher
- pandas
- requests
- openai
- google-genai
- One AI provider (Ollama, Gemini, OpenAI, or Groq)
Install all dependencies:
pip install aillmcleaner
AI Provider Options
Ollama — Free, runs locally on your computer. No API key needed. Download from: https://ollama.com
Google Gemini — Free tier available. Get API key from: https://aistudio.google.com
OpenAI (ChatGPT) — Paid service. Get API key from: https://platform.openai.com
Groq — Free tier available, very fast. Get API key from: https://console.groq.com
Installation
pip install aillmcleaner
Quick Start
Using Ollama (Free, No API Key Needed)
First start Ollama on your computer:
ollama serve
ollama pull llama3.2
Then use in Python:
import pandas as pd
from aillmcleaner import clean_column, standardize_column, fill_missing, detect_anomalies
data = {
"name": ["alice johnson", "BOB SMITH", "Charlie Brown"],
"city": ["new york", "newyork", "LA"],
"country": [None, "USA", "United States"]
}
df = pd.DataFrame(data)
# Fix name casing
df["name"] = clean_column(df, "name",
instruction="Convert to proper title case. Return only the name.")
# Standardize city names
df["city"] = standardize_column(df, "city",
categories=["New York", "Los Angeles", "Chicago"])
# Fill missing country values
df["country"] = fill_missing(df, "country", context_columns=["city"])
print(df)
Using Google Gemini
from aillmcleaner import clean_column
df["name"] = clean_column(df, "name",
provider="gemini",
api_key="YOUR_GEMINI_API_KEY")
Or set as environment variable:
export GEMINI_API_KEY="your_key_here"
Using OpenAI
from aillmcleaner import clean_column
df["name"] = clean_column(df, "name",
provider="openai",
model="gpt-4o-mini",
api_key="YOUR_OPENAI_API_KEY")
Using Groq
from aillmcleaner import clean_column
df["name"] = clean_column(df, "name",
provider="groq",
model="llama3-8b-8192",
api_key="YOUR_GROQ_API_KEY")
All Available Functions
clean_text — Clean a single text value.
from aillmcleaner import clean_text
result = clean_text("helo wrold", provider="gemini", api_key="YOUR_KEY")
clean_column — Clean an entire DataFrame column.
df["name"] = clean_column(df, "name",
instruction="Fix spelling and capitalize properly.")
standardize_column — Map column values to a fixed list of categories.
df["city"] = standardize_column(df, "city",
categories=["New York", "Los Angeles", "Chicago"])
fill_missing — Fill missing values using context from other columns.
df["country"] = fill_missing(df, "country",
context_columns=["city", "state"])
detect_anomalies — Find suspicious or invalid values in a column.
bad_values = detect_anomalies(df, "email")
print(bad_values)
clean_dataframe — Clean all text columns in a DataFrame at once.
from aillmcleaner import clean_dataframe
cleaned_df = clean_dataframe(df, provider="groq", api_key="YOUR_KEY")
Provider and Model Options
provider can be: ollama (default), gemini, openai, groq model is optional — uses the best default for each provider api_key is required only for cloud providers (gemini, openai, groq)
License
MIT License — free to use, modify, and distribute.
Author
Sujoy Panigrahi (spanigrahidev) GitHub: https://github.com/spanigrahidev PyPI: https://pypi.org/project/aillmcleaner/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aillmcleaner-0.2.1.tar.gz.
File metadata
- Download URL: aillmcleaner-0.2.1.tar.gz
- Upload date:
- Size: 6.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2797f408225991babfcf21799eba5fdaa7a1114d52c14745b4b67ba15891291d
|
|
| MD5 |
3ab5a98d07e0933bd000ae2710b4c40d
|
|
| BLAKE2b-256 |
6966496d64bd1195df4d30f70a78591a29c0f19f708ff86f4cf3e1e6d04ff4d7
|
File details
Details for the file aillmcleaner-0.2.1-py3-none-any.whl.
File metadata
- Download URL: aillmcleaner-0.2.1-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31cff5eefe9af2f7ca2205f687ca7dfc5df185f660c136ec5318145f0ea87137
|
|
| MD5 |
bdc789855690c11591327bc569991e7e
|
|
| BLAKE2b-256 |
0a16a6545416bbcf6dae4f15a08bb07cad5be297f93dd952167db1b27cfc9b72
|