Skip to main content

An AI-powered Python library for context-aware data cleaning using local LLMs

Project description

AILLMCleaner

An AI-powered Python library for intelligent, context-aware automated data cleaning using both Local LLMs (Ollama) and Cloud AI APIs (Google Gemini, OpenAI, Groq).


What Does It Do?

Real-world datasets are messy — names in wrong case, duplicate city names, missing values, invalid emails. aillmcleaner uses Large Language Models to understand and fix your data intelligently, not just with simple rules.


Requirements

  • Python 3.8 or higher
  • pandas
  • requests
  • openai
  • google-genai
  • One AI provider (Ollama, Gemini, OpenAI, or Groq)

Install all dependencies:

pip install aillmcleaner

AI Provider Options

Ollama — Free, runs locally on your computer. No API key needed. Download from: https://ollama.com

Google Gemini — Free tier available. Get API key from: https://aistudio.google.com

OpenAI (ChatGPT) — Paid service. Get API key from: https://platform.openai.com

Groq — Free tier available, very fast. Get API key from: https://console.groq.com


Installation

pip install aillmcleaner

Quick Start

Using Ollama (Free, No API Key Needed)

First start Ollama on your computer:

ollama serve
ollama pull llama3.2

Then use in Python:

import pandas as pd
from aillmcleaner import clean_column, standardize_column, fill_missing, detect_anomalies

data = {
    "name":    ["alice johnson", "BOB SMITH", "Charlie  Brown"],
    "city":    ["new york", "newyork", "LA"],
    "country": [None, "USA", "United States"]
}

df = pd.DataFrame(data)

# Fix name casing
df["name"] = clean_column(df, "name",
    instruction="Convert to proper title case. Return only the name.")

# Standardize city names
df["city"] = standardize_column(df, "city",
    categories=["New York", "Los Angeles", "Chicago"])

# Fill missing country values
df["country"] = fill_missing(df, "country", context_columns=["city"])

print(df)

Using Google Gemini

from aillmcleaner import clean_column

df["name"] = clean_column(df, "name",
    provider="gemini",
    api_key="YOUR_GEMINI_API_KEY")

Or set as environment variable:

export GEMINI_API_KEY="your_key_here"

Using OpenAI

from aillmcleaner import clean_column

df["name"] = clean_column(df, "name",
    provider="openai",
    model="gpt-4o-mini",
    api_key="YOUR_OPENAI_API_KEY")

Using Groq

from aillmcleaner import clean_column

df["name"] = clean_column(df, "name",
    provider="groq",
    model="llama3-8b-8192",
    api_key="YOUR_GROQ_API_KEY")

All Available Functions

clean_text — Clean a single text value.

from aillmcleaner import clean_text
result = clean_text("helo wrold", provider="gemini", api_key="YOUR_KEY")

clean_column — Clean an entire DataFrame column.

df["name"] = clean_column(df, "name",
    instruction="Fix spelling and capitalize properly.")

standardize_column — Map column values to a fixed list of categories.

df["city"] = standardize_column(df, "city",
    categories=["New York", "Los Angeles", "Chicago"])

fill_missing — Fill missing values using context from other columns.

df["country"] = fill_missing(df, "country",
    context_columns=["city", "state"])

detect_anomalies — Find suspicious or invalid values in a column.

bad_values = detect_anomalies(df, "email")
print(bad_values)

clean_dataframe — Clean all text columns in a DataFrame at once.

from aillmcleaner import clean_dataframe
cleaned_df = clean_dataframe(df, provider="groq", api_key="YOUR_KEY")

Provider and Model Options

provider can be: ollama (default), gemini, openai, groq model is optional — uses the best default for each provider api_key is required only for cloud providers (gemini, openai, groq)


License

MIT License — free to use, modify, and distribute.


Author

Sujoy Panigrahi (spanigrahidev) GitHub: https://github.com/spanigrahidev PyPI: https://pypi.org/project/aillmcleaner/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aillmcleaner-0.2.1.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aillmcleaner-0.2.1-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file aillmcleaner-0.2.1.tar.gz.

File metadata

  • Download URL: aillmcleaner-0.2.1.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for aillmcleaner-0.2.1.tar.gz
Algorithm Hash digest
SHA256 2797f408225991babfcf21799eba5fdaa7a1114d52c14745b4b67ba15891291d
MD5 3ab5a98d07e0933bd000ae2710b4c40d
BLAKE2b-256 6966496d64bd1195df4d30f70a78591a29c0f19f708ff86f4cf3e1e6d04ff4d7

See more details on using hashes here.

File details

Details for the file aillmcleaner-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: aillmcleaner-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for aillmcleaner-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 31cff5eefe9af2f7ca2205f687ca7dfc5df185f660c136ec5318145f0ea87137
MD5 bdc789855690c11591327bc569991e7e
BLAKE2b-256 0a16a6545416bbcf6dae4f15a08bb07cad5be297f93dd952167db1b27cfc9b72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page