AI toolkit for tabular data — auto EDA, data profiling, anomaly detection, and smart transformations on DataFrames.
Project description
tableai
Profile, clean, and query tabular data with one-liners — plus natural-language DataFrame analysis.
tableai is a toolkit for making sense of DataFrames fast. Profile any DataFrame and get column types, null counts, descriptive statistics, correlations, and a data-quality score. Clean it with a single call that imputes missing values, drops duplicates, and clips outliers. Detect anomalies with IQR or Isolation Forest. Get rule-based natural-language insights — or ask questions in plain English and have anyllm generate the pandas code for you.
Built by Viet-Anh Nguyen at NRL.ai.
Why tableai?
- One-liner API —
tableai.profile(df)gives you everything in one call - Plugin architecture — Register custom profilers, cleaners, and anomaly detectors
- Local-first — All core features work without any cloud or LLM call
- Minimal core deps —
pandasandnumpy; sklearn and anyllm are optional - Production-ready — Structured dataclass results, JSON export, reproducible
Installation
pip install tableai
For optional features:
pip install tableai[sklearn] # Isolation Forest + KMeans clustering
pip install tableai[llm] # NL querying via anyllm
pip install tableai[all] # everything
Python 3.8+ supported (tested on 3.8, 3.9, 3.10, 3.11, 3.12, 3.13)
Quick Start
import tableai
import pandas as pd
df = pd.read_csv("sales.csv")
# 1. Profile the DataFrame (dtypes, nulls, stats, correlations, quality score)
report = tableai.profile(df)
print(report.quality_score) # 0.0 - 1.0
print(report.nulls) # per-column null counts
print(report.correlations.head()) # top correlated pairs
# 2. Clean the DataFrame (impute, dedupe, clip outliers)
clean = tableai.clean(df, impute=True, dedupe=True, clip_outliers=True)
# 3. Detect anomalies (IQR by default, Isolation Forest if sklearn installed)
anomalies = tableai.anomalies(df, method="iqr")
print(f"{len(anomalies)} anomalous rows")
# 4. Rule-based insights
for insight in tableai.insights(df):
print("-", insight)
# 5. Natural-language querying (requires tableai[llm] + anyllm)
result = tableai.ask(df, "what is the average revenue by region?")
print(result)
Models & Methods
Profiling
- Dtype detection — numeric / categorical / datetime / text / boolean / ID
- Null analysis — per-column null counts, percentages, and null patterns
- Descriptive statistics — mean, std, min, 25/50/75 percentiles, max, skew, kurtosis
- Cardinality — unique counts and top-K value frequencies
- Correlation matrix — Pearson for numerics, Cramer's V for categoricals
- Duplicate detection — exact and near-duplicate row counts
Cleaning
Configurable pipeline applied in order:
- Drop constant columns — zero variance
- Impute —
medianfor numerics,modefor categoricals (configurable) - Deduplicate — drop exact-duplicate rows
- Clip outliers — IQR method (
[Q1 - 1.5*IQR, Q3 + 1.5*IQR]) - Type coercion — auto-convert date-like strings to datetime
Anomaly detection
| Method | Algorithm | Notes |
|---|---|---|
iqr (default) |
1.5 x IQR per numeric column | Zero deps |
zscore |
` | z |
isolation_forest |
sklearn IsolationForest |
Needs tableai[sklearn] |
Data quality score
Weighted average (0.0 - 1.0) of four sub-scores:
- Completeness —
1 - null_ratio - Uniqueness — ratio of distinct rows
- Consistency — fraction of columns with a dominant dtype
- Validity — fraction of values inside expected ranges / formats
Insights (rule-based NL)
Pattern-driven natural-language observations, for example:
"Column 'age' has 23.4% missing values""'price' and 'quantity' are strongly positively correlated (r=0.87)""Column 'id' appears to be a unique identifier""12 rows are exact duplicates"
Natural-language querying (optional)
tableai.ask(df, "…") uses anyllm to generate pandas code for your question, executes it in a sandboxed namespace, and returns the result. Works with any local or cloud LLM that anyllm supports.
Models & Methods
tableai uses pure pandas/numpy for core operations — no ML dependencies required.
Profiling (tableai.profile) — Computes per-column:
- Dtype detection (numeric, categorical, datetime, string)
- Null counts and percentages
- Unique value counts
- Numeric statistics: mean, median, std, min, max, quartiles, skewness, kurtosis
- Top categorical values
- Pearson correlation matrix between numeric columns
Cleaning (tableai.clean) — Configurable strategies:
- Missing values: median (numeric), mode (categorical), drop, or zero
- Duplicate removal
- Outlier handling: IQR-based clipping or removal
Anomaly Detection (tableai.anomalies):
- IQR method (default, no deps) — flags points outside Q1-1.5·IQR / Q3+1.5·IQR
- Isolation Forest (optional via
[ml], requires scikit-learn)
Quality Scoring (tableai.quality_score) — Weighted score 0-100:
- Completeness 35% (1 - null_ratio)
- Validity 25% (IQR-based outlier ratio)
- Uniqueness 20% (duplicate detection)
- Consistency 20% (mixed-type detection)
Insights (tableai.insights) — Rule-based natural language insights about missing values, correlations, skewness, cardinality, duplicates, and class imbalance.
Natural Language Querying (tableai.ask, tableai.query) — Optional via [llm] extra. Uses anyllm to generate pandas code from natural language. Falls back to keyword matching when LLM unavailable.
API Reference
| Function | Purpose |
|---|---|
tableai.profile(df) |
Returns ProfileReport dataclass |
tableai.clean(df, **opts) |
Returns a cleaned DataFrame |
tableai.anomalies(df, method="iqr") |
Returns rows flagged as anomalous |
tableai.quality_score(df) |
Returns float 0.0 - 1.0 |
tableai.insights(df) |
Returns list[str] of NL insights |
tableai.ask(df, question, model=None) |
NL query via LLM |
tableai.compare(df1, df2) |
Diff two DataFrames (schema + data) |
CLI Usage
tableai profile data.csv --out report.json
tableai clean data.csv --out clean.csv
tableai anomalies data.csv --method isolation_forest
tableai ask data.csv "average sales by region"
tableai quality data.csv
Examples
Full profiling report to JSON
import tableai, pandas as pd
df = pd.read_csv("customers.csv")
report = tableai.profile(df)
report.to_json("customers_report.json")
print(f"Quality: {report.quality_score:.2f}")
Custom cleaning pipeline
import tableai
clean = tableai.clean(
df,
impute_numeric="median",
impute_categorical="mode",
dedupe=True,
clip_outliers=True,
drop_constant=True,
)
Ask questions in English (with Ollama)
import tableai
# Uses anyllm; defaults to Ollama if running locally
answer = tableai.ask(df, "which customer spent the most last quarter?",
model="llama3.1:8b")
print(answer)
License
MIT (c) Viet-Anh Nguyen
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tableai-0.2.3.tar.gz.
File metadata
- Download URL: tableai-0.2.3.tar.gz
- Upload date:
- Size: 35.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6803a699a0837c83039b18a5e1cff5d9a0c7e7ea1adc3e8737b19c9284b01c2f
|
|
| MD5 |
bfe8b327b127f28f557aea966eb3ba42
|
|
| BLAKE2b-256 |
668659d1ca4cee8db4b7963d1bf9f4bf835f6347a5f319e049d9c25aa6c88b41
|
File details
Details for the file tableai-0.2.3-py3-none-any.whl.
File metadata
- Download URL: tableai-0.2.3-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64e11480d10fdb3e56505ba84a6a9e9d246c278cb384b4d437c9bc036cf30be5
|
|
| MD5 |
396ccf8fc13b0ef9f577846b9771ee09
|
|
| BLAKE2b-256 |
770602b723c4357c9c7a36b0f5b1eda7286a6ca0c11c1ea49ea3b4071952c926
|