Skip to main content

AI toolkit for tabular data — auto EDA, data profiling, anomaly detection, and smart transformations on DataFrames.

Project description

tableai

Profile, clean, and query tabular data with one-liners — plus natural-language DataFrame analysis.

PyPI Python License

tableai is a toolkit for making sense of DataFrames fast. Profile any DataFrame and get column types, null counts, descriptive statistics, correlations, and a data-quality score. Clean it with a single call that imputes missing values, drops duplicates, and clips outliers. Detect anomalies with IQR or Isolation Forest. Get rule-based natural-language insights — or ask questions in plain English and have anyllm generate the pandas code for you.

Built by Viet-Anh Nguyen at NRL.ai.

Why tableai?

  • One-liner APItableai.profile(df) gives you everything in one call
  • Plugin architecture — Register custom profilers, cleaners, and anomaly detectors
  • Local-first — All core features work without any cloud or LLM call
  • Minimal core depspandas and numpy; sklearn and anyllm are optional
  • Production-ready — Structured dataclass results, JSON export, reproducible

Installation

pip install tableai

For optional features:

pip install tableai[sklearn]   # Isolation Forest + KMeans clustering
pip install tableai[llm]       # NL querying via anyllm
pip install tableai[all]       # everything

Python 3.8+ supported (tested on 3.8, 3.9, 3.10, 3.11, 3.12, 3.13)

Quick Start

import tableai
import pandas as pd

df = pd.read_csv("sales.csv")

# 1. Profile the DataFrame (dtypes, nulls, stats, correlations, quality score)
report = tableai.profile(df)
print(report.quality_score)              # 0.0 - 1.0
print(report.nulls)                      # per-column null counts
print(report.correlations.head())        # top correlated pairs

# 2. Clean the DataFrame (impute, dedupe, clip outliers)
clean = tableai.clean(df, impute=True, dedupe=True, clip_outliers=True)

# 3. Detect anomalies (IQR by default, Isolation Forest if sklearn installed)
anomalies = tableai.anomalies(df, method="iqr")
print(f"{len(anomalies)} anomalous rows")

# 4. Rule-based insights
for insight in tableai.insights(df):
    print("-", insight)

# 5. Natural-language querying (requires tableai[llm] + anyllm)
result = tableai.ask(df, "what is the average revenue by region?")
print(result)

Models & Methods

Profiling

  • Dtype detection — numeric / categorical / datetime / text / boolean / ID
  • Null analysis — per-column null counts, percentages, and null patterns
  • Descriptive statistics — mean, std, min, 25/50/75 percentiles, max, skew, kurtosis
  • Cardinality — unique counts and top-K value frequencies
  • Correlation matrix — Pearson for numerics, Cramer's V for categoricals
  • Duplicate detection — exact and near-duplicate row counts

Cleaning

Configurable pipeline applied in order:

  1. Drop constant columns — zero variance
  2. Imputemedian for numerics, mode for categoricals (configurable)
  3. Deduplicate — drop exact-duplicate rows
  4. Clip outliers — IQR method ([Q1 - 1.5*IQR, Q3 + 1.5*IQR])
  5. Type coercion — auto-convert date-like strings to datetime

Anomaly detection

Method Algorithm Notes
iqr (default) 1.5 x IQR per numeric column Zero deps
zscore ` z
isolation_forest sklearn IsolationForest Needs tableai[sklearn]

Data quality score

Weighted average (0.0 - 1.0) of four sub-scores:

  • Completeness1 - null_ratio
  • Uniqueness — ratio of distinct rows
  • Consistency — fraction of columns with a dominant dtype
  • Validity — fraction of values inside expected ranges / formats

Insights (rule-based NL)

Pattern-driven natural-language observations, for example:

  • "Column 'age' has 23.4% missing values"
  • "'price' and 'quantity' are strongly positively correlated (r=0.87)"
  • "Column 'id' appears to be a unique identifier"
  • "12 rows are exact duplicates"

Natural-language querying (optional)

tableai.ask(df, "…") uses anyllm to generate pandas code for your question, executes it in a sandboxed namespace, and returns the result. Works with any local or cloud LLM that anyllm supports.

Models & Methods

tableai uses pure pandas/numpy for core operations — no ML dependencies required.

Profiling (tableai.profile) — Computes per-column:

  • Dtype detection (numeric, categorical, datetime, string)
  • Null counts and percentages
  • Unique value counts
  • Numeric statistics: mean, median, std, min, max, quartiles, skewness, kurtosis
  • Top categorical values
  • Pearson correlation matrix between numeric columns

Cleaning (tableai.clean) — Configurable strategies:

  • Missing values: median (numeric), mode (categorical), drop, or zero
  • Duplicate removal
  • Outlier handling: IQR-based clipping or removal

Anomaly Detection (tableai.anomalies):

  • IQR method (default, no deps) — flags points outside Q1-1.5·IQR / Q3+1.5·IQR
  • Isolation Forest (optional via [ml], requires scikit-learn)

Quality Scoring (tableai.quality_score) — Weighted score 0-100:

  • Completeness 35% (1 - null_ratio)
  • Validity 25% (IQR-based outlier ratio)
  • Uniqueness 20% (duplicate detection)
  • Consistency 20% (mixed-type detection)

Insights (tableai.insights) — Rule-based natural language insights about missing values, correlations, skewness, cardinality, duplicates, and class imbalance.

Natural Language Querying (tableai.ask, tableai.query) — Optional via [llm] extra. Uses anyllm to generate pandas code from natural language. Falls back to keyword matching when LLM unavailable.

API Reference

Function Purpose
tableai.profile(df) Returns ProfileReport dataclass
tableai.clean(df, **opts) Returns a cleaned DataFrame
tableai.anomalies(df, method="iqr") Returns rows flagged as anomalous
tableai.quality_score(df) Returns float 0.0 - 1.0
tableai.insights(df) Returns list[str] of NL insights
tableai.ask(df, question, model=None) NL query via LLM
tableai.compare(df1, df2) Diff two DataFrames (schema + data)

CLI Usage

tableai profile data.csv --out report.json
tableai clean data.csv --out clean.csv
tableai anomalies data.csv --method isolation_forest
tableai ask data.csv "average sales by region"
tableai quality data.csv

Examples

Full profiling report to JSON

import tableai, pandas as pd

df = pd.read_csv("customers.csv")
report = tableai.profile(df)
report.to_json("customers_report.json")
print(f"Quality: {report.quality_score:.2f}")

Custom cleaning pipeline

import tableai

clean = tableai.clean(
    df,
    impute_numeric="median",
    impute_categorical="mode",
    dedupe=True,
    clip_outliers=True,
    drop_constant=True,
)

Ask questions in English (with Ollama)

import tableai

# Uses anyllm; defaults to Ollama if running locally
answer = tableai.ask(df, "which customer spent the most last quarter?",
                     model="llama3.1:8b")
print(answer)

License

MIT (c) Viet-Anh Nguyen

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tableai-0.2.4.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tableai-0.2.4-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file tableai-0.2.4.tar.gz.

File metadata

  • Download URL: tableai-0.2.4.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for tableai-0.2.4.tar.gz
Algorithm Hash digest
SHA256 29b1f50cd369cf22fa79262f7b6fb2596ae23b1ff9072324f70646b9b6a5a66c
MD5 bd794b8f8387ea5cc4a5e879a5d95b18
BLAKE2b-256 6b065b66a9921e5606f2b67183f49eb07bfef4a7308858b7f1dbe83f2bef0629

See more details on using hashes here.

File details

Details for the file tableai-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: tableai-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for tableai-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c9eb922b9cfb2d9af1ea53978393dcdf8121e65733b431af305d38672b5a7a61
MD5 4bbc407cbf925c1fea2193679806c3cd
BLAKE2b-256 27bf8b05e121c0448783370eb92689cd05938c4efd91fdff90256b0ffeb851c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page