Skip to main content

AI toolkit for tabular data — auto EDA, data profiling, anomaly detection, and smart transformations on DataFrames.

Project description

tableai

Profile, clean, and query tabular data with one-liners — plus natural-language DataFrame analysis.

PyPI Python License

tableai is a toolkit for making sense of DataFrames fast. Profile any DataFrame and get column types, null counts, descriptive statistics, correlations, and a data-quality score. Clean it with a single call that imputes missing values, drops duplicates, and clips outliers. Detect anomalies with IQR or Isolation Forest. Get rule-based natural-language insights — or ask questions in plain English and have anyllm generate the pandas code for you.

Built by Viet-Anh Nguyen at NRL.ai.

Why tableai?

  • One-liner APItableai.profile(df) gives you everything in one call
  • Plugin architecture — Register custom profilers, cleaners, and anomaly detectors
  • Local-first — All core features work without any cloud or LLM call
  • Minimal core depspandas and numpy; sklearn and anyllm are optional
  • Production-ready — Structured dataclass results, JSON export, reproducible

Installation

pip install tableai

For optional features:

pip install tableai[sklearn]   # Isolation Forest + KMeans clustering
pip install tableai[llm]       # NL querying via anyllm
pip install tableai[all]       # everything

Python 3.8+ supported (tested on 3.8, 3.9, 3.10, 3.11, 3.12, 3.13)

Quick Start

import tableai
import pandas as pd

df = pd.read_csv("sales.csv")

# 1. Profile the DataFrame (dtypes, nulls, stats, correlations, quality score)
report = tableai.profile(df)
print(report.quality_score)              # 0.0 - 1.0
print(report.nulls)                      # per-column null counts
print(report.correlations.head())        # top correlated pairs

# 2. Clean the DataFrame (impute, dedupe, clip outliers)
clean = tableai.clean(df, impute=True, dedupe=True, clip_outliers=True)

# 3. Detect anomalies (IQR by default, Isolation Forest if sklearn installed)
anomalies = tableai.anomalies(df, method="iqr")
print(f"{len(anomalies)} anomalous rows")

# 4. Rule-based insights
for insight in tableai.insights(df):
    print("-", insight)

# 5. Natural-language querying (requires tableai[llm] + anyllm)
result = tableai.ask(df, "what is the average revenue by region?")
print(result)

Models & Methods

Profiling

  • Dtype detection — numeric / categorical / datetime / text / boolean / ID
  • Null analysis — per-column null counts, percentages, and null patterns
  • Descriptive statistics — mean, std, min, 25/50/75 percentiles, max, skew, kurtosis
  • Cardinality — unique counts and top-K value frequencies
  • Correlation matrix — Pearson for numerics, Cramer's V for categoricals
  • Duplicate detection — exact and near-duplicate row counts

Cleaning

Configurable pipeline applied in order:

  1. Drop constant columns — zero variance
  2. Imputemedian for numerics, mode for categoricals (configurable)
  3. Deduplicate — drop exact-duplicate rows
  4. Clip outliers — IQR method ([Q1 - 1.5*IQR, Q3 + 1.5*IQR])
  5. Type coercion — auto-convert date-like strings to datetime

Anomaly detection

Method Algorithm Notes
iqr (default) 1.5 x IQR per numeric column Zero deps
zscore ` z
isolation_forest sklearn IsolationForest Needs tableai[sklearn]

Data quality score

Weighted average (0.0 - 1.0) of four sub-scores:

  • Completeness1 - null_ratio
  • Uniqueness — ratio of distinct rows
  • Consistency — fraction of columns with a dominant dtype
  • Validity — fraction of values inside expected ranges / formats

Insights (rule-based NL)

Pattern-driven natural-language observations, for example:

  • "Column 'age' has 23.4% missing values"
  • "'price' and 'quantity' are strongly positively correlated (r=0.87)"
  • "Column 'id' appears to be a unique identifier"
  • "12 rows are exact duplicates"

Natural-language querying (optional)

tableai.ask(df, "…") uses anyllm to generate pandas code for your question, executes it in a sandboxed namespace, and returns the result. Works with any local or cloud LLM that anyllm supports.

Models & Methods

tableai uses pure pandas/numpy for core operations — no ML dependencies required.

Profiling (tableai.profile) — Computes per-column:

  • Dtype detection (numeric, categorical, datetime, string)
  • Null counts and percentages
  • Unique value counts
  • Numeric statistics: mean, median, std, min, max, quartiles, skewness, kurtosis
  • Top categorical values
  • Pearson correlation matrix between numeric columns

Cleaning (tableai.clean) — Configurable strategies:

  • Missing values: median (numeric), mode (categorical), drop, or zero
  • Duplicate removal
  • Outlier handling: IQR-based clipping or removal

Anomaly Detection (tableai.anomalies):

  • IQR method (default, no deps) — flags points outside Q1-1.5·IQR / Q3+1.5·IQR
  • Isolation Forest (optional via [ml], requires scikit-learn)

Quality Scoring (tableai.quality_score) — Weighted score 0-100:

  • Completeness 35% (1 - null_ratio)
  • Validity 25% (IQR-based outlier ratio)
  • Uniqueness 20% (duplicate detection)
  • Consistency 20% (mixed-type detection)

Insights (tableai.insights) — Rule-based natural language insights about missing values, correlations, skewness, cardinality, duplicates, and class imbalance.

Natural Language Querying (tableai.ask, tableai.query) — Optional via [llm] extra. Uses anyllm to generate pandas code from natural language. Falls back to keyword matching when LLM unavailable.

API Reference

Function Purpose
tableai.profile(df) Returns ProfileReport dataclass
tableai.clean(df, **opts) Returns a cleaned DataFrame
tableai.anomalies(df, method="iqr") Returns rows flagged as anomalous
tableai.quality_score(df) Returns float 0.0 - 1.0
tableai.insights(df) Returns list[str] of NL insights
tableai.ask(df, question, model=None) NL query via LLM
tableai.compare(df1, df2) Diff two DataFrames (schema + data)

CLI Usage

tableai profile data.csv --out report.json
tableai clean data.csv --out clean.csv
tableai anomalies data.csv --method isolation_forest
tableai ask data.csv "average sales by region"
tableai quality data.csv

Examples

Full profiling report to JSON

import tableai, pandas as pd

df = pd.read_csv("customers.csv")
report = tableai.profile(df)
report.to_json("customers_report.json")
print(f"Quality: {report.quality_score:.2f}")

Custom cleaning pipeline

import tableai

clean = tableai.clean(
    df,
    impute_numeric="median",
    impute_categorical="mode",
    dedupe=True,
    clip_outliers=True,
    drop_constant=True,
)

Ask questions in English (with Ollama)

import tableai

# Uses anyllm; defaults to Ollama if running locally
answer = tableai.ask(df, "which customer spent the most last quarter?",
                     model="llama3.1:8b")
print(answer)

License

MIT (c) Viet-Anh Nguyen

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tableai-0.2.3.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tableai-0.2.3-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file tableai-0.2.3.tar.gz.

File metadata

  • Download URL: tableai-0.2.3.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for tableai-0.2.3.tar.gz
Algorithm Hash digest
SHA256 6803a699a0837c83039b18a5e1cff5d9a0c7e7ea1adc3e8737b19c9284b01c2f
MD5 bfe8b327b127f28f557aea966eb3ba42
BLAKE2b-256 668659d1ca4cee8db4b7963d1bf9f4bf835f6347a5f319e049d9c25aa6c88b41

See more details on using hashes here.

File details

Details for the file tableai-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: tableai-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for tableai-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 64e11480d10fdb3e56505ba84a6a9e9d246c278cb384b4d437c9bc036cf30be5
MD5 396ccf8fc13b0ef9f577846b9771ee09
BLAKE2b-256 770602b723c4357c9c7a36b0f5b1eda7286a6ca0c11c1ea49ea3b4071952c926

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page