AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

These details have not been verified by PyPI

Project links

Project description

ThaiEDA

Exploratory data analysis that actually understands Thai.

Quick Start

pip install thaieda

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)          # full EDA in one line
result.to_html("report.html")     # self-contained HTML report

That's it. pip install thaieda ติดตั้งทุกอย่างเลย — Thai tokenizer, NER, ML, Excel, stats, encoding detection, interactive charts ไม่ต้องใส่ extras

Why ThaiEDA?

You already have ydata-profiling and sweetviz. Here's why you'd reach for ThaiEDA instead:

1. Thai text doesn't break. Generic tools render Thai as tofu boxes (□□□) in every chart. They miss Buddhist Era dates (พ.ศ. 2567), Thai numerals (๑๒๓), zero-width spaces, and mojibake from TIS-620 encoding. ThaiEDA detects and fixes all of these automatically — no font config, no manual cleanup.

2. Insights, not just stats. ydata-profiling gives you distributions and correlation matrices. ThaiEDA finds actionable cross-column patterns — "column A strongly predicts column B", "this group is 3× higher than average" — ranked by statistical interestingness with Benjamini-Hochberg correction. Plus anomaly detection, quality scoring, and data type classification.

3. One call, everything done. run(df) chains the full pipeline: type detection → smart cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. With ydata you'd still need a separate anomaly detector, a Thai font config, a cleaner, and manual interpretation.

4. Privacy-first LLM. Ask an LLM about your data without sending raw rows to a cloud API. 4 privacy modes — the default sends zero raw data. PDPA-ready.

5. Smaller reports on big data. ydata-profiling produces a 71 MB HTML on a 171-column dataset. ThaiEDA produces 0.48 MB — 148× smaller — because it caps charts, collapses tables, and samples intelligently on wide/tall data.

How It Works

DataFrame → thaieda.run(df) → EDAResult

  Step 0  pre-analyze    data type + language detection
  Step 1  detect         column types + Thai months + addresses
  Step 2  clean          smart cleaning (auto-decide what to fix)
  Step 3  quality        language-aware checks + 0–100 score
  Step 4  anomaly        IQR + ML + text anomaly detection
  Step 5  insights       6 cross-column patterns (BH-corrected)
  Step 6  viz            static (matplotlib) + interactive (Plotly)
  Step 7  report         executive HTML narrative

  + optional: LLM analysis (4 privacy modes)
  + optional: run_folder("data/") → multi-file master HTML
  + optional: compare(df1, df2) → drift detection

result = thaieda.run(df)

result.to_html()         # → report.html (self-contained)
result.to_dict()          # → Python dict
result.to_json()          # → JSON string
result.insights           # → insight cards
result.cleaned_df         # → cleaned DataFrame
result.quality_issues     # → list of issues
result.quality_score      # → 0–100 score with grade
result.anomalies          # → anomaly findings
result.llm_response       # → LLM analysis (if enabled)
result                    # → Jupyter rich display

Benchmarks — ThaiEDA vs ydata-profiling vs sweetviz

We ran all three on 6 representative datasets (small/large/wide, Thai + non-Thai):

Capability comparison

Feature	ydata-profiling	sweetviz	ThaiEDA
Standalone HTML report	✅	✅	✅
Cross-column insights	❌	❌	✅ 6 patterns + BH correction
Anomaly detection	❌	❌	✅ IQR + ML + text
Quality score (0–100)	❌	❌	✅
Language detection	❌	❌	✅ Thai/English/mixed
Thai font in charts	❌ tofu	❌ tofu	✅ Sarabun auto
Buddhist Era (พ.ศ.)	❌	❌	✅ → CE
Thai numerals (๑๒๓)	❌	❌	✅ → 123
Zero-width space fix	❌	❌	✅
Mojibake repair	❌	❌	✅
Smart cleaning	❌	❌	✅ auto-decide
Thai NER	❌	❌	✅
Privacy LLM modes	❌	❌	✅ 4 modes (PDPA)
Folder mode	❌	❌	✅ `run_folder()`

Speed & report size

Dataset	Rows	Cols	ydata	ydata size	sweetviz	sv size	Evidently	ev size	ThaiEDA	EDA size
titanic	891	12	5.3s	1.95 MB	3.3s	0.92 MB	—	—	8.2s	0.82 MB
superstore	10,800	21	9.3s	5.16 MB	5.4s	1.49 MB	—	—	26.0s	1.50 MB
adult	32,561	15	5.4s	1.65 MB	8.0s	1.26 MB	—	—	17.2s	1.05 MB
dirty-thai-retail	500	8	3.1s	0.90 MB	2.1s	0.68 MB	—	—	2.1s	0.53 MB
wisesight	26,737	2	2.6s	0.68 MB	0.8s	0.50 MB	—	—	18.8s	0.42 MB
aps-failure	16,000	171	99.8s	71.2 MB	15.8s	8.2 MB	—	—	93.0s	0.48 MB
synthetic	2,000	12	45s	7.2 MB	3s	0.9 MB	1s	3.7 MB	16s	1.5 MB

Quality benchmark — 4 tools on synthetic dataset (10 known issues)

We injected 10 known defects into a 2,000-row synthetic dataset and measured detection. All tools processed identically: HTML output stripped to plain text, same keyword detection applied uniformly.

Table A — General EDA quality (6 issues all tools can detect)

Metric	ydata (default)	sweetviz	Evidently	ThaiEDA
GTR — Ground-Truth Recall	100%	83%	100%	100%
ITB — Issue Type Breadth (11)	73%	64%	91%	91%
RC — Report Completeness (10)	70%	50%	70%	100%
Time	45s	3s	1s	16s
HTML size	7.2 MB	0.9 MB	3.7 MB	1.5 MB

On general EDA, ThaiEDA and Evidently both achieve 100% recall and 91% breadth. ThaiEDA wins on report completeness (100% vs 70%) while producing a 2× smaller report than Evidently and 5× smaller than ydata.

Table B — Thai-specific detection (4 Thai issues — competitors don't claim Thai support)

Thai issue	ydata	sweetviz	Evidently	ThaiEDA
Buddhist Era dates (พ.ศ.)	✅*	0%	✅*	✅
Thai numerals (๑๒๓)	0%	0%	✅*	✅
Zero-width spaces	0%	0%	0%	❌
Mojibake (TIS-620)	0%	0%	✅*	✅
Thai GTR	25%	0%	100%*	75%

* Evidently scores high on Thai keywords via generic matches ("encoding" appears in its CSS/JS framework), not Thai-specific recognition. ydata detected BE dates via generic encoding keywords. These are keyword-matching artifacts, not genuine Thai detection — competitors do not claim Thai language support.

ThaiEDA's 75% (3/4) reflects purpose-built Thai detection. The one miss (zero-width space in category_text) is a known gap. Competitors score 0% on genuine Thai detection by design.

What ThaiEDA Catches

Thai-specific problems

Problem	Example	What ThaiEDA does
Buddhist Era dates	`15/03/2567`	Detects พ.ศ. → converts to CE
Thai numerals	`๑๒๓` in numeric column	Converts to `123`
Zero-width spaces	`สม\u200bชาย`	Strips invisible chars + reports
Thai vowel/tone marks	`อร่อยค่ะ`	Counts U+0E30–U+0E4D for detection
Mixed Thai/English cells	`อร่อยมาก 5/5 stars`	Detects as mixed, not English/numeric
Thai month names	`มกราคม`	Parses to ISO date
Mojibake encoding	`Ã ¬Â¸Â¡Â¹`	Auto-detects TIS-620 → UTF-8
National ID cards	`1-1234-56789-01-2`	Checksum validation
Thai addresses	`123 ม.4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ`	Parses to structured fields
Phone numbers	`081-234-5678`	Detects + normalizes
Thai holidays	Spike on Dec 5	Attributes to Father's Day

Data quality & intelligence

Problem	What ThaiEDA does
Placeholder values (`-`, `N/A`, `ไม่มี`)	Flags as missing
Constant columns	Flags as useless
High-NA columns (>80%)	Flags `mostly_missing`, preserves NaN
Missing % per column	Severity threshold (warning >5%, info 1–5%)
Smart data type	Pre-classifies transaction/registry/survey/timeseries/mixed
Language-aware checks	English-only skips Thai พ.ศ./เลขไทย warnings
ID/FK semantics	`order_id`, `store_id` excluded from category anomaly
Numeric string preservation	`1.00005` left alone — not "spam"
Keyboard layout guard	`Floyd` in English column not converted to Thai
Index artifacts	`Unnamed: 0` ignored + flagged
CSV delimiter mismatch	`;`-delimited file warns to re-read

Opt-in operations (not in default pipeline)

Operation	Function	Effect
Abbreviation expansion	`expand_abbreviations()`	กทม. → กรุงเทพมหานคร
Spell correction	`spell_correct()`	ขอบคุน → ขอบคุณ
NFKC normalization	`normalize_nfkc()`	Ａ→A, ９→9
Fast tokenizer	`engine="auto-fast"`	nlpo3 (Rust, 3–4× faster)
Quality tokenizer	`engine="auto-quality"`	AttaCut (neural, better for OOV)
Keyboard layout anomaly	report-only	Detects suspicious Latin/Thai mixing
Grapheme validation	report-only	Detects abnormal stacked tone marks

v1.8 — Statistical Accuracy Improvements

Five new techniques that improve detection accuracy across different data patterns:

1. Spearman rank correlation (non-linear relationships)

Previously only Pearson (linear) correlation was computed. Now also computes Spearman ρ to catch monotonic non-linear relationships that Pearson misses (e.g., y = x⁵). The method with the highest |coefficient| is reported automatically.

from thaieda.insight_engine import discover_insights
# Now detects both linear AND non-linear strong correlations

2. Cramér's V effect size (categorical association)

Chi-square test only tells you if two categorical variables are associated (p-value). Cramér's V tells you how strongly — a 0–1 effect size with bias correction:

V range	Strength
< 0.3	เบาบาง (weak)
0.3–0.5	ปานกลาง (moderate)
> 0.5	ชัดเจน (strong)

from thaieda.analysis import analyze_target
results = analyze_target(df, "category_column")
# Each chi_square result now includes effect_size (Cramér's V)

3. Generalized ESD test (multiple outlier detection)

The existing z-score/IQR/MAD methods detect outliers one at a time, suffering from masking (outliers hide each other). The Generalized Extreme Studentized Deviate (Rosner 1983) test detects multiple outliers simultaneously with controlled Type I error:

Automatically selected when data is approximately normal (skew < 0.5, n ≥ 25)
Falls back to z-score/IQR/MAD for skewed or small datasets
Detects up to 10 outliers in one pass

4. Missing data mechanism detection (MCAR / MAR / MNAR)

Beyond counting missing values, ThaiEDA now classifies the missing data mechanism:

Mechanism	Meaning	Implication
MCAR	Missing Completely at Random	Safe to drop or impute simply
MAR_likely	Missing at Random	Imputation should use observed predictors
MNAR_likely	Missing Not at Random	Missing depends on unobserved values — needs domain model

from thaieda.quality import detect_missing_mechanism
result = detect_missing_mechanism(df)
print(result.mechanism)  # "MCAR", "MAR_likely", or "MNAR_likely"

5. Distribution fitting + Kolmogorov-Smirnov test

Automatically fits 4 distributions (normal, lognormal, exponential, uniform) to each numeric column and reports the best fit via KS goodness-of-fit test:

from thaieda.quality import fit_distributions
result = fit_distributions(df["column"], "column")
# result.best_fit → "normal", result.p_value, result.parameters

v1.9 — Privacy-First LLM Pipeline

New synthetic mode for enterprise-grade privacy. Generate mock data with real statistical properties — zero real values leave the machine.

5 Privacy Modes (ordered by risk)

Mode	What's sent to LLM	Risk	Use case
`insight_only`	Statistics + insights only	Low	Safe default
`synthetic` (v1.9)	Mock data from fitted distributions	Low	LLM sees realistic data shape
`anonymized`	PII replaced with tokens	Medium	Need data structure, no PII
`dp_noise`	Stats + Laplace noise (ε parameter)	Low	Statistical queries
`full`	Raw data	High	User accepts risk

Synthetic Data Pipeline

from thaieda.llm import generate_synthetic_data, privacy_audit_report, analyze_with_llm

# 1. Generate synthetic data (no real values)
synthetic_df = generate_synthetic_data(df)
# Numeric → sampled from fitted distribution (normal/lognormal/exponential/uniform)
# Categorical → sampled from proportions (PII replaced with placeholders)
# Datetime → sampled from date range
# Text → length-based placeholders (no real text)

# 2. Audit before sending
audit = privacy_audit_report(df, privacy_mode="synthetic")
# Detects: phone, email, Thai national ID, IP, Thai address
# Returns risk level + recommendations

# 3. Analyze with LLM using synthetic mode
response = analyze_with_llm(df, privacy="synthetic", provider="ollama")

Privacy Audit Report

Automatic PII detection before any data leaves the machine:

PII type	Detection method	Risk
Phone numbers	Regex (Thai + international)	High
Email addresses	Regex	High
Thai national ID	Regex (x-xxxx-xxxxx-xx-x)	Critical
IP addresses	Regex	Medium
Thai addresses	Keyword matching (ตำบล/อำเภอ/จังหวัด)	Medium

Scale & Performance

Tested across 19 public datasets — 500 to 541K rows, 2 to 171 columns:

Insight capping — surfaces the 30 most important findings. Executive summary shows the true count ("679 found, showing top 30").
HTML bloat control — 40 charts max, 1.6 MB max. Quality/anomaly tables collapse after 50 rows. Wide tables switch to summary view past 60 columns.
Wide-table fast path — insight engine samples when columns exceed 100. Heatmaps and scatter matrices skip on very wide data.
Tall-table fast path — anomaly/quality/outlier checks sample 50K rows when data exceeds 100K. Timeseries decomposition skips past 200K rows.
High-NA handling — columns >80% missing flagged as mostly_missing. >40% gets a warning. <40% unchanged.
Smarter type detection — Thai low-cardinality text → categorical, not free text. review/feedback stay text.
Cleaning safeguards — numeric strings untouched. Keyboard conversion only when Thai chars present.

Examples

One-Line EDA

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)

result.to_html("report.html")
print(result.quality_issues)
print(result.insights)

# In Jupyter: just display the result
result  # renders HTML inline

Folder Mode — Analyze Every File at Once

import thaieda

results = thaieda.run_folder("data/")

print(results.summary())
# ThaiEDA FolderResult — data/
#   Files: 5 (✅ 5 / ❌ 0)
#   ✅ customers.csv — 10,000 rows × 8 cols, 15 insights
#   ...

results.to_html("reports/")
results.to_master_html("master-report.html")  # single HTML with sidebar

Supported formats: CSV, Excel (.xlsx/.xls), JSON, JSONL, TSV. recursive=True for subfolders. Error isolation — one broken file doesn't crash the rest.

LLM Analysis (Privacy-Safe)

result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)
# Default: zero raw data leaves your machine

Mode	What Leaves	When to Use
`insight_only` (default)	Stats + insights only	Government, medical, PDPA
`anonymized`	PII → tokens	Need structure without raw data
`dp_noise`	Stats + Laplace noise	Small datasets where stats leak
`full`	Everything	Public data, demos

Compare Two Datasets

from thaieda.compare import compare_datasets

diff = compare_datasets(df_train, df_test, labels=("train", "test"))
print(diff["schema_diff"])       # columns added/removed
print(diff["drift"]["numeric"])  # KS statistic per column

Thai ID Card Validation

from thaieda.quality import validate_thai_id, validate_thai_id_column

validate_thai_id("1-1234-56789-01-2")           # → True/False
result = validate_thai_id_column(df["id_card"]) # entire column

Thai Address Parsing

from thaieda.detect import parse_thai_address

addr = parse_thai_address("123 หมู่ 4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ 10230")
# {'house_number': '123', 'moo': '4', 'subdistrict': 'บางบัว',
#  'district': 'บางบัว', 'province': 'กรุงเทพฯ', 'postal_code': '10230'}

Language Detection

from thaieda.detect import _detect_language

df = pd.DataFrame({
    "product": ["กาแฟ", "ชาไทย", "ขนม"],
    "review": ["อร่อยมาก 5/5 stars", "ดีครับ", "ไม่ดี"],
    "sku": ["SKU001", "SKU002", "SKU003"],
})

info = _detect_language(df)
print(info["language"], info["confidence"])
# thai/mixed/english/numeric + per-column language map

Features: Unicode Thai block analysis (U+0E00–U+0E7F), zero-width-space aware, mixed-cell detection, common Thai word hints, per-column column_details + dataset-level confidence (0.0–1.0), sample-based scan for large DataFrames.

Smart Pre-Analysis

from thaieda.report import _detect_data_type

pre = _detect_data_type(df)
print(pre["label"], pre["language"]["language"])
# Detects: transaction, registry, survey, timeseries, or mixed

Data Quality Score

from thaieda.quality import compute_quality_score

score = compute_quality_score(quality_issues, n_columns=10, n_rows=1000)
print(f"Score: {score.score}/100 ({score.grade})")  # Score: 85/100 (B)

Smart Cleaning

from thaieda.clean._smart import plan_cleaning

plan = plan_cleaning(df)
print(plan.actions)   # ['zwspace', 'numerals', 'duplicates']
print(plan.skipped)   # ['encoding', 'whitespace']

Visualization

Both static and interactive charts, all with Thai font support:

Static (matplotlib): correlation heatmap, distribution, box/violin, missing matrix, scatter matrix, wordcloud, timeseries, pair plot, KDE, QQ plot, sunburst
Interactive (Plotly): hover tooltips, zoom, pan — Thai font (Sarabun) via Google Fonts
Color palette: Okabe-Ito colorblind-safe (7 colors)

from thaieda.viz._interactive import create_correlation_heatmap_interactive

html_div = create_correlation_heatmap_interactive(df)  # → HTML <div>

Installation

pip install thaieda

ไม่ต้องใส่ extras — ติดตั้งทั้งหมด: Thai tokenizer, NER, ML, interactive charts, Excel, stats, encoding detection

LLM providers (optional, lazy-imported):

pip install openai       # OpenAI GPT
pip install anthropic    # Anthropic Claude
pip install ollama       # Ollama local LLM (หรือใช้ HTTP fallback)

Requirements: Python 3.10+

Modules

Module	What It Does
`run()` / `EDA()`	One-liner API — full pipeline in one call
`run_folder()`	Analyze every CSV/Excel/JSON in a folder + master HTML
`compare()`	Side-by-side dataset comparison with drift detection
`io/`	Auto-read CSV/JSON/JSONL/Excel + encoding detection
`detect/`	Column type detection + Thai months + address parsing + language detection
`clean/`	Smart cleaning: auto-decide what to fix (encoding, numerals, BE, zwspace)
`quality/`	Language-aware quality checks + score 0–100 + Thai ID card validation
`anomaly/`	Statistical + ML + text anomaly detection
`ner/`	Thai NER: person/place/organization
`insight_engine/`	6 cross-column insight patterns (BH-corrected)
`viz/`	Static + interactive charts with colorblind-safe palette
`report/`	Executive HTML report + smart pre-analysis
`llm/`	Privacy-preserving LLM analysis (4 modes, 3 providers)
`timeseries/`	Trend/seasonality/STL/ACF + Thai holiday awareness
`schema/`	Multi-file PK/FK discovery + relationship matching

Testing

pytest tests/ -v                    # all tests (691 passed)
ruff check src/ tests/              # lint
ruff format src/ tests/             # format

License

Apache-2.0 © Peet Wannasarnmetha

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.3.0

Jun 28, 2026

2.2.0

Jun 27, 2026

2.1.1

Jun 27, 2026

2.1.0

Jun 27, 2026

2.0.0

Jun 26, 2026

1.9.3

Jun 26, 2026

1.9.2

Jun 26, 2026

This version

1.9.1

Jun 26, 2026

1.9.0

Jun 26, 2026

1.8.0

Jun 26, 2026

1.7.1

Jun 26, 2026

1.7.0

Jun 26, 2026

1.6.0

Jun 26, 2026

1.5.0

Jun 26, 2026

1.1.0

Jun 25, 2026

1.0.1

Jun 25, 2026

1.0.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-1.9.1.tar.gz (480.1 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thaieda-1.9.1-py3-none-any.whl (265.5 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file thaieda-1.9.1.tar.gz.

File metadata

Download URL: thaieda-1.9.1.tar.gz
Upload date: Jun 26, 2026
Size: 480.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.9.1.tar.gz
Algorithm	Hash digest
SHA256	`93f0189ccedf1844df63969489295379534193a7bad09a6bde65467039f7995d`
MD5	`3f6ffa63602169d606f6164f09b732e1`
BLAKE2b-256	`aaadd52f6ce90391e0e31a606b25825cb627af27897847ecb83b9a73784c217e`

See more details on using hashes here.

File details

Details for the file thaieda-1.9.1-py3-none-any.whl.

File metadata

Download URL: thaieda-1.9.1-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 265.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55167825d4bc98705c5d8ff6901d845928560a426983975846e31beec449001e`
MD5	`940f0a6d1edc7ef5292252112e9615d0`
BLAKE2b-256	`2f489a5b7630378e43eb7ebdb0be2c92e1d1d8a0b6cac0c48298fc117d427105`

See more details on using hashes here.

thaieda 1.9.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ThaiEDA

Quick Start

Why ThaiEDA?

How It Works

Benchmarks — ThaiEDA vs ydata-profiling vs sweetviz

Capability comparison

Speed & report size

Quality benchmark — 4 tools on synthetic dataset (10 known issues)

What ThaiEDA Catches

Thai-specific problems

Data quality & intelligence

Opt-in operations (not in default pipeline)

v1.8 — Statistical Accuracy Improvements

1. Spearman rank correlation (non-linear relationships)

2. Cramér's V effect size (categorical association)

3. Generalized ESD test (multiple outlier detection)

4. Missing data mechanism detection (MCAR / MAR / MNAR)

5. Distribution fitting + Kolmogorov-Smirnov test

v1.9 — Privacy-First LLM Pipeline

5 Privacy Modes (ordered by risk)

Synthetic Data Pipeline

Privacy Audit Report

Scale & Performance

Examples

One-Line EDA

Folder Mode — Analyze Every File at Once

LLM Analysis (Privacy-Safe)

Compare Two Datasets

Thai ID Card Validation

Thai Address Parsing

Language Detection

Smart Pre-Analysis

Data Quality Score

Smart Cleaning

Visualization

Installation

Modules

Testing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes