AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

These details have not been verified by PyPI

Project links

Project description

ThaiEDA

Exploratory data analysis that actually understands Thai.

What is ThaiEDA?

ThaiEDA is a Python library that automates exploratory data analysis for Thai and mixed Thai/English datasets. You give it a DataFrame, it gives you back a full report — smart pre-analysis, language detection, column types, quality issues, anomalies, cross-column insights, charts, and an executive-style HTML report. All in one line.

It handles the things generic EDA tools miss: Buddhist Era dates, Thai numerals, zero-width spaces, Thai vowel/tone marks, mixed Thai/English cells, mojibake encoding, Thai month names, national ID card validation, Thai address parsing, and PII like phone numbers.

Quick Start

pip install thaieda

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)          # full EDA in one line
result.to_html("report.html")     # self-contained HTML report

pip install thaieda ติดตั้งทุกอย่างเลย — Thai tokenizer, NER, ML, Excel, stats, encoding detection, interactive charts ไม่ต้องใส่ extras

Why ThaiEDA?

Generic tools don't understand Thai data. Pandas Profiling, ydata-profiling, and Sweetviz are great — until you feed them Thai data. They miss Buddhist Era years (พ.ศ.), Thai numerals (๑๒๓), zero-width spaces that break tokenization, and mojibake from TIS-620 encoding. ThaiEDA catches all of these.

Privacy-first LLM analysis. Want to ask an LLM about your data but can't send raw rows to a cloud API? ThaiEDA has 4 privacy modes — the default sends zero raw data off your machine. Perfect for government, finance, and medical data under PDPA.

Insights, not just summaries. A cross-column insight engine finds non-obvious patterns — "column A strongly predicts column B", "this group is 3× higher than average" — ranked by statistical interestingness with Benjamini-Hochberg correction.

Thai-specific validation. National ID card checksum validation, Thai address parsing (province/district/subdistrict), Thai holiday awareness for timeseries spike attribution. No other EDA tool does this.

One line to get everything. thaieda.run(df) chains the full pipeline: type detection → smart cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. No config needed.

How It Works

DataFrame
    │
    ▼
┌──────────────────────────────────────────────┐
│  thaieda.run(df)                             │
│                                              │
│  0. pre-analyze → data type + language       │
│  1. detect      → column types + Thai months │
│  2. clean       → smart cleaning (auto-decide)│
│  3. quality     → language-aware checks      │
│  4. anomaly     → statistical + ML + text    │
│  5. insights    → 6 cross-column patterns    │
│  6. viz         → interactive + static charts│
│  7. report      → executive HTML narrative   │

│  + optional: LLM analysis (4 privacy modes)  │
│  + optional: compare(df1, df2) side-by-side  │
└──────────────────────────────────────────────┘
    │
    ▼
EDAResult
  .to_html()        → report.html
  .to_dict()        → Python dict
  .to_json()        → JSON string
  .insights         → insight cards
  .cleaned_df       → cleaned DataFrame
  .quality_issues   → list of issues
  .quality_score    → 0-100 score with grade
  .anomalies        → anomaly findings
  .llm_response     → LLM analysis (if enabled)
  ._repr_html_()    → Jupyter rich display

run_folder("data/")  → FolderResult
  .to_html("dir/")      → individual HTML per file
  .to_master_html()     → single master HTML with sidebar
  .summary()            → text summary
  ._repr_html_()        → Jupyter rich display

What's New

Scale & Performance

Tested across 14 public datasets — from 500 rows to 541K rows, 8 to 171 columns. Every report stays under 2 MB and finishes under 120 seconds.

Insight capping — reports surface the 30 most important findings instead of hundreds. Critical insights are always kept; warnings and info fill the rest. The executive summary shows the true count ("679 found, showing top 30").
HTML bloat control — dual chart budget (40 charts max, 1.6 MB max). Quality and anomaly tables collapse after 50 rows. Wide tables switch to a summary view past 60 columns.
Wide-table fast path — the insight engine samples breakdowns and measures when columns exceed 100. Correlation heatmaps and scatter matrices skip automatically on very wide data.
Tall-table fast path — anomaly, quality, and outlier checks sample 50K rows when data exceeds 100K. Correlation computes on a sample. Timeseries decomposition skips past 200K rows.

Data Quality & Cleaning

High-NA handling — columns over 80% missing are flagged as mostly_missing with NaN preserved. Columns over 40% get a warning to drop or impute with domain knowledge. Below 40% is unchanged.
Smarter type detection — Thai low-cardinality text is classified as categorical, not free text. Text-named columns like review and feedback stay text even with few unique values.
Cleaning safeguards — numeric strings like 1.00005 are left alone. Keyboard layout conversion only runs when Thai characters are present. Repeated-character spam on short codes is suppressed.
ID/FK awareness — ID columns are excluded from categorical anomaly checks. *_id columns are detected even with low unique ratio. Buddhist Era checks skip IDs. Timeseries excludes ID/FK/code columns from measures.

Reporting

Executive briefing format — reports flow from executive summary to key findings, business translation, priority actions, and plain-language explanations.
Template pagination — Key Insights shows the top 20 with a collapsible section for the rest. Count badges are preserved.
Fewer false positives — fuzzy duplicate guard skips short near-identical labels. Script mixing is skipped on low-cardinality columns. Outliers on heavy-tail distributions (skew > 2.0) are downgraded to info.
Folder reports — run_folder() analyzes CSV, Excel, JSON, JSONL, and TSV folders. FolderResult.to_master_html() builds one master HTML with sidebar navigation.

Smart Pre-Analysis

Language detection — Thai, English, mixed, and numeric data detected with confidence and per-column detail.
Data type classification — transaction, registry, survey, timeseries, and mixed datasets classified before EDA.
Language-aware quality — English-only data skips Thai-specific warnings automatically.

Accuracy Improvements (opt-in)

Abbreviation expansion — expand_abbreviations() ขยายคำย่อไทย (กทม. → กรุงเทพมหานคร, บจ. → บริษัทจำกัด) ผ่าน pythainlp.util.abbreviation_to_full_text — opt-in operation ไม่ได้เปิดใช้ใน default pipeline เพราะเปลี่ยน semantics ของข้อความ
Spell correction — spell_correct() แก้การสะกดคำผิดภาษาไทย (ขอบคุน → ขอบคุณ) ผ่าน pythainlp.spell.correct_sent — opt-in operation
NFKC normalization — normalize_nfkc() แปลง full-width characters (Ａ→A, ９→9) ผ่าน stdlib unicodedata.normalize("NFKC") — opt-in operation
Tokenizer selection modes — engine="auto-fast" เลือก nlpo3 (Rust, เร็ว 3-4x) และ engine="auto-quality" เลือก AttaCut (neural, แม่นยำสำหรับ social media/OOV text)
Keyboard layout anomaly detection — ตรวจหาเซลล์ที่สงสัยว่าพิมพ์ผิด keyboard layout (ละตินผสมในคอลัมน์ไทย) — report-only ไม่แก้ไขอัตโนมัติ
Thai grapheme validation — ตรวจหาวรรณยุกต์ซ้อนที่ผิดปกติ (เช่น ก่้ — mai ek + mai tho บนพยัญชนะเดียว) — report-only

Benchmarks

ThaiEDA is tested on 14 public datasets ranging from 500 rows to 541K rows, 8 to 171 columns. Every dataset produces a report under 2 MB in under 120 seconds.

Dataset	Rows	Cols	Time	HTML	Insights
titanic	891	12	8 s	0.79 MB	27
telco-churn	7,043	21	11 s	0.84 MB	11
wine-quality	1,599	12	7 s	0.93 MB	29
california-housing	20,640	10	15 s	0.99 MB	30
superstore	10,800	21	31 s	1.46 MB	30
adult	32,561	15	22 s	1.03 MB	29
bank-marketing	41,188	21	21 s	0.94 MB	30
online-retail	541,909	8	81 s	0.96 MB	30
dirty-thai-retail	500	8	2 s	0.51 MB	15
absenteeism	740	21	10 s	1.25 MB	30
online-shoppers	12,330	18	18 s	1.06 MB	30
aps-failure	16,000	171	100 s	0.48 MB	30
beijing-pm25	43,824	13	12 s	0.76 MB	19
bike-sharing	17,379	17	42 s	1.55 MB	30

All 14 datasets pass QA with 0 defects. Datasets from UCI ML Repository and public sources.

Examples

One-Line EDA

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")

# Full pipeline in one call
result = thaieda.run(df)

# Access results
result.to_html("report.html")
print(result.quality_issues)
print(result.insights)

# In Jupyter: just display the result
result  # renders HTML report inline

Folder Mode — Analyze Every File at Once

import thaieda

# One line — analyzes every CSV/Excel/JSON in the folder
results = thaieda.run_folder("data/")

# Print summary
print(results.summary())
# ThaiEDA FolderResult — data/
#   Files: 5 (✅ 5 / ❌ 0)
#   ✅ customers.csv — 10,000 rows × 8 cols, 15 insights
#   ✅ orders.csv    — 50,000 rows × 12 cols, 28 insights
#   ...

# Save individual HTML reports
results.to_html("reports/")

# Generate a single master HTML with sidebar navigation
results.to_master_html("master-report.html")

run_folder() features:

Auto-scans for CSV, Excel (.xlsx/.xls), JSON, JSONL, TSV
recursive=True to include subfolders
output_dir= to specify where HTML goes
Error isolation — one broken file doesn't crash the rest
progress= callback for progress tracking
All run() kwargs supported (lang, clean, llm, etc.)
to_master_html() — combines all reports into one page with sidebar nav + summary table

With LLM Analysis (Privacy-Safe)

import thaieda

# Default: zero raw data leaves your machine
result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)

Compare Two Datasets

from thaieda.compare import compare_datasets

diff = compare_datasets(df_train, df_test, labels=("train", "test"))
print(diff["schema_diff"])      # columns added/removed
print(diff["drift"]["numeric"]) # KS statistic per column

Thai ID Card Validation

from thaieda.quality import validate_thai_id, validate_thai_id_column

# Single ID
validate_thai_id("1-1234-56789-01-2")  # → True/False

# Entire column
result = validate_thai_id_column(df["id_card"])
print(f"Valid: {result['valid_count']}, Invalid: {result['invalid_count']}")

Thai Address Parsing

from thaieda.detect import parse_thai_address

addr = parse_thai_address("123 หมู่ 4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ 10230")
print(addr)
# {'house_number': '123', 'moo': '4', 'subdistrict': 'บางบัว',
#  'district': 'บางบัว', 'province': 'กรุงเทพฯ', 'postal_code': '10230'}

Language Detection

import pandas as pd
from thaieda.detect import _detect_language

df = pd.DataFrame({
    "product": ["กาแฟ", "ชาไทย", "ขนม"],
    "review": ["อร่อยมาก 5/5 stars", "ดีครับ", "ไม่ดี"],
    "sku": ["SKU001", "SKU002", "SKU003"],
})

info = _detect_language(df)
print(info["language"], info["confidence"])
print(info["columns"])
# thai/mixed/english/numeric + per-column language map

Language Detection v2 features:

Unicode Thai block analysis (U+0E00–U+0E7F) including vowels/tone marks (U+0E30–U+0E4D)
Zero-width-space aware (\u200b, BOM, word joiner)
Mixed-cell detection เช่น "อร่อยมาก 5/5 stars"
Common Thai word hints: ครับ, ค่ะ, ไทย, อร่อย, ดี, ไม่, มี, และ
Lazy pythainlp tokenizer when installed; regex fallback when unavailable
Per-column column_details + dataset-level confidence (0.0–1.0)
Sample-based scan (first 500 rows/column) for large DataFrames

Smart Pre-Analysis

ThaiEDA profiles the dataset before running the full report, so the narrative and quality checks match the data:

from thaieda.report import _detect_data_type

pre = _detect_data_type(df)
print(pre["label"], pre["language"]["language"])
print(pre["focus"])

Smart pre-analysis detects:

Transaction data — orders, payments, revenue, invoices
Registry/master data — customers, products, stores, entity attributes
Survey/review data — ratings, comments, feedback text
Timeseries data — datetime index/columns + numeric measures
Mixed data — conservative fallback when signals overlap
Language impact — Thai/mixed data enables Thai-specific checks; English-only data skips พ.ศ./เลขไทย checks automatically

Data Quality Score

from thaieda.quality import compute_quality_score

score = compute_quality_score(quality_issues, n_columns=10, n_rows=1000)
print(f"Score: {score.score}/100 ({score.grade})")
# Score: 85/100 (B)

Smart Cleaning

from thaieda.clean._smart import plan_cleaning

plan = plan_cleaning(df)
print(f"Actions: {plan.actions}")    # ['zwspace', 'numerals', 'duplicates']
print(f"Skipped: {plan.skipped}")    # ['encoding', 'whitespace']

Privacy Modes

Control exactly what data leaves your machine when using LLM analysis:

Mode	What Leaves	Guarantee	When to Use
`insight_only` (default)	Stats + insights only	Raw data never leaves	Government, medical, PDPA data
`anonymized`	Data with PII → tokens	Names/phones/ID cards masked	Need structure without raw PII
`dp_noise`	Stats + Laplace noise	Prevents re-identification	Small datasets where stats leak
`full`	Everything	None — you accept the risk	Public data, demos

What ThaiEDA Catches

Problem	Example	What Happens
Buddhist Era dates	`15/03/2567`	Auto-detects พ.ศ. → converts to CE
Thai numerals	`๑๒๓` in numeric column	Converts to `123`
Zero-width spaces	`สม\u200bชาย`	Strips invisible chars and reports language evidence
Thai vowel/tone marks	`อร่อยค่ะ`	Counts U+0E30–U+0E4D for better Thai detection
Mixed Thai/English cells	`อร่อยมาก 5/5 stars`	Detects as mixed language instead of English/numeric
Thai text in English-heavy tables	Thai product column + English IDs	Column-level language detection preserves Thai checks
Common Thai words	`ครับ`, `ค่ะ`, `ไม่ดี`	Boosts confidence for short Thai text
Mojibake encoding	`Ã ¬Â¸Â¡Â¹`	Auto-detects TIS-620 → UTF-8
Thai month names	`มกราคม`	Parses to ISO date
Phone numbers	`081-234-5678`	Detects + normalizes
National ID cards	`1-1234-56789-01-2`	Checksum validation
Thai addresses	`123 ม.4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ`	Parses to structured fields
Placeholder values	`-`, `N/A`, `ไม่มี`	Flags as missing
Constant columns	All same value	Flags as useless
Smart data type	Orders/reviews/timeseries	Pre-classifies transaction/registry/survey/timeseries/mixed
Language-aware checks	English-only DataFrame	Skips Thai-specific พ.ศ./เลขไทย warnings automatically
Thai holidays	Spike on Dec 5	Attributes to Father's Day
ID/FK semantics	`order_id`, `store_id`	Detected as ID even with low unique ratio; excluded from category anomaly
BE on ID	`order_id=2531`	No longer flagged as พ.ศ. (BE check requires date-like column)
Numeric string preservation	`1.00005` in numeric data	`fix_repeated_chars` skips decimal/numeric strings
Keyboard layout guard	`Floyd` in English column	Not converted to Thai (requires Thai chars in column first)
Payment method detection	`payment_method` column	Classified as categorical, not amount column
Index artifact cleanup	`Unnamed: 0` column	Ignored in analysis, flagged as index artifact
Per-column missing values	`Age` 20% missing	Quality issue with severity threshold (warning >5%, info 1-5%)
CSV delimiter warning	`;`-delimited file read as 1 column	Warns to re-read with `sep=';'`

Visualization

ThaiEDA generates both static (matplotlib) and interactive (Plotly) charts:

Static: correlation heatmap, distribution, box/violin, missing matrix, scatter matrix, wordcloud, timeseries, pair plot, KDE, QQ plot, sunburst
Interactive: hover tooltips, zoom, pan — using Plotly with Thai font (Sarabun) via Google Fonts
Color palette: Okabe-Ito colorblind-safe (7 colors)
Thai font: auto-detected for matplotlib, CSS-loaded for Plotly

from thaieda.viz._interactive import create_correlation_heatmap_interactive

html_div = create_correlation_heatmap_interactive(df)  # → HTML <div> for reports

Installation

# ติดตั้งทุกอย่างในคำสั่งเดียว
pip install thaieda

ไม่ต้องใส่ extras — pip install thaieda ติดตั้งทั้งหมด: Thai tokenizer, NER, ML, interactive charts, Excel, stats, encoding detection

LLM providers ยังเป็น optional (lazy-imported — ไม่ต้องติดตั้งถ้าไม่ใช้):

pip install openai       # OpenAI GPT
pip install anthropic    # Anthropic Claude
pip install ollama       # Ollama local LLM (หรือใช้ HTTP fallback ไม่ต้องติดตั้ง)

Requirements: Python 3.10+

Modules

Module	What It Does
`run()` / `EDA()`	One-liner API — full pipeline in one call
`run_folder()`	Analyze every CSV/Excel/JSON in a folder + master HTML
`compare()`	Side-by-side dataset comparison with drift detection
`io/`	Auto-read CSV/JSON/JSONL/Excel + encoding detection
`detect/`	Column type detection + Thai month names + address parsing + language detection v2
`clean/`	Smart cleaning: auto-decide what to fix (encoding, numerals, BE, zwspace)
`quality/`	Language-aware quality checks + score 0-100 + Thai ID card validation
`anomaly/`	Statistical + ML + text anomaly detection
`ner/`	Thai NER: person/place/organization
`insight_engine/`	6 cross-column insight patterns (BH-corrected)
`viz/`	Static + interactive charts with colorblind-safe palette
`report/`	Executive HTML report + smart pre-analysis (`_detect_data_type`)
`llm/`	Privacy-preserving LLM analysis (4 modes, 3 providers)
`timeseries/`	Trend/seasonality/STL/ACF + Thai holiday awareness
`schema/`	Multi-file PK/FK discovery + relationship matching

Testing

pytest tests/ -v                    # all tests (631 passed)
pytest tests/test_language_detection.py  # language detection + language-aware quality
pytest tests/test_thai_id.py        # ID card validation
pytest tests/test_thai_address.py   # address parsing
pytest tests/test_compare.py        # dataset comparison
pytest tests/test_run_folder.py     # folder mode + master HTML
pytest tests/test_llm.py            # LLM + privacy modes
ruff check src/ tests/              # lint
ruff format src/ tests/             # format

License

Apache-2.0 © Peet Wannasarnmetha

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.3.0

Jun 28, 2026

2.2.0

Jun 27, 2026

2.1.1

Jun 27, 2026

2.1.0

Jun 27, 2026

2.0.0

Jun 26, 2026

1.9.3

Jun 26, 2026

1.9.2

Jun 26, 2026

1.9.1

Jun 26, 2026

1.9.0

Jun 26, 2026

1.8.0

Jun 26, 2026

1.7.1

Jun 26, 2026

1.7.0

Jun 26, 2026

This version

1.6.0

Jun 26, 2026

1.5.0

Jun 26, 2026

1.1.0

Jun 25, 2026

1.0.1

Jun 25, 2026

1.0.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-1.6.0.tar.gz (462.1 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thaieda-1.6.0-py3-none-any.whl (253.6 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file thaieda-1.6.0.tar.gz.

File metadata

Download URL: thaieda-1.6.0.tar.gz
Upload date: Jun 26, 2026
Size: 462.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.6.0.tar.gz
Algorithm	Hash digest
SHA256	`94746ac199f3ce6d02c836bc1a7a9fa21e509fa56e5ca2f25c809503a5019095`
MD5	`accc44d0c53a77b35c59c4ef83510019`
BLAKE2b-256	`8aad2f064ba4571f3a25d588c453836d8fbc0975a1654c94676f84b41833cefd`

See more details on using hashes here.

File details

Details for the file thaieda-1.6.0-py3-none-any.whl.

File metadata

Download URL: thaieda-1.6.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 253.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94f7322e0064ef83582cfd732ed81c3f7f2a68d22f5651b121f13c58d9147155`
MD5	`ce8f4474aef919dc8160758475e03d97`
BLAKE2b-256	`90f7838f303594499047bde600b0822eca848eb365e4df1c14222a1736d6ce41`

See more details on using hashes here.

thaieda 1.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ThaiEDA

What is ThaiEDA?

Quick Start

Why ThaiEDA?

How It Works

What's New

Scale & Performance

Data Quality & Cleaning

Reporting

Smart Pre-Analysis

Accuracy Improvements (opt-in)

Benchmarks

Examples

One-Line EDA

Folder Mode — Analyze Every File at Once

With LLM Analysis (Privacy-Safe)

Compare Two Datasets

Thai ID Card Validation

Thai Address Parsing

Language Detection

Smart Pre-Analysis

Data Quality Score

Smart Cleaning

Privacy Modes

What ThaiEDA Catches

Visualization

Installation

Modules

Testing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes