Skip to main content

AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

Project description

ThaiEDA

Exploratory data analysis that actually understands Thai.

PyPI Python 3.10+ License: Apache-2.0 Tests: 691 passed Code Style: ruff Language aware


Quick Start

pip install thaieda
import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)          # full EDA in one line
result.to_html("report.html")     # self-contained HTML report

That's it. pip install thaieda ติดตั้งทุกอย่างเลย — Thai tokenizer, NER, ML, Excel, stats, encoding detection, interactive charts ไม่ต้องใส่ extras


Why ThaiEDA?

You already have ydata-profiling and sweetviz. Here's why you'd reach for ThaiEDA instead:

1. Thai text doesn't break. Generic tools render Thai as tofu boxes (□□□) in every chart. They miss Buddhist Era dates (พ.ศ. 2567), Thai numerals (๑๒๓), zero-width spaces, and mojibake from TIS-620 encoding. ThaiEDA detects and fixes all of these automatically — no font config, no manual cleanup.

2. Insights, not just stats. ydata-profiling gives you distributions and correlation matrices. ThaiEDA finds actionable cross-column patterns — "column A strongly predicts column B", "this group is 3× higher than average" — ranked by statistical interestingness with Benjamini-Hochberg correction. Plus anomaly detection, quality scoring, and data type classification.

3. One call, everything done. run(df) chains the full pipeline: type detection → smart cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. With ydata you'd still need a separate anomaly detector, a Thai font config, a cleaner, and manual interpretation.

4. Privacy-first LLM. Ask an LLM about your data without sending raw rows to a cloud API. 4 privacy modes — the default sends zero raw data. PDPA-ready.

5. Smaller reports on big data. ydata-profiling produces a 71 MB HTML on a 171-column dataset. ThaiEDA produces 0.48 MB — 148× smaller — because it caps charts, collapses tables, and samples intelligently on wide/tall data.


How It Works

DataFrame → thaieda.run(df) → EDAResult

  Step 0  pre-analyze    data type + language detection
  Step 1  detect         column types + Thai months + addresses
  Step 2  clean          smart cleaning (auto-decide what to fix)
  Step 3  quality        language-aware checks + 0–100 score
  Step 4  anomaly        IQR + ML + text anomaly detection
  Step 5  insights       6 cross-column patterns (BH-corrected)
  Step 6  viz            static (matplotlib) + interactive (Plotly)
  Step 7  report         executive HTML narrative

  + optional: LLM analysis (4 privacy modes)
  + optional: run_folder("data/") → multi-file master HTML
  + optional: compare(df1, df2) → drift detection
result = thaieda.run(df)

result.to_html()         # → report.html (self-contained)
result.to_dict()          # → Python dict
result.to_json()          # → JSON string
result.insights           # → insight cards
result.cleaned_df         # → cleaned DataFrame
result.quality_issues     # → list of issues
result.quality_score      # → 0–100 score with grade
result.anomalies          # → anomaly findings
result.llm_response       # → LLM analysis (if enabled)
result                    # → Jupyter rich display

Benchmarks — ThaiEDA vs ydata-profiling vs sweetviz

We ran all three on 6 representative datasets (small/large/wide, Thai + non-Thai):

Capability comparison

Feature ydata-profiling sweetviz ThaiEDA
Standalone HTML report
Cross-column insights ✅ 6 patterns + BH correction
Anomaly detection ✅ IQR + ML + text
Quality score (0–100)
Language detection ✅ Thai/English/mixed
Thai font in charts ❌ tofu ❌ tofu ✅ Sarabun auto
Buddhist Era (พ.ศ.) ✅ → CE
Thai numerals (๑๒๓) ✅ → 123
Zero-width space fix
Mojibake repair
Smart cleaning ✅ auto-decide
Thai NER
Privacy LLM modes ✅ 4 modes (PDPA)
Folder mode run_folder()

Speed & report size

Dataset Rows Cols ydata ydata size sweetviz sv size Evidently ev size ThaiEDA EDA size
titanic 891 12 5.3s 1.95 MB 3.3s 0.92 MB 8.2s 0.82 MB
superstore 10,800 21 9.3s 5.16 MB 5.4s 1.49 MB 26.0s 1.50 MB
adult 32,561 15 5.4s 1.65 MB 8.0s 1.26 MB 17.2s 1.05 MB
dirty-thai-retail 500 8 3.1s 0.90 MB 2.1s 0.68 MB 2.1s 0.53 MB
wisesight 26,737 2 2.6s 0.68 MB 0.8s 0.50 MB 18.8s 0.42 MB
aps-failure 16,000 171 99.8s 71.2 MB 15.8s 8.2 MB 93.0s 0.48 MB
synthetic 2,000 12 45s 7.2 MB 3s 0.9 MB 1s 3.7 MB 16s 1.5 MB

Quality benchmark — 5 tools on synthetic dataset (10 known issues)

We injected 10 known defects into a 2,000-row synthetic dataset and measured detection. All tools processed identically: HTML output stripped to plain text, same keyword detection applied uniformly.

Table A — General EDA quality (6 issues all tools can detect)

Metric ydata (default) sweetviz Evidently PyGWalker ThaiEDA
GTR — Ground-Truth Recall 100% 83% 100% 0%* 100%
ITB — Issue Type Breadth (11) 73% 64% 91% 0%* 91%
RC — Report Completeness (10) 70% 50% 70% 10% 100%
Time 45s 3s 1s <1s 16s
HTML size 7.2 MB 0.9 MB 3.7 MB 2.9 MB 1.5 MB

* PyGWalker is a visual exploration tool, not a report generator — its HTML contains a JS app shell with no extractable findings. Included for completeness, not as a quality comparison.

On general EDA, ThaiEDA and Evidently both achieve 100% recall and 91% breadth. ThaiEDA wins on report completeness (100% vs 70%) while producing a 2× smaller report than Evidently and 5× smaller than ydata.

Table B — Thai-specific detection (4 Thai issues — competitors don't claim Thai support)

Thai issue ydata sweetviz Evidently ThaiEDA
Buddhist Era dates (พ.ศ.) ✅* 0% ✅*
Thai numerals (๑๒๓) 0% 0% ✅*
Zero-width spaces 0% 0% 0%
Mojibake (TIS-620) 0% 0% ✅*
Thai GTR 25% 0% 100%* 75%

* Evidently scores high on Thai keywords via generic matches ("encoding" appears in its CSS/JS framework), not Thai-specific recognition. ydata detected BE dates via generic encoding keywords. These are keyword-matching artifacts, not genuine Thai detection — competitors do not claim Thai language support.

ThaiEDA's 75% (3/4) reflects purpose-built Thai detection. The one miss (zero-width space in category_text) is a known gap. Competitors score 0% on genuine Thai detection by design.


What ThaiEDA Catches

Thai-specific problems

Problem Example What ThaiEDA does
Buddhist Era dates 15/03/2567 Detects พ.ศ. → converts to CE
Thai numerals ๑๒๓ in numeric column Converts to 123
Zero-width spaces สม\u200bชาย Strips invisible chars + reports
Thai vowel/tone marks อร่อยค่ะ Counts U+0E30–U+0E4D for detection
Mixed Thai/English cells อร่อยมาก 5/5 stars Detects as mixed, not English/numeric
Thai month names มกราคม Parses to ISO date
Mojibake encoding à ¬Â¸Â¡Â¹ Auto-detects TIS-620 → UTF-8
National ID cards 1-1234-56789-01-2 Checksum validation
Thai addresses 123 ม.4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ Parses to structured fields
Phone numbers 081-234-5678 Detects + normalizes
Thai holidays Spike on Dec 5 Attributes to Father's Day

Data quality & intelligence

Problem What ThaiEDA does
Placeholder values (-, N/A, ไม่มี) Flags as missing
Constant columns Flags as useless
High-NA columns (>80%) Flags mostly_missing, preserves NaN
Missing % per column Severity threshold (warning >5%, info 1–5%)
Smart data type Pre-classifies transaction/registry/survey/timeseries/mixed
Language-aware checks English-only skips Thai พ.ศ./เลขไทย warnings
ID/FK semantics order_id, store_id excluded from category anomaly
Numeric string preservation 1.00005 left alone — not "spam"
Keyboard layout guard Floyd in English column not converted to Thai
Index artifacts Unnamed: 0 ignored + flagged
CSV delimiter mismatch ;-delimited file warns to re-read

Opt-in operations (not in default pipeline)

Operation Function Effect
Abbreviation expansion expand_abbreviations() กทม. → กรุงเทพมหานคร
Spell correction spell_correct() ขอบคุน → ขอบคุณ
NFKC normalization normalize_nfkc() A→A, 9→9
Fast tokenizer engine="auto-fast" nlpo3 (Rust, 3–4× faster)
Quality tokenizer engine="auto-quality" AttaCut (neural, better for OOV)
Keyboard layout anomaly report-only Detects suspicious Latin/Thai mixing
Grapheme validation report-only Detects abnormal stacked tone marks

Scale & Performance

Tested across 19 public datasets — 500 to 541K rows, 2 to 171 columns:

  • Insight capping — surfaces the 30 most important findings. Executive summary shows the true count ("679 found, showing top 30").
  • HTML bloat control — 40 charts max, 1.6 MB max. Quality/anomaly tables collapse after 50 rows. Wide tables switch to summary view past 60 columns.
  • Wide-table fast path — insight engine samples when columns exceed 100. Heatmaps and scatter matrices skip on very wide data.
  • Tall-table fast path — anomaly/quality/outlier checks sample 50K rows when data exceeds 100K. Timeseries decomposition skips past 200K rows.
  • High-NA handling — columns >80% missing flagged as mostly_missing. >40% gets a warning. <40% unchanged.
  • Smarter type detection — Thai low-cardinality text → categorical, not free text. review/feedback stay text.
  • Cleaning safeguards — numeric strings untouched. Keyboard conversion only when Thai chars present.

Examples

One-Line EDA

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)

result.to_html("report.html")
print(result.quality_issues)
print(result.insights)

# In Jupyter: just display the result
result  # renders HTML inline

Folder Mode — Analyze Every File at Once

import thaieda

results = thaieda.run_folder("data/")

print(results.summary())
# ThaiEDA FolderResult — data/
#   Files: 5 (✅ 5 / ❌ 0)
#   ✅ customers.csv — 10,000 rows × 8 cols, 15 insights
#   ...

results.to_html("reports/")
results.to_master_html("master-report.html")  # single HTML with sidebar

Supported formats: CSV, Excel (.xlsx/.xls), JSON, JSONL, TSV. recursive=True for subfolders. Error isolation — one broken file doesn't crash the rest.

LLM Analysis (Privacy-Safe)

result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)
# Default: zero raw data leaves your machine
Mode What Leaves When to Use
insight_only (default) Stats + insights only Government, medical, PDPA
anonymized PII → tokens Need structure without raw data
dp_noise Stats + Laplace noise Small datasets where stats leak
full Everything Public data, demos

Compare Two Datasets

from thaieda.compare import compare_datasets

diff = compare_datasets(df_train, df_test, labels=("train", "test"))
print(diff["schema_diff"])       # columns added/removed
print(diff["drift"]["numeric"])  # KS statistic per column

Thai ID Card Validation

from thaieda.quality import validate_thai_id, validate_thai_id_column

validate_thai_id("1-1234-56789-01-2")           # → True/False
result = validate_thai_id_column(df["id_card"]) # entire column

Thai Address Parsing

from thaieda.detect import parse_thai_address

addr = parse_thai_address("123 หมู่ 4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ 10230")
# {'house_number': '123', 'moo': '4', 'subdistrict': 'บางบัว',
#  'district': 'บางบัว', 'province': 'กรุงเทพฯ', 'postal_code': '10230'}

Language Detection

from thaieda.detect import _detect_language

df = pd.DataFrame({
    "product": ["กาแฟ", "ชาไทย", "ขนม"],
    "review": ["อร่อยมาก 5/5 stars", "ดีครับ", "ไม่ดี"],
    "sku": ["SKU001", "SKU002", "SKU003"],
})

info = _detect_language(df)
print(info["language"], info["confidence"])
# thai/mixed/english/numeric + per-column language map

Features: Unicode Thai block analysis (U+0E00–U+0E7F), zero-width-space aware, mixed-cell detection, common Thai word hints, per-column column_details + dataset-level confidence (0.0–1.0), sample-based scan for large DataFrames.

Smart Pre-Analysis

from thaieda.report import _detect_data_type

pre = _detect_data_type(df)
print(pre["label"], pre["language"]["language"])
# Detects: transaction, registry, survey, timeseries, or mixed

Data Quality Score

from thaieda.quality import compute_quality_score

score = compute_quality_score(quality_issues, n_columns=10, n_rows=1000)
print(f"Score: {score.score}/100 ({score.grade})")  # Score: 85/100 (B)

Smart Cleaning

from thaieda.clean._smart import plan_cleaning

plan = plan_cleaning(df)
print(plan.actions)   # ['zwspace', 'numerals', 'duplicates']
print(plan.skipped)   # ['encoding', 'whitespace']

Visualization

Both static and interactive charts, all with Thai font support:

  • Static (matplotlib): correlation heatmap, distribution, box/violin, missing matrix, scatter matrix, wordcloud, timeseries, pair plot, KDE, QQ plot, sunburst
  • Interactive (Plotly): hover tooltips, zoom, pan — Thai font (Sarabun) via Google Fonts
  • Color palette: Okabe-Ito colorblind-safe (7 colors)
from thaieda.viz._interactive import create_correlation_heatmap_interactive

html_div = create_correlation_heatmap_interactive(df)  # → HTML <div>

Installation

pip install thaieda

ไม่ต้องใส่ extras — ติดตั้งทั้งหมด: Thai tokenizer, NER, ML, interactive charts, Excel, stats, encoding detection

LLM providers (optional, lazy-imported):

pip install openai       # OpenAI GPT
pip install anthropic    # Anthropic Claude
pip install ollama       # Ollama local LLM (หรือใช้ HTTP fallback)

Requirements: Python 3.10+


Modules

Module What It Does
run() / EDA() One-liner API — full pipeline in one call
run_folder() Analyze every CSV/Excel/JSON in a folder + master HTML
compare() Side-by-side dataset comparison with drift detection
io/ Auto-read CSV/JSON/JSONL/Excel + encoding detection
detect/ Column type detection + Thai months + address parsing + language detection
clean/ Smart cleaning: auto-decide what to fix (encoding, numerals, BE, zwspace)
quality/ Language-aware quality checks + score 0–100 + Thai ID card validation
anomaly/ Statistical + ML + text anomaly detection
ner/ Thai NER: person/place/organization
insight_engine/ 6 cross-column insight patterns (BH-corrected)
viz/ Static + interactive charts with colorblind-safe palette
report/ Executive HTML report + smart pre-analysis
llm/ Privacy-preserving LLM analysis (4 modes, 3 providers)
timeseries/ Trend/seasonality/STL/ACF + Thai holiday awareness
schema/ Multi-file PK/FK discovery + relationship matching

Testing

pytest tests/ -v                    # all tests (691 passed)
ruff check src/ tests/              # lint
ruff format src/ tests/             # format

License

Apache-2.0 © Peet Wannasarnmetha

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-1.7.1.tar.gz (460.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thaieda-1.7.1-py3-none-any.whl (252.3 kB view details)

Uploaded Python 3

File details

Details for the file thaieda-1.7.1.tar.gz.

File metadata

  • Download URL: thaieda-1.7.1.tar.gz
  • Upload date:
  • Size: 460.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.7.1.tar.gz
Algorithm Hash digest
SHA256 cd8d1f8637f7f0ddb7b04c5cb0e10e40d384aa9e208e832083e0f5052bf033a3
MD5 72d02dcb03116ccdb324652eaeb317cb
BLAKE2b-256 c917b4a33749628c289b89977a91e17b6fcf6accbbd4a5ffad96fd1a7f53a45c

See more details on using hashes here.

File details

Details for the file thaieda-1.7.1-py3-none-any.whl.

File metadata

  • Download URL: thaieda-1.7.1-py3-none-any.whl
  • Upload date:
  • Size: 252.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8469f60e5be18ffeac16021a976f6ff36d7ff78403197c752ba2b60d70621514
MD5 86722e6a59413fb0c139f598001f1138
BLAKE2b-256 fc16198eb56a75bc422cc8c9b6ea804f6ea7f29801bf420c2fd84742a37f394d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page