Skip to main content

AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

Project description

ThaiEDA

Exploratory data analysis that actually understands Thai.

PyPI Python 3.10+ License: Apache-2.0 Tests: 691 passed Code Style: ruff Language aware


Quick Start

pip install thaieda
import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)          # full EDA in one line
result.to_html("report.html")     # self-contained HTML report

That's it. pip install thaieda ติดตั้งทุกอย่างเลย — Thai tokenizer, NER, ML, Excel, stats, encoding detection, interactive charts ไม่ต้องใส่ extras


Why ThaiEDA?

You already have ydata-profiling and sweetviz. Here's why you'd reach for ThaiEDA instead:

1. Thai text doesn't break. Generic tools render Thai as tofu boxes (□□□) in every chart. They miss Buddhist Era dates (พ.ศ. 2567), Thai numerals (๑๒๓), zero-width spaces, and mojibake from TIS-620 encoding. ThaiEDA detects and fixes all of these automatically — no font config, no manual cleanup.

2. Insights, not just stats. ydata-profiling gives you distributions and correlation matrices. ThaiEDA finds actionable cross-column patterns — "column A strongly predicts column B", "this group is 3× higher than average" — ranked by statistical interestingness with Benjamini-Hochberg correction. Plus anomaly detection, quality scoring, and data type classification.

3. One call, everything done. run(df) chains the full pipeline: type detection → smart cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. With ydata you'd still need a separate anomaly detector, a Thai font config, a cleaner, and manual interpretation.

4. Privacy-first LLM. Ask an LLM about your data without sending raw rows to a cloud API. 4 privacy modes — the default sends zero raw data. PDPA-ready.

5. Smaller reports on big data. ydata-profiling produces a 71 MB HTML on a 171-column dataset. ThaiEDA produces 0.48 MB — 148× smaller — because it caps charts, collapses tables, and samples intelligently on wide/tall data.


How It Works

DataFrame → thaieda.run(df) → EDAResult

  Step 0  pre-analyze    data type + language detection
  Step 1  detect         column types + Thai months + addresses
  Step 2  clean          smart cleaning (auto-decide what to fix)
  Step 3  quality        language-aware checks + 0–100 score
  Step 4  anomaly        IQR + ML + text anomaly detection
  Step 5  insights       6 cross-column patterns (BH-corrected)
  Step 6  viz            static (matplotlib) + interactive (Plotly)
  Step 7  report         executive HTML narrative

  + optional: LLM analysis (4 privacy modes)
  + optional: run_folder("data/") → multi-file master HTML
  + optional: compare(df1, df2) → drift detection
result = thaieda.run(df)

result.to_html()         # → report.html (self-contained)
result.to_dict()          # → Python dict
result.to_json()          # → JSON string
result.insights           # → insight cards
result.cleaned_df         # → cleaned DataFrame
result.quality_issues     # → list of issues
result.quality_score      # → 0–100 score with grade
result.anomalies          # → anomaly findings
result.llm_response       # → LLM analysis (if enabled)
result                    # → Jupyter rich display

Benchmarks — ThaiEDA vs ydata-profiling vs sweetviz

We ran all three on 6 representative datasets (small/large/wide, Thai + non-Thai):

Capability comparison

Feature ydata-profiling sweetviz ThaiEDA
Standalone HTML report
Cross-column insights ✅ 6 patterns + BH correction
Anomaly detection ✅ IQR + ML + text
Quality score (0–100)
Language detection ✅ Thai/English/mixed
Thai font in charts ❌ tofu ❌ tofu ✅ Sarabun auto
Buddhist Era (พ.ศ.) ✅ → CE
Thai numerals (๑๒๓) ✅ → 123
Zero-width space fix
Mojibake repair
Smart cleaning ✅ auto-decide
Thai NER
Privacy LLM modes ✅ 4 modes (PDPA)
Folder mode run_folder()

Speed & report size

Dataset Rows Cols ydata ydata size sweetviz sv size ThaiEDA EDA size
titanic 891 12 5.3s 1.95 MB 3.3s 0.92 MB 8.2s 0.82 MB
superstore 10,800 21 9.3s 5.16 MB 5.4s 1.49 MB 26.0s 1.50 MB
adult 32,561 15 5.4s 1.65 MB 8.0s 1.26 MB 17.2s 1.05 MB
dirty-thai-retail 500 8 3.1s 0.90 MB 2.1s 0.68 MB 2.1s 0.53 MB
wisesight 26,737 2 2.6s 0.68 MB 0.8s 0.50 MB 18.8s 0.42 MB
aps-failure 16,000 171 99.8s 71.2 MB 15.8s 8.2 MB 93.0s 0.48 MB

Quality benchmark — synthetic dataset with 10 known issues

We injected 10 known defects into a 2,000-row synthetic dataset (outliers, missing values, duplicates, constants, placeholders, Buddhist Era dates, Thai numerals, zero-width spaces, mojibake) and measured how many each tool detected. All tools processed identically: HTML output stripped to plain text, same keyword detection applied uniformly.

Table A — General EDA quality (6 issues all tools can reasonably detect)

Metric ydata-profiling (default) sweetviz ThaiEDA
GTR — Ground-Truth Recall 100% 83% 100%
ITB — Issue Type Breadth (11 categories) 73% 64% 91%
RC — Report Completeness (10 sections) 70% 50% 100%
Time 45s 3s 16s
HTML size 7.2 MB 0.9 MB 1.5 MB

On general EDA, ThaiEDA and ydata-profiling both achieve 100% recall. ThaiEDA wins on breadth (91% vs 73%) and report completeness (100% vs 70%), while producing a 5× smaller report. sweetviz misses constant column detection.

Table B — Thai-specific detection (4 issues — competitors don't claim Thai support)

Thai issue ydata sweetviz ThaiEDA
Buddhist Era dates (พ.ศ.) 25%* 0%
Thai numerals (๑๒๓) 0% 0%
Zero-width spaces 0% 0%
Mojibake (TIS-620) 0% 0%
Thai GTR 25% 0% 75%

* ydata detected BE dates via generic encoding keywords, not Thai-specific recognition.

Competitors score 0% on Thai issues by design — they don't claim Thai language support. ThaiEDA's 75% (3/4) reflects its purpose-built Thai detection engine. The one miss (zero-width space in category_text) is a known gap being addressed.


What ThaiEDA Catches

Thai-specific problems

Problem Example What ThaiEDA does
Buddhist Era dates 15/03/2567 Detects พ.ศ. → converts to CE
Thai numerals ๑๒๓ in numeric column Converts to 123
Zero-width spaces สม\u200bชาย Strips invisible chars + reports
Thai vowel/tone marks อร่อยค่ะ Counts U+0E30–U+0E4D for detection
Mixed Thai/English cells อร่อยมาก 5/5 stars Detects as mixed, not English/numeric
Thai month names มกราคม Parses to ISO date
Mojibake encoding à ¬Â¸Â¡Â¹ Auto-detects TIS-620 → UTF-8
National ID cards 1-1234-56789-01-2 Checksum validation
Thai addresses 123 ม.4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ Parses to structured fields
Phone numbers 081-234-5678 Detects + normalizes
Thai holidays Spike on Dec 5 Attributes to Father's Day

Data quality & intelligence

Problem What ThaiEDA does
Placeholder values (-, N/A, ไม่มี) Flags as missing
Constant columns Flags as useless
High-NA columns (>80%) Flags mostly_missing, preserves NaN
Missing % per column Severity threshold (warning >5%, info 1–5%)
Smart data type Pre-classifies transaction/registry/survey/timeseries/mixed
Language-aware checks English-only skips Thai พ.ศ./เลขไทย warnings
ID/FK semantics order_id, store_id excluded from category anomaly
Numeric string preservation 1.00005 left alone — not "spam"
Keyboard layout guard Floyd in English column not converted to Thai
Index artifacts Unnamed: 0 ignored + flagged
CSV delimiter mismatch ;-delimited file warns to re-read

Opt-in operations (not in default pipeline)

Operation Function Effect
Abbreviation expansion expand_abbreviations() กทม. → กรุงเทพมหานคร
Spell correction spell_correct() ขอบคุน → ขอบคุณ
NFKC normalization normalize_nfkc() A→A, 9→9
Fast tokenizer engine="auto-fast" nlpo3 (Rust, 3–4× faster)
Quality tokenizer engine="auto-quality" AttaCut (neural, better for OOV)
Keyboard layout anomaly report-only Detects suspicious Latin/Thai mixing
Grapheme validation report-only Detects abnormal stacked tone marks

Scale & Performance

Tested across 19 public datasets — 500 to 541K rows, 2 to 171 columns:

  • Insight capping — surfaces the 30 most important findings. Executive summary shows the true count ("679 found, showing top 30").
  • HTML bloat control — 40 charts max, 1.6 MB max. Quality/anomaly tables collapse after 50 rows. Wide tables switch to summary view past 60 columns.
  • Wide-table fast path — insight engine samples when columns exceed 100. Heatmaps and scatter matrices skip on very wide data.
  • Tall-table fast path — anomaly/quality/outlier checks sample 50K rows when data exceeds 100K. Timeseries decomposition skips past 200K rows.
  • High-NA handling — columns >80% missing flagged as mostly_missing. >40% gets a warning. <40% unchanged.
  • Smarter type detection — Thai low-cardinality text → categorical, not free text. review/feedback stay text.
  • Cleaning safeguards — numeric strings untouched. Keyboard conversion only when Thai chars present.

Examples

One-Line EDA

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)

result.to_html("report.html")
print(result.quality_issues)
print(result.insights)

# In Jupyter: just display the result
result  # renders HTML inline

Folder Mode — Analyze Every File at Once

import thaieda

results = thaieda.run_folder("data/")

print(results.summary())
# ThaiEDA FolderResult — data/
#   Files: 5 (✅ 5 / ❌ 0)
#   ✅ customers.csv — 10,000 rows × 8 cols, 15 insights
#   ...

results.to_html("reports/")
results.to_master_html("master-report.html")  # single HTML with sidebar

Supported formats: CSV, Excel (.xlsx/.xls), JSON, JSONL, TSV. recursive=True for subfolders. Error isolation — one broken file doesn't crash the rest.

LLM Analysis (Privacy-Safe)

result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)
# Default: zero raw data leaves your machine
Mode What Leaves When to Use
insight_only (default) Stats + insights only Government, medical, PDPA
anonymized PII → tokens Need structure without raw data
dp_noise Stats + Laplace noise Small datasets where stats leak
full Everything Public data, demos

Compare Two Datasets

from thaieda.compare import compare_datasets

diff = compare_datasets(df_train, df_test, labels=("train", "test"))
print(diff["schema_diff"])       # columns added/removed
print(diff["drift"]["numeric"])  # KS statistic per column

Thai ID Card Validation

from thaieda.quality import validate_thai_id, validate_thai_id_column

validate_thai_id("1-1234-56789-01-2")           # → True/False
result = validate_thai_id_column(df["id_card"]) # entire column

Thai Address Parsing

from thaieda.detect import parse_thai_address

addr = parse_thai_address("123 หมู่ 4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ 10230")
# {'house_number': '123', 'moo': '4', 'subdistrict': 'บางบัว',
#  'district': 'บางบัว', 'province': 'กรุงเทพฯ', 'postal_code': '10230'}

Language Detection

from thaieda.detect import _detect_language

df = pd.DataFrame({
    "product": ["กาแฟ", "ชาไทย", "ขนม"],
    "review": ["อร่อยมาก 5/5 stars", "ดีครับ", "ไม่ดี"],
    "sku": ["SKU001", "SKU002", "SKU003"],
})

info = _detect_language(df)
print(info["language"], info["confidence"])
# thai/mixed/english/numeric + per-column language map

Features: Unicode Thai block analysis (U+0E00–U+0E7F), zero-width-space aware, mixed-cell detection, common Thai word hints, per-column column_details + dataset-level confidence (0.0–1.0), sample-based scan for large DataFrames.

Smart Pre-Analysis

from thaieda.report import _detect_data_type

pre = _detect_data_type(df)
print(pre["label"], pre["language"]["language"])
# Detects: transaction, registry, survey, timeseries, or mixed

Data Quality Score

from thaieda.quality import compute_quality_score

score = compute_quality_score(quality_issues, n_columns=10, n_rows=1000)
print(f"Score: {score.score}/100 ({score.grade})")  # Score: 85/100 (B)

Smart Cleaning

from thaieda.clean._smart import plan_cleaning

plan = plan_cleaning(df)
print(plan.actions)   # ['zwspace', 'numerals', 'duplicates']
print(plan.skipped)   # ['encoding', 'whitespace']

Visualization

Both static and interactive charts, all with Thai font support:

  • Static (matplotlib): correlation heatmap, distribution, box/violin, missing matrix, scatter matrix, wordcloud, timeseries, pair plot, KDE, QQ plot, sunburst
  • Interactive (Plotly): hover tooltips, zoom, pan — Thai font (Sarabun) via Google Fonts
  • Color palette: Okabe-Ito colorblind-safe (7 colors)
from thaieda.viz._interactive import create_correlation_heatmap_interactive

html_div = create_correlation_heatmap_interactive(df)  # → HTML <div>

Installation

pip install thaieda

ไม่ต้องใส่ extras — ติดตั้งทั้งหมด: Thai tokenizer, NER, ML, interactive charts, Excel, stats, encoding detection

LLM providers (optional, lazy-imported):

pip install openai       # OpenAI GPT
pip install anthropic    # Anthropic Claude
pip install ollama       # Ollama local LLM (หรือใช้ HTTP fallback)

Requirements: Python 3.10+


Modules

Module What It Does
run() / EDA() One-liner API — full pipeline in one call
run_folder() Analyze every CSV/Excel/JSON in a folder + master HTML
compare() Side-by-side dataset comparison with drift detection
io/ Auto-read CSV/JSON/JSONL/Excel + encoding detection
detect/ Column type detection + Thai months + address parsing + language detection
clean/ Smart cleaning: auto-decide what to fix (encoding, numerals, BE, zwspace)
quality/ Language-aware quality checks + score 0–100 + Thai ID card validation
anomaly/ Statistical + ML + text anomaly detection
ner/ Thai NER: person/place/organization
insight_engine/ 6 cross-column insight patterns (BH-corrected)
viz/ Static + interactive charts with colorblind-safe palette
report/ Executive HTML report + smart pre-analysis
llm/ Privacy-preserving LLM analysis (4 modes, 3 providers)
timeseries/ Trend/seasonality/STL/ACF + Thai holiday awareness
schema/ Multi-file PK/FK discovery + relationship matching

Testing

pytest tests/ -v                    # all tests (691 passed)
ruff check src/ tests/              # lint
ruff format src/ tests/             # format

License

Apache-2.0 © Peet Wannasarnmetha

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-1.7.0.tar.gz (460.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thaieda-1.7.0-py3-none-any.whl (252.1 kB view details)

Uploaded Python 3

File details

Details for the file thaieda-1.7.0.tar.gz.

File metadata

  • Download URL: thaieda-1.7.0.tar.gz
  • Upload date:
  • Size: 460.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.7.0.tar.gz
Algorithm Hash digest
SHA256 726e7baa6b0250367f22dfa594204c7defca0f684552f4f9a32f8dee42b60eb3
MD5 4d337d5e7236cc24ebb4ee07fc8c6d43
BLAKE2b-256 4eea8752d1137f2238df75663959b8a0e0554da8277559b20c75d39f9dbdb33c

See more details on using hashes here.

File details

Details for the file thaieda-1.7.0-py3-none-any.whl.

File metadata

  • Download URL: thaieda-1.7.0-py3-none-any.whl
  • Upload date:
  • Size: 252.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 544aa173765cb759699e6afd8821e712237284ef3f5f1d426b12af6a9a62ec51
MD5 4fdec5f10eef8c5995574a8bd2772d54
BLAKE2b-256 595c6d2a3ddce11228fa89fa3d0cca5e74b00db4d7e20d983fc8230105211de6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page