Skip to main content

AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

Project description

ThaiEDA

Exploratory data analysis that actually understands Thai.

PyPI Python 3.10+ License: Apache-2.0 Tests: 645 passed Code Style: ruff Language aware


What is ThaiEDA?

ThaiEDA is a Python library that automates exploratory data analysis for Thai and mixed Thai/English datasets. You give it a DataFrame, it gives you back a full report — smart pre-analysis, language detection, column types, quality issues, anomalies, cross-column insights, charts, and an executive-style HTML report. All in one line.

It handles the things generic EDA tools miss: Buddhist Era dates, Thai numerals, zero-width spaces, Thai vowel/tone marks, mixed Thai/English cells, mojibake encoding, Thai month names, national ID card validation, Thai address parsing, and PII like phone numbers.


Quick Start

pip install thaieda
import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)          # full EDA in one line
result.to_html("report.html")     # self-contained HTML report

pip install thaieda ติดตั้งทุกอย่างเลย — Thai tokenizer, NER, ML, Excel, stats, encoding detection, interactive charts ไม่ต้องใส่ extras


Why ThaiEDA?

Generic tools don't understand Thai data. Pandas Profiling, ydata-profiling, and Sweetviz are great — until you feed them Thai data. They miss Buddhist Era years (พ.ศ.), Thai numerals (๑๒๓), zero-width spaces that break tokenization, and mojibake from TIS-620 encoding. ThaiEDA catches all of these.

Privacy-first LLM analysis. Want to ask an LLM about your data but can't send raw rows to a cloud API? ThaiEDA has 4 privacy modes — the default sends zero raw data off your machine. Perfect for government, finance, and medical data under PDPA.

Insights, not just summaries. A cross-column insight engine finds non-obvious patterns — "column A strongly predicts column B", "this group is 3× higher than average" — ranked by statistical interestingness with Benjamini-Hochberg correction.

Thai-specific validation. National ID card checksum validation, Thai address parsing (province/district/subdistrict), Thai holiday awareness for timeseries spike attribution. No other EDA tool does this.

One line to get everything. thaieda.run(df) chains the full pipeline: type detection → smart cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. No config needed.


How It Works

DataFrame
    │
    ▼
┌──────────────────────────────────────────────┐
│  thaieda.run(df)                             │
│                                              │
│  0. pre-analyze → data type + language       │
│  1. detect      → column types + Thai months │
│  2. clean       → smart cleaning (auto-decide)│
│  3. quality     → language-aware checks      │
│  4. anomaly     → statistical + ML + text    │
│  5. insights    → 6 cross-column patterns    │
│  6. viz         → interactive + static charts│
│  7. report      → executive HTML narrative   │

│  + optional: LLM analysis (4 privacy modes)  │
│  + optional: compare(df1, df2) side-by-side  │
└──────────────────────────────────────────────┘
    │
    ▼
EDAResult
  .to_html()        → report.html
  .to_dict()        → Python dict
  .to_json()        → JSON string
  .insights         → insight cards
  .cleaned_df       → cleaned DataFrame
  .quality_issues   → list of issues
  .quality_score    → 0-100 score with grade
  .anomalies        → anomaly findings
  .llm_response     → LLM analysis (if enabled)
  ._repr_html_()    → Jupyter rich display

run_folder("data/")  → FolderResult
  .to_html("dir/")      → individual HTML per file
  .to_master_html()     → single master HTML with sidebar
  .summary()            → text summary
  ._repr_html_()        → Jupyter rich display

What's New

Scale & Performance

Tested across 14 public datasets — from 500 rows to 541K rows, 8 to 171 columns. Every report stays under 2 MB and finishes under 120 seconds.

  • Insight capping — reports surface the 30 most important findings instead of hundreds. Critical insights are always kept; warnings and info fill the rest. The executive summary shows the true count ("679 found, showing top 30").
  • HTML bloat control — dual chart budget (40 charts max, 1.6 MB max). Quality and anomaly tables collapse after 50 rows. Wide tables switch to a summary view past 60 columns.
  • Wide-table fast path — the insight engine samples breakdowns and measures when columns exceed 100. Correlation heatmaps and scatter matrices skip automatically on very wide data.
  • Tall-table fast path — anomaly, quality, and outlier checks sample 50K rows when data exceeds 100K. Correlation computes on a sample. Timeseries decomposition skips past 200K rows.

Data Quality & Cleaning

  • High-NA handling — columns over 80% missing are flagged as mostly_missing with NaN preserved. Columns over 40% get a warning to drop or impute with domain knowledge. Below 40% is unchanged.
  • Smarter type detection — Thai low-cardinality text is classified as categorical, not free text. Text-named columns like review and feedback stay text even with few unique values.
  • Cleaning safeguards — numeric strings like 1.00005 are left alone. Keyboard layout conversion only runs when Thai characters are present. Repeated-character spam on short codes is suppressed.
  • ID/FK awareness — ID columns are excluded from categorical anomaly checks. *_id columns are detected even with low unique ratio. Buddhist Era checks skip IDs. Timeseries excludes ID/FK/code columns from measures.

Reporting

  • Executive briefing format — reports flow from executive summary to key findings, business translation, priority actions, and plain-language explanations.
  • Template pagination — Key Insights shows the top 20 with a collapsible section for the rest. Count badges are preserved.
  • Fewer false positives — fuzzy duplicate guard skips short near-identical labels. Script mixing is skipped on low-cardinality columns. Outliers on heavy-tail distributions (skew > 2.0) are downgraded to info.
  • Folder reportsrun_folder() analyzes CSV, Excel, JSON, JSONL, and TSV folders. FolderResult.to_master_html() builds one master HTML with sidebar navigation.

Smart Pre-Analysis

  • Language detection — Thai, English, mixed, and numeric data detected with confidence and per-column detail.
  • Data type classification — transaction, registry, survey, timeseries, and mixed datasets classified before EDA.
  • Language-aware quality — English-only data skips Thai-specific warnings automatically.

Benchmarks

ThaiEDA is tested on 14 public datasets ranging from 500 rows to 541K rows, 8 to 171 columns. Every dataset produces a report under 2 MB in under 120 seconds.

Dataset Rows Cols Time HTML Insights
titanic 891 12 8 s 0.79 MB 27
telco-churn 7,043 21 11 s 0.84 MB 11
wine-quality 1,599 12 7 s 0.93 MB 29
california-housing 20,640 10 15 s 0.99 MB 30
superstore 10,800 21 31 s 1.46 MB 30
adult 32,561 15 22 s 1.03 MB 29
bank-marketing 41,188 21 21 s 0.94 MB 30
online-retail 541,909 8 81 s 0.96 MB 30
dirty-thai-retail 500 8 2 s 0.51 MB 15
absenteeism 740 21 10 s 1.25 MB 30
online-shoppers 12,330 18 18 s 1.06 MB 30
aps-failure 16,000 171 100 s 0.48 MB 30
beijing-pm25 43,824 13 12 s 0.76 MB 19
bike-sharing 17,379 17 42 s 1.55 MB 30

All 14 datasets pass QA with 0 defects. Datasets from UCI ML Repository and public sources.


Examples

One-Line EDA

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")

# Full pipeline in one call
result = thaieda.run(df)

# Access results
result.to_html("report.html")
print(result.quality_issues)
print(result.insights)

# In Jupyter: just display the result
result  # renders HTML report inline

Folder Mode — Analyze Every File at Once

import thaieda

# One line — analyzes every CSV/Excel/JSON in the folder
results = thaieda.run_folder("data/")

# Print summary
print(results.summary())
# ThaiEDA FolderResult — data/
#   Files: 5 (✅ 5 / ❌ 0)
#   ✅ customers.csv — 10,000 rows × 8 cols, 15 insights
#   ✅ orders.csv    — 50,000 rows × 12 cols, 28 insights
#   ...

# Save individual HTML reports
results.to_html("reports/")

# Generate a single master HTML with sidebar navigation
results.to_master_html("master-report.html")

run_folder() features:

  • Auto-scans for CSV, Excel (.xlsx/.xls), JSON, JSONL, TSV
  • recursive=True to include subfolders
  • output_dir= to specify where HTML goes
  • Error isolation — one broken file doesn't crash the rest
  • progress= callback for progress tracking
  • All run() kwargs supported (lang, clean, llm, etc.)
  • to_master_html() — combines all reports into one page with sidebar nav + summary table

With LLM Analysis (Privacy-Safe)

import thaieda

# Default: zero raw data leaves your machine
result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)

Compare Two Datasets

from thaieda.compare import compare_datasets

diff = compare_datasets(df_train, df_test, labels=("train", "test"))
print(diff["schema_diff"])      # columns added/removed
print(diff["drift"]["numeric"]) # KS statistic per column

Thai ID Card Validation

from thaieda.quality import validate_thai_id, validate_thai_id_column

# Single ID
validate_thai_id("1-1234-56789-01-2")  # → True/False

# Entire column
result = validate_thai_id_column(df["id_card"])
print(f"Valid: {result['valid_count']}, Invalid: {result['invalid_count']}")

Thai Address Parsing

from thaieda.detect import parse_thai_address

addr = parse_thai_address("123 หมู่ 4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ 10230")
print(addr)
# {'house_number': '123', 'moo': '4', 'subdistrict': 'บางบัว',
#  'district': 'บางบัว', 'province': 'กรุงเทพฯ', 'postal_code': '10230'}

Language Detection

import pandas as pd
from thaieda.detect import _detect_language

df = pd.DataFrame({
    "product": ["กาแฟ", "ชาไทย", "ขนม"],
    "review": ["อร่อยมาก 5/5 stars", "ดีครับ", "ไม่ดี"],
    "sku": ["SKU001", "SKU002", "SKU003"],
})

info = _detect_language(df)
print(info["language"], info["confidence"])
print(info["columns"])
# thai/mixed/english/numeric + per-column language map

Language Detection v2 features:

  • Unicode Thai block analysis (U+0E00–U+0E7F) including vowels/tone marks (U+0E30–U+0E4D)
  • Zero-width-space aware (\u200b, BOM, word joiner)
  • Mixed-cell detection เช่น "อร่อยมาก 5/5 stars"
  • Common Thai word hints: ครับ, ค่ะ, ไทย, อร่อย, ดี, ไม่, มี, และ
  • Lazy pythainlp tokenizer when installed; regex fallback when unavailable
  • Per-column column_details + dataset-level confidence (0.0–1.0)
  • Sample-based scan (first 500 rows/column) for large DataFrames

Smart Pre-Analysis

ThaiEDA profiles the dataset before running the full report, so the narrative and quality checks match the data:

from thaieda.report import _detect_data_type

pre = _detect_data_type(df)
print(pre["label"], pre["language"]["language"])
print(pre["focus"])

Smart pre-analysis detects:

  • Transaction data — orders, payments, revenue, invoices
  • Registry/master data — customers, products, stores, entity attributes
  • Survey/review data — ratings, comments, feedback text
  • Timeseries data — datetime index/columns + numeric measures
  • Mixed data — conservative fallback when signals overlap
  • Language impact — Thai/mixed data enables Thai-specific checks; English-only data skips พ.ศ./เลขไทย checks automatically

Data Quality Score

from thaieda.quality import compute_quality_score

score = compute_quality_score(quality_issues, n_columns=10, n_rows=1000)
print(f"Score: {score.score}/100 ({score.grade})")
# Score: 85/100 (B)

Smart Cleaning

from thaieda.clean._smart import plan_cleaning

plan = plan_cleaning(df)
print(f"Actions: {plan.actions}")    # ['zwspace', 'numerals', 'duplicates']
print(f"Skipped: {plan.skipped}")    # ['encoding', 'whitespace']

Privacy Modes

Control exactly what data leaves your machine when using LLM analysis:

Mode What Leaves Guarantee When to Use
insight_only (default) Stats + insights only Raw data never leaves Government, medical, PDPA data
anonymized Data with PII → tokens Names/phones/ID cards masked Need structure without raw PII
dp_noise Stats + Laplace noise Prevents re-identification Small datasets where stats leak
full Everything None — you accept the risk Public data, demos

What ThaiEDA Catches

Problem Example What Happens
Buddhist Era dates 15/03/2567 Auto-detects พ.ศ. → converts to CE
Thai numerals ๑๒๓ in numeric column Converts to 123
Zero-width spaces สม\u200bชาย Strips invisible chars and reports language evidence
Thai vowel/tone marks อร่อยค่ะ Counts U+0E30–U+0E4D for better Thai detection
Mixed Thai/English cells อร่อยมาก 5/5 stars Detects as mixed language instead of English/numeric
Thai text in English-heavy tables Thai product column + English IDs Column-level language detection preserves Thai checks
Common Thai words ครับ, ค่ะ, ไม่ดี Boosts confidence for short Thai text
Mojibake encoding à ¬Â¸Â¡Â¹ Auto-detects TIS-620 → UTF-8
Thai month names มกราคม Parses to ISO date
Phone numbers 081-234-5678 Detects + normalizes
National ID cards 1-1234-56789-01-2 Checksum validation
Thai addresses 123 ม.4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ Parses to structured fields
Placeholder values -, N/A, ไม่มี Flags as missing
Constant columns All same value Flags as useless
Smart data type Orders/reviews/timeseries Pre-classifies transaction/registry/survey/timeseries/mixed
Language-aware checks English-only DataFrame Skips Thai-specific พ.ศ./เลขไทย warnings automatically
Thai holidays Spike on Dec 5 Attributes to Father's Day
ID/FK semantics order_id, store_id Detected as ID even with low unique ratio; excluded from category anomaly
BE on ID order_id=2531 No longer flagged as พ.ศ. (BE check requires date-like column)
Numeric string preservation 1.00005 in numeric data fix_repeated_chars skips decimal/numeric strings
Keyboard layout guard Floyd in English column Not converted to Thai (requires Thai chars in column first)
Payment method detection payment_method column Classified as categorical, not amount column
Index artifact cleanup Unnamed: 0 column Ignored in analysis, flagged as index artifact
Per-column missing values Age 20% missing Quality issue with severity threshold (warning >5%, info 1-5%)
CSV delimiter warning ;-delimited file read as 1 column Warns to re-read with sep=';'

Visualization

ThaiEDA generates both static (matplotlib) and interactive (Plotly) charts:

  • Static: correlation heatmap, distribution, box/violin, missing matrix, scatter matrix, wordcloud, timeseries, pair plot, KDE, QQ plot, sunburst
  • Interactive: hover tooltips, zoom, pan — using Plotly with Thai font (Sarabun) via Google Fonts
  • Color palette: Okabe-Ito colorblind-safe (7 colors)
  • Thai font: auto-detected for matplotlib, CSS-loaded for Plotly
from thaieda.viz._interactive import create_correlation_heatmap_interactive

html_div = create_correlation_heatmap_interactive(df)  # → HTML <div> for reports

Installation

# ติดตั้งทุกอย่างในคำสั่งเดียว
pip install thaieda

ไม่ต้องใส่ extras — pip install thaieda ติดตั้งทั้งหมด: Thai tokenizer, NER, ML, interactive charts, Excel, stats, encoding detection

LLM providers ยังเป็น optional (lazy-imported — ไม่ต้องติดตั้งถ้าไม่ใช้):

pip install openai       # OpenAI GPT
pip install anthropic    # Anthropic Claude
pip install ollama       # Ollama local LLM (หรือใช้ HTTP fallback ไม่ต้องติดตั้ง)

Requirements: Python 3.10+


Modules

Module What It Does
run() / EDA() One-liner API — full pipeline in one call
run_folder() Analyze every CSV/Excel/JSON in a folder + master HTML
compare() Side-by-side dataset comparison with drift detection
io/ Auto-read CSV/JSON/JSONL/Excel + encoding detection
detect/ Column type detection + Thai month names + address parsing + language detection v2
clean/ Smart cleaning: auto-decide what to fix (encoding, numerals, BE, zwspace)
quality/ Language-aware quality checks + score 0-100 + Thai ID card validation
anomaly/ Statistical + ML + text anomaly detection
ner/ Thai NER: person/place/organization
insight_engine/ 6 cross-column insight patterns (BH-corrected)
viz/ Static + interactive charts with colorblind-safe palette
report/ Executive HTML report + smart pre-analysis (_detect_data_type)
llm/ Privacy-preserving LLM analysis (4 modes, 3 providers)
timeseries/ Trend/seasonality/STL/ACF + Thai holiday awareness
schema/ Multi-file PK/FK discovery + relationship matching

Testing

pytest tests/ -v                    # all tests (631 passed)
pytest tests/test_language_detection.py  # language detection + language-aware quality
pytest tests/test_thai_id.py        # ID card validation
pytest tests/test_thai_address.py   # address parsing
pytest tests/test_compare.py        # dataset comparison
pytest tests/test_run_folder.py     # folder mode + master HTML
pytest tests/test_llm.py            # LLM + privacy modes
ruff check src/ tests/              # lint
ruff format src/ tests/             # format

License

Apache-2.0 © Peet Wannasarnmetha

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-1.5.0.tar.gz (448.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thaieda-1.5.0-py3-none-any.whl (244.0 kB view details)

Uploaded Python 3

File details

Details for the file thaieda-1.5.0.tar.gz.

File metadata

  • Download URL: thaieda-1.5.0.tar.gz
  • Upload date:
  • Size: 448.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.5.0.tar.gz
Algorithm Hash digest
SHA256 d6241c7f4dedc302cfdcaa6d7116ba8aae3e0f08f3f3ae4c7bb111742ab481b7
MD5 29c306fedd3fef4b0fcaa8ab1a1dd7f1
BLAKE2b-256 d32b99145a1848b893bc7b1b45affa5a2731e6549b8919848d487a5b88440036

See more details on using hashes here.

File details

Details for the file thaieda-1.5.0-py3-none-any.whl.

File metadata

  • Download URL: thaieda-1.5.0-py3-none-any.whl
  • Upload date:
  • Size: 244.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6f291c2feade86be10b1149c9b4f5beb23d93a9e06f4532edfd3c789da5b52c
MD5 6e01861a6cf88b1d549a2da8d9a6858d
BLAKE2b-256 e992a52cecff1e9084213d8533d9c33db76fc362bf0c791e5fcccca17007a587

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page