AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai
Project description
ThaiEDA
Exploratory data analysis that actually understands Thai.
What is ThaiEDA?
ThaiEDA is a Python library that automates exploratory data analysis for Thai-language datasets. You give it a DataFrame, it gives you back a full report — column types, quality issues, anomalies, cross-column insights, charts, and an HTML report. All in one line.
It handles the things generic EDA tools miss: Buddhist Era dates, Thai numerals, zero-width spaces, mojibake encoding, Thai month names, and PII like phone numbers and national ID cards.
Quick Start
pip install thaieda
import thaieda
import pandas as pd
df = pd.read_csv("data.csv")
result = thaieda.run(df) # that's it — full EDA in one line
result.to_html("report.html") # self-contained HTML report
Want everything (Thai tokenizer, NER, ML, Excel, stats, LLM)?
pip install "thaieda[all]"
Why ThaiEDA?
Generic tools don't understand Thai data. Pandas Profiling, ydata-profiling, and Sweetviz are great — until you feed them Thai data. They miss Buddhist Era years (พ.ศ.), Thai numerals (๑๒๓), zero-width spaces that break tokenization, and mojibake from TIS-620 encoding. ThaiEDA catches all of these.
Privacy-first LLM analysis. Want to ask an LLM about your data but can't send raw rows to a cloud API? ThaiEDA has 4 privacy modes — the default sends zero raw data off your machine. Just stats and insights. Perfect for government, finance, and medical data under PDPA.
Insights, not just summaries. Most EDA tools show you df.describe() with nicer formatting. ThaiEDA has a cross-column insight engine that finds non-obvious patterns — "column A strongly predicts column B", "this group is 3× higher than average", "this column has outliers at row 47" — ranked by statistical interestingness with Benjamini-Hochberg correction.
One line to get everything. thaieda.run(df) chains the full pipeline: type detection → data cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. No config needed.
How It Works
DataFrame
│
▼
┌─────────────────────────────────────────┐
│ thaieda.run(df) │
│ │
│ 1. detect → column types │
│ 2. clean → fix encoding/numerals/BE │
│ 3. quality → nulls, placeholders, BE │
│ 4. anomaly → statistical + text │
│ 5. insights → 6 cross-column patterns │
│ 6. viz → auto charts (Thai font) │
│ 7. report → self-contained HTML │
│ │
│ + optional: LLM analysis (4 modes) │
└─────────────────────────────────────────┘
│
▼
EDAResult
.to_html() → report.html
.to_dict() → Python dict
.to_json() → JSON string
.insights → insight cards
.cleaned_df → cleaned DataFrame
.quality_issues → list of issues
.anomalies → anomaly findings
.llm_response → LLM analysis (if enabled)
Examples
One-Line EDA
import thaieda
import pandas as pd
df = pd.read_csv("data.csv")
# Full pipeline: detect → clean → quality → insights → viz → report
result = thaieda.run(df)
# Access results
result.to_html("report.html")
print(result.insights) # cross-column insight cards
print(result.quality_issues) # data quality findings
print(result.notes) # pipeline notes/warnings
# Alias works too
result = thaieda.EDA(df)
With LLM Analysis (Privacy-Safe)
import thaieda
# Default: zero raw data leaves your machine
result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)
# Or use OpenAI/Anthropic — still safe with insight_only
result = thaieda.run(df, llm=True, privacy="insight_only", provider="openai")
Privacy Modes
Control exactly what data leaves your machine:
| Mode | What Leaves | Guarantee | When to Use |
|---|---|---|---|
insight_only (default) |
Stats + insights only | Raw data never leaves | Government, medical, PDPA data |
anonymized |
Data with PII → tokens | Names/phones/ID cards masked | Need structure without raw PII |
dp_noise |
Stats + Laplace noise | Prevents re-identification | Small datasets where stats leak |
full |
Everything | None — you accept the risk | Public data, demos |
from thaieda.llm import analyze_with_llm
# Each mode as a standalone call
answer = analyze_with_llm(df, privacy="insight_only", provider="ollama")
answer = analyze_with_llm(df, privacy="anonymized", provider="openai")
answer = analyze_with_llm(df, privacy="dp_noise", provider="anthropic", epsilon=0.5)
Manual Pipeline (Full Control)
from thaieda import profile, discover_insights
from thaieda.detect import detect_all
# Step-by-step if you want control
report = profile(df, clean=True)
report.to_html("report.html")
result = discover_insights(df, detect_all(df), top_n=8)
for card in result.cards:
print(f"[{card.pattern}] {card.description_th}")
print(f" → {card.recommendation_th}")
What ThaiEDA Catches
| Problem | Example | What Happens |
|---|---|---|
| Buddhist Era dates | 15/03/2567 |
Auto-detects พ.ศ. → converts to CE |
| Thai numerals | ๑๒๓ in numeric column |
Converts to 123 |
| Zero-width spaces | สม\u200bชาย |
Strips invisible chars |
| Mojibake encoding | à ¬Â¸Â¡Â¹ |
Auto-detects TIS-620 → UTF-8 |
| Thai month names | มกราคม |
Parses to ISO date |
| Phone numbers | 081-234-5678 |
Detects + normalizes |
| National ID cards | 1-1234-56789-01-2 |
Detects via regex |
| Placeholder values | -, N/A, ไม่มี |
Flags as missing |
| Constant columns | All same value | Flags as useless |
Installation
# Basic — works immediately
pip install thaieda
# Everything in one command
pip install "thaieda[all]"
# Or pick what you need
pip install "thaieda[thai]" # Thai tokenizer
pip install "thaieda[ner]" # Thai NER
pip install "thaieda[ml]" # ML anomaly detection
pip install "thaieda[timeseries]" # STL decomposition
pip install "thaieda[excel]" # Excel support
pip install "thaieda[stats]" # p-values (scipy)
# LLM providers (all optional, lazy-imported)
pip install openai # OpenAI GPT
pip install anthropic # Anthropic Claude
pip install ollama # Ollama local LLM
Requirements: Python 3.10+, pandas, numpy, matplotlib, Jinja2
Modules
| Module | What It Does |
|---|---|
run() / EDA() |
One-liner API — full pipeline in one call |
io/ |
Auto-read CSV/JSON/JSONL/Excel + encoding detection |
detect/ |
Column type detection + Thai month names |
clean/ |
Encoding fix, numerals, BE→CE, dates, duplicates, missing |
quality/ |
Thai quality checks + placeholder/constant detection |
anomaly/ |
Statistical + ML + text anomaly detection |
ner/ |
Thai NER: person/place/organization |
insight_engine/ |
6 cross-column insight patterns (BH-corrected) |
viz/ |
Auto charts with Thai font support |
report/ |
Self-contained HTML report (Jinja2) |
llm/ |
Privacy-preserving LLM analysis (4 modes, 3 providers) |
schema/ |
Multi-file PK/FK discovery + relationship matching |
timeseries/ |
Trend/seasonality/STL/ACF/gap detection |
Testing
pytest tests/ -v # all tests (424 passed)
pytest tests/test_oneliner.py # one-liner API tests
pytest tests/test_llm.py # LLM + privacy mode tests
ruff check src/ tests/ # lint
ruff format src/ tests/ # format
License
Apache-2.0 © Peet Wannasarnmetha
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thaieda-1.0.1.tar.gz.
File metadata
- Download URL: thaieda-1.0.1.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35e25e72efbdcde345eedbb2d925d3f8e2766413268989e52e5606d091514de3
|
|
| MD5 |
66aad71208514a492ac9ae24eb0da5f6
|
|
| BLAKE2b-256 |
1a98c65145c1bf8b37f9fe8f17f8c9825e3e5c59cb4c3b721ee441011badd153
|
File details
Details for the file thaieda-1.0.1-py3-none-any.whl.
File metadata
- Download URL: thaieda-1.0.1-py3-none-any.whl
- Upload date:
- Size: 176.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9b2f6f298c3175e11639cbef21888931244d9d64de2d48099e881f2a6a2d124
|
|
| MD5 |
bdcfeca41940a42d43e2ec8b4e33b2ed
|
|
| BLAKE2b-256 |
0ac966472aacbe712d638a59902d7908a0060ac3cbc6a94ec452535ce77157c8
|