AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai
Project description
ThaiEDA
AutoEDA for Thai-language data — exploratory data analysis that understands Thai.
Quick Start
pip install "thaieda[thai]"
import pandas as pd
from thaieda import profile
from thaieda.llm import analyze_with_llm
df = pd.read_csv("data.csv")
report = profile(df, clean=True)
report.to_html("report.html")
# Ask an LLM about the data — privacy-safe by default
answer = analyze_with_llm(df, privacy="insight_only", provider="ollama")
print(answer)
Why ThaiEDA?
- Thai-specific — catches Buddhist Era dates, Thai numerals, zero-width spaces, mojibake, and Thai month names that generic tools miss.
- Privacy-first — LLM analysis with 4 privacy modes; the default sends zero raw data off your machine.
- Auto insights — a cross-column insight engine surfaces non-obvious findings, ranked by statistical interestingness (BH-corrected).
- No lock-in — generates a self-contained HTML report; works as a library or CLI; all LLM providers are optional and lazy-imported.
Features by Version
| Version | Feature | Description |
|---|---|---|
| v0.9 | Privacy-preserving LLM analysis | 4 privacy modes + 3 LLM providers (OpenAI / Anthropic / Ollama) |
| v0.8 | Data cleaning + actionable insights | Thai numeral→numeric, BE→CE, date standardization, correlation/outlier patterns, Excel support |
| v0.7 | Insight visualization | Auto-generated charts for each cross-column finding (bar, donut, box plot, trend line) |
| v0.6 | Cross-column insight engine | 6 patterns: outstanding / attribution / comparison / trend / correlation / outlier (BH-corrected) |
| v0.5 | Multi-file schema discovery | PK/FK matching, ER diagram, relationship validation, orphan detection |
| v0.4 | Timeseries analysis | Trend/seasonality/STL/ACF/gaps + distribution & correlation insights |
| v0.3 | Single-command pipeline | JSON input, auto encoding detection, auto insights, cleaning diff |
| v0.2 | Thai NER + target analysis | pythainlp normalize, auto chart selection, unified anomaly API |
| v0.1 | Thai text profiling | Column type detection, quality checks, HTML report, CLI |
Privacy Modes (v0.9)
Control exactly what data leaves your machine when calling analyze_with_llm():
| Mode | What Leaves Machine | Privacy Guarantee | Use Case |
|---|---|---|---|
insight_only (default) |
Summary statistics + insight cards only | Raw data never leaves | Regulated / PDPA data, cautious users |
anonymized |
Data with PII replaced by tokens ([PHONE_1], [NAME_1]) |
Names/phones/ID cards masked; token_map returned for local reversal |
LLM needs to see structure without raw PII |
dp_noise |
Statistics with Laplace noise (configurable ε) | DP noise prevents re-identification from small stats | Small datasets where stats alone may leak identity |
full |
All raw data sent | None — user accepts tradeoff | Public data, dev/demo workflows |
Examples
Basic EDA
import pandas as pd
from thaieda import profile, read_data
# Auto-reads CSV/JSON/JSONL/Excel with auto encoding detection
df = read_data("data.xlsx")
# Profile + clean + auto insights in one call
report = profile(df, clean=True)
report.to_html("report.html")
# Get Thai-language executive summary
print(report.insights.executive_summary_th)
Insight Discovery
from thaieda import discover_insights
from thaieda.detect import detect_all
result = discover_insights(df, detect_all(df), top_n=8)
for card in result.cards:
print(f"[{card.pattern}] score={card.score:.2f} {card.description_th}")
print(f" → {card.recommendation_th}")
LLM Analysis — All 4 Privacy Modes
from thaieda.llm import analyze_with_llm
# Mode 1: safest — only stats leave, no raw data (default)
answer = analyze_with_llm(df, privacy="insight_only", provider="ollama")
# Mode 2: anonymized — PII replaced with tokens before sending
answer = analyze_with_llm(df, privacy="anonymized", provider="openai", model="gpt-4o-mini")
# Mode 3: differential privacy — Laplace noise on stats
answer = analyze_with_llm(df, privacy="dp_noise", provider="anthropic", epsilon=0.5)
# Mode 4: full raw data (user accepts risk)
answer = analyze_with_llm(df, privacy="full", provider="ollama", language="en")
Architecture
src/thaieda/
io/ # Auto-read CSV/JSON/JSONL/Excel + encoding detection
detect/ # Column type detection + Thai month name detection
tokenize/ # Tokenizer adapter: pythainlp / nlpo3 / attacut
text/ # Text metrics: length, frequency, n-grams, TF-IDF
quality/ # Thai quality checks + placeholder/constant detection
anomaly/ # Anomaly detection: statistical + ML + text + unified API
clean/ # Data cleaning: encoding, zwspace, numerals, BE→CE, dates, duplicates, missing
ner/ # Thai NER: person/place/organization extraction
analysis/ # Target variable analysis: Pearson/ANOVA/Chi-square
insight/ # Auto insight summary in Thai (interpreter)
insight_engine/ # Cross-column insight discovery: 6 patterns + BH correction
timeseries/ # Timeseries analysis: trend/seasonality/STL/ACF/gaps
schema/ # Multi-file schema discovery: PK/FK detection + relationship matching
viz/ # Visualization + auto chart + Thai font + insight charts
report/ # HTML report generation (Jinja2) + DatasetReport
i18n/ # Bilingual labels (Thai/English)
llm/ # Privacy-preserving LLM analysis (4 modes, 3 providers) — v0.9
Installation
# Core library (no Thai tokenizer)
pip install thaieda
# Recommended — with Thai tokenizer
pip install "thaieda[thai]"
# Optional extras (all lazy-imported)
pip install "thaieda[ner]" # pythainlp NER
pip install "thaieda[ml]" # Isolation Forest / LOF anomaly detection
pip install "thaieda[timeseries]" # STL decomposition (statsmodels)
pip install "thaieda[excel]" # Excel support (openpyxl)
pip install "thaieda[stats]" # p-values (scipy)
pip install "thaieda[detect]" # auto encoding detection (chardet)
# LLM providers (v0.9 — all optional)
pip install openai # OpenAI GPT
pip install anthropic # Anthropic Claude
pip install ollama # Ollama local server (or use built-in HTTP fallback)
Requirements: Python 3.10+, pandas, numpy, matplotlib, Jinja2
Testing
# Run all tests
pytest tests/ -v
# Run only LLM module tests (v0.9)
pytest tests/test_llm.py -v
# Lint
ruff check src/ tests/
ruff format src/ tests/
License
Apache-2.0 © Peet Wannasarnmetha
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thaieda-1.0.0.tar.gz.
File metadata
- Download URL: thaieda-1.0.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
292f8b864c91bd92870a5a83349aba181e5c29e7a882630a85a7c9b574d0719a
|
|
| MD5 |
4b674337a7dfbe4ca6b481d5a2655299
|
|
| BLAKE2b-256 |
e14c07695820f87c9b8d6470ce83f3522c6bd526f7ea67cc5423ba2e3bde3db5
|
File details
Details for the file thaieda-1.0.0-py3-none-any.whl.
File metadata
- Download URL: thaieda-1.0.0-py3-none-any.whl
- Upload date:
- Size: 176.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2998139a50926c6e1f8e540d46bcb0570a82560852f21966dfa81be1ac54e8a2
|
|
| MD5 |
2f232e647d9716ec36ee75f84fc9ffff
|
|
| BLAKE2b-256 |
ee7e3ba6e1fcc52c4a7ee36facb6583658c4812be4714b0751c77a545e086ada
|