Skip to main content

AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

Project description

ThaiEDA

Exploratory data analysis that actually understands Thai.

PyPI Python 3.10+ License: Apache-2.0 Tests: 424 passed Code Style: ruff


What is ThaiEDA?

ThaiEDA is a Python library that automates exploratory data analysis for Thai-language datasets. You give it a DataFrame, it gives you back a full report — column types, quality issues, anomalies, cross-column insights, charts, and an HTML report. All in one line.

It handles the things generic EDA tools miss: Buddhist Era dates, Thai numerals, zero-width spaces, mojibake encoding, Thai month names, and PII like phone numbers and national ID cards.


Quick Start

pip install thaieda
import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)          # that's it — full EDA in one line
result.to_html("report.html")     # self-contained HTML report

pip install thaieda ติดตั้งทุกอย่างเลย — Thai tokenizer, NER, ML, Excel, stats, encoding detection ไม่ต้องใส่ extras


Why ThaiEDA?

Generic tools don't understand Thai data. Pandas Profiling, ydata-profiling, and Sweetviz are great — until you feed them Thai data. They miss Buddhist Era years (พ.ศ.), Thai numerals (๑๒๓), zero-width spaces that break tokenization, and mojibake from TIS-620 encoding. ThaiEDA catches all of these.

Privacy-first LLM analysis. Want to ask an LLM about your data but can't send raw rows to a cloud API? ThaiEDA has 4 privacy modes — the default sends zero raw data off your machine. Just stats and insights. Perfect for government, finance, and medical data under PDPA.

Insights, not just summaries. Most EDA tools show you df.describe() with nicer formatting. ThaiEDA has a cross-column insight engine that finds non-obvious patterns — "column A strongly predicts column B", "this group is 3× higher than average", "this column has outliers at row 47" — ranked by statistical interestingness with Benjamini-Hochberg correction.

One line to get everything. thaieda.run(df) chains the full pipeline: type detection → data cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. No config needed.


How It Works

DataFrame
    │
    ▼
┌─────────────────────────────────────────┐
│  thaieda.run(df)                        │
│                                         │
│  1. detect    → column types             │
│  2. clean     → fix encoding/numerals/BE │
│  3. quality   → nulls, placeholders, BE │
│  4. anomaly   → statistical + text      │
│  5. insights  → 6 cross-column patterns │
│  6. viz       → auto charts (Thai font) │
│  7. report    → self-contained HTML     │
│                                         │
│  + optional: LLM analysis (4 modes)     │
└─────────────────────────────────────────┘
    │
    ▼
EDAResult
  .to_html()      → report.html
  .to_dict()      → Python dict
  .to_json()      → JSON string
  .insights       → insight cards
  .cleaned_df     → cleaned DataFrame
  .quality_issues → list of issues
  .anomalies      → anomaly findings
  .llm_response   → LLM analysis (if enabled)

Examples

One-Line EDA

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")

# Full pipeline: detect → clean → quality → insights → viz → report
result = thaieda.run(df)

# Access results
result.to_html("report.html")
print(result.insights)           # cross-column insight cards
print(result.quality_issues)     # data quality findings
print(result.notes)              # pipeline notes/warnings

# Alias works too
result = thaieda.EDA(df)

With LLM Analysis (Privacy-Safe)

import thaieda

# Default: zero raw data leaves your machine
result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)

# Or use OpenAI/Anthropic — still safe with insight_only
result = thaieda.run(df, llm=True, privacy="insight_only", provider="openai")

Privacy Modes

Control exactly what data leaves your machine:

Mode What Leaves Guarantee When to Use
insight_only (default) Stats + insights only Raw data never leaves Government, medical, PDPA data
anonymized Data with PII → tokens Names/phones/ID cards masked Need structure without raw PII
dp_noise Stats + Laplace noise Prevents re-identification Small datasets where stats leak
full Everything None — you accept the risk Public data, demos
from thaieda.llm import analyze_with_llm

# Each mode as a standalone call
answer = analyze_with_llm(df, privacy="insight_only", provider="ollama")
answer = analyze_with_llm(df, privacy="anonymized", provider="openai")
answer = analyze_with_llm(df, privacy="dp_noise", provider="anthropic", epsilon=0.5)

Manual Pipeline (Full Control)

from thaieda import profile, discover_insights
from thaieda.detect import detect_all

# Step-by-step if you want control
report = profile(df, clean=True)
report.to_html("report.html")

result = discover_insights(df, detect_all(df), top_n=8)
for card in result.cards:
    print(f"[{card.pattern}] {card.description_th}")
    print(f"  → {card.recommendation_th}")

What ThaiEDA Catches

Problem Example What Happens
Buddhist Era dates 15/03/2567 Auto-detects พ.ศ. → converts to CE
Thai numerals ๑๒๓ in numeric column Converts to 123
Zero-width spaces สม\u200bชาย Strips invisible chars
Mojibake encoding à ¬Â¸Â¡Â¹ Auto-detects TIS-620 → UTF-8
Thai month names มกราคม Parses to ISO date
Phone numbers 081-234-5678 Detects + normalizes
National ID cards 1-1234-56789-01-2 Detects via regex
Placeholder values -, N/A, ไม่มี Flags as missing
Constant columns All same value Flags as useless

Installation

# ติดตั้งทุกอย่างในคำสั่งเดียว
pip install thaieda

ไม่ต้องใส่ extras — pip install thaieda ติดตั้งทั้งหมดเลย: Thai tokenizer, NER, ML anomaly detection, timeseries, Excel, stats, encoding detection

LLM providers ยังเป็น optional (lazy-imported — ไม่ต้องติดตั้งถ้าไม่ใช้):

pip install openai       # OpenAI GPT
pip install anthropic    # Anthropic Claude
pip install ollama       # Ollama local LLM (หรือใช้ HTTP fallback ไม่ต้องติดตั้ง)

Requirements: Python 3.10+


Modules

Module What It Does
run() / EDA() One-liner API — full pipeline in one call
io/ Auto-read CSV/JSON/JSONL/Excel + encoding detection
detect/ Column type detection + Thai month names
clean/ Encoding fix, numerals, BE→CE, dates, duplicates, missing
quality/ Thai quality checks + placeholder/constant detection
anomaly/ Statistical + ML + text anomaly detection
ner/ Thai NER: person/place/organization
insight_engine/ 6 cross-column insight patterns (BH-corrected)
viz/ Auto charts with Thai font support
report/ Self-contained HTML report (Jinja2)
llm/ Privacy-preserving LLM analysis (4 modes, 3 providers)
schema/ Multi-file PK/FK discovery + relationship matching
timeseries/ Trend/seasonality/STL/ACF/gap detection

Testing

pytest tests/ -v              # all tests (424 passed)
pytest tests/test_oneliner.py # one-liner API tests
pytest tests/test_llm.py      # LLM + privacy mode tests
ruff check src/ tests/        # lint
ruff format src/ tests/       # format

License

Apache-2.0 © Peet Wannasarnmetha

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-1.1.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thaieda-1.1.0-py3-none-any.whl (203.5 kB view details)

Uploaded Python 3

File details

Details for the file thaieda-1.1.0.tar.gz.

File metadata

  • Download URL: thaieda-1.1.0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.1.0.tar.gz
Algorithm Hash digest
SHA256 362df7fce4bc3bc350fe6d5c3dbe01efda1e4918d128466171e113a86f8273e8
MD5 1b8ba81738a3565c925f8645474ce35e
BLAKE2b-256 ba6d364a2188ab9d30832e0e5de9baebba332c581420d6ae97da3891e234d009

See more details on using hashes here.

File details

Details for the file thaieda-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: thaieda-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 203.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f57f4a9bc573eee8a2e795a9183bf1e4b2f7052bf7d2ffe732e4908f0dc8a26c
MD5 79c4137f338802bcea9770c356136d3f
BLAKE2b-256 dc51f6c7697d51735b4a5267da77fa3022500f227f364c4908369dbe66dcdc0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page