Skip to main content

AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

Project description

ThaiEDA

Exploratory data analysis that actually understands Thai.

PyPI Python 3.10+ License: Apache-2.0 Tests: 424 passed Code Style: ruff


What is ThaiEDA?

ThaiEDA is a Python library that automates exploratory data analysis for Thai-language datasets. You give it a DataFrame, it gives you back a full report — column types, quality issues, anomalies, cross-column insights, charts, and an HTML report. All in one line.

It handles the things generic EDA tools miss: Buddhist Era dates, Thai numerals, zero-width spaces, mojibake encoding, Thai month names, and PII like phone numbers and national ID cards.


Quick Start

pip install thaieda
import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)          # that's it — full EDA in one line
result.to_html("report.html")     # self-contained HTML report

Want everything (Thai tokenizer, NER, ML, Excel, stats, LLM)?

pip install "thaieda[all]"

Why ThaiEDA?

Generic tools don't understand Thai data. Pandas Profiling, ydata-profiling, and Sweetviz are great — until you feed them Thai data. They miss Buddhist Era years (พ.ศ.), Thai numerals (๑๒๓), zero-width spaces that break tokenization, and mojibake from TIS-620 encoding. ThaiEDA catches all of these.

Privacy-first LLM analysis. Want to ask an LLM about your data but can't send raw rows to a cloud API? ThaiEDA has 4 privacy modes — the default sends zero raw data off your machine. Just stats and insights. Perfect for government, finance, and medical data under PDPA.

Insights, not just summaries. Most EDA tools show you df.describe() with nicer formatting. ThaiEDA has a cross-column insight engine that finds non-obvious patterns — "column A strongly predicts column B", "this group is 3× higher than average", "this column has outliers at row 47" — ranked by statistical interestingness with Benjamini-Hochberg correction.

One line to get everything. thaieda.run(df) chains the full pipeline: type detection → data cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. No config needed.


How It Works

DataFrame
    │
    ▼
┌─────────────────────────────────────────┐
│  thaieda.run(df)                        │
│                                         │
│  1. detect    → column types             │
│  2. clean     → fix encoding/numerals/BE │
│  3. quality   → nulls, placeholders, BE │
│  4. anomaly   → statistical + text      │
│  5. insights  → 6 cross-column patterns │
│  6. viz       → auto charts (Thai font) │
│  7. report    → self-contained HTML     │
│                                         │
│  + optional: LLM analysis (4 modes)     │
└─────────────────────────────────────────┘
    │
    ▼
EDAResult
  .to_html()      → report.html
  .to_dict()      → Python dict
  .to_json()      → JSON string
  .insights       → insight cards
  .cleaned_df     → cleaned DataFrame
  .quality_issues → list of issues
  .anomalies      → anomaly findings
  .llm_response   → LLM analysis (if enabled)

Examples

One-Line EDA

import thaieda
import pandas as pd

df = pd.read_csv("data.csv")

# Full pipeline: detect → clean → quality → insights → viz → report
result = thaieda.run(df)

# Access results
result.to_html("report.html")
print(result.insights)           # cross-column insight cards
print(result.quality_issues)     # data quality findings
print(result.notes)              # pipeline notes/warnings

# Alias works too
result = thaieda.EDA(df)

With LLM Analysis (Privacy-Safe)

import thaieda

# Default: zero raw data leaves your machine
result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)

# Or use OpenAI/Anthropic — still safe with insight_only
result = thaieda.run(df, llm=True, privacy="insight_only", provider="openai")

Privacy Modes

Control exactly what data leaves your machine:

Mode What Leaves Guarantee When to Use
insight_only (default) Stats + insights only Raw data never leaves Government, medical, PDPA data
anonymized Data with PII → tokens Names/phones/ID cards masked Need structure without raw PII
dp_noise Stats + Laplace noise Prevents re-identification Small datasets where stats leak
full Everything None — you accept the risk Public data, demos
from thaieda.llm import analyze_with_llm

# Each mode as a standalone call
answer = analyze_with_llm(df, privacy="insight_only", provider="ollama")
answer = analyze_with_llm(df, privacy="anonymized", provider="openai")
answer = analyze_with_llm(df, privacy="dp_noise", provider="anthropic", epsilon=0.5)

Manual Pipeline (Full Control)

from thaieda import profile, discover_insights
from thaieda.detect import detect_all

# Step-by-step if you want control
report = profile(df, clean=True)
report.to_html("report.html")

result = discover_insights(df, detect_all(df), top_n=8)
for card in result.cards:
    print(f"[{card.pattern}] {card.description_th}")
    print(f"  → {card.recommendation_th}")

What ThaiEDA Catches

Problem Example What Happens
Buddhist Era dates 15/03/2567 Auto-detects พ.ศ. → converts to CE
Thai numerals ๑๒๓ in numeric column Converts to 123
Zero-width spaces สม\u200bชาย Strips invisible chars
Mojibake encoding à ¬Â¸Â¡Â¹ Auto-detects TIS-620 → UTF-8
Thai month names มกราคม Parses to ISO date
Phone numbers 081-234-5678 Detects + normalizes
National ID cards 1-1234-56789-01-2 Detects via regex
Placeholder values -, N/A, ไม่มี Flags as missing
Constant columns All same value Flags as useless

Installation

# Basic — works immediately
pip install thaieda

# Everything in one command
pip install "thaieda[all]"

# Or pick what you need
pip install "thaieda[thai]"        # Thai tokenizer
pip install "thaieda[ner]"         # Thai NER
pip install "thaieda[ml]"          # ML anomaly detection
pip install "thaieda[timeseries]"  # STL decomposition
pip install "thaieda[excel]"       # Excel support
pip install "thaieda[stats]"       # p-values (scipy)

# LLM providers (all optional, lazy-imported)
pip install openai                 # OpenAI GPT
pip install anthropic              # Anthropic Claude
pip install ollama                 # Ollama local LLM

Requirements: Python 3.10+, pandas, numpy, matplotlib, Jinja2


Modules

Module What It Does
run() / EDA() One-liner API — full pipeline in one call
io/ Auto-read CSV/JSON/JSONL/Excel + encoding detection
detect/ Column type detection + Thai month names
clean/ Encoding fix, numerals, BE→CE, dates, duplicates, missing
quality/ Thai quality checks + placeholder/constant detection
anomaly/ Statistical + ML + text anomaly detection
ner/ Thai NER: person/place/organization
insight_engine/ 6 cross-column insight patterns (BH-corrected)
viz/ Auto charts with Thai font support
report/ Self-contained HTML report (Jinja2)
llm/ Privacy-preserving LLM analysis (4 modes, 3 providers)
schema/ Multi-file PK/FK discovery + relationship matching
timeseries/ Trend/seasonality/STL/ACF/gap detection

Testing

pytest tests/ -v              # all tests (424 passed)
pytest tests/test_oneliner.py # one-liner API tests
pytest tests/test_llm.py      # LLM + privacy mode tests
ruff check src/ tests/        # lint
ruff format src/ tests/       # format

License

Apache-2.0 © Peet Wannasarnmetha

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-1.0.1.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thaieda-1.0.1-py3-none-any.whl (176.5 kB view details)

Uploaded Python 3

File details

Details for the file thaieda-1.0.1.tar.gz.

File metadata

  • Download URL: thaieda-1.0.1.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.0.1.tar.gz
Algorithm Hash digest
SHA256 35e25e72efbdcde345eedbb2d925d3f8e2766413268989e52e5606d091514de3
MD5 66aad71208514a492ac9ae24eb0da5f6
BLAKE2b-256 1a98c65145c1bf8b37f9fe8f17f8c9825e3e5c59cb4c3b721ee441011badd153

See more details on using hashes here.

File details

Details for the file thaieda-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: thaieda-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 176.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c9b2f6f298c3175e11639cbef21888931244d9d64de2d48099e881f2a6a2d124
MD5 bdcfeca41940a42d43e2ec8b4e33b2ed
BLAKE2b-256 0ac966472aacbe712d638a59902d7908a0060ac3cbc6a94ec452535ce77157c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page