AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

These details have not been verified by PyPI

Project links

Project description

ThaiEDA

AutoEDA for Thai-language data — exploratory data analysis that understands Thai.

Quick Start

pip install "thaieda[thai]"

import pandas as pd
from thaieda import profile
from thaieda.llm import analyze_with_llm

df = pd.read_csv("data.csv")
report = profile(df, clean=True)
report.to_html("report.html")

# Ask an LLM about the data — privacy-safe by default
answer = analyze_with_llm(df, privacy="insight_only", provider="ollama")
print(answer)

Why ThaiEDA?

Thai-specific — catches Buddhist Era dates, Thai numerals, zero-width spaces, mojibake, and Thai month names that generic tools miss.
Privacy-first — LLM analysis with 4 privacy modes; the default sends zero raw data off your machine.
Auto insights — a cross-column insight engine surfaces non-obvious findings, ranked by statistical interestingness (BH-corrected).
No lock-in — generates a self-contained HTML report; works as a library or CLI; all LLM providers are optional and lazy-imported.

Features by Version

Version	Feature	Description
v0.9	Privacy-preserving LLM analysis	4 privacy modes + 3 LLM providers (OpenAI / Anthropic / Ollama)
v0.8	Data cleaning + actionable insights	Thai numeral→numeric, BE→CE, date standardization, correlation/outlier patterns, Excel support
v0.7	Insight visualization	Auto-generated charts for each cross-column finding (bar, donut, box plot, trend line)
v0.6	Cross-column insight engine	6 patterns: outstanding / attribution / comparison / trend / correlation / outlier (BH-corrected)
v0.5	Multi-file schema discovery	PK/FK matching, ER diagram, relationship validation, orphan detection
v0.4	Timeseries analysis	Trend/seasonality/STL/ACF/gaps + distribution & correlation insights
v0.3	Single-command pipeline	JSON input, auto encoding detection, auto insights, cleaning diff
v0.2	Thai NER + target analysis	pythainlp normalize, auto chart selection, unified anomaly API
v0.1	Thai text profiling	Column type detection, quality checks, HTML report, CLI

Privacy Modes (v0.9)

Control exactly what data leaves your machine when calling analyze_with_llm():

Mode	What Leaves Machine	Privacy Guarantee	Use Case
`insight_only` (default)	Summary statistics + insight cards only	Raw data never leaves	Regulated / PDPA data, cautious users
`anonymized`	Data with PII replaced by tokens (`[PHONE_1]`, `[NAME_1]`)	Names/phones/ID cards masked; `token_map` returned for local reversal	LLM needs to see structure without raw PII
`dp_noise`	Statistics with Laplace noise (configurable ε)	DP noise prevents re-identification from small stats	Small datasets where stats alone may leak identity
`full`	All raw data sent	None — user accepts tradeoff	Public data, dev/demo workflows

Examples

Basic EDA

import pandas as pd
from thaieda import profile, read_data

# Auto-reads CSV/JSON/JSONL/Excel with auto encoding detection
df = read_data("data.xlsx")

# Profile + clean + auto insights in one call
report = profile(df, clean=True)
report.to_html("report.html")

# Get Thai-language executive summary
print(report.insights.executive_summary_th)

Insight Discovery

from thaieda import discover_insights
from thaieda.detect import detect_all

result = discover_insights(df, detect_all(df), top_n=8)
for card in result.cards:
    print(f"[{card.pattern}] score={card.score:.2f}  {card.description_th}")
    print(f"  → {card.recommendation_th}")

LLM Analysis — All 4 Privacy Modes

from thaieda.llm import analyze_with_llm

# Mode 1: safest — only stats leave, no raw data (default)
answer = analyze_with_llm(df, privacy="insight_only", provider="ollama")

# Mode 2: anonymized — PII replaced with tokens before sending
answer = analyze_with_llm(df, privacy="anonymized", provider="openai", model="gpt-4o-mini")

# Mode 3: differential privacy — Laplace noise on stats
answer = analyze_with_llm(df, privacy="dp_noise", provider="anthropic", epsilon=0.5)

# Mode 4: full raw data (user accepts risk)
answer = analyze_with_llm(df, privacy="full", provider="ollama", language="en")

Architecture

src/thaieda/
  io/             # Auto-read CSV/JSON/JSONL/Excel + encoding detection
  detect/         # Column type detection + Thai month name detection
  tokenize/       # Tokenizer adapter: pythainlp / nlpo3 / attacut
  text/           # Text metrics: length, frequency, n-grams, TF-IDF
  quality/        # Thai quality checks + placeholder/constant detection
  anomaly/        # Anomaly detection: statistical + ML + text + unified API
  clean/          # Data cleaning: encoding, zwspace, numerals, BE→CE, dates, duplicates, missing
  ner/            # Thai NER: person/place/organization extraction
  analysis/       # Target variable analysis: Pearson/ANOVA/Chi-square
  insight/        # Auto insight summary in Thai (interpreter)
  insight_engine/ # Cross-column insight discovery: 6 patterns + BH correction
  timeseries/     # Timeseries analysis: trend/seasonality/STL/ACF/gaps
  schema/         # Multi-file schema discovery: PK/FK detection + relationship matching
  viz/            # Visualization + auto chart + Thai font + insight charts
  report/         # HTML report generation (Jinja2) + DatasetReport
  i18n/           # Bilingual labels (Thai/English)
  llm/            # Privacy-preserving LLM analysis (4 modes, 3 providers) — v0.9

Installation

# Core library (no Thai tokenizer)
pip install thaieda

# Recommended — with Thai tokenizer
pip install "thaieda[thai]"

# Optional extras (all lazy-imported)
pip install "thaieda[ner]"          # pythainlp NER
pip install "thaieda[ml]"           # Isolation Forest / LOF anomaly detection
pip install "thaieda[timeseries]"   # STL decomposition (statsmodels)
pip install "thaieda[excel]"        # Excel support (openpyxl)
pip install "thaieda[stats]"        # p-values (scipy)
pip install "thaieda[detect]"       # auto encoding detection (chardet)

# LLM providers (v0.9 — all optional)
pip install openai                 # OpenAI GPT
pip install anthropic              # Anthropic Claude
pip install ollama                 # Ollama local server (or use built-in HTTP fallback)

Requirements: Python 3.10+, pandas, numpy, matplotlib, Jinja2

Testing

# Run all tests
pytest tests/ -v

# Run only LLM module tests (v0.9)
pytest tests/test_llm.py -v

# Lint
ruff check src/ tests/
ruff format src/ tests/

License

Apache-2.0 © Peet Wannasarnmetha

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.3.0

Jun 28, 2026

2.2.0

Jun 27, 2026

2.1.1

Jun 27, 2026

2.1.0

Jun 27, 2026

2.0.0

Jun 26, 2026

1.9.3

Jun 26, 2026

1.9.2

Jun 26, 2026

1.9.1

Jun 26, 2026

1.9.0

Jun 26, 2026

1.8.0

Jun 26, 2026

1.7.1

Jun 26, 2026

1.7.0

Jun 26, 2026

1.6.0

Jun 26, 2026

1.5.0

Jun 26, 2026

1.1.0

Jun 25, 2026

1.0.1

Jun 25, 2026

This version

1.0.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-1.0.0.tar.gz (1.3 MB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thaieda-1.0.0-py3-none-any.whl (176.1 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file thaieda-1.0.0.tar.gz.

File metadata

Download URL: thaieda-1.0.0.tar.gz
Upload date: Jun 25, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`292f8b864c91bd92870a5a83349aba181e5c29e7a882630a85a7c9b574d0719a`
MD5	`4b674337a7dfbe4ca6b481d5a2655299`
BLAKE2b-256	`e14c07695820f87c9b8d6470ce83f3522c6bd526f7ea67cc5423ba2e3bde3db5`

See more details on using hashes here.

File details

Details for the file thaieda-1.0.0-py3-none-any.whl.

File metadata

Download URL: thaieda-1.0.0-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 176.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2998139a50926c6e1f8e540d46bcb0570a82560852f21966dfa81be1ac54e8a2`
MD5	`2f232e647d9716ec36ee75f84fc9ffff`
BLAKE2b-256	`ee7e3ba6e1fcc52c4a7ee36facb6583658c4812be4714b0751c77a545e086ada`

See more details on using hashes here.

thaieda 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ThaiEDA

Quick Start

Why ThaiEDA?

Features by Version

Privacy Modes (v0.9)

Examples

Basic EDA

Insight Discovery

LLM Analysis — All 4 Privacy Modes

Architecture

Installation

Testing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes