AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai

These details have not been verified by PyPI

Project links

Project description

ThaiEDA

One-line Exploratory Data Analysis and Smart Data Cleaning for Thai and Mixed-Language Datasets.

ThaiEDA answers one simple question: "Can I trust this dataset, and what should I explore first?"

While generic profiling tools count missing values and draw standard charts, they often fail when processing Thai text and mixed-language data. ThaiEDA treats Thai-specific data complexities—such as Buddhist Era (BE) dates, Thai numerals, invisible zero-width spaces, encoding errors (mojibake), local phone formats, and Thai fonts in charts—as normal data problems, eliminating the need for tedious manual preprocessing.

🚀 Key Features

Smart Column & Type Detection: Identifies Thai/English text, numbers masquerading as text, Buddhist Era years, Thai phone numbers, Thai national IDs, and mixed-language columns.
One-Line AutoEDA (run): A complete pipeline that auto-detects, cleans, checks quality, finds anomalies, performs time-series/target analysis, runs a cross-column insight engine, generates charts, builds offline executive narratives, and generates HTML reports.
Thai-Aware Cleaning Pipeline (clean): Easily cleans and normalizes Unicode formats, fixes zero-width/invisible spaces, normalizes currency/numbers, converts Buddhist Era to Common Era (CE), corrects keyboard layout mistakes (e.g., l;ylfu ➔ สวัสดี), protects product IDs/codes using heuristic likeness scores, and performs machine-learning (ML) missing value imputation.
Cross-Column Insight Engine: Automatically discovers complex relationships, outlier influences, trend evidence, Simpson's paradox, and target leakage with statistical scoring.
Multi-File Schema Discovery: Scans folders of files (profile_dataset) to discover primary/foreign key candidates and orphans, then renders schema relationships as interactive reports and Mermaid diagrams.
Privacy-Preserving LLM Integration: Generates secure LLM summaries with 5 privacy modes (insight_only, synthetic, anonymized, dp_noise, and full) to safely analyze data without risking raw PII exposure.

📦 Requirements & Installation

ThaiEDA requires Python 3.10+.

Install the lightweight core package (contains pandas, numpy, matplotlib, and Jinja2):

pip install thaieda

For advanced features, install extras for Thai NLP (tokenizers), visual enhancements, and Excel/Parquet I/O:

pip install "thaieda[thai,viz,excel,parquet]"

Or install all optional backends and dependencies at once:

pip install "thaieda[all]"

⚡ Quickstart

Here is a fully reproducible example. Copy and run this script to see ThaiEDA in action:

import pandas as pd
import thaieda

# 1. Create a messy DataFrame simulating real Thai data issues
data = {
    "name": ["สมชาย\u200bรักไทย", "สมหญิง   ใจดี", "นายดำ ๐๑"],  # Has zero-width space and multiple spaces
    "birth_year": [2530, 2532, 2528],                          # Buddhist Era (BE) years
    "sales": ["฿1,200", "฿3,500", "฿10,000"],                  # Currency formatting as text
    "phone": ["081-234-5678", "+66898765432", "๐๒-๓๔๕-๖๗๘๙"]     # Phone formats
}
df = pd.DataFrame(data)

# 2. Run the full EDA pipeline with cleaning enabled
# By default, lang="th" produces Thai reports. Set lang="en" for English.
result = thaieda.run(df, clean=True, lang="en")

# 3. Save the interactive report
result.to_html("quickstart-report.html")

# 4. Extract the clean DataFrame
cleaned_df = result.cleaned_df
print(cleaned_df)

💡 Core Recipes

1. One-Line EDA & Result Inspection

The run() function (also aliased as EDA()) performs the analysis and returns an EDAResult object:

result = thaieda.run(df, lang="en")

# Access details in Python
print(result.overview)          # Dataset metadata
print(result.quality_issues)    # Quality flags (e.g., constant columns, BE years)
print(result.anomalies)         # Outliers and text anomalies
print(result.insights)          # Discovered statistical insights
print(result.narrative)         # Offline, rule-based executive summary
print(result.llm_response)      # Optional LLM analysis response (if llm=True)

# Export options
result.to_html("report.html")   # Save HTML report
result.to_json("report.json")   # Export structured report metadata
result.to_dict()                # Convert result to Python dict

2. Standalone Cleaning Pipeline

Use clean() to sanitize a DataFrame on a copy, returning both the clean DataFrame and a structured cleaning report.

cleaned_df, report = thaieda.clean(
    df,
    handle_missing="ml",        # Imputation strategy: flag, median, mode, drop, unknown, or ml
    remove_duplicates=True,
    fix_dates=True,             # Converts BE to CE, normalizes formats
    fix_numerals=True,          # Normalizes Thai digits to Arabic
    fix_encoding=True,          # Repairs mojibake and spacing
    downcast=True               # Optimizes data types for memory efficiency
)

# Export the audit trail of modifications
report.to_json("cleaning-audit.json")

🧼 Smart Cleaning & ID Protection

When normalizing text (such as using fix_repeated_chars to collapse excessive characters), standard rule-based approaches might unintentionally mangle product codes, serial numbers, or model names. ThaiEDA solves this with an intelligent heuristic protection system.

skip_id_like Parameter (default True): Under fix_repeated_chars, setting this to True protects strings and sub-tokens that look like identifier codes from being modified.
Token-Level Protection: Instead of analyzing the entire text block globally, ThaiEDA splits text into individual tokens and applies the safeguard locally. This ensures a product ID embedded within a chat or review text remains completely untouched, while surrounding natural text is cleaned.
_id_likeness_score Heuristic: Determines if a token is an ID using a 7-criteria scoring algorithm (0.0 to 1.0):
1. digit_ratio: The ratio of numeric digits to length (IDs generally have $\ge 0.3$).
2. upper_ratio: The ratio of uppercase characters to all alphabet characters (IDs generally have $> 0.5$).
3. separator: Presence of symbols like hyphens, underscores, or dots in the middle.
4. length: Usually short strings ($\le 20$ characters).
5. no_spaces: No spaces within the token.
6. alnum_mix: Mixtures of both letters and digits.
7. entropy: Lower character entropy due to repeated structures.

Examples in Action:

'55555' ➔ '555' (digits in laughter are normalized)
'มากกกกกก' ➔ 'มากกก' (exaggerated Thai text is collapsed)
'SKU-AAA111' ➔ 'SKU-AAA111' (safely kept intact as an ID)

3. Comparing Two Datasets (Drift & Schema)

Use compare() to detect schema changes and statistical distribution drift between two datasets (e.g., training vs. production data).

from thaieda import compare

diff = compare(train_df, current_df, labels=("train", "current"))

print(diff["schema_diff"])          # Mismatched column names or types
print(diff["missing_diff"])         # Changes in missing value rates
print(diff["distribution_drift"])    # Numerical distribution shift (using statistical tests)
print(diff["categorical_drift"])     # Categorical frequency drift

4. Folder Schema Discovery (Multi-File)

If your folder contains multiple tables with relationships, profile_dataset() identifies key connections and outputs Mermaid schemas.

from thaieda import DatasetReport, profile_dataset

# Scan directory for CSV/JSON/TSV/Excel/Parquet tables
dataset = profile_dataset("data/warehouse", validate_values=True)

# Export interactive multi-table relationship report
DatasetReport(dataset, lang="en").to_html("schema-report.html")

# Output Mermaid diagram representing PK/FK relationships
print(dataset.to_mermaid())

5. Multi-File Batch EDA

Analyze every file in a directory and compile them into a unified master report with a navigation sidebar.

import thaieda

# Scans supported file formats in the folder
folder = thaieda.run_folder(
    "data/",
    recursive=True,
    output_dir="reports",
    lang="en"
)

# Generate master index report containing all summaries
folder.to_master_html("reports/index.html")
print(folder.summary())

🔒 Privacy-Preserving LLM Analysis

ThaiEDA offers privacy-first, local-first LLM analysis. When llm=True, the data is processed according to a specified privacy mode to ensure sensitive information never leaves your environment.

result = thaieda.run(
    df,
    llm=True,
    provider="openai",        # openai, anthropic, or ollama
    privacy="insight_only",    # insight_only, synthetic, anonymized, dp_noise, or full
    lang="en"
)
print(result.llm_response)

Privacy Modes Overview

Privacy Mode	What the LLM Sees	Best Used For
`insight_only`	Summary stats and statistical insights only. No raw rows.	Highly sensitive datasets; default safe setting.
`synthetic`	Generative synthetic rows with identical patterns.	Sharing realistic dataset structure without raw records.
`anonymized`	Data with PII (phone, ID, name) replaced by placeholders.	Masking obvious personal identifier columns.
`dp_noise`	Aggregated summaries with Differential Privacy noise.	Protecting aggregated statistical distributions.
`full`	Original raw dataset.	Non-sensitive, public datasets.

Note: The LLM module can also be invoked independently using thaieda.llm.analyze_with_llm(...).

💻 Command Line Interface (CLI)

ThaiEDA comes with a powerful command line tool.

# Get version info
thaieda --version

# Run AutoEDA report on a file
thaieda run data.csv -o report.html --lang en

# Generate report with explicit cleaning
thaieda profile data.xlsx -o profile.html --clean --lang en

# Clean data and output clean file
thaieda clean inputs.csv -o cleaned.csv

# Multi-file schema profiling
thaieda dataset data/warehouse/ -o schema-report.html --lang en

Command Reference

Command	Usage
`thaieda run`	Generates a quick HTML report from a file (includes default cleaning).
`thaieda profile`	Generates a full profile report with granular `--clean` options.
`thaieda clean`	Performs data cleaning and outputs a sanitized data file.
`thaieda dataset`	Discovers primary/foreign keys and relationships across folders.

⚖️ How ThaiEDA Compares

ThaiEDA does not replace generic profiling packages; it complements them by handling the unique nuances of Thai data.

Capability	Generic Profiling Tools (e.g., YData-Profiling)	ThaiEDA
Basic Data Profiling	✅ Excellent, detailed standard statistics	✅ Lightweight statistics and distributions
Thai-Specific Quality Checks	❌ Manual (treats BE years as outliers)	✅ Out-of-the-box (detects BE, Thai numerals)
Report-Level Cleaning	❌ None or minimal	✅ Auto-cleaning embedded in `run(clean=True)`
Interactive Viz Thai Font Support	❌ Shows unreadable squares (`[]`)	✅ Pre-configured Thai font fallback
Cross-Column Insights	⚠️ Basic correlations	✅ Scoring engine (leakage, Simpson's paradox)
Multi-File Schema Discovery	❌ Single-file focus	✅ Automatic PK/FK detection & Mermaid schemas
Dataset Comparison & Drift	⚠️ Basic comparison	✅ Detailed statistical schema & drift comparison
Privacy-First LLM Summaries	❌ None	✅ 5 levels of privacy-preserving modes

💬 FAQ

Q: Why do the charts in my report show empty boxes instead of Thai characters?

A: This happens if your OS lacks standard Thai fonts or if matplotlib cannot locate them. ThaiEDA configures fallback fonts automatically. If the issue persists, install the visualization extra (pip install "thaieda[viz]") which packages appropriate open-source Thai fonts.

Q: Does ThaiEDA send my data to external servers for LLM analysis?

A: No, unless you explicitly enable llm=True. By default, all operations run 100% locally. When LLM is enabled, data is aggressively aggregated or anonymized (based on your chosen privacy mode) before being sent to the provider. You can also run Ollama locally to keep 100% of LLM processing local.

🛠️ Development & Contributing

To run the test suite and verify formatting, use the following commands:

# Run tests
python -m pytest tests/

# Code quality checks
ruff check src/ tests/
ruff format src/ tests/

For guidelines on coding style, checkout CONTRIBUTING.md and CODE_OF_CONDUCT.md.

📄 License

ThaiEDA is released under the Apache-2.0 License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.3.0

Jun 28, 2026

2.2.0

Jun 27, 2026

2.1.1

Jun 27, 2026

This version

2.1.0

Jun 27, 2026

2.0.0

Jun 26, 2026

1.9.3

Jun 26, 2026

1.9.2

Jun 26, 2026

1.9.1

Jun 26, 2026

1.9.0

Jun 26, 2026

1.8.0

Jun 26, 2026

1.7.1

Jun 26, 2026

1.7.0

Jun 26, 2026

1.6.0

Jun 26, 2026

1.5.0

Jun 26, 2026

1.1.0

Jun 25, 2026

1.0.1

Jun 25, 2026

1.0.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaieda-2.1.0.tar.gz (6.9 MB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thaieda-2.1.0-py3-none-any.whl (295.3 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file thaieda-2.1.0.tar.gz.

File metadata

Download URL: thaieda-2.1.0.tar.gz
Upload date: Jun 27, 2026
Size: 6.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f7f8482ce794df67d94d4eb9d8385ca6d528a37bfc8cb47541b03ff293be50c3`
MD5	`7596e7b77fd02739d2058bce670cb62f`
BLAKE2b-256	`a38926c9355e985101275e6cddc90abd9ba512633394e089543629fc4a5b8e05`

See more details on using hashes here.

File details

Details for the file thaieda-2.1.0-py3-none-any.whl.

File metadata

Download URL: thaieda-2.1.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 295.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for thaieda-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fa7dab974a13e03b772a2063aea2b6791c0ca9a3c8e8e2e98f6d955a7176d336`
MD5	`6cd7823a236ca1e852f8967076663731`
BLAKE2b-256	`56543dbb378e561f8c02dfba8f1a2031dd1b44bacee585bfd524417ac069dc26`

See more details on using hashes here.

thaieda 2.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ThaiEDA

🚀 Key Features

📦 Requirements & Installation

⚡ Quickstart

💡 Core Recipes

1. One-Line EDA & Result Inspection

2. Standalone Cleaning Pipeline

🧼 Smart Cleaning & ID Protection

3. Comparing Two Datasets (Drift & Schema)

4. Folder Schema Discovery (Multi-File)

5. Multi-File Batch EDA

🔒 Privacy-Preserving LLM Analysis

Privacy Modes Overview

💻 Command Line Interface (CLI)

Command Reference

⚖️ How ThaiEDA Compares

💬 FAQ

Q: Why do the charts in my report show empty boxes instead of Thai characters?

Q: Does ThaiEDA send my data to external servers for LLM analysis?

🛠️ Development & Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes