Automated, local exploratory data analysis: stats, charts, correlations, outliers, a chat assistant, and self-contained HTML reports.

Project description

eda-k

Automated, local exploratory data analysis — as a Python library you can import, with an optional Streamlit UI on top.

Runs 100% locally. Your data never leaves your machine, no API key needed.

Install

pip install -e .

Want everything in one shot (library + Streamlit app + OLS trendlines)?

pip install -e ".[app,trend]"

Or pick extras individually:

Need the bundled Streamlit app too?

pip install -e ".[app]"

Need OLS trendlines on scatter plots (charts.pairwise_scatter_with_trendline)?

pip install -e ".[trend]"

Quick start (recommended)

The simplest way to use the library — one function call analyzes your data, one method call exports a report:

import eda_k

result = eda_k.analyze("dataset.csv")     # path, file-like object, or DataFrame all work

result.summary()                          # quick text overview
result.ask("which columns have missing values?")

result.to_html("dataset_report.html")      # self-contained HTML report
result.to_csv_zip("dataset_tables.zip")    # every summary table as CSVs in one ZIP

That's the entire workflow for most use cases. Everything below explains what each piece does and how to drop down to the lower-level modules if you need more control.

The `analyze()` function and `EDAResult` object

`eda_k.analyze(source, filename=None, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)`

Loads your data and runs the complete EDA pipeline in one call.

source — a file path (str/Path), an open file-like object, or an already-loaded pandas.DataFrame.
filename — only needed if source is a file-like object without a .name attribute; used to detect file type and for report titles.
outlier_method — "IQR", "Z-score", or "Both".
correlation_method — "pearson", "spearman", or "kendall".
max_sample_size — row cap used when sampling for the Shapiro-Wilk normality test (large columns get sampled down for speed).

Returns an EDAResult.

`EDAResult` — what you get back

Member	What it does
`result.df`	The loaded `pandas.DataFrame`.
`result.results`	Raw dict of every computed table — see below.
`result.summary()`	Plain-text dataset overview (rows, columns, missing %, dtypes, memory).
`result.ask(question)`	Ask a natural-language question; see chat assistant section below.
`result.build_figures()`	Builds the full dict of Plotly figures used in the HTML report.
`result.to_html(path=None, ...)`	Builds the self-contained HTML report. Returns the HTML string; writes to `path` if given.
`result.to_csv_zip(path=None)`	Builds a ZIP of every summary table as CSV. Returns the ZIP bytes; writes to `path` if given.

result.results contains these keys, each produced by eda_engine:

overview — shape, dtypes, missing %, duplicates, memory usage, likely datetime/ID columns
dtype_table — per-column dtype, missing count, uniqueness
missing_summary — missing count/% per column
numeric_summary — mean/median/std/skew/kurtosis/normality per numeric column
outliers — IQR and Z-score outlier counts per numeric column
categorical_summary — unique count, mode, top values per categorical column
correlation — full correlation matrix
top_correlations — strongest correlated pairs, ranked

Lower-level modules

If you want to call the underlying functions directly instead of using analyze(), every module is importable on its own. Note the correct signatures — a common mistake is passing two Series into a chart function; every chart function takes the DataFrame plus a column name string, not Series.

from eda_k import eda_engine, charts, chat_assistant, report_builder

# 1. Load the file (note: pass an open file object, not just the path string)
df = eda_engine.load_file(open("dataset.csv", "rb"), "dataset.csv")

# 2. Run the full pipeline
results = eda_engine.run_full_eda(df)

# 3. Build individual charts — pass (df, "column_name"), not df["column_name"]
fig = charts.histogram(df, "discounted_price")       # correct
fig = charts.bar_categorical(df, "category")          #  correct
# fig = charts.histogram(df["product_name"], df["category"])  #  wrong — two Series, not df + col name

# 4. Build the figures dict report_builder.build_html_report() expects
ov = results["overview"]
figures = {
    "missing_bar": charts.missing_values_bar(results["missing_summary"]),
    "histograms": {c: charts.histogram(df, c) for c in ov["numeric_cols"]},
    "boxplots": {c: charts.boxplot(df, c) for c in ov["numeric_cols"]},
    "categorical_bars": {c: charts.bar_categorical(df, c) for c in ov["categorical_cols"]},
    "corr_heatmap": (
        charts.correlation_heatmap(results["correlation"])
        if not results["correlation"].empty else None
    ),
}

# 5. Build and save the HTML report
html = report_builder.build_html_report(df, results, figures, filename="dataset.csv")
with open("dataset_report.html", "w", encoding="utf-8") as f:
    f.write(html)

This is exactly what result.to_html() does internally — use the high-level analyze() API unless you specifically need this manual control.

`eda_engine` — core analysis (pandas/numpy/scipy, no UI)

Function	What it does
`load_file(file, filename)`	Loads CSV, TSV, TXT, XLSX, XLS, JSON, or Parquet into a DataFrame based on the filename extension.
`get_overview(df)`	Row/column counts, missing %, duplicate rows, numeric/categorical/datetime column lists, likely-datetime and ID-like column detection, memory usage.
`get_dtype_table(df)`	Per-column dtype, missing count/%, unique count/%, potential-ID flag.
`get_missing_summary(df)`	Missing count and % per column, sorted worst-first.
`get_numeric_summary(df, numeric_cols, max_sample_size=5000)`	Mean, median, std, min/max, IQR, CV%, skew, kurtosis, and a Shapiro-Wilk normality flag per numeric column.
`detect_outliers(df, numeric_cols, method="Both")`	IQR-fence and/or Z-score (\|z\|>3) outlier counts per numeric column.
`get_categorical_summary(df, categorical_cols, top_n=10)`	Unique count, missing count, mode, mode %, and top-N value counts per categorical column.
`get_correlation(df, numeric_cols, method="pearson")`	Correlation matrix (`pearson`, `spearman`, or `kendall`).
`get_top_correlated_pairs(corr_df, top_n=10)`	Strongest correlated column pairs, ranked by absolute correlation.
`run_full_eda(df, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)`	Runs everything above and returns it all as one results dict — this is what `analyze()` calls.

`charts` — Plotly chart builders

Every function takes (df, column_name) (or a column list), not raw Series, and returns a Plotly figure (or None if there isn't enough data to plot).

Function	Chart
`missing_values_bar(missing_df)`	Bar chart of missing values by column.
`missing_pattern_heatmap(missing_matrix)`	Heatmap of where missing values cluster across rows/columns.
`correlation_heatmap(corr_df)`	Annotated correlation heatmap.
`histogram(df, col, bins=40)`	Histogram with marginal boxplot, mean/median lines. (Numeric columns only.)
`boxplot(df, col)`	Boxplot with IQR fences annotated, outliers highlighted. (Numeric columns only.)
`qq_plot(df, col)`	Q-Q plot against a normal distribution (needs scipy).
`bar_categorical(df, col, top_n=15)`	Horizontal bar chart of the top-N most frequent values. (Categorical columns.)
`scatter_matrix(df, numeric_cols, max_cols=5)`	Pairwise scatter matrix across several numeric columns at once.
`pairwise_scatter(df, col_x, col_y)`	Scatter plot of two numeric columns with a correlation annotation.
`pairwise_scatter_with_trendline(df, col_x, col_y)`	Same as above, plus an OLS trendline (needs the `[trend]` extra / `statsmodels`; falls back to a plain scatter if not installed).
`time_series_plot(df, date_col, value_col)`	Line chart of a value over a datetime column.
`multi_histogram(df, numeric_cols, max_cols=4)`	Grid of histograms for several numeric columns at once.

`chat_assistant` — local rule-based Q&A

answer_question(question, df, results) answers natural-language questions about the dataset using the results dict — no API key, no internet, pure keyword matching against the EDA results. Recognized topics:

Summary/overview — "give me a summary of this dataset", "tell me about this data"
Missing values — "which columns have missing values?", "any nulls?"
Correlations — "what are the top correlated pairs?", "any relationships?"
Outliers — "which columns have the most outliers?"
Duplicates — "are there duplicate rows?"
Numeric columns — "describe the numeric columns"
Categorical columns — "what categorical columns are in the data?"
Skewness — "which column has the highest skewness?"
Normality — "is this data normally distributed?"
A specific column by name — e.g. "describe discounted_price" (fuzzy-matches column names, including ones in quotes)
Row/column counts — "how many rows?", "how many columns?"
Help — "help", "what can you do?"

SUGGESTED_QUESTIONS is a ready-made list of example prompts (used to populate quick-reply buttons in the Streamlit UI, but usable anywhere).

`report_builder` — self-contained HTML report

build_html_report(df, results, figures, filename="dataset", include_advanced_stats=True) assembles one standalone HTML file (Plotly JS embedded inline, so it works fully offline — open it in any browser, or print to PDF). It includes:

Header with dataset name and generation timestamp
Stat cards (rows, columns, missing %, duplicates, numeric/categorical counts, memory usage)
Column type & completeness table
Missing values chart + table
Numeric summary table, plus skew/kurtosis/normality table
Outlier detection table (IQR + Z-score) with method explanations
A histogram + boxplot pair for every numeric column
Correlation heatmap + top correlated pairs table
A bar chart + top-values table for every categorical column

Opens a browser tab with upload, tabs (Overview / Missing / Numeric / Outliers / Categorical / Correlation / Chat / Download), and one-click export of the HTML report or a ZIP of CSVs — same as before, just now built on top of the installed eda_k package instead of loose scripts.

Supported file types

CSV, TSV, TXT (auto-delimiter-detect), XLSX, XLS, JSON, Parquet.

Notes / known limits

Very large files (millions of rows) will be slower to chart; consider sampling first if you hit performance issues.
The "likely datetime column" detector is a heuristic on a small sample — always double check it against the Overview before trusting it blindly.
Normality test (Shapiro-Wilk) auto-samples to 5,000 rows for large columns for speed.
Chart functions take a DataFrame + column name (charts.histogram(df, "col")), not a Series (charts.histogram(df["col"])) — passing a Series-only call will raise an error or silently misbehave depending on the function.

Project details

Release history Release notifications | RSS feed

This version

0.1.2

Jun 29, 2026

0.1.1

Jun 25, 2026

0.1.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eda_k-0.1.2.tar.gz (27.4 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

eda_k-0.1.2-py3-none-any.whl (24.1 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file eda_k-0.1.2.tar.gz.

File metadata

Download URL: eda_k-0.1.2.tar.gz
Upload date: Jun 29, 2026
Size: 27.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for eda_k-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`bc6fcd1ea59d901c87e7352b29ce2211c811a1efa3efb2b3f3064d2dd2525d74`
MD5	`f25df8ed8b084414d3ca7d6527a9932d`
BLAKE2b-256	`07c123e9464a354064c77855aa37631fc481b82417958170d92d0c278c3f3add`

See more details on using hashes here.

File details

Details for the file eda_k-0.1.2-py3-none-any.whl.

File metadata

Download URL: eda_k-0.1.2-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 24.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for eda_k-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0670d56bb5e7150f389a2b50d23718cf0d345f2c10be7bc277f4dafd7f70388c`
MD5	`046b774d1761ea5b4df4598327975173`
BLAKE2b-256	`d2ea3dd5c1f5f49352ccdf94895e67a728685e6eeaa00128c52b5c84c77ffe44`

See more details on using hashes here.

eda-k 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

eda-k

Install

Quick start (recommended)

The `analyze()` function and `EDAResult` object

`eda_k.analyze(source, filename=None, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)`

`EDAResult` — what you get back

Lower-level modules

`eda_engine` — core analysis (pandas/numpy/scipy, no UI)

`charts` — Plotly chart builders

`chat_assistant` — local rule-based Q&A

`report_builder` — self-contained HTML report

Supported file types

Notes / known limits

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

eda-k 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

eda-k

Install

Quick start (recommended)

The analyze() function and EDAResult object

eda_k.analyze(source, filename=None, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)

EDAResult — what you get back

Lower-level modules

eda_engine — core analysis (pandas/numpy/scipy, no UI)

charts — Plotly chart builders

chat_assistant — local rule-based Q&A

report_builder — self-contained HTML report

Supported file types

Notes / known limits

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

The `analyze()` function and `EDAResult` object

`eda_k.analyze(source, filename=None, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)`

`EDAResult` — what you get back

`eda_engine` — core analysis (pandas/numpy/scipy, no UI)

`charts` — Plotly chart builders

`chat_assistant` — local rule-based Q&A

`report_builder` — self-contained HTML report