Skip to main content

Automated, local exploratory data analysis: stats, charts, correlations, outliers, a chat assistant, and self-contained HTML reports.

Project description

eda-k

Automated, local exploratory data analysis — as a Python library you can import, with an optional Streamlit UI on top.

Runs 100% locally. Your data never leaves your machine, no API key needed.


Install

pip install -e .

Want everything in one shot (library + Streamlit app + OLS trendlines)?

pip install -e ".[app,trend]"

Or pick extras individually:

Need the bundled Streamlit app too?

pip install -e ".[app]"

Need OLS trendlines on scatter plots (charts.pairwise_scatter_with_trendline)?

pip install -e ".[trend]"

Quick start (recommended)

The simplest way to use the library — one function call analyzes your data, one method call exports a report:

import eda_k

result = eda_k.analyze("dataset.csv")     # path, file-like object, or DataFrame all work

result.summary()                          # quick text overview
result.ask("which columns have missing values?")

result.to_html("dataset_report.html")      # self-contained HTML report
result.to_csv_zip("dataset_tables.zip")    # every summary table as CSVs in one ZIP

That's the entire workflow for most use cases. Everything below explains what each piece does and how to drop down to the lower-level modules if you need more control.


The analyze() function and EDAResult object

eda_k.analyze(source, filename=None, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)

Loads your data and runs the complete EDA pipeline in one call.

  • source — a file path (str/Path), an open file-like object, or an already-loaded pandas.DataFrame.
  • filename — only needed if source is a file-like object without a .name attribute; used to detect file type and for report titles.
  • outlier_method"IQR", "Z-score", or "Both".
  • correlation_method"pearson", "spearman", or "kendall".
  • max_sample_size — row cap used when sampling for the Shapiro-Wilk normality test (large columns get sampled down for speed).

Returns an EDAResult.

EDAResult — what you get back

Member What it does
result.df The loaded pandas.DataFrame.
result.results Raw dict of every computed table — see below.
result.summary() Plain-text dataset overview (rows, columns, missing %, dtypes, memory).
result.ask(question) Ask a natural-language question; see chat assistant section below.
result.build_figures() Builds the full dict of Plotly figures used in the HTML report.
result.to_html(path=None, ...) Builds the self-contained HTML report. Returns the HTML string; writes to path if given.
result.to_csv_zip(path=None) Builds a ZIP of every summary table as CSV. Returns the ZIP bytes; writes to path if given.

result.results contains these keys, each produced by eda_engine:

  • overview — shape, dtypes, missing %, duplicates, memory usage, likely datetime/ID columns
  • dtype_table — per-column dtype, missing count, uniqueness
  • missing_summary — missing count/% per column
  • numeric_summary — mean/median/std/skew/kurtosis/normality per numeric column
  • outliers — IQR and Z-score outlier counts per numeric column
  • categorical_summary — unique count, mode, top values per categorical column
  • correlation — full correlation matrix
  • top_correlations — strongest correlated pairs, ranked

Lower-level modules

If you want to call the underlying functions directly instead of using analyze(), every module is importable on its own. Note the correct signatures — a common mistake is passing two Series into a chart function; every chart function takes the DataFrame plus a column name string, not Series.

from eda_k import eda_engine, charts, chat_assistant, report_builder

# 1. Load the file (note: pass an open file object, not just the path string)
df = eda_engine.load_file(open("dataset.csv", "rb"), "dataset.csv")

# 2. Run the full pipeline
results = eda_engine.run_full_eda(df)

# 3. Build individual charts — pass (df, "column_name"), not df["column_name"]
fig = charts.histogram(df, "discounted_price")       # correct
fig = charts.bar_categorical(df, "category")          #  correct
# fig = charts.histogram(df["product_name"], df["category"])  #  wrong — two Series, not df + col name

# 4. Build the figures dict report_builder.build_html_report() expects
ov = results["overview"]
figures = {
    "missing_bar": charts.missing_values_bar(results["missing_summary"]),
    "histograms": {c: charts.histogram(df, c) for c in ov["numeric_cols"]},
    "boxplots": {c: charts.boxplot(df, c) for c in ov["numeric_cols"]},
    "categorical_bars": {c: charts.bar_categorical(df, c) for c in ov["categorical_cols"]},
    "corr_heatmap": (
        charts.correlation_heatmap(results["correlation"])
        if not results["correlation"].empty else None
    ),
}

# 5. Build and save the HTML report
html = report_builder.build_html_report(df, results, figures, filename="dataset.csv")
with open("dataset_report.html", "w", encoding="utf-8") as f:
    f.write(html)

This is exactly what result.to_html() does internally — use the high-level analyze() API unless you specifically need this manual control.

eda_engine — core analysis (pandas/numpy/scipy, no UI)

Function What it does
load_file(file, filename) Loads CSV, TSV, TXT, XLSX, XLS, JSON, or Parquet into a DataFrame based on the filename extension.
get_overview(df) Row/column counts, missing %, duplicate rows, numeric/categorical/datetime column lists, likely-datetime and ID-like column detection, memory usage.
get_dtype_table(df) Per-column dtype, missing count/%, unique count/%, potential-ID flag.
get_missing_summary(df) Missing count and % per column, sorted worst-first.
get_numeric_summary(df, numeric_cols, max_sample_size=5000) Mean, median, std, min/max, IQR, CV%, skew, kurtosis, and a Shapiro-Wilk normality flag per numeric column.
detect_outliers(df, numeric_cols, method="Both") IQR-fence and/or Z-score (|z|>3) outlier counts per numeric column.
get_categorical_summary(df, categorical_cols, top_n=10) Unique count, missing count, mode, mode %, and top-N value counts per categorical column.
get_correlation(df, numeric_cols, method="pearson") Correlation matrix (pearson, spearman, or kendall).
get_top_correlated_pairs(corr_df, top_n=10) Strongest correlated column pairs, ranked by absolute correlation.
run_full_eda(df, outlier_method="Both", correlation_method="pearson", max_sample_size=5000) Runs everything above and returns it all as one results dict — this is what analyze() calls.

charts — Plotly chart builders

Every function takes (df, column_name) (or a column list), not raw Series, and returns a Plotly figure (or None if there isn't enough data to plot).

Function Chart
missing_values_bar(missing_df) Bar chart of missing values by column.
missing_pattern_heatmap(missing_matrix) Heatmap of where missing values cluster across rows/columns.
correlation_heatmap(corr_df) Annotated correlation heatmap.
histogram(df, col, bins=40) Histogram with marginal boxplot, mean/median lines. (Numeric columns only.)
boxplot(df, col) Boxplot with IQR fences annotated, outliers highlighted. (Numeric columns only.)
qq_plot(df, col) Q-Q plot against a normal distribution (needs scipy).
bar_categorical(df, col, top_n=15) Horizontal bar chart of the top-N most frequent values. (Categorical columns.)
scatter_matrix(df, numeric_cols, max_cols=5) Pairwise scatter matrix across several numeric columns at once.
pairwise_scatter(df, col_x, col_y) Scatter plot of two numeric columns with a correlation annotation.
pairwise_scatter_with_trendline(df, col_x, col_y) Same as above, plus an OLS trendline (needs the [trend] extra / statsmodels; falls back to a plain scatter if not installed).
time_series_plot(df, date_col, value_col) Line chart of a value over a datetime column.
multi_histogram(df, numeric_cols, max_cols=4) Grid of histograms for several numeric columns at once.

chat_assistant — local rule-based Q&A

answer_question(question, df, results) answers natural-language questions about the dataset using the results dict — no API key, no internet, pure keyword matching against the EDA results. Recognized topics:

  • Summary/overview — "give me a summary of this dataset", "tell me about this data"
  • Missing values — "which columns have missing values?", "any nulls?"
  • Correlations — "what are the top correlated pairs?", "any relationships?"
  • Outliers — "which columns have the most outliers?"
  • Duplicates — "are there duplicate rows?"
  • Numeric columns — "describe the numeric columns"
  • Categorical columns — "what categorical columns are in the data?"
  • Skewness — "which column has the highest skewness?"
  • Normality — "is this data normally distributed?"
  • A specific column by name — e.g. "describe discounted_price" (fuzzy-matches column names, including ones in quotes)
  • Row/column counts — "how many rows?", "how many columns?"
  • Help — "help", "what can you do?"

SUGGESTED_QUESTIONS is a ready-made list of example prompts (used to populate quick-reply buttons in the Streamlit UI, but usable anywhere).

report_builder — self-contained HTML report

build_html_report(df, results, figures, filename="dataset", include_advanced_stats=True) assembles one standalone HTML file (Plotly JS embedded inline, so it works fully offline — open it in any browser, or print to PDF). It includes:

  • Header with dataset name and generation timestamp
  • Stat cards (rows, columns, missing %, duplicates, numeric/categorical counts, memory usage)
  • Column type & completeness table
  • Missing values chart + table
  • Numeric summary table, plus skew/kurtosis/normality table
  • Outlier detection table (IQR + Z-score) with method explanations
  • A histogram + boxplot pair for every numeric column
  • Correlation heatmap + top correlated pairs table
  • A bar chart + top-values table for every categorical column

Opens a browser tab with upload, tabs (Overview / Missing / Numeric / Outliers / Categorical / Correlation / Chat / Download), and one-click export of the HTML report or a ZIP of CSVs — same as before, just now built on top of the installed eda_k package instead of loose scripts.

Supported file types

CSV, TSV, TXT (auto-delimiter-detect), XLSX, XLS, JSON, Parquet.

Notes / known limits

  • Very large files (millions of rows) will be slower to chart; consider sampling first if you hit performance issues.
  • The "likely datetime column" detector is a heuristic on a small sample — always double check it against the Overview before trusting it blindly.
  • Normality test (Shapiro-Wilk) auto-samples to 5,000 rows for large columns for speed.
  • Chart functions take a DataFrame + column name (charts.histogram(df, "col")), not a Series (charts.histogram(df["col"])) — passing a Series-only call will raise an error or silently misbehave depending on the function.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eda_k-0.1.2.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eda_k-0.1.2-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file eda_k-0.1.2.tar.gz.

File metadata

  • Download URL: eda_k-0.1.2.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for eda_k-0.1.2.tar.gz
Algorithm Hash digest
SHA256 bc6fcd1ea59d901c87e7352b29ce2211c811a1efa3efb2b3f3064d2dd2525d74
MD5 f25df8ed8b084414d3ca7d6527a9932d
BLAKE2b-256 07c123e9464a354064c77855aa37631fc481b82417958170d92d0c278c3f3add

See more details on using hashes here.

File details

Details for the file eda_k-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: eda_k-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for eda_k-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0670d56bb5e7150f389a2b50d23718cf0d345f2c10be7bc277f4dafd7f70388c
MD5 046b774d1761ea5b4df4598327975173
BLAKE2b-256 d2ea3dd5c1f5f49352ccdf94895e67a728685e6eeaa00128c52b5c84c77ffe44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page