Automated, local exploratory data analysis: stats, charts, correlations, outliers, a chat assistant, and self-contained HTML reports.
Project description
eda-k
Automated, local exploratory data analysis — as a Python library you can
import, with an optional Streamlit UI on top.
Runs 100% locally. Your data never leaves your machine, no API key needed.
Install
pip install -e .
Want everything in one shot (library + Streamlit app + OLS trendlines)?
pip install -e ".[app,trend]"
Or pick extras individually:
Need the bundled Streamlit app too?
pip install -e ".[app]"
Need OLS trendlines on scatter plots (charts.pairwise_scatter_with_trendline)?
pip install -e ".[trend]"
Quick start (recommended)
The simplest way to use the library — one function call analyzes your data, one method call exports a report:
import eda_k
result = eda_k.analyze("dataset.csv") # path, file-like object, or DataFrame all work
result.summary() # quick text overview
result.ask("which columns have missing values?")
result.to_html("dataset_report.html") # self-contained HTML report
result.to_csv_zip("dataset_tables.zip") # every summary table as CSVs in one ZIP
That's the entire workflow for most use cases. Everything below explains what each piece does and how to drop down to the lower-level modules if you need more control.
The analyze() function and EDAResult object
eda_k.analyze(source, filename=None, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)
Loads your data and runs the complete EDA pipeline in one call.
source— a file path (str/Path), an open file-like object, or an already-loadedpandas.DataFrame.filename— only needed ifsourceis a file-like object without a.nameattribute; used to detect file type and for report titles.outlier_method—"IQR","Z-score", or"Both".correlation_method—"pearson","spearman", or"kendall".max_sample_size— row cap used when sampling for the Shapiro-Wilk normality test (large columns get sampled down for speed).
Returns an EDAResult.
EDAResult — what you get back
| Member | What it does |
|---|---|
result.df |
The loaded pandas.DataFrame. |
result.results |
Raw dict of every computed table — see below. |
result.summary() |
Plain-text dataset overview (rows, columns, missing %, dtypes, memory). |
result.ask(question) |
Ask a natural-language question; see chat assistant section below. |
result.build_figures() |
Builds the full dict of Plotly figures used in the HTML report. |
result.to_html(path=None, ...) |
Builds the self-contained HTML report. Returns the HTML string; writes to path if given. |
result.to_csv_zip(path=None) |
Builds a ZIP of every summary table as CSV. Returns the ZIP bytes; writes to path if given. |
result.results contains these keys, each produced by eda_engine:
overview— shape, dtypes, missing %, duplicates, memory usage, likely datetime/ID columnsdtype_table— per-column dtype, missing count, uniquenessmissing_summary— missing count/% per columnnumeric_summary— mean/median/std/skew/kurtosis/normality per numeric columnoutliers— IQR and Z-score outlier counts per numeric columncategorical_summary— unique count, mode, top values per categorical columncorrelation— full correlation matrixtop_correlations— strongest correlated pairs, ranked
Lower-level modules
If you want to call the underlying functions directly instead of using
analyze(), every module is importable on its own. Note the correct
signatures — a common mistake is passing two Series into a chart function;
every chart function takes the DataFrame plus a column name string, not
Series.
from eda_k import eda_engine, charts, chat_assistant, report_builder
# 1. Load the file (note: pass an open file object, not just the path string)
df = eda_engine.load_file(open("dataset.csv", "rb"), "dataset.csv")
# 2. Run the full pipeline
results = eda_engine.run_full_eda(df)
# 3. Build individual charts — pass (df, "column_name"), not df["column_name"]
fig = charts.histogram(df, "discounted_price") # correct
fig = charts.bar_categorical(df, "category") # correct
# fig = charts.histogram(df["product_name"], df["category"]) # wrong — two Series, not df + col name
# 4. Build the figures dict report_builder.build_html_report() expects
ov = results["overview"]
figures = {
"missing_bar": charts.missing_values_bar(results["missing_summary"]),
"histograms": {c: charts.histogram(df, c) for c in ov["numeric_cols"]},
"boxplots": {c: charts.boxplot(df, c) for c in ov["numeric_cols"]},
"categorical_bars": {c: charts.bar_categorical(df, c) for c in ov["categorical_cols"]},
"corr_heatmap": (
charts.correlation_heatmap(results["correlation"])
if not results["correlation"].empty else None
),
}
# 5. Build and save the HTML report
html = report_builder.build_html_report(df, results, figures, filename="dataset.csv")
with open("dataset_report.html", "w", encoding="utf-8") as f:
f.write(html)
This is exactly what result.to_html() does internally — use the high-level
analyze() API unless you specifically need this manual control.
eda_engine — core analysis (pandas/numpy/scipy, no UI)
| Function | What it does |
|---|---|
load_file(file, filename) |
Loads CSV, TSV, TXT, XLSX, XLS, JSON, or Parquet into a DataFrame based on the filename extension. |
get_overview(df) |
Row/column counts, missing %, duplicate rows, numeric/categorical/datetime column lists, likely-datetime and ID-like column detection, memory usage. |
get_dtype_table(df) |
Per-column dtype, missing count/%, unique count/%, potential-ID flag. |
get_missing_summary(df) |
Missing count and % per column, sorted worst-first. |
get_numeric_summary(df, numeric_cols, max_sample_size=5000) |
Mean, median, std, min/max, IQR, CV%, skew, kurtosis, and a Shapiro-Wilk normality flag per numeric column. |
detect_outliers(df, numeric_cols, method="Both") |
IQR-fence and/or Z-score (|z|>3) outlier counts per numeric column. |
get_categorical_summary(df, categorical_cols, top_n=10) |
Unique count, missing count, mode, mode %, and top-N value counts per categorical column. |
get_correlation(df, numeric_cols, method="pearson") |
Correlation matrix (pearson, spearman, or kendall). |
get_top_correlated_pairs(corr_df, top_n=10) |
Strongest correlated column pairs, ranked by absolute correlation. |
run_full_eda(df, outlier_method="Both", correlation_method="pearson", max_sample_size=5000) |
Runs everything above and returns it all as one results dict — this is what analyze() calls. |
charts — Plotly chart builders
Every function takes (df, column_name) (or a column list), not raw Series,
and returns a Plotly figure (or None if there isn't enough data to plot).
| Function | Chart |
|---|---|
missing_values_bar(missing_df) |
Bar chart of missing values by column. |
missing_pattern_heatmap(missing_matrix) |
Heatmap of where missing values cluster across rows/columns. |
correlation_heatmap(corr_df) |
Annotated correlation heatmap. |
histogram(df, col, bins=40) |
Histogram with marginal boxplot, mean/median lines. (Numeric columns only.) |
boxplot(df, col) |
Boxplot with IQR fences annotated, outliers highlighted. (Numeric columns only.) |
qq_plot(df, col) |
Q-Q plot against a normal distribution (needs scipy). |
bar_categorical(df, col, top_n=15) |
Horizontal bar chart of the top-N most frequent values. (Categorical columns.) |
scatter_matrix(df, numeric_cols, max_cols=5) |
Pairwise scatter matrix across several numeric columns at once. |
pairwise_scatter(df, col_x, col_y) |
Scatter plot of two numeric columns with a correlation annotation. |
pairwise_scatter_with_trendline(df, col_x, col_y) |
Same as above, plus an OLS trendline (needs the [trend] extra / statsmodels; falls back to a plain scatter if not installed). |
time_series_plot(df, date_col, value_col) |
Line chart of a value over a datetime column. |
multi_histogram(df, numeric_cols, max_cols=4) |
Grid of histograms for several numeric columns at once. |
chat_assistant — local rule-based Q&A
answer_question(question, df, results) answers natural-language questions
about the dataset using the results dict — no API key, no internet, pure
keyword matching against the EDA results. Recognized topics:
- Summary/overview — "give me a summary of this dataset", "tell me about this data"
- Missing values — "which columns have missing values?", "any nulls?"
- Correlations — "what are the top correlated pairs?", "any relationships?"
- Outliers — "which columns have the most outliers?"
- Duplicates — "are there duplicate rows?"
- Numeric columns — "describe the numeric columns"
- Categorical columns — "what categorical columns are in the data?"
- Skewness — "which column has the highest skewness?"
- Normality — "is this data normally distributed?"
- A specific column by name — e.g. "describe discounted_price" (fuzzy-matches column names, including ones in quotes)
- Row/column counts — "how many rows?", "how many columns?"
- Help — "help", "what can you do?"
SUGGESTED_QUESTIONS is a ready-made list of example prompts (used to
populate quick-reply buttons in the Streamlit UI, but usable anywhere).
report_builder — self-contained HTML report
build_html_report(df, results, figures, filename="dataset", include_advanced_stats=True)
assembles one standalone HTML file (Plotly JS embedded inline, so it works
fully offline — open it in any browser, or print to PDF). It includes:
- Header with dataset name and generation timestamp
- Stat cards (rows, columns, missing %, duplicates, numeric/categorical counts, memory usage)
- Column type & completeness table
- Missing values chart + table
- Numeric summary table, plus skew/kurtosis/normality table
- Outlier detection table (IQR + Z-score) with method explanations
- A histogram + boxplot pair for every numeric column
- Correlation heatmap + top correlated pairs table
- A bar chart + top-values table for every categorical column
Opens a browser tab with upload, tabs (Overview / Missing / Numeric / Outliers
/ Categorical / Correlation / Chat / Download), and one-click export of the
HTML report or a ZIP of CSVs — same as before, just now built on top of the
installed eda_k package instead of loose scripts.
Supported file types
CSV, TSV, TXT (auto-delimiter-detect), XLSX, XLS, JSON, Parquet.
Notes / known limits
- Very large files (millions of rows) will be slower to chart; consider sampling first if you hit performance issues.
- The "likely datetime column" detector is a heuristic on a small sample — always double check it against the Overview before trusting it blindly.
- Normality test (Shapiro-Wilk) auto-samples to 5,000 rows for large columns for speed.
- Chart functions take a DataFrame + column name (
charts.histogram(df, "col")), not a Series (charts.histogram(df["col"])) — passing a Series-only call will raise an error or silently misbehave depending on the function.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eda_k-0.1.2.tar.gz.
File metadata
- Download URL: eda_k-0.1.2.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc6fcd1ea59d901c87e7352b29ce2211c811a1efa3efb2b3f3064d2dd2525d74
|
|
| MD5 |
f25df8ed8b084414d3ca7d6527a9932d
|
|
| BLAKE2b-256 |
07c123e9464a354064c77855aa37631fc481b82417958170d92d0c278c3f3add
|
File details
Details for the file eda_k-0.1.2-py3-none-any.whl.
File metadata
- Download URL: eda_k-0.1.2-py3-none-any.whl
- Upload date:
- Size: 24.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0670d56bb5e7150f389a2b50d23718cf0d345f2c10be7bc277f4dafd7f70388c
|
|
| MD5 |
046b774d1761ea5b4df4598327975173
|
|
| BLAKE2b-256 |
d2ea3dd5c1f5f49352ccdf94895e67a728685e6eeaa00128c52b5c84c77ffe44
|