scikit-na

Missing Values Analysis for Data Science

These details have not been verified by PyPI

Project links

Project description

scikit-na logo

scikit-na is a comprehensive Python package for missing data (NA) analysis and exploration. It provides statistical functions, interactive visualizations, and export capabilities to help data scientists understand and handle missing values in their datasets.

Why scikit-na?

Comprehensive Analysis: Get detailed statistics on missing data patterns
Interactive Reports: Generate widget-based reports for Jupyter notebooks
Multiple Export Formats: Share results as CSV, JSON, HTML, or Excel files
Statistical Modeling: Build logistic regression models to understand missingness
Rich Visualizations: Create heatmaps, correlation plots, and distribution charts
Hypothesis Testing: Test for missing completely at random (MCAR) patterns

Visualizations

Features

Interactive report (based on ipywidgets)
Export functionality (CSV, JSON, HTML, XLSX formats)
Descriptive statistics
Regression modeling
Hypotheses tests
Data visualization

Donate

If you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.

Installation

Basic installation

pip install scikit-na

With optional dependencies

# For export functionality (Excel support)
pip install scikit-na[export]

# For development
pip install scikit-na[dev]

# Install from source
pip install git+https://github.com/maximtrp/scikit-na.git

Quick Start

import scikit_na as na
import pandas as pd

# Load your data
data = pd.read_csv('your_dataset.csv')

# Get missing data summary
summary = na.summary(data)
print(summary)

# Create interactive report
report = na.report(data)

# Export results
na.export_summary(data, 'missing_data_analysis.csv', format='csv')

Examples

The following examples use the Titanic dataset (from Kaggle) that contains NA values in three columns: Age, Cabin, and Embarked.

Core Functions

Function	Description
`na.summary()`	Comprehensive missing data statistics
`na.correlate()`	Correlations between missing values
`na.describe()`	Descriptive stats grouped by missingness
`na.model()`	Logistic regression for missing patterns
`na.test_hypothesis()`	Statistical tests for MCAR
`na.report()`	Interactive widget-based report
`na.export_summary()`	Export analysis to files
`na.export_report()`	Export interactive reports

Summary

Per each column

By default, summary() function returns the results for each column.

import scikit_na as na
import pandas as pd

data = pd.read_csv('titanic_dataset.csv')

# Excluding three columns without NA to fit the table here
na.summary(data, columns=data.columns.difference(['SibSp', 'Parch', 'Ticket']))

	Age	Cabin	Embarked	Fare	Name	PassengerId	Pclass	Sex	Survived
na_count	177	687	2	0	0	0	0	0	0
na_pct_per_col	19.87	77.1	0.22	0	0	0	0	0	0
na_pct_total	20.44	79.33	0.23	0	0	0	0	0	0
na_unique_per_col	19	529	2	0	0	0	0	0	0
na_unique_pct_per_col	10.73	77	100	0	0	0	0	0	0
rows_after_dropna	714	204	889	891	891	891	891	891	891
rows_after_dropna_pct	80.13	22.9	99.78	100	100	100	100	100	100

NA unique is the number of NA values per each column that are unique for it, i.e. do not intersect with NA values in the other columns (or that will remain in dataset if we drop NA values in the other columns).

Whole dataset

We can also get a summary of missing data for the whole dataset:

na.summary(data, per_column=False)

	dataset
total_columns	12
total_rows	891
na_rows	708
non_na_rows	183
total_cells	10692
na_cells	866
na_cells_pct	8.1
non_na_cells	9826
non_na_cells_pct	91.9

Correlations

To calculate correlations between columns in terms of missing data, just call correlate() function with your DataFrame as the first argument:

na.correlate(data, method="spearman").round(3)

	Embarked	Age	Cabin
Embarked	1	-0.024	-0.087
Age	-0.024	1	0.144
Cabin	-0.087	0.144	1

This method can be used to uncover hidden patterns in missing data across many columns in a dataset. Columns with no missing data are automatically excluded.

There is a function to visualize correlations with a heatmap:

na.altair\
    .plot_corr(data, corr_kws={'method': 'spearman'})
    .properties(width=150, height=150)

NA correlations

Visualization

Heatmap

Now, let's visualize NA values on a heatmap. We will be using Altair + Vega backend:

na.altair.plot_heatmap(data)

NA heatmap

Droppables are those values that will be dropped if we simply use pandas.DataFrame.dropna() on the entire dataset.

Stairs plot

Stairs plot is one more useful visualization of dataset shrinkage on applying pandas.Series.dropna() method to each column sequentially (sorted by the number of NA values, by default):

na.altair.plot_stairs(data)

NA stairsplot

After dropping all NAs in Cabin column, we are left with 21 more NAs (in Age and Embarked columns). This plot also shows tooltips with exact numbers of NA values that are dropped per each column.

Histogram

You may need to adjust some parameters before a histogram starts looking as you expect:

chart = na.altair.plot_hist(data, col='Pclass', col_na='Age')\
    .properties(width=200, height=200)
chart.configure_axisX(labelAngle = 0)

NA histogram

Regression model

We can build a logistic regression model with Age as a dependent variable and Fare, Parch, Pclass, SibSp, Survived as independent variables. Internally, pandas.Series.isna() method is called on Age column, and the resulting boolean values are converted to integers (True/False becomes 1/0). Finally, fitting a logistic model is done by statsmodels package:

# Selecting columns with numeric data
# Dropping "PassengerId" column
subset = data.loc[:, data.dtypes != object].drop(columns=['PassengerId'])
model = na.model(subset, col_na='Age')
model.summary()

Optimization terminated successfully.
Current function value: 0.467801
Iterations 7
                        Logit Regression Results
===============================================================================
Dep. Variable:                    Age   No. Observations:                   891
Model:                          Logit   Df Residuals:                       885
Method:                           MLE   Df Model:                             5
Date:                Sat, 05 Jun 2021   Pseudo R-squ.:                  0.06164
Time:                        17:51:31   Log-Likelihood:                 -416.81
converged:                       True   LL-Null:                        -444.19
Covariance Type:            nonrobust   LLR p-value:                  1.463e-10
===============================================================================
                coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
(intercept)    -2.7294      0.429     -6.369      0.000      -3.569      -1.890
Fare            0.0010      0.003      0.376      0.707      -0.004       0.006
Parch          -0.8874      0.223     -3.984      0.000      -1.324      -0.451
Pclass          0.5953      0.147      4.046      0.000       0.307       0.884
SibSp           0.2548      0.095      2.684      0.007       0.069       0.441
Survived       -0.1026      0.198     -0.519      0.604      -0.490       0.285
===============================================================================

Interactive report

Use scikit_na.report() function to show interactive report interface:

na.report(data)

Report

Export functionality

Export your analysis results to various formats for sharing and further processing:

Export summary statistics

# Export to CSV
na.export_summary(data, filename='missing_data_summary.csv', format='csv')

# Export to JSON
na.export_summary(data, filename='summary.json', format='json')

# Export to Excel
na.export_summary(data, filename='analysis.xlsx', format='xlsx')

Export interactive reports

# Export complete report to HTML
na.export_report(data, filename='missing_data_report.html', format='html')

# Export with custom columns
na.export_report(
    data, 
    columns=['Age', 'Cabin', 'Embarked'],
    filename='focused_analysis.html', 
    format='html'
)

The export functionality supports:

CSV: Summary statistics in tabular format
JSON: Structured data for programmatic access
HTML: Interactive reports for web viewing
XLSX: Excel-compatible spreadsheets

API Reference

Statistical Functions

summary(data, columns=None, per_column=True, round_dec=2) - Missing data statistics
correlate(data, columns=None, drop=True, **kwargs) - Correlation analysis
describe(data, col_na, columns=None, na_mapping=None) - Grouped descriptive stats
model(data, col_na, columns=None, intercept=True, **kwargs) - Logistic regression
test_hypothesis(data, col_na, test_fn, columns=None, **kwargs) - Hypothesis testing
stairs(data, columns=None, **kwargs) - Dataset shrinkage analysis

Visualization Functions

altair.plot_heatmap(data, **kwargs) - Missing data heatmap
altair.plot_corr(data, **kwargs) - Correlation heatmap
altair.plot_stairs(data, **kwargs) - Stairs plot
altair.plot_hist(data, col, col_na, **kwargs) - Missing data histogram

Export Functions

export_summary(data, filename, format, **kwargs) - Export summary statistics
export_report(data, filename, format, **kwargs) - Export interactive reports

Interactive Reports

report(data, columns=None, **kwargs) - Generate interactive widget-based report

Contribution

Any contribution is highly appreciated: pull requests, suggestions, or bug reports.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Sep 13, 2025

0.2.0

Feb 15, 2025

0.1.1

Mar 1, 2023

0.1.0

Mar 11, 2022

0.0.7

Jul 17, 2021

0.0.6

Jun 21, 2021

0.0.5

Jun 13, 2021

0.0.4

Jun 11, 2021

0.0.2

Jun 11, 2021

0.0.1

Jun 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit_na-0.3.0.tar.gz (38.5 kB view details)

Uploaded Sep 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scikit_na-0.3.0-py3-none-any.whl (24.9 kB view details)

Uploaded Sep 13, 2025 Python 3

File details

Details for the file scikit_na-0.3.0.tar.gz.

File metadata

Download URL: scikit_na-0.3.0.tar.gz
Upload date: Sep 13, 2025
Size: 38.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scikit_na-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`6f1af2c4cc7a981ca9955fa30aff247192a9b0859eddbf14296e57ee440282e5`
MD5	`1ba40cd663cd030078d4a37c667eba33`
BLAKE2b-256	`729764db19a11ba1cebfdd3a10777ffd57083915a210579a11381a3d14f0578b`

See more details on using hashes here.

File details

Details for the file scikit_na-0.3.0-py3-none-any.whl.

File metadata

Download URL: scikit_na-0.3.0-py3-none-any.whl
Upload date: Sep 13, 2025
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scikit_na-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db72715d2d7299154e9b1f1bdedd2d53ab5958af904d57d216a1c83cf7e3d4b2`
MD5	`064400c5567f6c7dcb40a23356c4277b`
BLAKE2b-256	`9b30524cad2a69db25d7940e6b3213a7f93411fce7c554a18d41de05262e2b90`

See more details on using hashes here.

scikit-na 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Why scikit-na?

Features

Donate

Installation

Basic installation

With optional dependencies

Quick Start

Examples

Core Functions

Summary

Per each column

Whole dataset

Correlations

Visualization

Heatmap

Stairs plot

Histogram

Regression model

Interactive report

Export functionality

Export summary statistics

Export interactive reports

API Reference

Statistical Functions

Visualization Functions

Export Functions

Interactive Reports

Contribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes