Missing Values Analysis for Data Science
Project description
scikit-na is a comprehensive Python package for missing data (NA) analysis and exploration. It provides statistical functions, interactive visualizations, and export capabilities to help data scientists understand and handle missing values in their datasets.
Why scikit-na?
- Comprehensive Analysis: Get detailed statistics on missing data patterns
- Interactive Reports: Generate widget-based reports for Jupyter notebooks
- Multiple Export Formats: Share results as CSV, JSON, HTML, or Excel files
- Statistical Modeling: Build logistic regression models to understand missingness
- Rich Visualizations: Create heatmaps, correlation plots, and distribution charts
- Hypothesis Testing: Test for missing completely at random (MCAR) patterns
Features
- Interactive report (based on ipywidgets)
- Export functionality (CSV, JSON, HTML, XLSX formats)
- Descriptive statistics
- Regression modeling
- Hypotheses tests
- Data visualization
Donate
If you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.
Installation
Basic installation
pip install scikit-na
With optional dependencies
# For export functionality (Excel support)
pip install scikit-na[export]
# For development
pip install scikit-na[dev]
# Install from source
pip install git+https://github.com/maximtrp/scikit-na.git
Quick Start
import scikit_na as na
import pandas as pd
# Load your data
data = pd.read_csv('your_dataset.csv')
# Get missing data summary
summary = na.summary(data)
print(summary)
# Create interactive report
report = na.report(data)
# Export results
na.export_summary(data, 'missing_data_analysis.csv', format='csv')
Examples
The following examples use the Titanic dataset (from Kaggle) that contains NA values in three columns: Age, Cabin, and Embarked.
Core Functions
| Function | Description |
|---|---|
na.summary() |
Comprehensive missing data statistics |
na.correlate() |
Correlations between missing values |
na.describe() |
Descriptive stats grouped by missingness |
na.model() |
Logistic regression for missing patterns |
na.test_hypothesis() |
Statistical tests for MCAR |
na.report() |
Interactive widget-based report |
na.export_summary() |
Export analysis to files |
na.export_report() |
Export interactive reports |
Summary
Per each column
By default, summary() function returns the results for each column.
import scikit_na as na
import pandas as pd
data = pd.read_csv('titanic_dataset.csv')
# Excluding three columns without NA to fit the table here
na.summary(data, columns=data.columns.difference(['SibSp', 'Parch', 'Ticket']))
| Age | Cabin | Embarked | Fare | Name | PassengerId | Pclass | Sex | Survived | |
|---|---|---|---|---|---|---|---|---|---|
| na_count | 177 | 687 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| na_pct_per_col | 19.87 | 77.1 | 0.22 | 0 | 0 | 0 | 0 | 0 | 0 |
| na_pct_total | 20.44 | 79.33 | 0.23 | 0 | 0 | 0 | 0 | 0 | 0 |
| na_unique_per_col | 19 | 529 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| na_unique_pct_per_col | 10.73 | 77 | 100 | 0 | 0 | 0 | 0 | 0 | 0 |
| rows_after_dropna | 714 | 204 | 889 | 891 | 891 | 891 | 891 | 891 | 891 |
| rows_after_dropna_pct | 80.13 | 22.9 | 99.78 | 100 | 100 | 100 | 100 | 100 | 100 |
NA unique is the number of NA values per each column that are unique for it, i.e. do not intersect with NA values in the other columns (or that will remain in dataset if we drop NA values in the other columns).
Whole dataset
We can also get a summary of missing data for the whole dataset:
na.summary(data, per_column=False)
| dataset | |
|---|---|
| total_columns | 12 |
| total_rows | 891 |
| na_rows | 708 |
| non_na_rows | 183 |
| total_cells | 10692 |
| na_cells | 866 |
| na_cells_pct | 8.1 |
| non_na_cells | 9826 |
| non_na_cells_pct | 91.9 |
Correlations
To calculate correlations between columns in terms of missing data, just call
correlate() function with your DataFrame as the first argument:
na.correlate(data, method="spearman").round(3)
| Embarked | Age | Cabin | |
|---|---|---|---|
| Embarked | 1 | -0.024 | -0.087 |
| Age | -0.024 | 1 | 0.144 |
| Cabin | -0.087 | 0.144 | 1 |
This method can be used to uncover hidden patterns in missing data across many columns in a dataset. Columns with no missing data are automatically excluded.
There is a function to visualize correlations with a heatmap:
na.altair\
.plot_corr(data, corr_kws={'method': 'spearman'})
.properties(width=150, height=150)
Visualization
Heatmap
Now, let's visualize NA values on a heatmap. We will be using Altair + Vega backend:
na.altair.plot_heatmap(data)
Droppables are those values that will be dropped if we simply use
pandas.DataFrame.dropna() on the entire dataset.
Stairs plot
Stairs plot is one more useful visualization of dataset shrinkage on applying
pandas.Series.dropna() method to each column sequentially (sorted by the
number of NA values, by default):
na.altair.plot_stairs(data)
After dropping all NAs in Cabin column, we are left with 21 more NAs (in Age
and Embarked columns). This plot also shows tooltips with exact numbers of NA
values that are dropped per each column.
Histogram
You may need to adjust some parameters before a histogram starts looking as you expect:
chart = na.altair.plot_hist(data, col='Pclass', col_na='Age')\
.properties(width=200, height=200)
chart.configure_axisX(labelAngle = 0)
Regression model
We can build a logistic regression model with Age as a dependent variable and
Fare, Parch, Pclass, SibSp, Survived as independent variables.
Internally, pandas.Series.isna() method is called on Age column, and the
resulting boolean values are converted to integers (True/False becomes
1/0). Finally, fitting a logistic model is done by
statsmodels package:
# Selecting columns with numeric data
# Dropping "PassengerId" column
subset = data.loc[:, data.dtypes != object].drop(columns=['PassengerId'])
model = na.model(subset, col_na='Age')
model.summary()
Optimization terminated successfully.
Current function value: 0.467801
Iterations 7
Logit Regression Results
===============================================================================
Dep. Variable: Age No. Observations: 891
Model: Logit Df Residuals: 885
Method: MLE Df Model: 5
Date: Sat, 05 Jun 2021 Pseudo R-squ.: 0.06164
Time: 17:51:31 Log-Likelihood: -416.81
converged: True LL-Null: -444.19
Covariance Type: nonrobust LLR p-value: 1.463e-10
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
(intercept) -2.7294 0.429 -6.369 0.000 -3.569 -1.890
Fare 0.0010 0.003 0.376 0.707 -0.004 0.006
Parch -0.8874 0.223 -3.984 0.000 -1.324 -0.451
Pclass 0.5953 0.147 4.046 0.000 0.307 0.884
SibSp 0.2548 0.095 2.684 0.007 0.069 0.441
Survived -0.1026 0.198 -0.519 0.604 -0.490 0.285
===============================================================================
Interactive report
Use scikit_na.report() function to show interactive report interface:
na.report(data)
Export functionality
Export your analysis results to various formats for sharing and further processing:
Export summary statistics
# Export to CSV
na.export_summary(data, filename='missing_data_summary.csv', format='csv')
# Export to JSON
na.export_summary(data, filename='summary.json', format='json')
# Export to Excel
na.export_summary(data, filename='analysis.xlsx', format='xlsx')
Export interactive reports
# Export complete report to HTML
na.export_report(data, filename='missing_data_report.html', format='html')
# Export with custom columns
na.export_report(
data,
columns=['Age', 'Cabin', 'Embarked'],
filename='focused_analysis.html',
format='html'
)
The export functionality supports:
- CSV: Summary statistics in tabular format
- JSON: Structured data for programmatic access
- HTML: Interactive reports for web viewing
- XLSX: Excel-compatible spreadsheets
API Reference
Statistical Functions
summary(data, columns=None, per_column=True, round_dec=2)- Missing data statisticscorrelate(data, columns=None, drop=True, **kwargs)- Correlation analysisdescribe(data, col_na, columns=None, na_mapping=None)- Grouped descriptive statsmodel(data, col_na, columns=None, intercept=True, **kwargs)- Logistic regressiontest_hypothesis(data, col_na, test_fn, columns=None, **kwargs)- Hypothesis testingstairs(data, columns=None, **kwargs)- Dataset shrinkage analysis
Visualization Functions
altair.plot_heatmap(data, **kwargs)- Missing data heatmapaltair.plot_corr(data, **kwargs)- Correlation heatmapaltair.plot_stairs(data, **kwargs)- Stairs plotaltair.plot_hist(data, col, col_na, **kwargs)- Missing data histogram
Export Functions
export_summary(data, filename, format, **kwargs)- Export summary statisticsexport_report(data, filename, format, **kwargs)- Export interactive reports
Interactive Reports
report(data, columns=None, **kwargs)- Generate interactive widget-based report
Contribution
Any contribution is highly appreciated: pull requests, suggestions, or bug reports.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scikit_na-0.3.0.tar.gz.
File metadata
- Download URL: scikit_na-0.3.0.tar.gz
- Upload date:
- Size: 38.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f1af2c4cc7a981ca9955fa30aff247192a9b0859eddbf14296e57ee440282e5
|
|
| MD5 |
1ba40cd663cd030078d4a37c667eba33
|
|
| BLAKE2b-256 |
729764db19a11ba1cebfdd3a10777ffd57083915a210579a11381a3d14f0578b
|
File details
Details for the file scikit_na-0.3.0-py3-none-any.whl.
File metadata
- Download URL: scikit_na-0.3.0-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db72715d2d7299154e9b1f1bdedd2d53ab5958af904d57d216a1c83cf7e3d4b2
|
|
| MD5 |
064400c5567f6c7dcb40a23356c4277b
|
|
| BLAKE2b-256 |
9b30524cad2a69db25d7940e6b3213a7f93411fce7c554a18d41de05262e2b90
|