An exploratory data science toolkit for analysis, machine learning, multimodal ai agents for text and image processing, and visualization (Apache Superset)

These details have not been verified by PyPI

Project links

Project description

datashadric - Python Toolkit for Machine Learning and Advanced Data Analytics

Last updated: November 10, 2025

Author: Paul Namalomba

SESKA Computational Engineer
Software Developer
PhD Candidate (Civil Engineering Spec. Computational and Applied Mechanic

Version: 0.3.3 (10 April 2026)
Contact: kabwenzenamalomba@gmail.com

datashadric provides a collection of well-organized modules for common data science tasks, from data cleaning and exploration to machine learning model building, unsupervised and supervised classification and statistical analysis and testing. The package is designed with readability and ease-of-use in mind, making complex data science workflows more accessible and easier to write for end-use analysts.

datashadric — Python Toolkit for Machine Learning and Advanced Data Analytics

Features

Machine Learning: Model training, data ensembling (sampling), model evaluation, and prediction tools.
Regression Analysis: Linear and Logistic regression modeling with diagnostic checks.
Data Manipulation: Pandas-based utilities for cleaning and transforming data, getting data descriptive characteristics.
Statistical Analysis: Hypothesis testing, confidence intervals, normal, Bayesian and Gaussian distribution checks. Also some sampling stuff included.
Visualization: Plotting functions for data exploration, visualization and presentation.
Multiple Imputation: MICE (PMM, norm, logistic regression), Random Forest, and KNN imputation for handling missing data.

Installation

From PyPI (recommended)

pip install datashadric

From Source

git clone https://github.com/paulnamalomba/datashadric.git
cd datashadric
pip install .

Development Installation

git clone https://github.com/paulnamalomba/datashadric.git
cd datashadric
pip install -e ".[dev]"

Quick Start

import pandas as pd
from datashadric.mlearning import ml_naive_bayes_model
from datashadric.regression import lr_ols_model
from datashadric.dataframing import df_check_na_values
from datashadric.stochastics import df_gaussian_checks
from datashadric.plotters import df_boxplotter
from datashadric.aiagents import ai_analyze_plot_data_with_vision
from datashadric.aiagents import ai_data_insights_summary
from datashadric.imputation import df_mice_impute_pmm, df_impute_knn

# load your data
df = pd.read_csv('your_data.csv')

# check for missing values
na_summary = df_check_na_values(df)

# test for normality
normality_results = df_gaussian_checks(df, 'your_column')

# create visualizations
df_boxplotter(df, 'category_col', 'numeric_col', type_plot=0)

# build machine learning models
model, metrics = ml_naive_bayes_model(df, 'target_column', test_size=0.2)

# perform regression analysis
ols_results = lr_ols_model(df, 'dependent_var', ['independent_var1', 'independent_var2'])

Module Overview

`mlearning` - Machine Learning

ml_naive_bayes_model(): Train and evaluate Naive Bayes classifiers
ml_naive_bayes_metrics(): Calculate detailed model performance metrics
logr_predictor(): Logistic regression modeling and prediction
confusion_matrix_from_predictions(): Generate confusion matrices

`regression` - Regression Analysis

lr_ols_model(): Ordinary Least Squares regression modeling
lr_check_homoscedasticity(): Test regression assumptions
lr_check_normality(): Check residual normality
lr_post_hoc_test(): Post-hoc regression diagnostics

`dataframing` - Data Manipulation

df_check_na_values(): Comprehensive missing value analysis
df_drop_dupes(): Remove duplicate rows with reporting
df_one_hot_encoding(): Convert categorical variables to dummy variables
df_check_correlation(): Correlation analysis and visualization

`stochastics` - Statistical Analysis

df_gaussian_checks(): Test data normality with Shapiro-Wilk and Q-Q plots
df_calc_conf_interval(): Calculate confidence intervals
df_calc_moe(): Compute margin of error
df_calc_zscore(): Z-score calculations

`plotters` - Visualization

df_boxplotter(): Box plots for outlier detection
df_histplotter(): Histogram creation with customization
df_scatterplotter(): Scatter plot generation
df_pairplot(): Comprehensive pairwise plotting

`imputation` - Multiple Imputation Methods (new in v0.3.3)

df_mice_impute_pmm(): MICE with Predictive Mean Matching — imputes from observed donor values
df_mice_impute_norm(): MICE with Bayesian Linear Regression (norm) — smooth posterior-predictive draws
df_mice_impute_logistic(): MICE with Logistic Regression for binary/categorical columns
df_impute_random_forest(): Iterative Random Forest imputation (missForest-style)
df_impute_knn(): K-Nearest Neighbours imputation
df_impute_summary(): Before/after comparison of NaN counts and descriptive statistics

Dependencies

Core Dependencies

pandas >= 1.3.0
numpy >= 1.20.0
scikit-learn >= 1.0.0
matplotlib >= 3.4.0
seaborn >= 0.11.0
scipy >= 1.7.0
statsmodels >= 0.12.0
plotly >- 5.0.0

You can simply do:

pip install -r requirements/requirements-core.txt

Testing Dependencies

For running tests, you'll need to install additional packages:

pip install pytest pytest-cov

Testing (Testing the app Modules)

To run the test suite:

# Install testing dependencies first
pip install pytest pytest-cov

# Run all tests
python -m pytest tests/ -v

# Run tests with coverage report
python -m pytest tests/ --cov=datashadric --cov-report=html --cov-report=term-missing

Examples (Applications of certain Data Science techniques)

Data Cleaning and Exploration

from datashadric.dataframing import df_check_na_values, df_drop_dupes
from datashadric.plotters import df_histplotter

# check data quality
na_report = df_check_na_values(df)
df_clean = df_drop_dupes(df)

# visualize distributions
df_histplotter(df_clean, 'numeric_column', type_plot=0, bins=30)

Statistical Testing (testing data samples)

from datashadric.stochastics import df_gaussian_checks, df_calc_conf_interval

# test normality
normality_test = df_gaussian_checks(df, 'measurement_column')

# calculate confidence intervals
ci = df_calc_conf_interval(df['measurement_column'], confidence=0.95)

Machine Learning Workflows

from datashadric.mlearning import ml_naive_bayes_model, ml_naive_bayes_metrics

# train model
model, initial_metrics = ml_naive_bayes_model(df, 'target', test_size=0.3)

# detailed evaluation
detailed_metrics = ml_naive_bayes_metrics(model, X_test, y_test)

Contributing to the Project

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Licensing & Copyright

This project is licensed under the MIT License - see the LICENSE file for details.

The author retains all rights to the code and documentation in this repository. You are free to use, modify, and distribute the code as long as you comply with the terms of the MIT License.

Have issues or questions?

If you encounter any problems or have questions, please file an issue on the datashadric GitHub repository - Issues Page.

Build, Release & Deploy Instructions (v0.3.3)

The full build-to-publish workflow is captured in datashadric-build-test-upload_instructions.ps1 (PowerShell) and datashadric-build-test-upload_instructions.bat (CMD). The steps below can be run manually in order.

1. Clean previous build artefacts

# Remove old distributions and egg-info
rm -rf dist/ build/ src/*.egg-info

2. Build the package

python -m build

This produces .tar.gz and .whl files in the dist/ directory.

3. Validate the build

twine check dist/*

Ensure the output reports no errors or warnings.

4. Quick smoke-test

import datashadric
print(datashadric.__version__) # should print 0.3.3 as of 13 March 2026

5. Run the test suite

python -m pytest tests/ -v --cov=datashadric --cov-report=term-missing

6. Publish to TestPyPI (optional, recommended)

twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ datashadric==0.3.3

7. Publish to PyPI

twine upload --repository pypi dist/*

8. Install locally in editable mode

pip install -e .

9. Tag the release in Git

git add .
git commit -m "Release v0.3.3 — multiple imputation methods"
git tag -a v0.3.3 -m "v0.3.3"
git push origin main --tags

Note: If you use the Manage-GitHub PowerShell function, you can replace steps 8-9 with:
Manage-GitHub -commitMessage "Release v0.3.3" -TagName v0.3.3 -TagMessage "v0.3.3"

Changelog

Iterative Releases are usually the same release re-bundled with minor imporvements, hence they are grouped also below

Version: 0.3.0 - 0.3.3 (Iterative Releases)

Release Date: 12 March 2026 - 13 March 2026

New module: imputation — comprehensive multiple imputation methods for handling missing data
- MICE with Predictive Mean Matching (PMM)
- MICE with Bayesian Linear Regression (norm)
- MICE with Logistic Regression for binary/categorical columns
- Iterative Random Forest imputation (missForest-style, supports numeric and categorical)
- K-Nearest Neighbours (KNN) imputation
- Imputation summary utility for before/after comparison
Added MODULE_NOTES.md in src/datashadric/ documenting every module and function
Added build, release, and deploy instructions to README
Version bump to 0.3.3 and then 0.3.3 for minor fixes and documentation updates
Fixed README formatting and typos
Fixed broken anova function in stochastics module (was using wrong statsmodels submodules)
Fixed VIF calculation function in stochastics module to ensure it works correctly with pandas DataFrames and handles constant term properly
Fixed broken ols regression function in regression module (was using wrong statsmodels submodules)
Updated documentation in MODULE_NOTES.md for all modules, especially the new imputation module

Version: 0.2.0 - 0.2.3 (Iterative Releases)

Release Date: 4 Novemeber 2025 - 10 November 2025

Added image annotation when detecting outliers using AI-assisted bounding box generation
Enhanced outlier detection and removal functions in data-processor module
Added use of AI agents to assist with data analysis and visualization tasks (needs user to store their API keys in system environment variables)
Added Apache Superset as an additional visualization dependency
Minor bug fixes and enhancements in dataframing and plotters modules
Updated documentation

Version: 0.1.4

Release Date: 9 October 2025

Minor bug fixes
Minor enhancements to user optionality in many functions for mlearning, stochastics and dataframing modules
Added user optionality for saving plots to files in plotters module
Updated documentation

Version: 0.1.3

Release Date: 8 October 2025

Minor bug fixes
Added print statements for better process tracking in data processing functions
Added for stochastic and machine learning based outlier detectio adn removal
Updated documentation

Version: 0.1.2

Release Date: 6 October 2025

Enhanced dataframe utilities
New functions for index and column name retrieval
Improved documentation and examples

Version: 0.1.1

Release Date: 3 October 2025

Supplemental release
Additional functions for outlier detection
Additional functions for plotting (LOWESS meanline plotter)
Additional functions for data clustering based on k-means

Version: 0.1.0

Release Date: 2 October 2025

Initial release
Core modules: mlearning, regression, dataframing, stochastics, plotters
Comprehensive documentation and examples
Minimal test coverage

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.4

Apr 10, 2026

This version

0.3.3

Apr 10, 2026

0.3.2

Mar 13, 2026

0.3.1

Mar 13, 2026

0.3.0

Mar 12, 2026

0.2.3

Nov 10, 2025

0.2.2

Nov 10, 2025

0.2.2b0 pre-release

Nov 10, 2025

0.2.1

Nov 10, 2025

0.1.4

Oct 8, 2025

0.1.3

Oct 8, 2025

0.1.2

Oct 6, 2025

0.1.1

Oct 3, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datashadric-0.3.3.tar.gz (37.7 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datashadric-0.3.3-py3-none-any.whl (34.4 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file datashadric-0.3.3.tar.gz.

File metadata

Download URL: datashadric-0.3.3.tar.gz
Upload date: Apr 10, 2026
Size: 37.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for datashadric-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`f50be200b40a95f3b39e5d399ced673c7147af149b519426e81e54735f680633`
MD5	`aabdcb387b4617fbfbd08bdc813fccd2`
BLAKE2b-256	`7c608a3bb8272c061f62b803e39cb00fdad853e4f712497f28785b6873850bcb`

See more details on using hashes here.

File details

Details for the file datashadric-0.3.3-py3-none-any.whl.

File metadata

Download URL: datashadric-0.3.3-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for datashadric-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b8c4cb44c0855210fa86fec7c4d1c4a481f085c2bd47961d954059ea26de5c9`
MD5	`c58d1e48ff7168c3a1ad9fd1f6adb933`
BLAKE2b-256	`c4150a1e46c6279393fae4d70240ac91b0c9eb239b2c45cb375d2954516971be`

See more details on using hashes here.

datashadric 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

datashadric - Python Toolkit for Machine Learning and Advanced Data Analytics

Contents

Features

Installation

From PyPI (recommended)

From Source

Development Installation

Quick Start

Module Overview

mlearning - Machine Learning

regression - Regression Analysis

dataframing - Data Manipulation

stochastics - Statistical Analysis

plotters - Visualization

imputation - Multiple Imputation Methods (new in v0.3.3)

Dependencies

Core Dependencies

Testing Dependencies

Testing (Testing the app Modules)

Examples (Applications of certain Data Science techniques)

Data Cleaning and Exploration

Statistical Testing (testing data samples)

Machine Learning Workflows

Contributing to the Project

Licensing & Copyright

Have issues or questions?

Build, Release & Deploy Instructions (v0.3.3)

1. Clean previous build artefacts

2. Build the package

3. Validate the build

4. Quick smoke-test

5. Run the test suite

6. Publish to TestPyPI (optional, recommended)

7. Publish to PyPI

8. Install locally in editable mode

9. Tag the release in Git

Changelog

Version: 0.3.0 - 0.3.3 (Iterative Releases)

Release Date: 12 March 2026 - 13 March 2026

Version: 0.2.0 - 0.2.3 (Iterative Releases)

Release Date: 4 Novemeber 2025 - 10 November 2025

Version: 0.1.4

Release Date: 9 October 2025

Version: 0.1.3

Release Date: 8 October 2025

Version: 0.1.2

Release Date: 6 October 2025

Version: 0.1.1

Release Date: 3 October 2025

Version: 0.1.0

Release Date: 2 October 2025

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`mlearning` - Machine Learning

`regression` - Regression Analysis

`dataframing` - Data Manipulation

`stochastics` - Statistical Analysis

`plotters` - Visualization

`imputation` - Multiple Imputation Methods (new in v0.3.3)