Skip to main content

An exploratory data science toolkit for analysis, machine learning, multimodal ai agents for text and image processing, and visualization (Apache Superset)

Project description

datashadric - Python Toolkit for Machine Learning and Advanced Data Analytics

An exploratory Python toolkit for data science, machine learning, statistical analysis, and visualization.

Author

Paul Namalomba - University of Cape Town

  • SESKA Computational Engineer
  • Software Developer
  • PhD Candidate (Civil Engineering Spec. Computational and Applied Mechanics)
  • Email: kabwenzenamalomba@gmail.com

Overview

datashadric provides a collection of well-organized modules for common data science tasks, from data cleaning and exploration to machine learning model building, unsupervised and supervised classification and statistical analysis and testing. The package is designed with readability and ease-of-use in mind, making complex data science workflows more accessible and easier to write for end-use analysts.

Features

  • Machine Learning: Model training, data ensembling (sampling), model evaluation, and prediction tools.
  • Regression Analysis: Linear and Logistic regression modeling with diagnostic checks.
  • Data Manipulation: Pandas-based utilities for cleaning and transforming data, getting data descriptive characteristics.
  • Statistical Analysis: Hypothesis testing, confidence intervals, normal, Bayesian and Gaussian distribution checks. Also some sampling stuff included.
  • Visualization: Plotting functions for data exploration, visualization and presentation.

Installation

From PyPI (recommended)

pip install datashadric

From Source

git clone https://github.com/paulnamalomba/datashadric.git
cd datashadric
pip install .

Development Installation

git clone https://github.com/paulnamalomba/datashadric.git
cd datashadric
pip install -e ".[dev]"

Quick Start

import pandas as pd
from datashadric.mlearning import ml_naive_bayes_model
from datashadric.regression import lr_ols_model
from datashadric.dataframing import df_check_na_values
from datashadric.stochastics import df_gaussian_checks
from datashadric.plotters import df_boxplotter
from datashadric.aiagents import ai_analyze_plot_data_with_vision
from datashadric.aiagents import ai_data_insights_summary

# load your data
df = pd.read_csv('your_data.csv')

# check for missing values
na_summary = df_check_na_values(df)

# test for normality
normality_results = df_gaussian_checks(df, 'your_column')

# create visualizations
df_boxplotter(df, 'category_col', 'numeric_col', type_plot=0)

# build machine learning models
model, metrics = ml_naive_bayes_model(df, 'target_column', test_size=0.2)

# perform regression analysis
ols_results = lr_ols_model(df, 'dependent_var', ['independent_var1', 'independent_var2'])

Module Overview

mlearning - Machine Learning

  • ml_naive_bayes_model(): Train and evaluate Naive Bayes classifiers
  • ml_naive_bayes_metrics(): Calculate detailed model performance metrics
  • logr_predictor(): Logistic regression modeling and prediction
  • confusion_matrix_from_predictions(): Generate confusion matrices

regression - Regression Analysis

  • lr_ols_model(): Ordinary Least Squares regression modeling
  • lr_check_homoscedasticity(): Test regression assumptions
  • lr_check_normality(): Check residual normality
  • lr_post_hoc_test(): Post-hoc regression diagnostics

dataframing - Data Manipulation

  • df_check_na_values(): Comprehensive missing value analysis
  • df_drop_dupes(): Remove duplicate rows with reporting
  • df_one_hot_encoding(): Convert categorical variables to dummy variables
  • df_check_correlation(): Correlation analysis and visualization

stochastics - Statistical Analysis

  • df_gaussian_checks(): Test data normality with Shapiro-Wilk and Q-Q plots
  • df_calc_conf_interval(): Calculate confidence intervals
  • df_calc_moe(): Compute margin of error
  • df_calc_zscore(): Z-score calculations

plotters - Visualization

  • df_boxplotter(): Box plots for outlier detection
  • df_histplotter(): Histogram creation with customization
  • df_scatterplotter(): Scatter plot generation
  • df_pairplot(): Comprehensive pairwise plotting

Dependencies

Core Dependencies

  • pandas >= 1.3.0
  • numpy >= 1.20.0
  • scikit-learn >= 1.0.0
  • matplotlib >= 3.4.0
  • seaborn >= 0.11.0
  • scipy >= 1.7.0
  • statsmodels >= 0.12.0
  • plotly >- 5.0.0

You can simply do:

pip install -r requirements/requirements-core.txt

Testing Dependencies

For running tests, you'll need to install additional packages:

pip install pytest pytest-cov

Testing

To run the test suite:

# Install testing dependencies first
pip install pytest pytest-cov

# Run all tests
python -m pytest tests/ -v

# Run tests with coverage report
python -m pytest tests/ --cov=datashadric --cov-report=html --cov-report=term-missing

Examples

Data Cleaning and Exploration

from datashadric.dataframing import df_check_na_values, df_drop_dupes
from datashadric.plotters import df_histplotter

# check data quality
na_report = df_check_na_values(df)
df_clean = df_drop_dupes(df)

# visualize distributions
df_histplotter(df_clean, 'numeric_column', type_plot=0, bins=30)

Statistical Testing

from datashadric.stochastics import df_gaussian_checks, df_calc_conf_interval

# test normality
normality_test = df_gaussian_checks(df, 'measurement_column')

# calculate confidence intervals
ci = df_calc_conf_interval(df['measurement_column'], confidence=0.95)

Machine Learning Workflow

from datashadric.mlearning import ml_naive_bayes_model, ml_naive_bayes_metrics

# train model
model, initial_metrics = ml_naive_bayes_model(df, 'target', test_size=0.3)

# detailed evaluation
detailed_metrics = ml_naive_bayes_metrics(model, X_test, y_test)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

If you encounter any problems or have questions, please file an issue on the GitHub repository.

Changelog

Version: 0.1.0

Release Date: 2 October 2025

  • Initial release
  • Core modules: mlearning, regression, dataframing, stochastics, plotters
  • Comprehensive documentation and examples
  • Minimal test coverage

Version: 0.1.1

Release Date: 3 October 2025

  • Supplemental release
  • Additional functions for outlier detection
  • Additional functions for plotting (LOWESS meanline plotter)
  • Additional functions for data clustering based on k-means

Version: 0.1.2

Release Date: 6 October 2025

  • Enhanced dataframe utilities
  • New functions for index and column name retrieval
  • Improved documentation and examples

Version: 0.1.3

Release Date: 8 October 2025

  • Minor bug fixes
  • Added print statements for better process tracking in data processing functions
  • Added for stochastic and machine learning based outlier detectio adn removal
  • Updated documentation

Version: 0.1.4

Release Date: 9 October 2025

  • Minor bug fixes
  • Minor enhancements to user optionality in many functions for mlearning, stochastics and dataframing modules
  • Added user optionality for saving plots to files in plotters module
  • Updated documentation

Version: 0.2.0

Release Date: 24 October 2025

  • Added image annotation when detecting outliers using AI-assisted bounding box generation
  • Enhanced outlier detection and removal functions in data-processor module
  • Added use of AI agents to assist with data analysis and visualization tasks (needs user to store their API keys in system environment variables)
  • Updated documentation

Version: 0.2.1

Release Date: 10 November 2025

  • Added Apache Superset as an additional visualization dependency
  • Minor bug fixes and enhancements in dataframing and plotters modules
  • Updated documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datashadric-0.2.1.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datashadric-0.2.1-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file datashadric-0.2.1.tar.gz.

File metadata

  • Download URL: datashadric-0.2.1.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for datashadric-0.2.1.tar.gz
Algorithm Hash digest
SHA256 7f09de2738f6e6a40f3e76470c0996a5e7d883e81ce3f7761e3f6e437c501b15
MD5 7971386ba5226cd6939bdf143b68eb94
BLAKE2b-256 36374623778a562a761e848c0b9a3580afb949e315be486dba5ccf403577930e

See more details on using hashes here.

File details

Details for the file datashadric-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: datashadric-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for datashadric-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a8cdf97811b709450d2d16c634b50c1d8a0c478652728b01aedb84af27be0c39
MD5 e5d57f5bd13e122803816f1ec456d6cd
BLAKE2b-256 2e3bc145d1a5b39248af55751b1ca89d679d38889330d3974006e3cc8d1649bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page