Skip to main content

A Pyspark companion for data science tasks.

Project description

Pyspark DS Toolbox

Lifecycle: experimental PyPI Latest Release CodeFactor Maintainability Codecov test coverage Package Tests Downloads

The objective of the package is to provide a set of tools that helps the daily work of data science with spark. The documentation can be found here and notebooks with usage examples here.

Feel free to contribute :)

Installation

Directly from PyPi:

pip install pyspark-ds-toolbox

or from github, note that installing from github will install the latest development version:

pip install git+https://github.com/viniciusmsousa/pyspark-ds-toolbox.git

Organization

The package organized in a structure based on the nature of the task, such as data wrangling, model/prediction evaluation, and so on.

pyspark_ds_toolbox         # Main Package
├─ causal_inference           # Sub-package dedicated to Causal Inferece
│  ├─ diff_in_diff.py   
│  └─ ps_matching.py    
├─ ml                         # Sub-package dedicated to ML
│  ├─ data_prep                  # Sub-package to ML data preparation tools
│  │  ├─ class_weights.py     
│  │  └─ features_vector.py 
│  ├─ classification             # Sub-package decidated to classification tasks
│  │  ├─ eval.py
│  │  └─ baseline_classifiers.py 
│  ├─ feature_importance         # Sub-package with feature importance tools
│  │  ├─ native_spark.py
│  │  └─ shap_values.py 
│  └─ feature_selection         # Sub-package with feature selection tools
│     └─ information_value.py    
├─ wrangling                  # Sub-package decidated to data wrangling tasks
│  ├─ reshape.py               
│  └─ data_quality.py         
└─ stats                      # Sub-package dedicated to basic statistic functionalities
   └─ association.py    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark-ds-toolbox-0.4.3.tar.gz (33.6 kB view hashes)

Uploaded Source

Built Distribution

pyspark_ds_toolbox-0.4.3-py3-none-any.whl (40.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page