Skip to main content

Datawaza is a collection of tools for data exploration, visualization, data cleaning, pipeline creation, model iteration, and evaluation.

Project description


datawaza_logo_name_trans.svg

PyPI Version License Last Commit Documentation Status Coverage Status Python Version

Datawaza streamlines common Data Science tasks. It's a collection of tools for data exploration, visualization, data cleaning, pipeline creation, hyper-parameter searching, model iteration, and evaluation. It builds upon core libraries like Pandas, Matplotlib, Seaborn, and Scikit-Learn.

Installation

The latest release can be found on PyPI. See the Change Log for a history of changes. Install Datawaza with pip:

pip install datawaza

Documentation

Online documentation is available at Datawaza.com.

The User Guide is a Jupyter notebook that walks through how to use the Datawaza functions. It's probably the best place to start. There is also an API reference for the major modules: Clean, Explore, Model, and Tools.

Development

The Datawaza repo is on GitHub.

Please submit bugs that you encounter to the Issue Tracker. Contributions and ideas for enhancements are welcome! So far this is a solo effort, but I would love to collaborate.

Dependencies

Datawaza supports Python 3.10. It may support other versions, but these have not been tested yet.

Due to the breadth of use cases, installation requires NumPy, Pandas, Matplotlib, Seaborn, Plotly, Scikit-Learn, SciPy, Cartopy, GeoPandas, StatsModels, and a few other supporting packages. See the Requirements.txt.

What is Waza?

Waza (技) means "technique" in Japanese. In martial arts like Aikido, it is paired with words like "suwari-waza" (sitting techniques) or "kaeshi-waza" (reversal techniques). So we've paired it with "data" to represent Data Science techniques: データ技 "data-waza".

Origin Story

Most of these functions were created while I was pursuing a Professional Certificate in Machine Learning & Artificial Intelligence from U.C. Berkeley. With every assignment, I tried to simplify repetitive tasks and streamline my workflow. They served me well, and I hope you will find some value in them.

Quick Start

The User Guide will show you how to use Datawaza's functions in depth. Assuming you already have data loaded, here are some examples of what it can do:

>>> import datawaza as dw

Show the unique values of each variable below the threshold of n = 12:

>>> dw.get_unique(df, 12, count=True, percent=True)

CATEGORICAL: Variables with unique values equal to or below: 12

job has 12 unique values:

    admin.              10422   25.3%
    blue-collar         9254    22.47%
    technician          6743    16.37%
    services            3969    9.64%
    management          2924    7.1%
    retired             1720    4.18%
    entrepreneur        1456    3.54%
    self-employed       1421    3.45%
    housemaid           1060    2.57%
    unemployed          1014    2.46%
    student             875     2.12%
    unknown             330     0.8%

marital has 4 unique values:

    married        24928   60.52%
    single         11568   28.09%
    divorced       4612    11.2%
    unknown        80      0.19%

Plot bar charts of categorical variables, dimensioned by the target variable:

>>> dw.plot_charts(df, plot_type='cat', cat_cols=cat_columns, hue='y', rotation=90)

plot_charts output

Get the top positive and negative correlations with the target variable, and save to lists:

>>> pos_features, neg_features = dw.get_corr(df_enc, n=10, var='subscribed_enc', return_arrays=True)

Top 10 positive correlations:
Variable 1      Variable 2  Correlation
0               duration  subscribed_enc         0.41
1       poutcome_success  subscribed_enc         0.32
2   previously_contacted  subscribed_enc         0.32
3                  pdays  subscribed_enc         0.27
4               previous  subscribed_enc         0.23
5              month_mar  subscribed_enc         0.14
6              month_oct  subscribed_enc         0.14
7              month_sep  subscribed_enc         0.12
8           no_default_1  subscribed_enc         0.10
9            job_student  subscribed_enc         0.09

Top 10 negative correlations:
Variable 1      Variable 2  Correlation
0            nr.employed  subscribed_enc        -0.35
1              euribor3m  subscribed_enc        -0.31
2           emp.var.rate  subscribed_enc        -0.30
3   poutcome_nonexistent  subscribed_enc        -0.19
4      contact_telephone  subscribed_enc        -0.14
5         cons.price.idx  subscribed_enc        -0.14
6              month_may  subscribed_enc        -0.11
7               campaign  subscribed_enc        -0.07
8        job_blue-collar  subscribed_enc        -0.07
9     education_basic.9y  subscribed_enc        -0.05

Plot a chart showing the top correlations with the target variable:

>>> dw.plot_corr(df_enc, 'subscribed_enc', n=16, size=(12,6), rotation=90)

plot_corr output

Run a model iteration, which dynamically assembles a pipeline and evaluates the model, including charts of residuals, predicted vs. actual, and coefficients:

>>> results_df, iteration_6 = dw.iterate_model(X2_train, X2_test, y2_train, y2_test,
...     transformers=['ohe', 'log', 'poly3'], model='linreg',
...     iteration='6', note='X2. Test size: 0.25, Pipeline: OHE > Log > Poly3 > LinReg',
...     plot=True, lowess=True, coef=True, perm=True, vif=True, decimal=2,
...     save=True, save_df=results_df, config=my_config)

iterate_model output 1 of 3 iterate_model output 2 of 3 iterate_model output 3 of 3

Compare train/test scores across model iterations, and select the best result:

>>> dw.plot_results(results_df, metrics=['Train MAE', 'Test MAE'], y_label='Mean Absolute Error',
...     select_metric='Test MAE', select_criteria='min', decimal=0)

plot_results output

This was just a sample of some Datawaza tools. Download userguide.ipynb and explore the full breadth of the library in your Jupyter environment.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datawaza-0.1.2.tar.gz (2.9 MB view hashes)

Uploaded Source

Built Distribution

datawaza-0.1.2-py3-none-any.whl (2.9 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page