Library of useful data structures for data preprocessing, vizualizations and grid search

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

Rearden

Rearden is a Python package that provides a faster and more convenient way of carrying out data science and running machine learning algorithms. Making use of the functionality of the most popular libraries for data analysis (pandas, numpy, statsmodels), data vizualization (matplotlib, seaborn) and grid search (scikit-learn), it enables reaching the conclusions about the data in a quicker and clearer manner.

Modules and API

The package is designed to aid data scientists in quickly getting insights about the data during the following stages of data analysis/machine learning:

Data preprocessing
Data vizualization
Time-series analysis
Grid search

Hence, the data structures which make up the Rearden package have been logically divided into Python modules based off of the above respective parts:

preprocessings.py
vizualizations.py
time_series.py
grid_search.py

Data preprocessing

Functions included in preprocessings.py are basically programmed to help with missing and anomalous values, duplicates and data preparation for machine learning algorithms (e.g. data split into sets). For instance, currently the following functions are included in the module:

Name	Kind	Description
`identify_missing_values`	function	Display of the number and share of missing values
`preprocess_duplicates`	function	Deletion of duplicated rows with a message
`filter_data`	function	Filters data according to the predetermined ranges
`prepare_sets`	function	Data split into sets depending on target name and sets proportions

The module and the associated functions can be called like so:

from rearden.preprocessings import prepare_sets

Data vizualization

Enhanced data vizualizations tools are located in vizualizations.py module. The functions here are as follows:

Name	Kind	Description
`plot_model_comparison`	function	Vizualization of ML models performances based on their names and scores
`plot_corr_heatmap`	function	Plotting correlation matrix heatmap in one go
`plot_class_structure`	function	Plotting the shares of different classes for a target vector in classification problems

Models performance comparison

Using plot_model_comparison function, it is very easy to conveniently showcase how models perform according to some metric. One would just run:

import seaborn as sns

from rearden.vizualizations import plot_model_comparison

sns.set_theme()

models_performance = [
    ("Decision Tree", 30.8343),
    ("Random Forest", 29.3127),
    ("Catboost", 26.4651),
    ("Xgboost", 26.7804),
    ("LightGBM", 26.6084),
]

plot_model_comparison(
    results=models_performance,
    metric_name="RMSE",
    title_name="Grid search results",
)

The result is the following figure:

Models performance comparison

Correlation matrix heatmap

It is possible to quickly plot the heatmap of the correlation matrix for the data using plot_corr_heatmap function. Here is how we would do that:

import pandas as pd
import seaborn as sns

from rearden.vizualizations import plot_corr_heatmap

sns.set_theme()

test_data = pd.read_csv("datasets/test_data.csv")

plot_corr_heatmap(
    data=test_data,
    heatmap_coloring="Oranges",
    annotation=True,
    lower_triangle=True,
)

The code above results in the following plot:

Correlation matrix heatmap

Time-series analysis

Tools for time-series analysis from time_series.py are pretty straightforward:

Name	Kind	Description
`FeaturesExtractor`	class	Extraction of time variables from a one-dimensional time-series depending on lag and rolling mean order values
`prepare_ts`	function	Data split of a time-series data into sets depending on target name and sets proportions
`plot_time_series`	function	Plotting the original time-series or a decomposed one

One can, for example, want to firstly generate the data by FeaturesExtractor, then look at the graph via plot_time_series and then divide the data into sets with prepare_ts.

Time-series plot and its decomposition

plot_time_series function provides two additional ways we could plot a time-series:

Plain time-series
Decomposed time-series (trend, seasonality, residual)

Take, for example:

import pandas as pd
import seaborn as sns

from rearden.time_series import plot_time_series

sns.set_theme()

ts_data_test = pd.read_csv("datasets/ts_data_test.csv", parse_dates=[0], index_col=[0])
ts_data_test_resampled = ts_data_test.resample("1H").sum()

plot_time_series(
    data=ts_data_test_resampled,
    col="num_orders",
    period_start="2018-03-01",
    period_end="2018-03-03",
    ylabel_name="Number of orders",
    title_name="Time-series plot",
)

In this case we plot the evolution of the number of orders against time. We obtain the following plot:

Time-series plot

We could also decompose this time series by just adding kind="decomposed" to the above function:

plot_time_series(
    data=ts_data_test_resampled,
    col="num_orders",
    kind="decomposed",
    period_start="2018-03-01",
    period_end="2018-03-03",
    ylabel_name="Number of orders",
)

The result is as follows:

Decomposed time-series

Grid search

In grid_search.py module, RandomizedSearchCV base class from sklearn.model_selection was used, which has been wrapped with two additional classes with some additional methods, custom defaults and other functionality:

Name	Kind	Description
`RandomizedHyperoptRegression`	class	Wrapper for `RandomizedSearchCV` with functionality to quickly compute regression metrics and conveniently display tuning process
`RandomizedHyperoptClassification`	class	Wrapper for `RandomizedSearchCV` with functionality to quickly compute classification metrics, conveniently display tuning process and fastly plot confusion matrix

Confusion matrix

We can use RandomizedHyperoptClassification wrapper for quickly making conclusions about the results of the grid search. For instance, let's imagine that we have managed to split the data into features_train and features_test as well as target_train and target_test. We can now run the grid search algorithms and immediately get the plot of the confusion matrix:

from sklearn.tree import DecisionTreeClassifier

from rearden.grid_search import RandomizedHyperoptClassification

dtc_model = DecisionTreeClassifier(random_state=12345)
param_grid_dtc = {"max_depth": np.arange(1, 12)}

dtc_grid_search = RandomizedHyperoptClassification(
    estimator=dtc_model,
    param_distributions=param_grid_dtc,
    train_dataset=(features_train, target_train),
    eval_dataset=(features_test, target_test),
    random_state=12345,
    cv=5,
    n_iter=5,
    scoring="f1",
    n_jobs=None,
)
dtc_grid_search.train_crossvalidate()

dtc_grid_search.plot_confusion_matrix(label_names=("label_1", "label_2"))

Thanks to the additional eval_dataset attribute, the resulting plot is already a confusion matrix for the best model after cross-validation which has been used for making predictions on the test data:

Confusion matrix

Installation

Package dependencies

Rearden library requires the following dependencies:

Package	Version
Matplotlib	>= 3.3.4
Pandas	>= 1.2.4
NumPy	>= 1.24.3
Scikit-learn	>= 1.1.3
Seaborn	>= 0.11.1
Statsmodels	>= 0.13.2

NOTE: The package currently requires Python 3.7 or higher.

Installation using `pip`

The package is available on PyPI Index and can be easily installed using pip:

pip install rearden

The dependencies are automatically downloaded when executing the above command or can be installed manually using (after cloning the repo):

pip install -r requirements.txt

Building the package

Thanks to the build system requirements and other metadata specified in pyproject.toml it is easy to build and install the package. Firstly, clone the repository:

git clone https://github.com/spolivin/rearden.git

cd rearden

Then, one can simply run the following:

pip install -e .

Automatic code style checks

Installation of `pre-commit`

Before pushing the changed code to the remote Github repository, the code undergoes numerous checks conducted with the help of pre-commit hooks specified in .pre-commit-config.yaml. Before making use of this feature, it is important to first download pre-commit package to the system:

pip install pre-commit

or if rearden package has already been installed:

pip install rearden[precommit]

Afterwards, in the git-repository run the following command for installation:

pre-commit install

Now, the pre-commit hooks can be easily used for verifying the code style.

Pre-commit hooks

After running git commit -m "<Commit message>" in the terminal, the file to be committed goes through a few checks before being enabled to be committed. As specified in .pre-commit-config.yaml, the following hooks are used:

Hooks	Version
Pre-commit-hooks	4.3.0
Autoflake	2.1.1
Isort	5.12.0
Black	23.3.0
Flake8	5.0.0

NOTE: Check .pre-commit-config.yaml for more information about the repos and hooks used.

It is also possible to download the required dependencies for pre-commit hooks:

pip install -r requirements-dev.txt

or:

pip install rearden[formatters]

pip install rearden[linters]

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.0.3

Dec 15, 2023

This version

0.0.2

Aug 2, 2023

0.0.1

Jul 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rearden-0.0.2.tar.gz (18.3 kB view hashes)

Uploaded Aug 2, 2023 Source

Built Distribution

rearden-0.0.2-py3-none-any.whl (16.9 kB view hashes)

Uploaded Aug 2, 2023 Python 3

Hashes for rearden-0.0.2.tar.gz

Hashes for rearden-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`7c1ba18038c893416857fed4e3a955c7683ea85b6583732ec1d178e5d814143e`
MD5	`126cfaf7c8df09525c37d577eaffa167`
BLAKE2b-256	`a6804b60575bac7c32c09a12039dc91c704dc65da882a48955d6698a546c6a3c`

Hashes for rearden-0.0.2-py3-none-any.whl

Hashes for rearden-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`199395f8b8c5337895e6e66141b0d52b8d1de98c6bbc1580e12e5cdd1b1bb728`
MD5	`13fcfe49cee46fb7fc79ba2ee9383edb`
BLAKE2b-256	`b98b3345eac329e21b60527b7b6deabeb16e302e44d9ab6b364ceb22e0c4a801`

rearden 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Rearden

Modules and API

Data preprocessing

Data vizualization

Models performance comparison

Correlation matrix heatmap

Time-series analysis

Time-series plot and its decomposition

Grid search

Confusion matrix

Installation

Package dependencies

Installation using `pip`

Building the package

Automatic code style checks

Installation of `pre-commit`

Pre-commit hooks

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

rearden 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Rearden

Modules and API

Data preprocessing

Data vizualization

Models performance comparison

Correlation matrix heatmap

Time-series analysis

Time-series plot and its decomposition

Grid search

Confusion matrix

Installation

Package dependencies

Installation using pip

Building the package

Automatic code style checks

Installation of pre-commit

Pre-commit hooks

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Installation using `pip`

Installation of `pre-commit`