Skip to main content

Pandas extension to enchance your data analysis.

Project description

BambooTools

BambooTools is a Python library designed to enhance your data analysis workflows. Built as an extension to the widely-used pandas library, BambooTools provides one liner methods for outlier detection and investigation of missing values.

With BambooTools, you can easily identify and handle outliers in your data, enabling more accurate analyses and predictions. The library also offers a completeness summary feature, which provides a quick and efficient way to assess the completeness of your dataset.

Installation

Install from PyPi

pip install BambooTools

Install from source

pip install git+https://github.com/KwstasMCPU/BambooTools

Usage

You can find examples in the bin\examples.py file. I have illustrated some below as well.

Completeness summary

completeness() retuns a completeness summary table, stating the percentages and counts of complete (not NULL) values for each column:

from bambootools import bambootools
import pandas as pd
import numpy as np

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot',
                              'Lama', 'Falcon'],
                   'Max Speed': [380, 370,
                                 24, 26,
                                 np.nan, np.nan],
                   'Weight': [np.nan, 2,
                              1.5, np.nan,
                              80, 2.2]
                   })
# check the completeness of the dataset per column
print(df.bbt.completeness())
complete values completeness ratio total
Animal 6 1.0 6
Max Speed 4 0.6666666666666666 6
Weight 4 0.6666666666666666 6

Specifying a list of categorical columns would result the completeness per category:

# check the completeness of the datataset per category
print(df.bbt.completeness(by=['Animal']))
Max Speed Weight
Animal complete values completeness ratio total complete values completeness ratio total
Falcon 2 0.6666666666666666 3 2 0.6666666666666666 3
Lama 0 0.0 1 1 1.0 1
Parrot 2 1.0 2 1 0.5 2

Missing values correlation matrix

missing_corr_matrix() This matrix aims to help to pintpoint relationships between missing values of different columns. Calculates the conditional probability of a record's value being NaN in a specific column, given the fact another value for the same record is missing at a different column.

For a dataset with two columns 'A', 'B' the conditional probability of a record having a missing value at column 'A' is:

$$P(A \text{ is NULL } | B \text{ is NULL}) = \frac{P(A \text{ is NULL } \cap B \text{ is NULL})}{P(B \text{ is NULL})}$$

Note: The matrix alone will not tell the whole story. Additional metrics, such dataset's completeness can help if any relationship exists.

# Generate a bigger dataset
# Set a seed for reproducibility
np.random.seed(0)

# Define the number of records
n_records = 50

# Define the categories for the 'animal' column
animals = ['cat', 'dog', 'lama']

# Generate random data
df = pd.DataFrame({
    'animal': np.random.choice(animals, n_records),
    'color': np.random.choice(['black', 'white', 'brown', 'gray'], n_records),
    'weight': np.random.randint(1, 100, n_records),
    'tail length': np.random.randint(1, 50, n_records),
    'height': np.random.randint(10, 500, n_records)
})

# Insert NULL values in the 'animal', 'color', 'weight', 'tail length' and 'height' columns
for col, n_nulls in zip(df.columns, [2, 15, 20, 48, 17]):
    null_indices = np.random.choice(df.index, n_nulls, replace=False)
    df.loc[null_indices, col] = np.nan

# missing values correlations
print(df.bbt.missing_corr_matrix())
animal color weight tail length height
animal NaN 0.5 0.5 1 0
color 0.066667 NaN 0.333333 1 0.4
weight 0.05 0.25 NaN 0.95 0.25
tail length 0.041667 0.3125 0.395833 NaN 0.354167
height 0 0.352941 0.294118 1 NaN

Outlier summary

outlier_summary() retuns a summary of the outliers found in the dataset based on a specific method (eg. IQR). It returns the number of outliers below and above the boundaries calculated by the specific method.

penguins = sns.load_dataset("penguins")
# identify outliers using the  Inter Quartile Range approach
print(penguins.bbt.outlier_summary('iqr', factor=1))
n_outliers_upper n_outliers_lower n_non_outliers n_total_outliers total_records
bill_depth_mm 0 0 342 0 342
bill_length_mm 2 0 340 2 342
body_mass_g 4 0 338 4 342
flipper_length_mm 0 0 342 0 342

You can also get the summary per group:

# outliers per category
print(penguins.bbt.outlier_summary(method='iqr', by=['sex', 'species'], factor=1))
n_non_outliers n_outliers_lower n_outliers_upper n_total_outliers total_records
('Female', 'Adelie') bill_depth_mm 71 1 1 2 73
('Female', 'Adelie') bill_length_mm 71 1 1 2 73
('Female', 'Adelie') body_mass_g 73 0 0 0 73
('Female', 'Adelie') flipper_length_mm 65 5 3 8 73
('Female', 'Chinstrap') bill_depth_mm 33 0 1 1 34
('Female', 'Chinstrap') bill_length_mm 23 5 6 11 34
('Female', 'Chinstrap') body_mass_g 31 2 1 3 34
('Female', 'Chinstrap') flipper_length_mm 33 1 0 1 34
('Female', 'Gentoo') bill_depth_mm 57 0 1 1 58
('Female', 'Gentoo') bill_length_mm 57 0 1 1 58
('Female', 'Gentoo') body_mass_g 57 1 0 1 58
('Female', 'Gentoo') flipper_length_mm 56 1 1 2 58
('Male', 'Adelie') bill_depth_mm 64 3 6 9 73
('Male', 'Adelie') bill_length_mm 65 3 5 8 73
('Male', 'Adelie') body_mass_g 73 0 0 0 73
('Male', 'Adelie') flipper_length_mm 67 4 2 6 73
('Male', 'Chinstrap') bill_depth_mm 33 1 0 1 34
('Male', 'Chinstrap') bill_length_mm 32 0 2 2 34
('Male', 'Chinstrap') body_mass_g 29 2 3 5 34
('Male', 'Chinstrap') flipper_length_mm 32 1 1 2 34
('Male', 'Gentoo') bill_depth_mm 56 2 3 5 61
('Male', 'Gentoo') bill_length_mm 51 5 5 10 61
('Male', 'Gentoo') body_mass_g 59 1 1 2 61
('Male', 'Gentoo') flipper_length_mm 59 2 0 2 61

Outlier boundaries

outlier_bounds() returns the boundary values which any value below or above is considered an outlier:

print(penguins.bbt.outlier_bounds(method='iqr', by=['sex', 'species'], factor=1))
bill_length_mm bill_length_mm bill_depth_mm bill_depth_mm flipper_length_mm flipper_length_mm body_mass_g body_mass_g
lower upper lower upper lower upper lower upper
sex species
Female Adelie 33 41.7 15.7 19.6 179 197 2800 3925
Female Chinstrap 43.475 49.325 15.95 19.1 178.75 204.25 3031.25 4025
Female Gentoo 40.825 49.9 13 15.4 205 220 4050 5287.5
Male Adelie 36.5 44 17.4 20.7 181 205 3300 4800
Male Chinstrap 48.125 53.9 17.8 20.8 189 210 3362.5 4468.75
Male Gentoo 45.7 52.9 14.3 17 211 232 4900 6100

Duplication summary

duplication_summary() returns metrics regarding the duplicate records of the given dataset. It states the number of total rows, unique rows, unique rows without duplications, unique records with duplications and total duplicated records:

print(penguins.bbt.duplication_summary(subset=['sex',
                                               'species',
                                               'island']))
counts
total records 344
unique records 13
unique records without duplications 1
unique records with duplications 12
total duplicated records 343

Duplication frequency table

duplication_frequency_table generates a table which states the frequency of records with duplications. Categorizes the duplicated records according to their number of duplications, and reports the frequency of those categories.

In the example below, we notice that there are 2 cases of 5 identical records.

print(penguins.bbt.duplication_frequency_table(subset=['sex',
                                                       'species',
                                                       'island']))
n identical bins frequency sum of duplications percentage to total duplications
2 0 0 0
3 0 0 0
4 0 0 0
5 2 10 0.029154519
[6, 10) 0 0 0
[10, 15) 0 0 0
[15, 50) 8 214 0.623906706
50> 2 119 0.346938776

Contributing

Contributions are more than welcome! You can contribute with several ways:

  • Bug reports and bug fixes
  • Recommendations for new features and implementation of those
  • Writing and or improving existing tests, to ensure quality

Prior yout contribution, opening an issue is recommended.

It is also recommended to install the package in "development mode" while working on it. When installed as editable, a project can be edited in-place without reinstallation.

To install the Python package in "editable"/"development" mode, change directory to the root of the project directory and run:

pip install -e .
pip install -r requirements-dev.txt # this will install the development dependencies (e.g. pytest)

OR in order to install the package and the development dependencies with a one liner, run the below:

pip install -e ".[dev]"

To ensure that the development workflow is followed, please also setup the pre-commit hooks:

pre-commit install

General Guidelines

  1. Fork the repository on GitHub.
  2. Clone the forked repository to your local machine.
  3. Make a new branch, from the develop branch for your feature or bug fix.
  4. Implement your changes.
    • It is recommended to write tests and examples for them in tests\test_bambootols.py and bin\examples.py respectively.
  5. Create a Pull Request. Link it to the issue you have opened.

Credits

Special thanks to danikavu for the code reviews

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BambooTools-0.4.0.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

BambooTools-0.4.0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file BambooTools-0.4.0.tar.gz.

File metadata

  • Download URL: BambooTools-0.4.0.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for BambooTools-0.4.0.tar.gz
Algorithm Hash digest
SHA256 ef0d0a3d7b340a65ed5133b66e07ac8fc61bd035ba399a085c24148294ffce29
MD5 42029c5c77fd37e9593e56cec0745363
BLAKE2b-256 8d158b096e5017889599c35a765de69b1aba0a0991ea3fd903a644af7082ac24

See more details on using hashes here.

Provenance

File details

Details for the file BambooTools-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: BambooTools-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for BambooTools-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca262852fcd9f61a0b794e7e93153d3050e271b1852ad0b2cee475ff58ab1e38
MD5 029d150172ef5d74c80f99d41f75adfa
BLAKE2b-256 5f55cb8c9e00a531bced2038ec821f96fbce88426018d409e83027ec4614a5f9

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page