Skip to main content

A package designed to simplify data preprocessing for use with Pandas

Project description

PyPI version

pandashape: a simpleish Python package for easy data cleanup and preparation of Pandas dataframes

I made pandashape because I've been finding I do a lot of the same repetitive cleanup for simple modeling with scikit-learn. I've intentionally designed it to make data preparation expressive, concise, and easily repeatable - just put your use of

Getting started

Just install with pip!

pip install pandashape

Using pandashape

Create your dataframe however you choose - from a CSV, .txt. file, random generation, whatever. Then wrap your frame in a PandaShaper.

# import packages
import numpy as np
import pandas as pd
from pandashape import PandaShaper, Columns
from pandashape.transformers import CategoricalEncoder, NullColumnsDropper

# create your frame
my_df = pd.read_csv('./my_data.csv')

# wrap it in a shaper
shaper = PandaShaper(my_df)

From here, you can use PandaShape to inspect and transform your data.

Data inspection

PandaShape provides an automatic .describe() method similar to the one provided by pandas, but with more feature richness and extensibility.

shaper.describe()
#########################################
###         PANDASHAPE REPORT         ###
#########################################

### General frame info ###
-----------------------------------------
Shape: (1000, 12)
Columns with one or more null values: ['History']
Columns of type "object" (may need label encoding): ['Age' 'Gender' 'OwnHome' 'Married' 'Location' 'History']

### Data types ###
-----------------------------------------
Columns by data type
- Numeric: 6
- Objects/strings: 6

### Distribution ###
-----------------------------------------
These columns have significant outlier values (more than +/- 2 standard deviations from the mean).
- Salary (34)
- AmountSpent (42)
- AmountSpent_HighCorrelation (42)
- Salary_HighCorr (34)

These columns are skewed beyond the threshold of 1 +/- 0.4. You may want to scale them somehow.
 - Salary (0.41909498781999727)
 - Catalogs (0.0920540150758884)
 - AmountSpent (1.4692769120373967)
 - AmountSpent_HighCorrelation (1.4692769120373967)
 - Salary_HighCorr (0.41909498781999727)

### Correlated columns ###
-----------------------------------------
The following columns are highly correlated (r² > 0.8): ['AmountSpent_HighCorrelation', 'Salary_HighCorr']

If you have questions that you often ask about your datasets, you can encapsulate them in classes that inherit PandaShape's Describer for reuse. See the wiki for documentation.

Data transformation

PandaShape's data preparation and cleanup features are where it really shines. It provides an expressive syntax that you can use to describe, order, and even dynamically modify transformations to your data:

# import packages
import numpy as np
import pandas as pd
from pandashape import PandaShaper, Columns
from pandashape.transformers import 
    CategoricalEncoder,
    MassScaler, 
    NullColumnsDropper

# create your frame
my_df = pd.read_csv('./my_data.csv')

# wrap it in a shaper
shaper = PandaShaper(my_df)

# create a pipeline of transform operations (these will happen in order)
# and assign the output to a new (transformed) frame!
transformed_df = shaper.transform(
    {
        # drop columns that have 80% or less null data
        'columns': Columns.All,
        'transformers': [
            NullColumnsDropper(null_values=[np.nan, None, ''], threshold=0.8),
            MassScaler()
        ]
    },
    {
        # CategoricalEncoder one-hot-encodes targeted categorical columns if they
        # have a number of values ≥ the breakpoint or label encodes them normally 
        'columns': ['Education', 'SES'], 
        'transformers': CategoricalEncoder(label_encoding_breakpoint=4)
    }
)

# inspect the new frame to see the fruits of your labors!
transformed_df.head()

Upcoming improvements

  • Allow the user to constrain describers to specific columns (by name or Columns enum value)
  • A describer that summarizes discrete column values for columns that appear to be categorical
  • Allow the user to pass types to the 'transformers' property when shaping

Features being evaluated

  • Improvements to .describe that returns all frames generated during transformation for inspection

Acknowledgments

Special thanks to the other members of the Sustainable Social Computing Lab at the University of Pittsburgh for their support, ideas, and contributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandashape-0.0.7.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

pandashape-0.0.7-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file pandashape-0.0.7.tar.gz.

File metadata

  • Download URL: pandashape-0.0.7.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200917 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for pandashape-0.0.7.tar.gz
Algorithm Hash digest
SHA256 eb1f3d4a0e2c815a31189585d35580a798be313532b84bc98a4b8e59105c62a9
MD5 3d20fe8267db1878a271b4bf673cab9c
BLAKE2b-256 a79efce8a4365976747c944fa0c51038ec9c213d995b2a2223ce6e276f3cf1ec

See more details on using hashes here.

File details

Details for the file pandashape-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: pandashape-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200917 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for pandashape-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 5995f323fd19b650f18e7b31fb44a6c3f31897d8735b0ee9614201142a0fb418
MD5 960fc5853ebf7e5603b341084536780a
BLAKE2b-256 e63a7c2d53eb5dbc034fa20046a05f438f5d22450d6582ba9481611c689b2c46

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page