Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

The utilities pack for data science and analytics tasksA collection of functions to simplify common operations in data analytics tasks. The core module pd_utils (Pandas Utility) is designed to work with Pandas to simplify common tasks such as generating metadata for the dataframe, validating merged dataframe, and etc.

Project description

<< Data Science Utilities >>

Collection of wrapper functions to simplify common operations in data analytics tasks. The core module pd_utils (Pandas Utility) is designed to work with DataFrame in Pandas to simplify common tasks such as generating metadata for the dataframe, validating merged dataframe, and etc.

The package can be can be used in iPython Console, Jupyter Notebook, and Scripts.

Intallation

pip install dsx

1. Core Module: "pd_utils"

The core module is "pd_utils". The module contains a list of functions that can accomplish common data analytics tasks with less codes. Basically, these functions are wrappers for commonly-used methods in Pandas, particularly methods of DataFrame object.

Some of the key features of the DataFrame utility functions are as following:

  • Generate metadata of columns in a DataFrame
    • Number & percentage of missing values
    • Number & percentage of unique values
    • Data Type
  • Generate accumulated percentage of values in a column
  • Quick Rename of a single column
  • Reorder columns of a DataFrame
  • Standardize column names into iPython-friendly names
  • Retrieve column name(s) by a partial keyword
  • Expand concatenated string in a column into child table

1.1 Usage

Below is example codes for importing the module:

from dsx.pd_utils import *

There are two ways of calling the functions, using the "pd_Missing_Rows" function as the example:

  1. Through the extended domain ('ds') of the native DataFrame object (Recommended)
df = pd.read_excel(os.path.join(os.getcwd(), "data.xlsx"))
df.ds.missing_rows("Column_Name")
  1. as a static function of pd_utils class
df = pd.read_excel(os.path.join(os.getcwd(), "data.xlsx"))
dsx.missing_rows(df, "Column_Name")

2. Data Science Workflow "ds_workflow" (Active Development / Work-In-Progress)

The "ds_workflow" module contains the methods for simplifying common tasks in a data science workflow. The methods are built on top of the functions in the core module "pd_utils".

Some of the key features of the module are as the following:

  • Get the column name of the features that are categorical
  • Get the column name of the features that are numerical
  • Create or merge the dummy variables created from categorical features with option to use k-1 dummification
  • Data Exploration
    • Generate barplot and accumulated percentage report for all the categorical features
    • Generate distribution plot for all the numerical features
    • Generate heatmap of the the correlation matrix
  • Preprocessing
    • Create a dataframe with all standardized features merged with other features
    • Generate features list
  • Model Assessment
    • Generate Recall-Precision-Threshold Curve
    • Generate truepositive_falsepositive Curve

2.1 Usage

The methods in the module are only callable as the extended domain 'dl' in the native Pandas DataFrame object. These methods are registered using the official accessor method specified in https://pandas.pydata.org/pandas-docs/stable/development/extending.html.

Calling a method in "ds_workflow":

df = pd.read_excel(os.path.join(os.getcwd(), "data.xlsx"))

cols_categorical = df.ml.get_features_categorical()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for dsx, version 0.9.1.3
Filename, size File type Python version Upload date Hashes
Filename, size dsx-0.9.1.3.tar.gz (12.4 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page