Skip to main content

Provides features to store and execute your operations for standardizing a pandas dataframes.

Project description

Standardize DataFrame

standardize_df provides features to store and execute your operations for standardizing a pandas dataframes.

Installation

pip install standardize-df

Overview

You can think of the process of standardizing a dataframe as conforming its values to some standard for comparitive evaluations. standardize contains many features to help standardize a dataframe:

  • adjust_df: Functions for altering a dataframe based off an altered subset of the dataframe.
  • df_operations: General functions to help with standardizing a dataframe.
  • pipeline: Pipeline classes for chaining the output of one callable to the next.
  • standards: Provides the Standard class to store and execute your operations for standardizing a dataframes column(s).
  • utils: Utility functions related for evaluating empty values and flattening iterables

Here's a tour of some of the main features with examples from these modules:

Adjusting DataFrames

When standardizing a dataframe, you'll often create, drop, or override columns and rows from a subset. The adjust_df module offers the adjust_df function that can reflect those changes from the subset to the dataframe.

Adjusting rows:

import pandas as pd
from standardize_df.adjust_df import adjust_df
df = pd.DataFrame({'letters': ['a', 'b', 'c'], 'symbols': ['!', '@', '#']})
altered_subset = pd.Series(['!', '@'], name='symbols')
adjust_df(df, altered_subset, fields=['symbols'])
  letters symbols
0       a       !
1       b       @

Adjusting columns:

altered_subset = pd.DataFrame({'symbols': ['!', '@', '#'], 'numbers': [1, 2, 3]})
adjust_df(df, altered_subset, fields=['symbols'])
  letters symbols  numbers
0       a       !        1
1       b       @        2
2       c       #        3

Pipeline

You can store operations (callables) into Pipeline or PipelineMapping to create a data pipeline. The PipelineMapping offers the reorder method to easily reorder, replace, or drop operations.

Creating a data pipeline with the Pipeline class:

from standardize_df.pipeline import Pipeline

def add_one(num): return num + 1
def add_two(num): return num + 2

pipe = Pipeline([add_one, add_two])
pipe(7)
10

Creating a data pipeline with the PipelineMapping class:

from standardize_df.pipeline import PipelineMapping

def add_one(num): return num + 1
def square_two(num): return num ** 2

pipe = PipelineMapping({'add_one': add_one, 'square_two': square_two})
pipe(2)
9

Reordering a PipelineMapping instance with an iterable:

reordered = pipe.reorder(['square_two', 'add_one'])
reordered
reordered(2)
PipelineMapping([('square_two', <function square_two at 0x7f002b86e820>), ('add_one', <function add_one at 0x7f0044be15e0>)])
5

Reordering a PipelineMapping instance with a mapping:

reordered = pipe.reorder({'square_two': None})  # None denotes 'leave func as is'
reordered
reordered(2)
PipelineMapping([('square_two', <function square_two at 0x7f002b86e820>)])
4

Standards

The standards module offers the Standards class for storing field name(s) and an operation to standardize those fields of a dataframe. The Standards.standardize_df method passes in a subset with the field name(s) to the operation, and adjusts the dataframe according to the return value, the altered subset.

Single field name (keys) will result in a series subset being passed into the operation. A series or a dataframe can be returned, and the original dataframe will be adjusted to it.

import pandas as pd
from standardize_df.standards import Standards

def drop_first(col: pd.Series) -> pd.Series: 
    '''drops the first row from the original dataframe.'''
    return col.drop(index=0)

def add_one(col: pd.Series) -> pd.DataFrame:
    '''adds the plus_one column to the original dataframe.'''
    df = col.to_frame()
    df['plus_one'] = df['numbers'] + 1
    return df

df = pd.DataFrame({'letters': ['a', 'b', 'c'], 'numbers': [1, 2, 3]})
standards_mapping = {
    'letters': drop_first, 
    'numbers': add_one
}
standards = Standards(standards_mapping)
standards.standardize_df(df)
  letters  numbers  plus_one
1       b        2         3
2       c        3         4

Multiple field name (keys) will result in a dataframe subset being passed into the operation. A series or dataframe can be returned, and the original dataframe will be adjusted to it.

def increment_one(df: pd.DataFrame) -> pd.DataFrame:
    '''adds columns letters_plus and numbers_plus to the original dataframe.'''
    df['letters_plus'] = df['letters'].apply(lambda x: chr(ord(x) + 1))
    df['numbers_plus'] = df['numbers'] + 1
    return df

def drop_numbers(df: pd.DataFrame) -> pd.Series:
    '''drops the numbers column from the original dataframe.'''
    return df['numbers_plus']


standards_mapping = {
    ('letters', 'numbers'): increment_one, 
    ('numbers', 'numbers_plus'): drop_numbers  # numbers_incr column added in increment_one func 
}
standards = Standards(standards_mapping)
standards.standardize_df(df)
  letters letters_plus  numbers_plus
0       a            b             2
1       b            c             3
2       c            d             4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

standardize_df-0.0.1.tar.gz (23.2 kB view hashes)

Uploaded Source

Built Distribution

standardize_df-0.0.1-py3-none-any.whl (23.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page