Skip to main content

A series of utility functions to help with tidy dataframe.

Project description

sichuanMojo

Built on the top of pandas, sichuanMojo is a grammar of data manipulation with data frame, providing a consistent a series of utility functions that help you solve the most common data manipulation challenges:

  • unite_col: unite cols into one column
  • separate_col: separate column by a pattern into individual column
  • groupby_col: provide a summary table of a selected column which grouped by desired column(s)
  • groupby_across: provide a summary table of a selected column with basic statistical infomation (min, max, sum, median) which grouped by desired column(s)
  • pivot_tb: provide a pivot (count/frequency) table of a selected column which grouped by desired column(s)
  • pivot_rate: provide a pivot (rate/percentage) table of a selected column which grouped by desired column(s)

Installation from github

Use the package manager pip to install foobar.

pip install sichuanMojo

Backgroud

The Sichuan province is pandas' hometown which explains the name of this library!

Usage

import sichuanMojo as sm
import pandas as pd

# test_dict as an example input
data = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike','Jan'],
        'age':[20, 21, 19, 18, 21, 33],
       'group':['A','A', 'A', 'B', 'B', "B"],
       'major':['biology', 'english', 'biology', 'english', 'biology', 'biology'],
       'response':['good', 'good', 'ok', 'bad', 'bad','bad']}
 
# Create DataFrame
test_df= pd.DataFrame(data)
name age group major response
0 Tom 20 A biology good
1 nick 21 A english good
2 krish 19 A biology ok
3 jack 18 B english bad
4 Mike 21 B biology bad

unite cols into one column

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

sm.unite_col(test_df, unite_by = ["group", "major", "response"], pattern="; ", united_col_name="New_col", output_json=False)
name age group major response New_col
0 Tom 20 A biology good A; biology; good
1 nick 21 A english good A; english; good
2 krish 19 A biology ok A; biology; ok
3 jack 18 B english bad B; english; bad
4 Mike 21 B biology bad B; biology; bad

separate column by a pattern into individual column

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

# Test data
data2 = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike'],
        'age':[20, 21, 19, 18, 21],
       'group':['A','A', 'A', 'B', 'B'],
       'major':['biology', 'english', 'biology', 'english', 'biology'],
       'date':['2022-01-13', '2022-06-23', '2022-03-12', '2022-08-23', '2022-09-13']}
 
# Create DataFrame
test_df2= pd.DataFrame(data2)
name age group major date
0 Tom 20 A biology 2022-01-13
1 nick 21 A english 2022-06-23
2 krish 19 A biology 2022-03-12
3 jack 18 B english 2022-08-23
4 Mike 21 B biology 2022-09-13
sm.separate_col(test_df2, sep_by="date", pattern="-", sep_to_names=["year", "month", "date"], output_json=False)
name age group major date year month
0 Tom 20 A biology 13 2022 01
1 nick 21 A english 23 2022 06
2 krish 19 A biology 12 2022 03
3 jack 18 B english 23 2022 08
4 Mike 21 B biology 13 2022 09

provide a summary table of a selected column which grouped by desired column(s)

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

operator should be one of mean, sum, min, max, median, unique, count, nunique

sm.groupby_col(test_df, group_by = ["group", "major"], summmarize_at = "name", operator = "nunique", output_json = True)
#returns [{'group': 'A', 'major': 'biology', 'name': 2},
# {'group': 'A', 'major': 'english', 'name': 1},
# {'group': 'B', 'major': 'biology', 'name': 2},
# {'group': 'B', 'major': 'english', 'name': 1}]
sm.groupby_col(test_df, group_by = ["group", "major"], summmarize_at = "name", operator = "nunique")
group major name
0 A biology 2
1 A english 1
2 B biology 2
3 B english 1

provide a summary table of a selected column with basic statistical infomation (min, max, sum, median) which grouped by desired column(s)

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

sm.groupby_across(test_df, ["group", "major"], summmarize_at="age", operation=None, output_json = False)
group major age_min age_max age_sum age_median
0 A biology 19 20 39 19.5
1 A english 21 21 21 21
2 B biology 21 33 54 27
3 B english 18 18 18 18

provide a pivot (count/frequency) table of a selected column which grouped by desired column(s)

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

sm.pivot_tb(test_df, group_by=["group", "major"], summmarize_at="response", operation="count", output_json = False, na_fill=0)
group major bad good ok
0 A biology 0 1 1
1 A english 0 1 0
2 B biology 2 0 0
3 B english 1 0 0

provide a pivot (rate/percentage) table of a selected column which grouped by desired column(s)

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

sm.pivot_rate(test_df, group_by=["major", "group"], summmarize_at="response", output_json = False, na_fill=0)
major group response_count_bad response_count_good response_count_ok response_perc_bad response_perc_good response_perc_ok
0 biology A 0 1 1 0 50 50
1 biology B 2 0 0 100 0 0
2 english A 0 1 0 0 100 0
3 english B 1 0 0 100 0 0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sichuanMojo-0.1.1.tar.gz (4.4 kB view hashes)

Uploaded Source

Built Distribution

sichuanMojo-0.1.1-py3-none-any.whl (5.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page