A series of utility functions to help with tidy dataframe.
Project description
sichuanMojo
Built on the top of pandas
, sichuanMojo is a grammar of data manipulation with data frame, providing a consistent a series of utility functions that help you solve the most common data manipulation challenges:
unite_col
: unite cols into one columnseparate_col
: separate column by a pattern into individual columngroupby_col
: provide a summary table of a selected column which grouped by desired column(s)groupby_across
: provide a summary table of a selected column with basic statistical infomation (min, max, sum, median) which grouped by desired column(s)pivot_tb
: provide a pivot (count/frequency) table of a selected column which grouped by desired column(s)pivot_rate
: provide a pivot (rate/percentage) table of a selected column which grouped by desired column(s)simplify_network_df
: simplify the network dataframe from directed to undirecte
Installation from github
Use the package manager pip to install foobar.
pip install sichuanMojo
Backgroud
The Sichuan province is pandas' hometown which explains the name of this library!
Usage
import sichuanMojo as sm
import pandas as pd
# test_dict as an example input
data = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike','Jan'],
'age':[20, 21, 19, 18, 21, 33],
'group':['A','A', 'A', 'B', 'B', "B"],
'major':['biology', 'english', 'biology', 'english', 'biology', 'biology'],
'response':['good', 'good', 'ok', 'bad', 'bad','bad']}
# Create DataFrame
test_df= pd.DataFrame(data)
name | age | group | major | response | |
---|---|---|---|---|---|
0 | Tom | 20 | A | biology | good |
1 | nick | 21 | A | english | good |
2 | krish | 19 | A | biology | ok |
3 | jack | 18 | B | english | bad |
4 | Mike | 21 | B | biology | bad |
unite cols into one column
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
sm.unite_col(test_df, unite_by = ["group", "major", "response"], pattern="; ", united_col_name="New_col", output_json=False)
name | age | group | major | response | New_col | |
---|---|---|---|---|---|---|
0 | Tom | 20 | A | biology | good | A; biology; good |
1 | nick | 21 | A | english | good | A; english; good |
2 | krish | 19 | A | biology | ok | A; biology; ok |
3 | jack | 18 | B | english | bad | B; english; bad |
4 | Mike | 21 | B | biology | bad | B; biology; bad |
separate column by a pattern into individual column
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
# Test data
data2 = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike'],
'age':[20, 21, 19, 18, 21],
'group':['A','A', 'A', 'B', 'B'],
'major':['biology', 'english', 'biology', 'english', 'biology'],
'date':['2022-01-13', '2022-06-23', '2022-03-12', '2022-08-23', '2022-09-13']}
# Create DataFrame
test_df2= pd.DataFrame(data2)
name | age | group | major | date | |
---|---|---|---|---|---|
0 | Tom | 20 | A | biology | 2022-01-13 |
1 | nick | 21 | A | english | 2022-06-23 |
2 | krish | 19 | A | biology | 2022-03-12 |
3 | jack | 18 | B | english | 2022-08-23 |
4 | Mike | 21 | B | biology | 2022-09-13 |
sm.separate_col(test_df2, sep_by="date", pattern="-", sep_to_names=["year", "month", "date"], output_json=False)
name | age | group | major | date | year | month | |
---|---|---|---|---|---|---|---|
0 | Tom | 20 | A | biology | 13 | 2022 | 01 |
1 | nick | 21 | A | english | 23 | 2022 | 06 |
2 | krish | 19 | A | biology | 12 | 2022 | 03 |
3 | jack | 18 | B | english | 23 | 2022 | 08 |
4 | Mike | 21 | B | biology | 13 | 2022 | 09 |
provide a summary table of a selected column which grouped by desired column(s)
Note: if arg output_json = True
, output will be json format. Otherwise, output will be a data frame.
operator should be one of mean, sum, min, max, median, unique, count, nunique
sm.groupby_col(test_df, group_by = ["group", "major"], summmarize_at = "name", operator = "nunique", output_json = True)
#returns [{'group': 'A', 'major': 'biology', 'name': 2},
# {'group': 'A', 'major': 'english', 'name': 1},
# {'group': 'B', 'major': 'biology', 'name': 2},
# {'group': 'B', 'major': 'english', 'name': 1}]
sm.groupby_col(test_df, group_by = ["group", "major"], summmarize_at = "name", operator = "nunique")
group | major | name | |
---|---|---|---|
0 | A | biology | 2 |
1 | A | english | 1 |
2 | B | biology | 2 |
3 | B | english | 1 |
provide a summary table of a selected column with basic statistical infomation (min, max, sum, median) which grouped by desired column(s)
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
sm.groupby_across(test_df, ["group", "major"], summmarize_at="age", operation=None, output_json = False)
group | major | age_min | age_max | age_sum | age_median | |
---|---|---|---|---|---|---|
0 | A | biology | 19 | 20 | 39 | 19.5 |
1 | A | english | 21 | 21 | 21 | 21 |
2 | B | biology | 21 | 33 | 54 | 27 |
3 | B | english | 18 | 18 | 18 | 18 |
provide a pivot (count/frequency) table of a selected column which grouped by desired column(s)
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
sm.pivot_tb(test_df, group_by=["group", "major"], summmarize_at="response", operation="count", output_json = False, na_fill=0)
group | major | bad | good | ok | |
---|---|---|---|---|---|
0 | A | biology | 0 | 1 | 1 |
1 | A | english | 0 | 1 | 0 |
2 | B | biology | 2 | 0 | 0 |
3 | B | english | 1 | 0 | 0 |
provide a pivot (rate/percentage) table of a selected column which grouped by desired column(s)
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
sm.pivot_rate(test_df, group_by=["major", "group"], summmarize_at="response", output_json = False, na_fill=0)
major | group | response_count_bad | response_count_good | response_count_ok | response_perc_bad | response_perc_good | response_perc_ok | |
---|---|---|---|---|---|---|---|---|
0 | biology | A | 0 | 1 | 1 | 0 | 50 | 50 |
1 | biology | B | 2 | 0 | 0 | 100 | 0 | 0 |
2 | english | A | 0 | 1 | 0 | 0 | 100 | 0 |
3 | english | B | 1 | 0 | 0 | 100 | 0 | 0 |
provide an easy way to simplify network data frame from directed to undirected
Note: if arg keep = "last"
, output dataframe will keep the last row of duplicated rows. if arg keep = "first"
, output dataframe will keep the first row of duplicated rows.
data = {'from':['Tom', 'Jack', 'Jen', 'Sam'],
'overlap':[20, 21, 19, 18],
'to':['Jack','Tom', 'Emily', 'John']}
test_df= pd.DataFrame(data)
from | overlap | to | |
---|---|---|---|
0 | Tom | 20 | Jack |
1 | Jack | 21 | Tom |
2 | Jen | 19 | Emily |
3 | Sam | 18 | John |
In the test_df, we have duplicated Tom <--> Jack pairs.
sm.simplify_network_df(test_df, from_col = "from", to_col = "to", keep = "first")
from | overlap | to | |
---|---|---|---|
0 | Tom | 20 | Jack |
2 | Jen | 19 | Emily |
3 | Sam | 18 | John |
We can see output only keeps one Tom <--> Jack pair.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sichuanMojo-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 505e5bb45a2eca7d63ee76be1722d9a4e3b59e9ca72b309152e9c9d940b8a914 |
|
MD5 | 6cb101ebc1231b6bb7c37f3770f995a9 |
|
BLAKE2b-256 | eead867ba19dbeb932b852036ab66dc78feb67586c28b3c5bd9ea8f8423c70fa |