sichuanMojo

A series of utility functions to help with tidy dataframe.

Project description

sichuanMojo

Built on the top of pandas, sichuanMojo is a grammar of data manipulation with data frame, providing a consistent a series of utility functions that help you solve the most common data manipulation challenges:

unite_col: unite cols into one column
separate_col: separate column by a pattern into individual column
groupby_col: provide a summary table of a selected column which grouped by desired column(s)
groupby_across: provide a summary table of a selected column with basic statistical infomation (min, max, sum, median) which grouped by desired column(s)
pivot_tb: provide a pivot (count/frequency) table of a selected column which grouped by desired column(s)
pivot_rate: provide a pivot (rate/percentage) table of a selected column which grouped by desired column(s)
simplify_network_df: simplify the network dataframe from directed to undirecte

Installation from github

Use the package manager pip to install foobar.

pip install sichuanMojo

Backgroud

The Sichuan province is pandas' hometown which explains the name of this library!

Usage

import sichuanMojo as sm
import pandas as pd

# test_dict as an example input
data = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike','Jan'],
        'age':[20, 21, 19, 18, 21, 33],
       'group':['A','A', 'A', 'B', 'B', "B"],
       'major':['biology', 'english', 'biology', 'english', 'biology', 'biology'],
       'response':['good', 'good', 'ok', 'bad', 'bad','bad']}
 
# Create DataFrame
test_df= pd.DataFrame(data)

	name	age	group	major	response
0	Tom	20	A	biology	good
1	nick	21	A	english	good
2	krish	19	A	biology	ok
3	jack	18	B	english	bad
4	Mike	21	B	biology	bad

unite cols into one column

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

sm.unite_col(test_df, unite_by = ["group", "major", "response"], pattern="; ", united_col_name="New_col", output_json=False)

	name	age	group	major	response	New_col
0	Tom	20	A	biology	good	A; biology; good
1	nick	21	A	english	good	A; english; good
2	krish	19	A	biology	ok	A; biology; ok
3	jack	18	B	english	bad	B; english; bad
4	Mike	21	B	biology	bad	B; biology; bad

separate column by a pattern into individual column

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

# Test data
data2 = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike'],
        'age':[20, 21, 19, 18, 21],
       'group':['A','A', 'A', 'B', 'B'],
       'major':['biology', 'english', 'biology', 'english', 'biology'],
       'date':['2022-01-13', '2022-06-23', '2022-03-12', '2022-08-23', '2022-09-13']}
 
# Create DataFrame
test_df2= pd.DataFrame(data2)

	name	age	group	major	date
0	Tom	20	A	biology	2022-01-13
1	nick	21	A	english	2022-06-23
2	krish	19	A	biology	2022-03-12
3	jack	18	B	english	2022-08-23
4	Mike	21	B	biology	2022-09-13

sm.separate_col(test_df2, sep_by="date", pattern="-", sep_to_names=["year", "month", "date"], output_json=False)

	name	age	group	major	date	year	month
0	Tom	20	A	biology	13	2022	01
1	nick	21	A	english	23	2022	06
2	krish	19	A	biology	12	2022	03
3	jack	18	B	english	23	2022	08
4	Mike	21	B	biology	13	2022	09

provide a summary table of a selected column which grouped by desired column(s)

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

operator should be one of mean, sum, min, max, median, unique, count, nunique

sm.groupby_col(test_df, group_by = ["group", "major"], summmarize_at = "name", operator = "nunique", output_json = True)
#returns [{'group': 'A', 'major': 'biology', 'name': 2},
# {'group': 'A', 'major': 'english', 'name': 1},
# {'group': 'B', 'major': 'biology', 'name': 2},
# {'group': 'B', 'major': 'english', 'name': 1}]

sm.groupby_col(test_df, group_by = ["group", "major"], summmarize_at = "name", operator = "nunique")

	group	major	name
0	A	biology	2
1	A	english	1
2	B	biology	2
3	B	english	1

provide a summary table of a selected column with basic statistical infomation (min, max, sum, median) which grouped by desired column(s)

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

sm.groupby_across(test_df, ["group", "major"], summmarize_at="age", operation=None, output_json = False)

	group	major	age_min	age_max	age_sum	age_median
0	A	biology	19	20	39	19.5
1	A	english	21	21	21	21
2	B	biology	21	33	54	27
3	B	english	18	18	18	18

provide a pivot (count/frequency) table of a selected column which grouped by desired column(s)

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

sm.pivot_tb(test_df, group_by=["group", "major"], summmarize_at="response", operation="count", output_json = False, na_fill=0)

	group	major	bad	good	ok
0	A	biology	0	1	1
1	A	english	0	1	0
2	B	biology	2	0	0
3	B	english	1	0	0

provide a pivot (rate/percentage) table of a selected column which grouped by desired column(s)

Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.

sm.pivot_rate(test_df, group_by=["major", "group"], summmarize_at="response", output_json = False, na_fill=0)

	major	group	response_count_bad	response_count_good	response_count_ok	response_perc_bad	response_perc_good	response_perc_ok
0	biology	A	0	1	1	0	50	50
1	biology	B	2	0	0	100	0	0
2	english	A	0	1	0	0	100	0
3	english	B	1	0	0	100	0	0

provide an easy way to simplify network data frame from directed to undirected

Note: if arg keep = "last", output dataframe will keep the last row of duplicated rows. if arg keep = "first", output dataframe will keep the first row of duplicated rows.

data = {'from':['Tom', 'Jack', 'Jen', 'Sam'],
'overlap':[20, 21, 19, 18],
'to':['Jack','Tom', 'Emily', 'John']}

test_df= pd.DataFrame(data)

	from	overlap	to
0	Tom	20	Jack
1	Jack	21	Tom
2	Jen	19	Emily
3	Sam	18	John

In the test_df, we have duplicated Tom <--> Jack pairs.

sm.simplify_network_df(test_df, from_col = "from", to_col = "to", keep = "first")

	from	overlap	to
0	Tom	20	Jack
2	Jen	19	Emily
3	Sam	18	John

We can see output only keeps one Tom <--> Jack pair.

Project details

Release history Release notifications | RSS feed

This version

0.1.2

Dec 30, 2022

0.1.1

May 11, 2022

0.1.0

May 11, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sichuanMojo-0.1.2.tar.gz (4.9 kB view hashes)

Uploaded Dec 30, 2022 Source

Built Distribution

sichuanMojo-0.1.2-py3-none-any.whl (5.4 kB view hashes)

Uploaded Dec 30, 2022 Python 3

Hashes for sichuanMojo-0.1.2.tar.gz

Hashes for sichuanMojo-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`48a581c1912b444b970ec5b28d60adc97273f28090c9d15145f791aa3b36222f`
MD5	`829d8db927f624a354049605fb03d98e`
BLAKE2b-256	`7486944001e398d43949212e01c9785334f405434f03dda95869477864c211ac`

Hashes for sichuanMojo-0.1.2-py3-none-any.whl

Hashes for sichuanMojo-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`505e5bb45a2eca7d63ee76be1722d9a4e3b59e9ca72b309152e9c9d940b8a914`
MD5	`6cb101ebc1231b6bb7c37f3770f995a9`
BLAKE2b-256	`eead867ba19dbeb932b852036ab66dc78feb67586c28b3c5bd9ea8f8423c70fa`