PandaPlyr

The purpose of PandaPlyr is to make chained operations on pandas DataFrames easier and more readable.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Project description

PandaPlyr

PandaPlyr is a Python package designed to provide a familiar and efficient data manipulation experience similar to the popular dplyr package in R. It aims to simplify and streamline the process of working with tabular data by providing a concise and intuitive syntax.

Overview and vision for PandaPlyr

The purpose of PyPlyr is to make chained operations on pandas DataFrames easier and more readable.

Use case and example

I have two dataframes - grade_df (left) and subject_df (right).

StudentID	Subject	Grade
1	CompSci	80
1	English	85
1	History	75
1	LinearAlg	75
...	...	...

Subject	SubjectType
CompSci	STEM
LinearAlg	STEM
Philosophy	Humanities
English	Humanities
History	Humanities

Merge the dataframes and find the student with the highest average grade for Humanities classes only, but exclude any students ('StudentID') who are not enrolled in at least 2 humanities courses.

import pandas as pd
import pandaplyr as pp
grade_df = pp.read_grades_dataset()
subject_df = pp.read_subject_dataset()

In pandas

merged_df = pd.merge(grade_df, subject_df, on='Subject')

humanities_df = merged_df[merged_df['SubjectType'] == 'Humanities']

course_counts = (
    humanities_df
    .groupby('StudentID', as_index = False)
    .agg(CourseCount = ('Subject', 'count'))
)

filtered_students =  course_counts.loc[course_counts['CourseCount'] >= 2][['StudentID']]

top_student_pandas = (
    humanities_df.loc[humanities_df['StudentID'].isin(filtered_students['StudentID'])]
    .groupby('StudentID', as_index=False)
    .agg(AverageGrade = ('Grade', 'mean'))
    .sort_values('AverageGrade', ascending = False)
    .head(1)
)

In PyPlyr

top_student_pp = (
    grade_df >>
    pp.inner_join(subject_df, 'Subject') >>
    pp.where('SubjectType == "Humanities"') >>
    pp.group_by('StudentID', 'SubjectType') >>
    pp.summarise(AverageGrade = ('Grade', 'mean'), CourseCount = ('Grade', 'count')) >>
    pp.where('(CourseCount >= 2)') >>
    pp.mutate(MaxAverageGrade = 'AverageGrade.max()') >>
    pp.where('MaxAverageGrade == AverageGrade') >>
    pp.select('StudentID', 'AverageGrade')
    )

The same answer takes 8 fewer lines, ~ 145 fewer characters, and is lot more readable.

StudentID	AverageGrade
2	87.0

Features

Here's a quick summary of the classes, methods, and functions we'll cover:

group_by() and summarise() / summarize()

These functions allow group-wise aggregations on your DataFrame for one or more columns. The syntax is as follows:

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
                   'Z' : ['x', 'x', 'y', 'x', 'x', 'y'],
                   'B': [10, 20, 30, 40, 50, 60]})
new_df = df >> group_by('A', 'Z') >> summarise(AVG_B = ('B', 'mean'))
print(new_df)

	A	Z	B
0	foo	x	10
1	foo	x	20
2	foo	y	30
3	bar	x	40
4	bar	x	50
5	bar	y	60

------>

A	Z	AVG_B
foo	x	15.0
foo	y	30.0
bar	x	45.0
bar	y	60.0

Note that you can pass the columns as separate arguments, or inside a list. By default, it will not return indices. Functions summarize() and summarise() are identical.

mutate()

The mutate function lets you add new columns or modify existing ones.

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
                   'B': [10, 20, 30, 40, 50, 60],
                   'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> mutate(B_X_2 = 'B * 2',
                      B_PLUS_C = 'B + C',
                      CONST = 1)
print(new_df)

	A	B	C	B_X_2	B_PLUS_C	CONST
0	foo	10	1	20	11	1
1	foo	20	2	40	22	1
2	foo	30	3	60	33	1
3	bar	40	4	80	44	1
4	bar	50	5	100	55	1
5	bar	60	6	120	66	1

where()

This function allows you to filter rows in your DataFrame based on a condition.

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
                   'B': [10, 20, 30, 40, 50, 60],
                   'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> where('A == "foo" | C == 6')
print(new_df)

	A	B	C
0	foo	10	1
1	foo	20	2
2	foo	30	3
3	bar	40	4
4	bar	50	5
5	bar	60	6

------>

	A	B	C
0	foo	10	1
1	foo	20	2
2	foo	30	3
5	bar	60	6

select()

The select function can be used to select specific columns in your DataFrame.

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
                   'B': [10, 20, 30, 40, 50, 60],
                   'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> select('A', 'B')
print(new_df)

	A	B	C
0	foo	10	1
1	foo	20	2
2	foo	30	3
3	bar	40	4
4	bar	50	5
5	bar	60	6

------>

	A	B
0	foo	10
1	foo	20
2	foo	30
3	bar	40
4	bar	50
5	bar	60

rename()

You can rename columns in your DataFrame using the rename function.

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
                   'B': [10, 20, 30, 40, 50, 60],
                   'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> rename(Z = 'C')
print(new_df)

	A	B	C
0	foo	10	1
1	foo	20	2
2	foo	30	3
3	bar	40	4
4	bar	50	5
5	bar	60	6

------>

	A	B	Z
0	foo	10	1
1	foo	20	2
2	foo	30	3
3	bar	40	4
4	bar	50	5
5	bar	60	6

arrange() and order_by()

Use arrange or order_by (which are 100% identical) to sort your DataFrame by one or more columns.

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
                   'B': [10, 20, 30, 40, 50, 60]})
new_df = df >> arrange('B', 'desc')
print(new_df)

	A	B
0	foo	10
1	foo	20
2	foo	30
3	bar	40
4	bar	50
5	bar	60

------>

	A	B
5	bar	60
4	bar	50
3	bar	40
2	foo	30
1	foo	20
0	foo	10

left_join(), right_join(), full_join()

These functions allow you to join multiple DataFrames together.

import pandas as pd
from pandaplyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
                    'B': [1, 2, 3]})

df2 = pd.DataFrame({'A': ['foo', 'bar', 'foo'],
                    'C': [10, 20, 30]})

new_df = df1 >> left_join(df2, on = 'A', fill_na = 0)
print(new_df)

	A	B	C
0	foo	1	10
1	foo	1	30
2	bar	2	20
3	other	3	0

Note that left_join and full_join have an optional fill_na argument to replace numpy.nan values from merged fields.

union() and union_all()

union and union_all let you concatenate two DataFrames together.

Note that union removes duplicates while union_all doesn't.

import pandas as pd
from pandaplyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
                    'B': [1, 2, 3]})

df2 = pd.DataFrame({'A': ['other', 'bar', 'foo'],
                    'B': [3, 4, 5]})

new_df = df1 >> union(df2)
print(new_df)

	A	B
0	foo	1
1	bar	2
2	other	3
3	bar	4
4	foo	5

import pandas as pd
from pandaplyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
                    'B': [1, 2, 3]})

df2 = pd.DataFrame({'A': ['other', 'bar', 'foo'],
                    'B': [3, 4, 5]})

new_df = df1 >> union_all(df2)
print(new_df)

	A	B
0	foo	1
1	bar	2
2	other	3
3	other	3
4	bar	4
5	foo	5

distinct()

distinct removes duplicate rows in your DataFrame.

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': ['foo', 'bar', 'other', 'other']})

new_df = df >> distinct()
print(new_df)

	A
0	foo
1	bar
2	other
3	other

------>

	A
0	foo
1	bar
2	other

fillna()

replaces numpy.nan, None, and (unlike pandas fillna) it works on numpy.inf and -numpy.inf

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': [1, np.nan, None, np.inf, -np.inf]})
new_df = df >> fillna('A', 0)
print(new_df)

	A
0	1.0
1	0.0
2	0.0
3	0.0
4	0.0

dropna()

replaces numpy.nan, None, and (unlike pandas fillna) it works on numpy.inf and -numpy.inf

import pandas as pd
from pandaplyr import *
df = pd.DataFrame({'A': [1, np.nan, None, np.inf, -np.inf]})
new_df = df >> fillna('A', 0)
print(new_df)

	A
0	1.0
1	0.0
2	0.0
3	0.0
4	0.0

The Pipe class allows us to use the '>>' operator to chain operations together in a pipeline.

User-defined functions

You can define your own functions using the @Pipe decorator

import pandas as pd
from pandaplyr import *

@Pipe
def median_impute(df, *args):
    """Replace missing values with the median value in the column"""
    for col in args:
        col_median = np.nanmedian(df[col])
        df = df >> fillna(col, value = col_median)
    return df

df = pd.DataFrame({'A': ['X', None, 'Y', np.inf, 'X', 'Y'],
                   'B': [1, 2, 3, None, 5, 6]})

new_df = df >> fillna('A', 'Missing') >> median_impute('B')
print(new_df)

	A	B
0	X	1
1	None	2
2	Y	3
3	np.inf	None
4	X	5
5	Y	6

------>

	A	B
0	X	1
1	Missing	2
2	Y	3
3	Missing	3
4	X	4
5	Y	5

Install

pip install PandaPlyr==0.0.2

To-do list

Add error handling to pyplyr.py
Add unit tests for new functions

Future features

Polars backend
Intelligent multiprocessing

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Release history Release notifications | RSS feed

0.1.8

Jul 20, 2023

0.1.7

Jul 20, 2023

0.1.6

Jul 20, 2023

0.1.5

Jul 20, 2023

0.1.4

Jul 20, 2023

0.1.3

Jul 19, 2023

0.1.2

Jul 19, 2023

0.1.1

Jul 19, 2023

0.1.0

Jul 19, 2023

This version

0.0.9

Jul 19, 2023

0.0.8

Jul 19, 2023

0.0.7

Jul 19, 2023

0.0.6

Jul 19, 2023

0.0.5

Jul 19, 2023

0.0.4

Jul 19, 2023

0.0.3

Jul 19, 2023

0.0.2

Jul 19, 2023

0.0.1

Jul 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PandaPlyr-0.0.9.tar.gz (7.5 kB view hashes)

Uploaded Jul 19, 2023 Source

Built Distribution

PandaPlyr-0.0.9-py3-none-any.whl (4.2 kB view hashes)

Uploaded Jul 19, 2023 Python 3

Hashes for PandaPlyr-0.0.9.tar.gz

Hashes for PandaPlyr-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`98cf48d0abdd88ebb35a9023f2770d62eb59cac75887c35147a871da1c5c93a3`
MD5	`ddc1ed73cc6cd24d26fa9d4f3d65640f`
BLAKE2b-256	`e9b4c40b64342369f2fddadd52db94e6a7e64f9d30c19f7e31bf345340177320`

Hashes for PandaPlyr-0.0.9-py3-none-any.whl

Hashes for PandaPlyr-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2ca581bf776814a48f05e04b8695bbc972c5c5536b18b71f240cd9e22da3f3db`
MD5	`8bcb6e31906ceb5b713c5ef8461e9c6b`
BLAKE2b-256	`981cd21d8dece4e1b5d2f365c2e5374a8fc444beb2639c4dae0001ef5234947a`

	A	B	C	B_X_2	B_PLUS_C	CONST
0	foo	10	1	20	11	1
1	foo	20	2	40	22	1
2	foo	30	3	60	33	1
3	bar	40	4	80	44	1
4	bar	50	5	100	55	1
5	bar	60	6	120	66	1

	A	B	C	B_X_2	B_PLUS_C	CONST
0	foo	10	1	20	11	1
1	foo	20	2	40	22	1
2	foo	30	3	60	33	1
3	bar	40	4	80	44	1
4	bar	50	5	100	55	1
5	bar	60	6	120	66	1

PandaPlyr 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

PandaPlyr

Overview and vision for PandaPlyr

Use case and example

In pandas

In PyPlyr

Features

group_by() and summarise() / summarize()

mutate()

where()

select()

rename()

arrange() and order_by()

left_join(), right_join(), full_join()

union() and union_all()

distinct()

fillna()

dropna()

User-defined functions

Install

To-do list

Future features

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

	A	B	C	B_X_2	B_PLUS_C	CONST
0	foo	10	1	20	11	1
1	foo	20	2	40	22	1
2	foo	30	3	60	33	1
3	bar	40	4	80	44	1
4	bar	50	5	100	55	1
5	bar	60	6	120	66	1