PandaPlyr
![PyPI version](https://pypi-camo.freetls.fastly.net/fe91bc95e99b4663b181264ab04be4b426e41113/68747470733a2f2f62616467652e667572792e696f2f70792f70616e6461706c79722e737667)
pandaplyr is a Python package designed to provide a familiar and efficient data manipulation experience similar to the popular dplyr package in R. It aims to simplify and streamline the process of working with tabular data by providing a concise and intuitive syntax.
Install
Installers for the latest version are available at the Python
Package Index (PyPI).
pip install PandaPlyr
Overview and vision for pandaplyr
The purpose of pandaplyr is to make chained operations on pandas DataFrames easier and more readable.
Use case and example
I have two dataframes - grade_df (left) and subject_df (right).
StudentID |
Subject |
Grade |
1 |
CompSci |
80 |
1 |
English |
85 |
1 |
History |
75 |
1 |
LinearAlg |
75 |
... |
... |
... |
|
|
Subject |
SubjectType |
CompSci |
STEM |
LinearAlg |
STEM |
Philosophy |
Humanities |
English |
Humanities |
History |
Humanities |
|
Merge the dataframes and find the student with the highest average grade for Humanities classes only,
but exclude any students ('StudentID') who are not enrolled in at least 2 humanities courses.
import pandas as pd
import PandaPlyr as pp
grade_df = pp.utils.read_grades_dataset()
subject_df = pp.utils.read_subject_dataset()
In pandas
merged_df = pd.merge(grade_df, subject_df, on='Subject')
humanities_df = merged_df[merged_df['SubjectType'] == 'Humanities']
course_counts = (
humanities_df
.groupby('StudentID', as_index = False)
.agg(CourseCount = ('Subject', 'count'))
)
filtered_students = course_counts.loc[course_counts['CourseCount'] >= 2][['StudentID']]
top_student_pandas = (
humanities_df.loc[humanities_df['StudentID'].isin(filtered_students['StudentID'])]
.groupby('StudentID', as_index=False)
.agg(AverageGrade = ('Grade', 'mean'))
.sort_values('AverageGrade', ascending = False)
.head(1)
)
In pandaplyr
top_student_pp = (
grade_df >>
pp.inner_join(subject_df, 'Subject') >>
pp.where('SubjectType == "Humanities"') >>
pp.group_by('StudentID', 'SubjectType') >>
pp.summarise(AverageGrade = ('Grade', 'mean'), CourseCount = ('Grade', 'count')) >>
pp.where('(CourseCount >= 2)') >>
pp.mutate(MaxAverageGrade = 'AverageGrade.max()') >>
pp.where('MaxAverageGrade == AverageGrade') >>
pp.select('StudentID', 'AverageGrade')
)
The same answer takes 8 fewer lines, ~ 145 fewer characters, and is lot more readable.
StudentID |
AverageGrade |
2 |
87.0 |
Features
Here's a quick summary of the classes, methods, and functions we'll cover:
group_by() and summarise() / summarize()
These functions allow group-wise aggregations on your DataFrame for one or more columns. The syntax is as follows:
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'Z' : ['x', 'x', 'y', 'x', 'x', 'y'],
'B': [10, 20, 30, 40, 50, 60]})
new_df = df >> group_by('A', 'Z') >> summarise(AVG_B = ('B', 'mean'))
print(new_df)
|
A |
Z |
B |
0 |
foo |
x |
10 |
1 |
foo |
x |
20 |
2 |
foo |
y |
30 |
3 |
bar |
x |
40 |
4 |
bar |
x |
50 |
5 |
bar |
y |
60 |
|
------>
|
A |
Z |
AVG_B |
foo |
x |
15.0 |
foo |
y |
30.0 |
bar |
x |
45.0 |
bar |
y |
60.0 |
|
Note that you can pass the columns as separate arguments, or inside a list. By default, it will not return indices.
Functions summarize() and summarise() are identical.
mutate()
The mutate function lets you add new columns or modify existing ones.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60],
'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> mutate(B_X_2 = 'B * 2',
B_PLUS_C = 'B + C',
CONST = 1)
print(new_df)
|
A |
B |
C |
B_X_2 |
B_PLUS_C |
CONST |
0 |
foo |
10 |
1 |
20 |
11 |
1 |
1 |
foo |
20 |
2 |
40 |
22 |
1 |
2 |
foo |
30 |
3 |
60 |
33 |
1 |
3 |
bar |
40 |
4 |
80 |
44 |
1 |
4 |
bar |
50 |
5 |
100 |
55 |
1 |
5 |
bar |
60 |
6 |
120 |
66 |
1 |
where()
This function allows you to filter rows in your DataFrame based on a condition.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60],
'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> where('A == "foo" | C == 6')
print(new_df)
|
A |
B |
C |
0 |
foo |
10 |
1 |
1 |
foo |
20 |
2 |
2 |
foo |
30 |
3 |
3 |
bar |
40 |
4 |
4 |
bar |
50 |
5 |
5 |
bar |
60 |
6 |
|
------>
|
|
A |
B |
C |
0 |
foo |
10 |
1 |
1 |
foo |
20 |
2 |
2 |
foo |
30 |
3 |
5 |
bar |
60 |
6 |
|
select()
The select function can be used to select specific columns in your DataFrame.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60],
'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> select('A', 'B')
print(new_df)
|
A |
B |
C |
0 |
foo |
10 |
1 |
1 |
foo |
20 |
2 |
2 |
foo |
30 |
3 |
3 |
bar |
40 |
4 |
4 |
bar |
50 |
5 |
5 |
bar |
60 |
6 |
|
------>
|
|
A |
B |
0 |
foo |
10 |
1 |
foo |
20 |
2 |
foo |
30 |
3 |
bar |
40 |
4 |
bar |
50 |
5 |
bar |
60 |
|
rename()
You can rename columns in your DataFrame using the rename function.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60],
'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> rename(Z = 'C')
print(new_df)
|
A |
B |
C |
0 |
foo |
10 |
1 |
1 |
foo |
20 |
2 |
2 |
foo |
30 |
3 |
3 |
bar |
40 |
4 |
4 |
bar |
50 |
5 |
5 |
bar |
60 |
6 |
|
------>
|
|
A |
B |
Z |
0 |
foo |
10 |
1 |
1 |
foo |
20 |
2 |
2 |
foo |
30 |
3 |
3 |
bar |
40 |
4 |
4 |
bar |
50 |
5 |
5 |
bar |
60 |
6 |
|
arrange() and order_by()
Use arrange or order_by (which are 100% identical) to sort your DataFrame by one or more columns.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60]})
new_df = df >> arrange('B', 'desc')
print(new_df)
|
A |
B |
0 |
foo |
10 |
1 |
foo |
20 |
2 |
foo |
30 |
3 |
bar |
40 |
4 |
bar |
50 |
5 |
bar |
60 |
|
------>
|
|
A |
B |
5 |
bar |
60 |
4 |
bar |
50 |
3 |
bar |
40 |
2 |
foo |
30 |
1 |
foo |
20 |
0 |
foo |
10 |
|
left_join(), right_join(), full_join()
These functions allow you to join multiple DataFrames together.
import pandas as pd
from PandaPlyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
'B': [1, 2, 3]})
df2 = pd.DataFrame({'A': ['foo', 'bar', 'foo'],
'C': [10, 20, 30]})
new_df = df1 >> left_join(df2, on = 'A', fill_na = 0)
print(new_df)
|
A |
B |
C |
0 |
foo |
1 |
10 |
1 |
foo |
1 |
30 |
2 |
bar |
2 |
20 |
3 |
other |
3 |
0 |
Note that left_join and full_join have an optional fill_na argument to replace numpy.nan values from merged fields.
union() and union_all()
union and union_all let you concatenate two DataFrames together.
Note that union removes duplicates while union_all doesn't.
import pandas as pd
from PandaPlyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
'B': [1, 2, 3]})
df2 = pd.DataFrame({'A': ['other', 'bar', 'foo'],
'B': [3, 4, 5]})
new_df = df1 >> union(df2)
print(new_df)
|
A |
B |
0 |
foo |
1 |
1 |
bar |
2 |
2 |
other |
3 |
3 |
bar |
4 |
4 |
foo |
5 |
import pandas as pd
from PandaPlyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
'B': [1, 2, 3]})
df2 = pd.DataFrame({'A': ['other', 'bar', 'foo'],
'B': [3, 4, 5]})
new_df = df1 >> union_all(df2)
print(new_df)
|
A |
B |
0 |
foo |
1 |
1 |
bar |
2 |
2 |
other |
3 |
3 |
other |
3 |
4 |
bar |
4 |
5 |
foo |
5 |
distinct()
distinct removes duplicate rows in your DataFrame.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'bar', 'other', 'other']})
new_df = df >> distinct()
print(new_df)
|
A |
0 |
foo |
1 |
bar |
2 |
other |
3 |
other |
|
------>
|
|
fill_na()
replaces numpy.nan, None, and (unlike pandas fillna) it works on numpy.inf and -numpy.inf
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': [1, np.nan, None, np.inf, -np.inf]})
new_df = df >> fill_na('A', 0)
print(new_df)
|
A |
0 |
1.0 |
1 |
0.0 |
2 |
0.0 |
3 |
0.0 |
4 |
0.0 |
drop_na()
Removes records with numpy.nan, None, and (unlike pandas dropna) it works on numpy.inf and -numpy.inf
import pandas as pd
from src.PandaPlyr import *
df = pd.DataFrame({'A': [1, np.nan, None, np.inf, -np.inf]})
new_df = df >> pp.drop_na()
print(new_df)
The Pipe class allows us to use the '>>' operator to chain operations together in a pipeline.
User-defined functions
You can define your own functions using the @Pipe decorator
import pandas as pd
from PandaPlyr import *
@Pipe
def median_impute(df, *args):
"""Replace missing values with the median value in the column"""
for col in args:
col_median = np.nanmedian(df[col])
df = df >> fillna(col, value = col_median)
return df
df = pd.DataFrame({'A': ['X', None, 'Y', np.inf, 'X', 'Y'],
'B': [1, 2, 3, None, 5, 6]})
new_df = df >> fillna('A', 'Missing') >> median_impute('B')
print(new_df)
|
A |
B |
0 |
X |
1 |
1 |
None |
2 |
2 |
Y |
3 |
3 |
np.inf |
None |
4 |
X |
5 |
5 |
Y |
6 |
|
------>
|
|
A |
B |
0 |
X |
1 |
1 |
Missing |
2 |
2 |
Y |
3 |
3 |
Missing |
3 |
4 |
X |
4 |
5 |
Y |
5 |
|
To-do list
Future features
Contact
![LinkedIn Badge](https://pypi-camo.freetls.fastly.net/534fe0e6f6ad2cd4bf3706cf1d3dd2d3a1130284/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c696e6b6564496e2d3030373742353f7374796c653d666f722d7468652d6261646765266c6f676f3d6c696e6b6564696e266c6f676f436f6c6f723d7768697465)