Classes and methods for executing stata-like commands easily for pandas dataframes.

## Project description

# EasyFrames

## Summary

This package makes it easier to perform some basic operations using a Pandas dataframe. For example, suppose you have the following dataset:


... age educ fridge has_car hh house_rooms id male prov weighthh
0 44 secondary yes 1 1 3 1 1 BC 2
1 43 bachelor yes 1 1 3 2 0 BC 2
2 13 primary yes 1 1 3 3 1 BC 2
3 70 higher no 1 2 2 1 1 Alberta 3
4 23 bachelor yes 0 3 1 1 1 BC 2
5 20 secondary yes 0 3 1 2 0 BC 2
6 37 higher no 1 4 3 1 1 Alberta 3
7 35 higher no 1 4 3 2 0 Alberta 3
8 8 primary no 1 4 3 3 0 Alberta 3
9 15 primary no 1 4 3 4 0 Alberta 3


If you are using Stata, and you want to add a column with the household size, the command is simple:

egen hhsize = count(id), by(hh)

If you are using Pandas and have the dataset loaded as df, you might have to do something like:


result = df[include].groupby('hh')['hh'].agg(['count'])
result.rename(columns={'count':'hh size'}, inplace=True)
merged = pd.merge(df, result, left_on='hh', right_index=True, how='left')


Using this package, the command would be:


from easyframes.easyframes import hhkit

myhhkit = hhkit()
df = myhhkit.egen(df, operation='count', groupby='hh', col='hh', column_label='hhsize')



id hh fridge age male house_rooms has_car weighthh prov educ hhsize
0 1 1 yes 44 1 3 1 2 BC secondary 3
1 2 1 yes 43 0 3 1 2 BC bachelor 3
2 3 1 yes 13 1 3 1 2 BC primary 3
3 1 2 no 70 1 2 1 3 Alberta higher 1
4 1 3 yes 23 1 1 0 2 BC bachelor 2
5 2 3 yes 20 0 1 0 2 BC secondary 2
6 1 4 no 37 1 3 1 3 Alberta higher 4
7 2 4 no 35 0 3 1 3 Alberta higher 4
8 3 4 no 8 0 3 1 3 Alberta primary 4
9 4 4 no 15 0 3 1 3 Alberta primary 4


Ok, so it doesn't save much typing or space, but suppose you want to calculate the average age in the household. Here you would simply add

df = myhhkit.egen(df, operation='mean', groupby='hh', col='age', column_label='mean age in hh')

and the result:

id hh fridge age male house_rooms has_car weighthh prov educ hhsize mean age in hh
0 1 1 yes 44 1 3 1 2 BC secondary 3 33.333333
1 2 1 yes 43 0 3 1 2 BC bachelor 3 33.333333
2 3 1 yes 13 1 3 1 2 BC primary 3 33.333333
3 1 2 no 70 1 2 1 3 Alberta higher 1 70.000000
4 1 3 yes 23 1 1 0 2 BC bachelor 2 21.500000
5 2 3 yes 20 0 1 0 2 BC secondary 2 21.500000
6 1 4 no 37 1 3 1 3 Alberta higher 4 23.750000
7 2 4 no 35 0 3 1 3 Alberta higher 4 23.750000
8 3 4 no 8 0 3 1 3 Alberta primary 4 23.750000
9 4 4 no 15 0 3 1 3 Alberta primary 4 23.750000


You can also include or exclude certain rows. For example, suppose we want to include in household size only members over the age of 22:

df = myhhkit.egen(df, operation='count', groupby='hh', col='hh', column_label='hhs_o22', include=df['age']>22)


The result:

id hh fridge age male house_rooms has_car weighthh prov educ hhs_o22
0 1 1 yes 44 1 3 1 2 BC secondary 2
1 2 1 yes 43 0 3 1 2 BC bachelor 2
2 3 1 yes 13 1 3 1 2 BC primary 2
3 1 2 no 70 1 2 1 3 Alberta higher 1
4 1 3 yes 23 1 1 0 2 BC bachelor 1
5 2 3 yes 20 0 1 0 2 BC secondary 1
6 1 4 no 37 1 3 1 3 Alberta higher 2
7 2 4 no 35 0 3 1 3 Alberta higher 2
8 3 4 no 8 0 3 1 3 Alberta primary 2
9 4 4 no 15 0 3 1 3 Alberta primary 2

You can also exclude members over 22 years of age:

df = myhhkit.egen(df, operation='count', groupby='hh', col='hh', column_label='hhs_o22',
exclude=df['age']>22)

If you don't specify the column label, then a default is constructed:

df = myhhkit.egen(df, operation='mean', groupby='hh', col='age')


id hh fridge age male house_rooms has_car weighthh prov educ (mean) age by hh
0 1 1 yes 44 1 3 1 2 BC secondary 33.333333
1 2 1 yes 43 0 3 1 2 BC bachelor 33.333333
2 3 1 yes 13 1 3 1 2 BC primary 33.333333
3 1 2 no 70 1 2 1 3 Alberta higher 70.000000
4 1 3 yes 23 1 1 0 2 BC bachelor 21.500000
5 2 3 yes 20 0 1 0 2 BC secondary 21.500000
6 1 4 no 37 1 3 1 3 Alberta higher 23.750000
7 2 4 no 35 0 3 1 3 Alberta higher 23.750000
8 3 4 no 8 0 3 1 3 Alberta primary 23.750000
9 4 4 no 15 0 3 1 3 Alberta primary 23.750000