How to use pandas_cub
The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.
Installation
pip install pandas-cub
What is pandas_cub?
pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit this repository. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.
pandas_cub functionality
pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.
- Subset selection with the brackets
- Arithmetic and comparison operators (+, -, <, !=, etc...)
- Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
- Grouping via pivot tables
- String-only methods for columns containing strings
- Reading in simple comma-separated value files
- Several other methods
pandas_cub DataFrame
pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.
Missing value representation
Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.
Code Examples
pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.
Reading data with read_csv
pandas_cub consists of a single function, read_csv
, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The head
method returns the first 5 rows by default.
import pandas_cub as pdc
df = pdc.read_csv('data/employee.csv')
df.head()
DataFrame properties
The shape
property returns a tuple of the number of rows and columns
df.shape
(1535, 4)
The len
function returns just the number of rows.
len(df)
1535
The dtypes
property returns a DataFrame of the column names and their respective data type.
df.dtypes
| dept | race | gender | salary |
---|
0 | Houston Police Department-HPD | White | Male | 45279 |
1 | Houston Fire Department (HFD) | White | Male | 63166 |
2 | Houston Police Department-HPD | Black | Male | 66614 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 |
4 | Houston Airport System (HAS) | White | Male | 42390 |
The columns
property returns a list of the columns.
df.columns
['dept', 'race', 'gender', 'salary']
Set new columns by assigning the columns
property to a list.
df.columns = ['department', 'race', 'gender', 'salary']
df.head()
| Column Name | Data Type |
---|
0 | dept | string |
1 | race | string |
2 | gender | string |
3 | salary | int |
The values
property returns a single numpy array of all the data.
df.values
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
['Houston Fire Department (HFD)', 'White', 'Male', 63166],
['Houston Police Department-HPD', 'Black', 'Male', 66614],
...,
['Houston Police Department-HPD', 'White', 'Male', 43443],
['Houston Police Department-HPD', 'Asian', 'Male', 55461],
['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
dtype=object)
Subset selection
Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.
df['race'].head()
| department | race | gender | salary |
---|
0 | Houston Police Department-HPD | White | Male | 45279 |
1 | Houston Fire Department (HFD) | White | Male | 63166 |
2 | Houston Police Department-HPD | Black | Male | 66614 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 |
4 | Houston Airport System (HAS) | White | Male | 42390 |
Select multiple columns with a list of strings.
df[['race', 'salary']].head()
| race |
---|
0 | White |
1 | White |
2 | Black |
3 | Asian |
4 | White |
Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.
rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]
| race | salary |
---|
0 | White | 45279 |
1 | White | 63166 |
2 | Black | 66614 |
3 | Asian | 71680 |
4 | White | 42390 |
You can use integers for the columns as well.
rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]
| salary | race |
---|
0 | 77076 | Black |
1 | 81239 | White |
2 | 81239 | White |
You can use a single integer and not just a list.
df[99, 3]
| gender | department |
---|
0 | Male | Houston Police Department-HPD |
1 | Male | Houston Police Department-HPD |
2 | Male | Houston Police Department-HPD |
Or a single string for the columns
df[99, 'salary']
You can use a slice for the rows
df[20:100:10, ['race', 'gender']]
You can also slice the columns with either integers or strings
df[20:100:10, :2]
| race | gender |
---|
0 | White | Male |
1 | White | Male |
2 | Hispanic | Male |
3 | White | Male |
4 | White | Male |
5 | Hispanic | Male |
6 | Hispanic | Male |
7 | Black | Female |
df[20:100:10, 'department':'gender']
| department | race |
---|
0 | Houston Police Department-HPD | White |
1 | Houston Fire Department (HFD) | White |
2 | Houston Police Department-HPD | Hispanic |
3 | Houston Police Department-HPD | White |
4 | Houston Fire Department (HFD) | White |
5 | Houston Police Department-HPD | Hispanic |
6 | Houston Fire Department (HFD) | Hispanic |
7 | Houston Police Department-HPD | Black |
You can do boolean selection if you pass the brackets a one-column boolean DataFrame.
filt = df['salary'] > 100000
filt.head()
| department | race | gender |
---|
0 | Houston Police Department-HPD | White | Male |
1 | Houston Fire Department (HFD) | White | Male |
2 | Houston Police Department-HPD | Hispanic | Male |
3 | Houston Police Department-HPD | White | Male |
4 | Houston Fire Department (HFD) | White | Male |
5 | Houston Police Department-HPD | Hispanic | Male |
6 | Houston Fire Department (HFD) | Hispanic | Male |
7 | Houston Police Department-HPD | Black | Female |
df[filt].head()
| salary |
---|
0 | False |
1 | False |
2 | False |
3 | False |
4 | False |
df[filt, ['race', 'salary']].head()
| department | race | gender | salary |
---|
0 | Public Works & Engineering-PWE | White | Male | 107962 |
1 | Health & Human Services | Black | Male | 180416 |
2 | Houston Fire Department (HFD) | Hispanic | Male | 165216 |
3 | Health & Human Services | White | Female | 100791 |
4 | Houston Airport System (HAS) | White | Male | 120916 |
Assigning Columns
You can only assign an entire new column or overwrite an old one. You cannot assign a subset of the data. You can assign a new column with a single value like this:
df['bonus'] = 1000
df.head()
| race | salary |
---|
0 | White | 107962 |
1 | Black | 180416 |
2 | Hispanic | 165216 |
3 | White | 100791 |
4 | White | 120916 |
You can assign with a numpy array the same length as a column.
import numpy as np
df['bonus'] = np.random.randint(100, 5000, len(df))
df.head()
| department | race | gender | salary | bonus |
---|
0 | Houston Police Department-HPD | White | Male | 45279 | 1000 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 1000 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 1000 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 1000 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 1000 |
You can assign a new column with a one column DataFrame.
df['salary'] + df['bonus']
| department | race | gender | salary | bonus |
---|
0 | Houston Police Department-HPD | White | Male | 45279 | 1539 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 2885 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 619 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 3010 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3180 |
| salary |
---|
0 | 46818 |
1 | 66051 |
2 | 67233 |
3 | 74690 |
4 | 45570 |
5 | 108399 |
6 | 53162 |
7 | 181993 |
8 | 33989 |
9 | 58163 |
... | ... |
1525 | 31775 |
1526 | 45543 |
1527 | 31705 |
1528 | 81998 |
1529 | 105211 |
1530 | 47273 |
1531 | 69367 |
1532 | 46605 |
1533 | 55721 |
1534 | 54934 |
df['total salary'] = df['salary'] + df['bonus']
df.head()
Arithmetic and comparison operators
df1 = df[['salary', 'bonus']] * 5
df1.head()
| department | race | gender | salary | bonus | total salary |
---|
0 | Houston Police Department-HPD | White | Male | 45279 | 1539 | 46818 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 2885 | 66051 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 619 | 67233 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 3010 | 74690 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3180 | 45570 |
df1 = df[['salary', 'bonus']] > 100000
df1.head()
| salary | bonus |
---|
0 | 226395 | 7695 |
1 | 315830 | 14425 |
2 | 333070 | 3095 |
3 | 358400 | 15050 |
4 | 211950 | 15900 |
df1 = df['race'] == 'White'
df1.head()
| salary | bonus |
---|
0 | False | False |
1 | False | False |
2 | False | False |
3 | False | False |
4 | False | False |
Aggregation
Most of the common aggregation methods are available. They only work down the columns and not across the rows.
df.min()
| race |
---|
0 | True |
1 | True |
2 | False |
3 | False |
4 | True |
Columns that the aggregation does not work are dropped.
df.mean()
| department | race | gender | salary | bonus | total salary |
---|
0 | Health & Human Services | Asian | Female | 24960 | 108 | 26249 |
df.argmax()
| salary | bonus | total salary |
---|
0 | 56278.746 | 2532.181 | 58810.927 |
df['salary'].argmin()
| department | race | gender | salary | bonus | total salary |
---|
0 | 3 | 0 | 0 | 145 | 786 | 145 |
Check if all salaries are greater than 20000
df1 = df['salary'] > 20000
df1.all()
Count the number of non-missing values
df.count()
Get number of unique values.
df.nunique()
| department | race | gender | salary | bonus | total salary |
---|
0 | 1535 | 1535 | 1535 | 1535 | 1535 | 1535 |
Non-Aggregating Methods
These are methods that do not return a single value.
Get the unique values of each column. The unique
method returns a list of DataFrames containing the unique values for each column.
dfs = df.unique()
dfs[0]
| department | race | gender | salary | bonus | total salary |
---|
0 | 6 | 5 | 2 | 548 | 1321 | 1513 |
dfs[1]
| department |
---|
0 | Health & Human Services |
1 | Houston Airport System (HAS) |
2 | Houston Fire Department (HFD) |
3 | Houston Police Department-HPD |
4 | Parks & Recreation |
5 | Public Works & Engineering-PWE |
dfs[2]
| race |
---|
0 | Asian |
1 | Black |
2 | Hispanic |
3 | Native American |
4 | White |
Rename columns with a dictionary.
df.rename({'department':'dept', 'bonus':'BONUS'}).head()
Drop columns with a string or list of strings.
df.drop('race').head()
| dept | race | gender | salary | BONUS | total salary |
---|
0 | Houston Police Department-HPD | White | Male | 45279 | 1539 | 46818 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 2885 | 66051 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 619 | 67233 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 3010 | 74690 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3180 | 45570 |
df.drop(['race', 'gender']).head()
| department | gender | salary | bonus | total salary |
---|
0 | Houston Police Department-HPD | Male | 45279 | 1539 | 46818 |
1 | Houston Fire Department (HFD) | Male | 63166 | 2885 | 66051 |
2 | Houston Police Department-HPD | Male | 66614 | 619 | 67233 |
3 | Public Works & Engineering-PWE | Male | 71680 | 3010 | 74690 |
4 | Houston Airport System (HAS) | Male | 42390 | 3180 | 45570 |
Absolute value
df.abs().head()
| department | salary | bonus | total salary |
---|
0 | Houston Police Department-HPD | 45279 | 1539 | 46818 |
1 | Houston Fire Department (HFD) | 63166 | 2885 | 66051 |
2 | Houston Police Department-HPD | 66614 | 619 | 67233 |
3 | Public Works & Engineering-PWE | 71680 | 3010 | 74690 |
4 | Houston Airport System (HAS) | 42390 | 3180 | 45570 |
Cumulative min, max, and sum
df.cummax().head()
| salary | bonus | total salary |
---|
0 | 45279 | 1539 | 46818 |
1 | 63166 | 2885 | 66051 |
2 | 66614 | 619 | 67233 |
3 | 71680 | 3010 | 74690 |
4 | 42390 | 3180 | 45570 |
Clip values to be within a range.
df.clip(40000, 60000).head()
| department | race | gender | salary | bonus | total salary |
---|
0 | Houston Police Department-HPD | White | Male | 45279 | 1539 | 46818 |
1 | Houston Police Department-HPD | White | Male | 63166 | 2885 | 66051 |
2 | Houston Police Department-HPD | White | Male | 66614 | 2885 | 67233 |
3 | Public Works & Engineering-PWE | White | Male | 71680 | 3010 | 74690 |
4 | Public Works & Engineering-PWE | White | Male | 71680 | 3180 | 74690 |
Round numeric columns
df.round(-3).head()
| salary | bonus | total salary |
---|
0 | 45279 | 40000 | 46818 |
1 | 60000 | 40000 | 60000 |
2 | 60000 | 40000 | 60000 |
3 | 60000 | 40000 | 60000 |
4 | 42390 | 40000 | 45570 |
Copy the DataFrame
df.copy().head()
| salary | bonus | total salary |
---|
0 | 45000 | 2000 | 47000 |
1 | 63000 | 3000 | 66000 |
2 | 67000 | 1000 | 67000 |
3 | 72000 | 3000 | 75000 |
4 | 42000 | 3000 | 46000 |
Take the nth difference.
df['salary'].diff(2).head(10)
| department | race | gender | salary | bonus | total salary |
---|
0 | Houston Police Department-HPD | White | Male | 45279 | 1539 | 46818 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 2885 | 66051 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 619 | 67233 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 3010 | 74690 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3180 | 45570 |
Find the nth percentage change.
df['salary'].pct_change(2).head(10)
| salary |
---|
0 | nan |
1 | nan |
2 | 21335.000 |
3 | 8514.000 |
4 | -24224.000 |
5 | 36282.000 |
6 | 10254.000 |
7 | 72454.000 |
8 | -22297.000 |
9 | -125147.000 |
Sort the DataFrame by one or more columns
df.sort_values('salary').head()
| salary |
---|
0 | nan |
1 | nan |
2 | 0.471 |
3 | 0.135 |
4 | -0.364 |
5 | 0.506 |
6 | 0.242 |
7 | 0.671 |
8 | -0.424 |
9 | -0.694 |
Sort descending
df.sort_values('salary', asc=False).head()
| department | race | gender | salary | bonus | total salary |
---|
0 | Houston Police Department-HPD | Black | Female | 24960 | 1335 | 26295 |
1 | Public Works & Engineering-PWE | Hispanic | Male | 26104 | 666 | 26770 |
2 | Public Works & Engineering-PWE | Black | Female | 26125 | 3904 | 30029 |
3 | Houston Airport System (HAS) | Hispanic | Female | 26125 | 3352 | 29477 |
4 | Houston Airport System (HAS) | Black | Female | 26125 | 4150 | 30275 |
Sort by multiple columns
df.sort_values(['race', 'salary']).head()
| department | race | gender | salary | bonus | total salary |
---|
0 | Houston Fire Department (HFD) | White | Male | 210588 | 364 | 210952 |
1 | Houston Police Department-HPD | White | Male | 199596 | 2164 | 201760 |
2 | Houston Airport System (HAS) | Black | Male | 186192 | 948 | 187140 |
3 | Health & Human Services | Black | Male | 180416 | 1577 | 181993 |
4 | Public Works & Engineering-PWE | White | Female | 178331 | 4891 | 183222 |
Randomly sample the DataFrame
df.sample(n=3)
| department | race | gender | salary | bonus | total salary |
---|
0 | Houston Airport System (HAS) | Asian | Female | 26125 | 4665 | 30790 |
1 | Houston Police Department-HPD | Asian | Male | 27914 | 2068 | 29982 |
2 | Houston Police Department-HPD | Asian | Male | 28169 | 246 | 28415 |
3 | Public Works & Engineering-PWE | Asian | Male | 28995 | 3747 | 32742 |
4 | Public Works & Engineering-PWE | Asian | Male | 30347 | 1445 | 31792 |
Randomly sample a fraction
df.sample(frac=.005)
| department | race | gender | salary | bonus | total salary |
---|
0 | Houston Fire Department (HFD) | Black | Male | 61226 | 4609 | 65835 |
1 | Public Works & Engineering-PWE | Hispanic | Male | 31158 | 3846 | 35004 |
2 | Houston Police Department-HPD | Asian | Male | 60347 | 1086 | 61433 |
Sample with replacement
df.sample(n=10000, replace=True).head()
| department | race | gender | salary | bonus | total salary |
---|
0 | Public Works & Engineering-PWE | Asian | Male | 32635 | 3802 | 36437 |
1 | Houston Airport System (HAS) | Hispanic | Male | 42099 | 4378 | 46477 |
2 | Public Works & Engineering-PWE | Black | Male | 104389 | 3602 | 107991 |
3 | Public Works & Engineering-PWE | White | Male | 43514 | 2222 | 45736 |
4 | Public Works & Engineering-PWE | Black | Female | 33488 | 848 | 34336 |
5 | Houston Fire Department (HFD) | White | Male | 70181 | 1109 | 71290 |
6 | Public Works & Engineering-PWE | White | Male | 60715 | 4480 | 65195 |
String-only methods
Use the str
accessor to call methods available just to string columns. Pass the name of the string column as the first parameter for all these methods.
df.str.count('department', 'P').head()
| department | race | gender | salary | bonus | total salary |
---|
0 | Houston Fire Department (HFD) | White | Male | 61921 | 172 | 62093 |
1 | Houston Police Department-HPD | Asian | Male | 61643 | 659 | 62302 |
2 | Houston Police Department-HPD | White | Male | 43443 | 3539 | 46982 |
3 | Health & Human Services | White | Male | 120799 | 2447 | 123246 |
4 | Public Works & Engineering-PWE | White | Male | 98895 | 4571 | 103466 |
df.str.lower('department').head()
| department |
---|
0 | 2 |
1 | 0 |
2 | 2 |
3 | 2 |
4 | 0 |
df.str.find('department', 'Houston').head()
| department |
---|
0 | houston police department-hpd |
1 | houston fire department (hfd) |
2 | houston police department-hpd |
3 | public works & engineering-pwe |
4 | houston airport system (has) |
Grouping
pandas_cub provides the value_counts
method for simple frequency counting of unique values and pivot_table
for grouping and aggregating.
The value_counts
method returns a list of DataFrames, one for each column.
dfs = df[['department', 'race', 'gender']].value_counts()
dfs[0]
| department |
---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | -1 |
4 | 0 |
dfs[1]
| department | count |
---|
0 | Houston Police Department-HPD | 570 |
1 | Houston Fire Department (HFD) | 365 |
2 | Public Works & Engineering-PWE | 341 |
3 | Health & Human Services | 103 |
4 | Houston Airport System (HAS) | 103 |
5 | Parks & Recreation | 53 |
dfs[2]
| race | count |
---|
0 | White | 542 |
1 | Black | 518 |
2 | Hispanic | 381 |
3 | Asian | 87 |
4 | Native American | 7 |
If your DataFrame has one column, a DataFrame and not a list is returned. You can also return the relative frequency by setting the normalize
parameter to True
.
df['race'].value_counts(normalize=True)
| gender | count |
---|
0 | Male | 1135 |
1 | Female | 400 |
The pivot_table
method allows to group by one or two columns and aggregate values from another column. Let's find the average salary for each race and gender. All parameters must be strings.
df.pivot_table(rows='race', columns='gender', values='salary', aggfunc='mean')
| race | count |
---|
0 | White | 0.353 |
1 | Black | 0.337 |
2 | Hispanic | 0.248 |
3 | Asian | 0.057 |
4 | Native American | 0.005 |
If you don't provide values
or aggfunc
then by default it will return frequency (a contingency table).
df.pivot_table(rows='race', columns='gender')
| race | Female | Male |
---|
0 | Asian | 58304.222 | 60622.957 |
1 | Black | 48133.382 | 51853.000 |
2 | Hispanic | 44216.960 | 55493.064 |
3 | Native American | 58844.333 | 68850.500 |
4 | White | 66415.528 | 63439.196 |
You can group by just a single column.
df.pivot_table(rows='department', values='salary', aggfunc='mean')
| race | Female | Male |
---|
0 | Asian | 18 | 69 |
1 | Black | 207 | 311 |
2 | Hispanic | 100 | 281 |
3 | Native American | 3 | 4 |
4 | White | 72 | 470 |
df.pivot_table(columns='department', values='salary', aggfunc='mean')
| department | mean |
---|
0 | Health & Human Services | 51324.981 |
1 | Houston Airport System (HAS) | 53990.369 |
2 | Houston Fire Department (HFD) | 59960.441 |
3 | Houston Police Department-HPD | 60428.746 |
4 | Parks & Recreation | 39426.151 |
5 | Public Works & Engineering-PWE | 50207.806 |
| Health & Human Services | Houston Airport System (HAS) | Houston Fire Department (HFD) | Houston Police Department-HPD | Parks & Recreation | Public Works & Engineering-PWE |
---|
0 | 51324.981 | 53990.369 | 59960.441 | 60428.746 | 39426.151 | 50207.806 |