Skip to main content

A simple data analysis library similar to pandas

Project description

How to use pandas_cub

The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.

Installation

pip install pandas-cub

What is pandas_cub?

pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit this repository. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.

pandas_cub functionality

pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.

  • Subset selection with the brackets
  • Arithmetic and comparison operators (+, -, <, !=, etc...)
  • Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
  • Grouping via pivot tables
  • String-only methods for columns containing strings
  • Reading in simple comma-separated value files
  • Several other methods

pandas_cub DataFrame

pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.

Missing value representation

Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.

Code Examples

pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.

Reading data with read_csv

pandas_cub consists of a single function, read_csv, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The head method returns the first 5 rows by default.

import pandas_cub as pdc
df = pdc.read_csv('data/employee.csv')
df.head()

DataFrame properties

The shape property returns a tuple of the number of rows and columns

df.shape
(1535, 4)

The len function returns just the number of rows.

len(df)
1535

The dtypes property returns a DataFrame of the column names and their respective data type.

df.dtypes
dept race gender salary
0Houston Police Department-HPDWhite Male 45279
1Houston Fire Department (HFD)White Male 63166
2Houston Police Department-HPDBlack Male 66614
3Public Works & Engineering-PWEAsian Male 71680
4Houston Airport System (HAS)White Male 42390

The columns property returns a list of the columns.

df.columns
['dept', 'race', 'gender', 'salary']

Set new columns by assigning the columns property to a list.

df.columns = ['department', 'race', 'gender', 'salary']
df.head()
Column NameData Type
0dept string
1race string
2gender string
3salary int

The values property returns a single numpy array of all the data.

df.values
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
       ['Houston Fire Department (HFD)', 'White', 'Male', 63166],
       ['Houston Police Department-HPD', 'Black', 'Male', 66614],
       ...,
       ['Houston Police Department-HPD', 'White', 'Male', 43443],
       ['Houston Police Department-HPD', 'Asian', 'Male', 55461],
       ['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
      dtype=object)

Subset selection

Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.

df['race'].head()
departmentrace gender salary
0Houston Police Department-HPDWhite Male 45279
1Houston Fire Department (HFD)White Male 63166
2Houston Police Department-HPDBlack Male 66614
3Public Works & Engineering-PWEAsian Male 71680
4Houston Airport System (HAS)White Male 42390

Select multiple columns with a list of strings.

df[['race', 'salary']].head()
race
0White
1White
2Black
3Asian
4White

Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.

rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]
race salary
0White 45279
1White 63166
2Black 66614
3Asian 71680
4White 42390

You can use integers for the columns as well.

rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]
salary race
0 77076Black
1 81239White
2 81239White

You can use a single integer and not just a list.

df[99, 3]
gender department
0Male Houston Police Department-HPD
1Male Houston Police Department-HPD
2Male Houston Police Department-HPD

Or a single string for the columns

df[99, 'salary']
salary
0 66614

You can use a slice for the rows

df[20:100:10, ['race', 'gender']]
salary
0 66614

You can also slice the columns with either integers or strings

df[20:100:10, :2]
race gender
0White Male
1White Male
2Hispanic Male
3White Male
4White Male
5Hispanic Male
6Hispanic Male
7Black Female
df[20:100:10, 'department':'gender']
departmentrace
0Houston Police Department-HPDWhite
1Houston Fire Department (HFD)White
2Houston Police Department-HPDHispanic
3Houston Police Department-HPDWhite
4Houston Fire Department (HFD)White
5Houston Police Department-HPDHispanic
6Houston Fire Department (HFD)Hispanic
7Houston Police Department-HPDBlack

You can do boolean selection if you pass the brackets a one-column boolean DataFrame.

filt = df['salary'] > 100000
filt.head()
departmentrace gender
0Houston Police Department-HPDWhite Male
1Houston Fire Department (HFD)White Male
2Houston Police Department-HPDHispanic Male
3Houston Police Department-HPDWhite Male
4Houston Fire Department (HFD)White Male
5Houston Police Department-HPDHispanic Male
6Houston Fire Department (HFD)Hispanic Male
7Houston Police Department-HPDBlack Female
df[filt].head()
salary
0False
1False
2False
3False
4False
df[filt, ['race', 'salary']].head()
departmentrace gender salary
0Public Works & Engineering-PWEWhite Male 107962
1Health & Human ServicesBlack Male 180416
2Houston Fire Department (HFD)Hispanic Male 165216
3Health & Human ServicesWhite Female 100791
4Houston Airport System (HAS)White Male 120916

Assigning Columns

You can only assign an entire new column or overwrite an old one. You cannot assign a subset of the data. You can assign a new column with a single value like this:

df['bonus'] = 1000
df.head()
race salary
0White 107962
1Black 180416
2Hispanic 165216
3White 100791
4White 120916

You can assign with a numpy array the same length as a column.

import numpy as np
df['bonus'] = np.random.randint(100, 5000, len(df))
df.head()
departmentrace gender salary bonus
0Houston Police Department-HPDWhite Male 45279 1000
1Houston Fire Department (HFD)White Male 63166 1000
2Houston Police Department-HPDBlack Male 66614 1000
3Public Works & Engineering-PWEAsian Male 71680 1000
4Houston Airport System (HAS)White Male 42390 1000

You can assign a new column with a one column DataFrame.

df['salary'] + df['bonus']
departmentrace gender salary bonus
0Houston Police Department-HPDWhite Male 45279 1539
1Houston Fire Department (HFD)White Male 63166 2885
2Houston Police Department-HPDBlack Male 66614 619
3Public Works & Engineering-PWEAsian Male 71680 3010
4Houston Airport System (HAS)White Male 42390 3180
salary
0 46818
1 66051
2 67233
3 74690
4 45570
5 108399
6 53162
7 181993
8 33989
9 58163
......
1525 31775
1526 45543
1527 31705
1528 81998
1529 105211
1530 47273
1531 69367
1532 46605
1533 55721
1534 54934
df['total salary'] = df['salary'] + df['bonus']
df.head()

Arithmetic and comparison operators

df1 = df[['salary', 'bonus']] * 5
df1.head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45279 1539 46818
1Houston Fire Department (HFD)White Male 63166 2885 66051
2Houston Police Department-HPDBlack Male 66614 619 67233
3Public Works & Engineering-PWEAsian Male 71680 3010 74690
4Houston Airport System (HAS)White Male 42390 3180 45570
df1 = df[['salary', 'bonus']] > 100000
df1.head()
salary bonus
0 226395 7695
1 315830 14425
2 333070 3095
3 358400 15050
4 211950 15900
df1 = df['race'] == 'White'
df1.head()
salary bonus
0FalseFalse
1FalseFalse
2FalseFalse
3FalseFalse
4FalseFalse

Aggregation

Most of the common aggregation methods are available. They only work down the columns and not across the rows.

df.min()
race
0True
1True
2False
3False
4True

Columns that the aggregation does not work are dropped.

df.mean()
departmentrace gender salary bonus total salary
0Health & Human ServicesAsian Female 24960 108 26249
df.argmax()
salary bonus total salary
0 56278.746 2532.181 58810.927
df['salary'].argmin()
departmentrace gender salary bonus total salary
0 3 0 0 145 786 145

Check if all salaries are greater than 20000

df1 = df['salary'] > 20000
df1.all()
salary
0 347

Count the number of non-missing values

df.count()
salary
0True

Get number of unique values.

df.nunique()
departmentrace gender salary bonus total salary
0 1535 1535 1535 1535 1535 1535

Non-Aggregating Methods

These are methods that do not return a single value.

Get the unique values of each column. The unique method returns a list of DataFrames containing the unique values for each column.

dfs = df.unique()
dfs[0]
departmentrace gender salary bonus total salary
0 6 5 2 548 1321 1513
dfs[1]
department
0Health & Human Services
1Houston Airport System (HAS)
2Houston Fire Department (HFD)
3Houston Police Department-HPD
4Parks & Recreation
5Public Works & Engineering-PWE
dfs[2]
race
0Asian
1Black
2Hispanic
3Native American
4White

Rename columns with a dictionary.

df.rename({'department':'dept', 'bonus':'BONUS'}).head()
gender
0Female
1Male

Drop columns with a string or list of strings.

df.drop('race').head()
dept race gender salary BONUS total salary
0Houston Police Department-HPDWhite Male 45279 1539 46818
1Houston Fire Department (HFD)White Male 63166 2885 66051
2Houston Police Department-HPDBlack Male 66614 619 67233
3Public Works & Engineering-PWEAsian Male 71680 3010 74690
4Houston Airport System (HAS)White Male 42390 3180 45570
df.drop(['race', 'gender']).head()
departmentgender salary bonus total salary
0Houston Police Department-HPDMale 45279 1539 46818
1Houston Fire Department (HFD)Male 63166 2885 66051
2Houston Police Department-HPDMale 66614 619 67233
3Public Works & Engineering-PWEMale 71680 3010 74690
4Houston Airport System (HAS)Male 42390 3180 45570

Absolute value

df.abs().head()
departmentsalary bonus total salary
0Houston Police Department-HPD 45279 1539 46818
1Houston Fire Department (HFD) 63166 2885 66051
2Houston Police Department-HPD 66614 619 67233
3Public Works & Engineering-PWE 71680 3010 74690
4Houston Airport System (HAS) 42390 3180 45570

Cumulative min, max, and sum

df.cummax().head()
salary bonus total salary
0 45279 1539 46818
1 63166 2885 66051
2 66614 619 67233
3 71680 3010 74690
4 42390 3180 45570

Clip values to be within a range.

df.clip(40000, 60000).head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45279 1539 46818
1Houston Police Department-HPDWhite Male 63166 2885 66051
2Houston Police Department-HPDWhite Male 66614 2885 67233
3Public Works & Engineering-PWEWhite Male 71680 3010 74690
4Public Works & Engineering-PWEWhite Male 71680 3180 74690

Round numeric columns

df.round(-3).head()
salary bonus total salary
0 45279 40000 46818
1 60000 40000 60000
2 60000 40000 60000
3 60000 40000 60000
4 42390 40000 45570

Copy the DataFrame

df.copy().head()
salary bonus total salary
0 45000 2000 47000
1 63000 3000 66000
2 67000 1000 67000
3 72000 3000 75000
4 42000 3000 46000

Take the nth difference.

df['salary'].diff(2).head(10)
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45279 1539 46818
1Houston Fire Department (HFD)White Male 63166 2885 66051
2Houston Police Department-HPDBlack Male 66614 619 67233
3Public Works & Engineering-PWEAsian Male 71680 3010 74690
4Houston Airport System (HAS)White Male 42390 3180 45570

Find the nth percentage change.

df['salary'].pct_change(2).head(10)
salary
0 nan
1 nan
2 21335.000
3 8514.000
4-24224.000
5 36282.000
6 10254.000
7 72454.000
8-22297.000
9-125147.000

Sort the DataFrame by one or more columns

df.sort_values('salary').head()
salary
0 nan
1 nan
2 0.471
3 0.135
4 -0.364
5 0.506
6 0.242
7 0.671
8 -0.424
9 -0.694

Sort descending

df.sort_values('salary', asc=False).head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDBlack Female 24960 1335 26295
1Public Works & Engineering-PWEHispanic Male 26104 666 26770
2Public Works & Engineering-PWEBlack Female 26125 3904 30029
3Houston Airport System (HAS)Hispanic Female 26125 3352 29477
4Houston Airport System (HAS)Black Female 26125 4150 30275

Sort by multiple columns

df.sort_values(['race', 'salary']).head()
departmentrace gender salary bonus total salary
0Houston Fire Department (HFD)White Male 210588 364 210952
1Houston Police Department-HPDWhite Male 199596 2164 201760
2Houston Airport System (HAS)Black Male 186192 948 187140
3Health & Human ServicesBlack Male 180416 1577 181993
4Public Works & Engineering-PWEWhite Female 178331 4891 183222

Randomly sample the DataFrame

df.sample(n=3)
departmentrace gender salary bonus total salary
0Houston Airport System (HAS)Asian Female 26125 4665 30790
1Houston Police Department-HPDAsian Male 27914 2068 29982
2Houston Police Department-HPDAsian Male 28169 246 28415
3Public Works & Engineering-PWEAsian Male 28995 3747 32742
4Public Works & Engineering-PWEAsian Male 30347 1445 31792

Randomly sample a fraction

df.sample(frac=.005)
departmentrace gender salary bonus total salary
0Houston Fire Department (HFD)Black Male 61226 4609 65835
1Public Works & Engineering-PWEHispanic Male 31158 3846 35004
2Houston Police Department-HPDAsian Male 60347 1086 61433

Sample with replacement

df.sample(n=10000, replace=True).head()
departmentrace gender salary bonus total salary
0Public Works & Engineering-PWEAsian Male 32635 3802 36437
1Houston Airport System (HAS)Hispanic Male 42099 4378 46477
2Public Works & Engineering-PWEBlack Male 104389 3602 107991
3Public Works & Engineering-PWEWhite Male 43514 2222 45736
4Public Works & Engineering-PWEBlack Female 33488 848 34336
5Houston Fire Department (HFD)White Male 70181 1109 71290
6Public Works & Engineering-PWEWhite Male 60715 4480 65195

String-only methods

Use the str accessor to call methods available just to string columns. Pass the name of the string column as the first parameter for all these methods.

df.str.count('department', 'P').head()
departmentrace gender salary bonus total salary
0Houston Fire Department (HFD)White Male 61921 172 62093
1Houston Police Department-HPDAsian Male 61643 659 62302
2Houston Police Department-HPDWhite Male 43443 3539 46982
3Health & Human ServicesWhite Male 120799 2447 123246
4Public Works & Engineering-PWEWhite Male 98895 4571 103466
df.str.lower('department').head()
department
0 2
1 0
2 2
3 2
4 0
df.str.find('department', 'Houston').head()
department
0houston police department-hpd
1houston fire department (hfd)
2houston police department-hpd
3public works & engineering-pwe
4houston airport system (has)

Grouping

pandas_cub provides the value_counts method for simple frequency counting of unique values and pivot_table for grouping and aggregating.

The value_counts method returns a list of DataFrames, one for each column.

dfs = df[['department', 'race', 'gender']].value_counts()
dfs[0]
department
0 0
1 0
2 0
3 -1
4 0
dfs[1]
departmentcount
0Houston Police Department-HPD 570
1Houston Fire Department (HFD) 365
2Public Works & Engineering-PWE 341
3Health & Human Services 103
4Houston Airport System (HAS) 103
5Parks & Recreation 53
dfs[2]
race count
0White 542
1Black 518
2Hispanic 381
3Asian 87
4Native American 7

If your DataFrame has one column, a DataFrame and not a list is returned. You can also return the relative frequency by setting the normalize parameter to True.

df['race'].value_counts(normalize=True)
gender count
0Male 1135
1Female 400

The pivot_table method allows to group by one or two columns and aggregate values from another column. Let's find the average salary for each race and gender. All parameters must be strings.

df.pivot_table(rows='race', columns='gender', values='salary', aggfunc='mean')
race count
0White 0.353
1Black 0.337
2Hispanic 0.248
3Asian 0.057
4Native American 0.005

If you don't provide values or aggfunc then by default it will return frequency (a contingency table).

df.pivot_table(rows='race', columns='gender')
race Female Male
0Asian 58304.222 60622.957
1Black 48133.382 51853.000
2Hispanic 44216.960 55493.064
3Native American 58844.333 68850.500
4White 66415.528 63439.196

You can group by just a single column.

df.pivot_table(rows='department', values='salary', aggfunc='mean')
race Female Male
0Asian 18 69
1Black 207 311
2Hispanic 100 281
3Native American 3 4
4White 72 470
df.pivot_table(columns='department', values='salary', aggfunc='mean')
departmentmean
0Health & Human Services 51324.981
1Houston Airport System (HAS) 53990.369
2Houston Fire Department (HFD) 59960.441
3Houston Police Department-HPD 60428.746
4Parks & Recreation 39426.151
5Public Works & Engineering-PWE 50207.806
Health & Human ServicesHouston Airport System (HAS)Houston Fire Department (HFD)Houston Police Department-HPDParks & RecreationPublic Works & Engineering-PWE
0 51324.981 53990.369 59960.441 60428.746 39426.151 50207.806

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_cub-0.0.3.tar.gz (29.5 kB view hashes)

Uploaded Source

Built Distribution

pandas_cub-0.0.3-py3-none-any.whl (19.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page