A simple data analysis library similar to pandas
Project description
How to use pandas_cub
=====================
The README.ipynb notebook will serve as the documentation and usage
guide to pandas_cub.
Installation
------------
``pip install pandas-cub``
What is pandas_cub?
-------------------
pandas_cub is a simple data analysis library that emulates the
functionality of the pandas library. The library is not meant for
serious work. It was built as an assignment for one of Ted Petrou’s
Python classes. If you would like to complete the assignment on your
own, visit `this repository <https://github.com/tdpetrou/pandas_cub>`__.
There are about 40 steps and 100 tests that you must pass in order to
rebuild the library. It is a good challenge and teaches you the
fundamentals of how to build your own data analysis library.
pandas_cub functionality
------------------------
pandas_cub has limited functionality but is still capable of a wide
variety of data analysis tasks.
- Subset selection with the brackets
- Arithmetic and comparison operators (+, -, <, !=, etc…)
- Aggregation of columns with most of the common functions (min, max,
mean, median, etc…)
- Grouping via pivot tables
- String-only methods for columns containing strings
- Reading in simple comma-separated value files
- Several other methods
pandas_cub DataFrame
--------------------
pandas_cub has a single main object, the DataFrame, to hold all of the
data. The DataFrame is capable of holding 4 data types - booleans,
integers, floats, and strings. All data is stored in NumPy arrays.
panda_cub DataFrames have no index (as in pandas). The columns must be
strings.
Missing value representation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Boolean and integer columns will have no missing value representation.
The NumPy NaN is used for float columns and the Python None is used for
string columns.
Code Examples
-------------
pandas_cub syntax is very similar to pandas, but implements much fewer
methods. The below examples will cover just about all of the API.
Reading data with ``read_csv``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pandas_cub consists of a single function, ``read_csv``, that has a
single parameter, the location of the file you would like to read in as
a DataFrame. This function can only handle simple CSV’s and the
delimiter must be a comma. A sample employee dataset is provided in the
data directory. Notice that the visual output of the DataFrame is nearly
identical to that of a pandas DataFrame. The ``head`` method returns the
first 5 rows by default.
.. code:: ipython3
import pandas_cub as pdc
.. code:: ipython3
df = pdc.read_csv('data/employee.csv')
df.head()
.. raw:: html
<table><thead><tr><th></th><th>dept </th><th>race </th><th>gender </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td></tr>
DataFrame properties
~~~~~~~~~~~~~~~~~~~~
The ``shape`` property returns a tuple of the number of rows and columns
.. code:: ipython3
df.shape
.. parsed-literal::
(1535, 4)
The ``len`` function returns just the number of rows.
.. code:: ipython3
len(df)
.. parsed-literal::
1535
The ``dtypes`` property returns a DataFrame of the column names and
their respective data type.
.. code:: ipython3
df.dtypes
.. raw:: html
<table><thead><tr><th></th><th>Column Name</th><th>Data Type </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>dept </td><td>string </td></tr><tr><td><strong>1</strong></td><td>race </td><td>string </td></tr><tr><td><strong>2</strong></td><td>gender </td><td>string </td></tr><tr><td><strong>3</strong></td><td>salary </td><td>int </td></tr>
The ``columns`` property returns a list of the columns.
.. code:: ipython3
df.columns
.. parsed-literal::
['dept', 'race', 'gender', 'salary']
Set new columns by assigning the ``columns`` property to a list.
.. code:: ipython3
df.columns = ['department', 'race', 'gender', 'salary']
df.head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td></tr>
The ``values`` property returns a single numpy array of all the data.
.. code:: ipython3
df.values
.. parsed-literal::
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
['Houston Fire Department (HFD)', 'White', 'Male', 63166],
['Houston Police Department-HPD', 'Black', 'Male', 66614],
...,
['Houston Police Department-HPD', 'White', 'Male', 43443],
['Houston Police Department-HPD', 'Asian', 'Male', 55461],
['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
dtype=object)
Subset selection
~~~~~~~~~~~~~~~~
Subset selection is handled with the brackets. To select a single
column, place that column name in the brackets.
.. code:: ipython3
df['race'].head()
.. raw:: html
<table><thead><tr><th></th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td></tr><tr><td><strong>1</strong></td><td>White </td></tr><tr><td><strong>2</strong></td><td>Black </td></tr><tr><td><strong>3</strong></td><td>Asian </td></tr><tr><td><strong>4</strong></td><td>White </td></tr>
Select multiple columns with a list of strings.
.. code:: ipython3
df[['race', 'salary']].head()
.. raw:: html
<table><thead><tr><th></th><th>race </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>White </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Black </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Asian </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>White </td><td> 42390</td></tr>
Simultaneously select rows and columns by passing the brackets the row
selection followed by the column selection separated by a comma. Here we
use integers for rows and strings for columns.
.. code:: ipython3
rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]
.. raw:: html
<table><thead><tr><th></th><th>salary </th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 77076</td><td>Black </td></tr><tr><td><strong>1</strong></td><td> 81239</td><td>White </td></tr><tr><td><strong>2</strong></td><td> 81239</td><td>White </td></tr>
You can use integers for the columns as well.
.. code:: ipython3
rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]
.. raw:: html
<table><thead><tr><th></th><th>gender </th><th>department</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr><tr><td><strong>1</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr><tr><td><strong>2</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr>
You can use a single integer and not just a list.
.. code:: ipython3
df[99, 3]
.. raw:: html
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 66614</td></tr>
Or a single string for the columns
.. code:: ipython3
df[99, 'salary']
.. raw:: html
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 66614</td></tr>
You can use a slice for the rows
.. code:: ipython3
df[20:100:10, ['race', 'gender']]
.. raw:: html
<table><thead><tr><th></th><th>race </th><th>gender </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td><td>Male </td></tr><tr><td><strong>1</strong></td><td>White </td><td>Male </td></tr><tr><td><strong>2</strong></td><td>Hispanic </td><td>Male </td></tr><tr><td><strong>3</strong></td><td>White </td><td>Male </td></tr><tr><td><strong>4</strong></td><td>White </td><td>Male </td></tr><tr><td><strong>5</strong></td><td>Hispanic </td><td>Male </td></tr><tr><td><strong>6</strong></td><td>Hispanic </td><td>Male </td></tr><tr><td><strong>7</strong></td><td>Black </td><td>Female </td></tr>
You can also slice the columns with either integers or strings
.. code:: ipython3
df[20:100:10, :2]
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Hispanic </td></tr><tr><td><strong>3</strong></td><td>Houston Police Department-HPD</td><td>White </td></tr><tr><td><strong>4</strong></td><td>Houston Fire Department (HFD)</td><td>White </td></tr><tr><td><strong>5</strong></td><td>Houston Police Department-HPD</td><td>Hispanic </td></tr><tr><td><strong>6</strong></td><td>Houston Fire Department (HFD)</td><td>Hispanic </td></tr><tr><td><strong>7</strong></td><td>Houston Police Department-HPD</td><td>Black </td></tr>
.. code:: ipython3
df[20:100:10, 'department':'gender']
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Hispanic </td><td>Male </td></tr><tr><td><strong>3</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td></tr><tr><td><strong>4</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td></tr><tr><td><strong>5</strong></td><td>Houston Police Department-HPD</td><td>Hispanic </td><td>Male </td></tr><tr><td><strong>6</strong></td><td>Houston Fire Department (HFD)</td><td>Hispanic </td><td>Male </td></tr><tr><td><strong>7</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Female </td></tr>
You can do boolean selection if you pass the brackets a one-column
boolean DataFrame.
.. code:: ipython3
filt = df['salary'] > 100000
filt.head()
.. raw:: html
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>False</td></tr><tr><td><strong>1</strong></td><td>False</td></tr><tr><td><strong>2</strong></td><td>False</td></tr><tr><td><strong>3</strong></td><td>False</td></tr><tr><td><strong>4</strong></td><td>False</td></tr>
.. code:: ipython3
df[filt].head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Public Works & Engineering-PWE</td><td>White </td><td>Male </td><td> 107962</td></tr><tr><td><strong>1</strong></td><td>Health & Human Services</td><td>Black </td><td>Male </td><td> 180416</td></tr><tr><td><strong>2</strong></td><td>Houston Fire Department (HFD)</td><td>Hispanic </td><td>Male </td><td> 165216</td></tr><tr><td><strong>3</strong></td><td>Health & Human Services</td><td>White </td><td>Female </td><td> 100791</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 120916</td></tr>
.. code:: ipython3
df[filt, ['race', 'salary']].head()
.. raw:: html
<table><thead><tr><th></th><th>race </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td><td> 107962</td></tr><tr><td><strong>1</strong></td><td>Black </td><td> 180416</td></tr><tr><td><strong>2</strong></td><td>Hispanic </td><td> 165216</td></tr><tr><td><strong>3</strong></td><td>White </td><td> 100791</td></tr><tr><td><strong>4</strong></td><td>White </td><td> 120916</td></tr>
Assigning Columns
~~~~~~~~~~~~~~~~~
You can only assign an entire new column or overwrite an old one. You
cannot assign a subset of the data. You can assign a new column with a
single value like this:
.. code:: ipython3
df['bonus'] = 1000
df.head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td><td> 1000</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td><td> 1000</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td><td> 1000</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td><td> 1000</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td><td> 1000</td></tr>
You can assign with a numpy array the same length as a column.
.. code:: ipython3
import numpy as np
df['bonus'] = np.random.randint(100, 5000, len(df))
df.head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td><td> 1539</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td><td> 2885</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td><td> 619</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td><td> 3010</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td><td> 3180</td></tr>
You can assign a new column with a one column DataFrame.
.. code:: ipython3
df['salary'] + df['bonus']
.. raw:: html
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 46818</td></tr><tr><td><strong>1</strong></td><td> 66051</td></tr><tr><td><strong>2</strong></td><td> 67233</td></tr><tr><td><strong>3</strong></td><td> 74690</td></tr><tr><td><strong>4</strong></td><td> 45570</td></tr><tr><td><strong>5</strong></td><td> 108399</td></tr><tr><td><strong>6</strong></td><td> 53162</td></tr><tr><td><strong>7</strong></td><td> 181993</td></tr><tr><td><strong>8</strong></td><td> 33989</td></tr><tr><td><strong>9</strong></td><td> 58163</td></tr><tr><strong><td>...</td></strong><td>...</td></tr><tr><td><strong>1525</strong></td><td> 31775</td></tr><tr><td><strong>1526</strong></td><td> 45543</td></tr><tr><td><strong>1527</strong></td><td> 31705</td></tr><tr><td><strong>1528</strong></td><td> 81998</td></tr><tr><td><strong>1529</strong></td><td> 105211</td></tr><tr><td><strong>1530</strong></td><td> 47273</td></tr><tr><td><strong>1531</strong></td><td> 69367</td></tr><tr><td><strong>1532</strong></td><td> 46605</td></tr><tr><td><strong>1533</strong></td><td> 55721</td></tr><tr><td><strong>1534</strong></td><td> 54934</td></tr></tbody></table>
.. code:: ipython3
df['total salary'] = df['salary'] + df['bonus']
df.head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td><td> 1539</td><td> 46818</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td><td> 2885</td><td> 66051</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td><td> 619</td><td> 67233</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td><td> 3010</td><td> 74690</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td><td> 3180</td><td> 45570</td></tr>
Arithmetic and comparison operators
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: ipython3
df1 = df[['salary', 'bonus']] * 5
df1.head()
.. raw:: html
<table><thead><tr><th></th><th>salary </th><th>bonus </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 226395</td><td> 7695</td></tr><tr><td><strong>1</strong></td><td> 315830</td><td> 14425</td></tr><tr><td><strong>2</strong></td><td> 333070</td><td> 3095</td></tr><tr><td><strong>3</strong></td><td> 358400</td><td> 15050</td></tr><tr><td><strong>4</strong></td><td> 211950</td><td> 15900</td></tr>
.. code:: ipython3
df1 = df[['salary', 'bonus']] > 100000
df1.head()
.. raw:: html
<table><thead><tr><th></th><th>salary </th><th>bonus </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>False</td><td>False</td></tr><tr><td><strong>1</strong></td><td>False</td><td>False</td></tr><tr><td><strong>2</strong></td><td>False</td><td>False</td></tr><tr><td><strong>3</strong></td><td>False</td><td>False</td></tr><tr><td><strong>4</strong></td><td>False</td><td>False</td></tr>
.. code:: ipython3
df1 = df['race'] == 'White'
df1.head()
.. raw:: html
<table><thead><tr><th></th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>True</td></tr><tr><td><strong>1</strong></td><td>True</td></tr><tr><td><strong>2</strong></td><td>False</td></tr><tr><td><strong>3</strong></td><td>False</td></tr><tr><td><strong>4</strong></td><td>True</td></tr>
Aggregation
~~~~~~~~~~~
Most of the common aggregation methods are available. They only work
down the columns and not across the rows.
.. code:: ipython3
df.min()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Health & Human Services</td><td>Asian </td><td>Female </td><td> 24960</td><td> 108</td><td> 26249</td></tr>
Columns that the aggregation does not work are dropped.
.. code:: ipython3
df.mean()
.. raw:: html
<table><thead><tr><th></th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 56278.746</td><td> 2532.181</td><td> 58810.927</td></tr>
.. code:: ipython3
df.argmax()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 3</td><td> 0</td><td> 0</td><td> 145</td><td> 786</td><td> 145</td></tr>
.. code:: ipython3
df['salary'].argmin()
.. raw:: html
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 347</td></tr>
Check if all salaries are greater than 20000
.. code:: ipython3
df1 = df['salary'] > 20000
df1.all()
.. raw:: html
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>True</td></tr>
Count the number of non-missing values
.. code:: ipython3
df.count()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 1535</td><td> 1535</td><td> 1535</td><td> 1535</td><td> 1535</td><td> 1535</td></tr>
Get number of unique values.
.. code:: ipython3
df.nunique()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 6</td><td> 5</td><td> 2</td><td> 548</td><td> 1321</td><td> 1513</td></tr>
Non-Aggregating Methods
~~~~~~~~~~~~~~~~~~~~~~~
These are methods that do not return a single value.
Get the unique values of each column. The ``unique`` method returns a
list of DataFrames containing the unique values for each column.
.. code:: ipython3
dfs = df.unique()
.. code:: ipython3
dfs[0]
.. raw:: html
<table><thead><tr><th></th><th>department</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Health & Human Services</td></tr><tr><td><strong>1</strong></td><td>Houston Airport System (HAS)</td></tr><tr><td><strong>2</strong></td><td>Houston Fire Department (HFD)</td></tr><tr><td><strong>3</strong></td><td>Houston Police Department-HPD</td></tr><tr><td><strong>4</strong></td><td>Parks & Recreation</td></tr><tr><td><strong>5</strong></td><td>Public Works & Engineering-PWE</td></tr>
.. code:: ipython3
dfs[1]
.. raw:: html
<table><thead><tr><th></th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Asian </td></tr><tr><td><strong>1</strong></td><td>Black </td></tr><tr><td><strong>2</strong></td><td>Hispanic </td></tr><tr><td><strong>3</strong></td><td>Native American</td></tr><tr><td><strong>4</strong></td><td>White </td></tr>
.. code:: ipython3
dfs[2]
.. raw:: html
<table><thead><tr><th></th><th>gender </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Female </td></tr><tr><td><strong>1</strong></td><td>Male </td></tr>
Rename columns with a dictionary.
.. code:: ipython3
df.rename({'department':'dept', 'bonus':'BONUS'}).head()
.. raw:: html
<table><thead><tr><th></th><th>dept </th><th>race </th><th>gender </th><th>salary </th><th>BONUS </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td><td> 1539</td><td> 46818</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td><td> 2885</td><td> 66051</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td><td> 619</td><td> 67233</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td><td> 3010</td><td> 74690</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td><td> 3180</td><td> 45570</td></tr>
Drop columns with a string or list of strings.
.. code:: ipython3
df.drop('race').head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>Male </td><td> 45279</td><td> 1539</td><td> 46818</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>Male </td><td> 63166</td><td> 2885</td><td> 66051</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Male </td><td> 66614</td><td> 619</td><td> 67233</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Male </td><td> 71680</td><td> 3010</td><td> 74690</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>Male </td><td> 42390</td><td> 3180</td><td> 45570</td></tr>
.. code:: ipython3
df.drop(['race', 'gender']).head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td> 45279</td><td> 1539</td><td> 46818</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td> 63166</td><td> 2885</td><td> 66051</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td> 66614</td><td> 619</td><td> 67233</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td> 71680</td><td> 3010</td><td> 74690</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td> 42390</td><td> 3180</td><td> 45570</td></tr>
Absolute value
.. code:: ipython3
df.abs().head()
.. raw:: html
<table><thead><tr><th></th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 45279</td><td> 1539</td><td> 46818</td></tr><tr><td><strong>1</strong></td><td> 63166</td><td> 2885</td><td> 66051</td></tr><tr><td><strong>2</strong></td><td> 66614</td><td> 619</td><td> 67233</td></tr><tr><td><strong>3</strong></td><td> 71680</td><td> 3010</td><td> 74690</td></tr><tr><td><strong>4</strong></td><td> 42390</td><td> 3180</td><td> 45570</td></tr>
Cumulative min, max, and sum
.. code:: ipython3
df.cummax().head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td><td> 1539</td><td> 46818</td></tr><tr><td><strong>1</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 63166</td><td> 2885</td><td> 66051</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 66614</td><td> 2885</td><td> 67233</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>White </td><td>Male </td><td> 71680</td><td> 3010</td><td> 74690</td></tr><tr><td><strong>4</strong></td><td>Public Works & Engineering-PWE</td><td>White </td><td>Male </td><td> 71680</td><td> 3180</td><td> 74690</td></tr>
Clip values to be within a range.
.. code:: ipython3
df.clip(40000, 60000).head()
.. raw:: html
<table><thead><tr><th></th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 45279</td><td> 40000</td><td> 46818</td></tr><tr><td><strong>1</strong></td><td> 60000</td><td> 40000</td><td> 60000</td></tr><tr><td><strong>2</strong></td><td> 60000</td><td> 40000</td><td> 60000</td></tr><tr><td><strong>3</strong></td><td> 60000</td><td> 40000</td><td> 60000</td></tr><tr><td><strong>4</strong></td><td> 42390</td><td> 40000</td><td> 45570</td></tr>
Round numeric columns
.. code:: ipython3
df.round(-3).head()
.. raw:: html
<table><thead><tr><th></th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 45000</td><td> 2000</td><td> 47000</td></tr><tr><td><strong>1</strong></td><td> 63000</td><td> 3000</td><td> 66000</td></tr><tr><td><strong>2</strong></td><td> 67000</td><td> 1000</td><td> 67000</td></tr><tr><td><strong>3</strong></td><td> 72000</td><td> 3000</td><td> 75000</td></tr><tr><td><strong>4</strong></td><td> 42000</td><td> 3000</td><td> 46000</td></tr>
Copy the DataFrame
.. code:: ipython3
df.copy().head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td><td> 1539</td><td> 46818</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td><td> 2885</td><td> 66051</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td><td> 619</td><td> 67233</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td><td> 3010</td><td> 74690</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td><td> 3180</td><td> 45570</td></tr>
Take the nth difference.
.. code:: ipython3
df['salary'].diff(2).head(10)
.. raw:: html
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> nan</td></tr><tr><td><strong>1</strong></td><td> nan</td></tr><tr><td><strong>2</strong></td><td> 21335.000</td></tr><tr><td><strong>3</strong></td><td> 8514.000</td></tr><tr><td><strong>4</strong></td><td>-24224.000</td></tr><tr><td><strong>5</strong></td><td> 36282.000</td></tr><tr><td><strong>6</strong></td><td> 10254.000</td></tr><tr><td><strong>7</strong></td><td> 72454.000</td></tr><tr><td><strong>8</strong></td><td>-22297.000</td></tr><tr><td><strong>9</strong></td><td>-125147.000</td></tr>
Find the nth percentage change.
.. code:: ipython3
df['salary'].pct_change(2).head(10)
.. raw:: html
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> nan</td></tr><tr><td><strong>1</strong></td><td> nan</td></tr><tr><td><strong>2</strong></td><td> 0.471</td></tr><tr><td><strong>3</strong></td><td> 0.135</td></tr><tr><td><strong>4</strong></td><td> -0.364</td></tr><tr><td><strong>5</strong></td><td> 0.506</td></tr><tr><td><strong>6</strong></td><td> 0.242</td></tr><tr><td><strong>7</strong></td><td> 0.671</td></tr><tr><td><strong>8</strong></td><td> -0.424</td></tr><tr><td><strong>9</strong></td><td> -0.694</td></tr>
Sort the DataFrame by one or more columns
.. code:: ipython3
df.sort_values('salary').head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Female </td><td> 24960</td><td> 1335</td><td> 26295</td></tr><tr><td><strong>1</strong></td><td>Public Works & Engineering-PWE</td><td>Hispanic </td><td>Male </td><td> 26104</td><td> 666</td><td> 26770</td></tr><tr><td><strong>2</strong></td><td>Public Works & Engineering-PWE</td><td>Black </td><td>Female </td><td> 26125</td><td> 3904</td><td> 30029</td></tr><tr><td><strong>3</strong></td><td>Houston Airport System (HAS)</td><td>Hispanic </td><td>Female </td><td> 26125</td><td> 3352</td><td> 29477</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>Black </td><td>Female </td><td> 26125</td><td> 4150</td><td> 30275</td></tr>
Sort descending
.. code:: ipython3
df.sort_values('salary', asc=False).head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 210588</td><td> 364</td><td> 210952</td></tr><tr><td><strong>1</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 199596</td><td> 2164</td><td> 201760</td></tr><tr><td><strong>2</strong></td><td>Houston Airport System (HAS)</td><td>Black </td><td>Male </td><td> 186192</td><td> 948</td><td> 187140</td></tr><tr><td><strong>3</strong></td><td>Health & Human Services</td><td>Black </td><td>Male </td><td> 180416</td><td> 1577</td><td> 181993</td></tr><tr><td><strong>4</strong></td><td>Public Works & Engineering-PWE</td><td>White </td><td>Female </td><td> 178331</td><td> 4891</td><td> 183222</td></tr>
Sort by multiple columns
.. code:: ipython3
df.sort_values(['race', 'salary']).head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Airport System (HAS)</td><td>Asian </td><td>Female </td><td> 26125</td><td> 4665</td><td> 30790</td></tr><tr><td><strong>1</strong></td><td>Houston Police Department-HPD</td><td>Asian </td><td>Male </td><td> 27914</td><td> 2068</td><td> 29982</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Asian </td><td>Male </td><td> 28169</td><td> 246</td><td> 28415</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 28995</td><td> 3747</td><td> 32742</td></tr><tr><td><strong>4</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 30347</td><td> 1445</td><td> 31792</td></tr>
Randomly sample the DataFrame
.. code:: ipython3
df.sample(n=3)
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Fire Department (HFD)</td><td>Black </td><td>Male </td><td> 61226</td><td> 4609</td><td> 65835</td></tr><tr><td><strong>1</strong></td><td>Public Works & Engineering-PWE</td><td>Hispanic </td><td>Male </td><td> 31158</td><td> 3846</td><td> 35004</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Asian </td><td>Male </td><td> 60347</td><td> 1086</td><td> 61433</td></tr>
Randomly sample a fraction
.. code:: ipython3
df.sample(frac=.005)
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 32635</td><td> 3802</td><td> 36437</td></tr><tr><td><strong>1</strong></td><td>Houston Airport System (HAS)</td><td>Hispanic </td><td>Male </td><td> 42099</td><td> 4378</td><td> 46477</td></tr><tr><td><strong>2</strong></td><td>Public Works & Engineering-PWE</td><td>Black </td><td>Male </td><td> 104389</td><td> 3602</td><td> 107991</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>White </td><td>Male </td><td> 43514</td><td> 2222</td><td> 45736</td></tr><tr><td><strong>4</strong></td><td>Public Works & Engineering-PWE</td><td>Black </td><td>Female </td><td> 33488</td><td> 848</td><td> 34336</td></tr><tr><td><strong>5</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 70181</td><td> 1109</td><td> 71290</td></tr><tr><td><strong>6</strong></td><td>Public Works & Engineering-PWE</td><td>White </td><td>Male </td><td> 60715</td><td> 4480</td><td> 65195</td></tr>
Sample with replacement
.. code:: ipython3
df.sample(n=10000, replace=True).head()
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th><th>bonus </th><th>total salary</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 61921</td><td> 172</td><td> 62093</td></tr><tr><td><strong>1</strong></td><td>Houston Police Department-HPD</td><td>Asian </td><td>Male </td><td> 61643</td><td> 659</td><td> 62302</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 43443</td><td> 3539</td><td> 46982</td></tr><tr><td><strong>3</strong></td><td>Health & Human Services</td><td>White </td><td>Male </td><td> 120799</td><td> 2447</td><td> 123246</td></tr><tr><td><strong>4</strong></td><td>Public Works & Engineering-PWE</td><td>White </td><td>Male </td><td> 98895</td><td> 4571</td><td> 103466</td></tr>
String-only methods
~~~~~~~~~~~~~~~~~~~
Use the ``str`` accessor to call methods available just to string
columns. Pass the name of the string column as the first parameter for
all these methods.
.. code:: ipython3
df.str.count('department', 'P').head()
.. raw:: html
<table><thead><tr><th></th><th>department</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 2</td></tr><tr><td><strong>1</strong></td><td> 0</td></tr><tr><td><strong>2</strong></td><td> 2</td></tr><tr><td><strong>3</strong></td><td> 2</td></tr><tr><td><strong>4</strong></td><td> 0</td></tr>
.. code:: ipython3
df.str.lower('department').head()
.. raw:: html
<table><thead><tr><th></th><th>department</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>houston police department-hpd</td></tr><tr><td><strong>1</strong></td><td>houston fire department (hfd)</td></tr><tr><td><strong>2</strong></td><td>houston police department-hpd</td></tr><tr><td><strong>3</strong></td><td>public works & engineering-pwe</td></tr><tr><td><strong>4</strong></td><td>houston airport system (has)</td></tr>
.. code:: ipython3
df.str.find('department', 'Houston').head()
.. raw:: html
<table><thead><tr><th></th><th>department</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 0</td></tr><tr><td><strong>1</strong></td><td> 0</td></tr><tr><td><strong>2</strong></td><td> 0</td></tr><tr><td><strong>3</strong></td><td> -1</td></tr><tr><td><strong>4</strong></td><td> 0</td></tr>
Grouping
~~~~~~~~
pandas_cub provides the ``value_counts`` method for simple frequency
counting of unique values and ``pivot_table`` for grouping and
aggregating.
The ``value_counts`` method returns a list of DataFrames, one for each
column.
.. code:: ipython3
dfs = df[['department', 'race', 'gender']].value_counts()
.. code:: ipython3
dfs[0]
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>count </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td> 570</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td> 365</td></tr><tr><td><strong>2</strong></td><td>Public Works & Engineering-PWE</td><td> 341</td></tr><tr><td><strong>3</strong></td><td>Health & Human Services</td><td> 103</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td> 103</td></tr><tr><td><strong>5</strong></td><td>Parks & Recreation</td><td> 53</td></tr>
.. code:: ipython3
dfs[1]
.. raw:: html
<table><thead><tr><th></th><th>race </th><th>count </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td><td> 542</td></tr><tr><td><strong>1</strong></td><td>Black </td><td> 518</td></tr><tr><td><strong>2</strong></td><td>Hispanic </td><td> 381</td></tr><tr><td><strong>3</strong></td><td>Asian </td><td> 87</td></tr><tr><td><strong>4</strong></td><td>Native American</td><td> 7</td></tr>
.. code:: ipython3
dfs[2]
.. raw:: html
<table><thead><tr><th></th><th>gender </th><th>count </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Male </td><td> 1135</td></tr><tr><td><strong>1</strong></td><td>Female </td><td> 400</td></tr>
If your DataFrame has one column, a DataFrame and not a list is
returned. You can also return the relative frequency by setting the
``normalize`` parameter to ``True``.
.. code:: ipython3
df['race'].value_counts(normalize=True)
.. raw:: html
<table><thead><tr><th></th><th>race </th><th>count </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td><td> 0.353</td></tr><tr><td><strong>1</strong></td><td>Black </td><td> 0.337</td></tr><tr><td><strong>2</strong></td><td>Hispanic </td><td> 0.248</td></tr><tr><td><strong>3</strong></td><td>Asian </td><td> 0.057</td></tr><tr><td><strong>4</strong></td><td>Native American</td><td> 0.005</td></tr>
The ``pivot_table`` method allows to group by one or two columns and
aggregate values from another column. Let’s find the average salary for
each race and gender. All parameters must be strings.
.. code:: ipython3
df.pivot_table(rows='race', columns='gender', values='salary', aggfunc='mean')
.. raw:: html
<table><thead><tr><th></th><th>race </th><th>Female </th><th>Male </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Asian </td><td> 58304.222</td><td> 60622.957</td></tr><tr><td><strong>1</strong></td><td>Black </td><td> 48133.382</td><td> 51853.000</td></tr><tr><td><strong>2</strong></td><td>Hispanic </td><td> 44216.960</td><td> 55493.064</td></tr><tr><td><strong>3</strong></td><td>Native American</td><td> 58844.333</td><td> 68850.500</td></tr><tr><td><strong>4</strong></td><td>White </td><td> 66415.528</td><td> 63439.196</td></tr>
If you don’t provide ``values`` or ``aggfunc`` then by default it will
return frequency (a contingency table).
.. code:: ipython3
df.pivot_table(rows='race', columns='gender')
.. raw:: html
<table><thead><tr><th></th><th>race </th><th>Female </th><th>Male </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Asian </td><td> 18</td><td> 69</td></tr><tr><td><strong>1</strong></td><td>Black </td><td> 207</td><td> 311</td></tr><tr><td><strong>2</strong></td><td>Hispanic </td><td> 100</td><td> 281</td></tr><tr><td><strong>3</strong></td><td>Native American</td><td> 3</td><td> 4</td></tr><tr><td><strong>4</strong></td><td>White </td><td> 72</td><td> 470</td></tr>
You can group by just a single column.
.. code:: ipython3
df.pivot_table(rows='department', values='salary', aggfunc='mean')
.. raw:: html
<table><thead><tr><th></th><th>department</th><th>mean </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Health & Human Services</td><td> 51324.981</td></tr><tr><td><strong>1</strong></td><td>Houston Airport System (HAS)</td><td> 53990.369</td></tr><tr><td><strong>2</strong></td><td>Houston Fire Department (HFD)</td><td> 59960.441</td></tr><tr><td><strong>3</strong></td><td>Houston Police Department-HPD</td><td> 60428.746</td></tr><tr><td><strong>4</strong></td><td>Parks & Recreation</td><td> 39426.151</td></tr><tr><td><strong>5</strong></td><td>Public Works & Engineering-PWE</td><td> 50207.806</td></tr>
.. code:: ipython3
df.pivot_table(columns='department', values='salary', aggfunc='mean')
.. raw:: html
<table><thead><tr><th></th><th>Health & Human Services</th><th>Houston Airport System (HAS)</th><th>Houston Fire Department (HFD)</th><th>Houston Police Department-HPD</th><th>Parks & Recreation</th><th>Public Works & Engineering-PWE</th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 51324.981</td><td> 53990.369</td><td> 59960.441</td><td> 60428.746</td><td> 39426.151</td><td> 50207.806</td></tr>
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pandas_cub-0.0.4.tar.gz
(36.0 kB
view hashes)
Built Distribution
pandas_cub-0.0.4-py3-none-any.whl
(19.6 kB
view hashes)
Close
Hashes for pandas_cub-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 904fd9f954278429b0f1993b63d65689f845ad3f95b0c9b684c2c8bcc9ded814 |
|
MD5 | 56dc4e3890bb0c731656484b5c115ebd |
|
BLAKE2b-256 | c5feaf5e0ce8a142c4711898eda492caa643b8da3183eb08c42e0892a7ff1c29 |