Skip to main content

A set of functions for working with categorical columns in pandas

Project description

Conventional Commits PyPI - Downloads PyPI CI - Test License: MIT

The package contains just a few features that make using pandas categorical types easier to use. The main purpose of using categorical types is to reduce RAM consumption when working with large datasets. Experience shows that on average, there is a decrease of 2 times (for datasets of several GB, this is very significant). The full justification of the reasons and examples will be given below.

Quickstart

pip install pandas-categorical
import pandas as pd
import pandas-categorical as pdc
df.astype('category')     ->     pdc.cat_astype(df, ...)
pd.concat()               ->     pdc.concat_categorical()
pd.merge()                ->     pdc.merge_categorical()
df.groupby(...)           ->     df.groupby(..., observed=True)

cat_astype

df = pd.read_csv("path_to_dataframe.csv")

SUB_DTYPES = {
	'cat_col_with_int_values': int,
	'cat_col_with_string_values': 'string',
	'ordered_cat_col_with_bool_values': bool,
}
pdc.cat_astype(
	data=df,
	cat_cols=SUB_DTYPES.keys(),
	sub_dtypes=SUB_DTYPES,
	ordered_cols=['ordered_cat_col_with_bool_values']
)

concat_categorical

df_1 = ...  # dataset with some categorical columns
df_2 = ...  # dataset with some categorical columns (categories are not equals)

df_res = pdc.concat_categorical((df_1, df_2), axis=0, ignore_index=True)

merge_categorical

df_1 = ...  # dataset with some categorical columns
df_2 = ...  # dataset with some categorical columns (categories are not equals)

df_res = pdc.merge_categorical(df_1, df_2, on=['cat_col_1', 'cat_col_2'])

A bit of theory

The advantages are discussed in detail in the articles here, here and here.

The categorical type implies the presence of a certain set of unique values in this column, which are often repeated. By reducing the copying of identical values, it is possible to reduce the size of the column (the larger the dataset, the more likely repetitions are). By default, categories (unique values) have no order. That is, they are not comparable to each other. It is possible to make them ordered.

Pandas already has everything for this (for example, .astype(’category’)). However, standard methods, in my opinion, require a high entry threshold and therefore are rarely used.

Let's try to outline a number of problems and ways to solve them.

1. Categorical types are easy to lose

Suppose you want to connect two datasets into one using pd.concat(..., axis=0). Datasets contain columns with categorical types. If the column categories of the source datasets are different, then pandas it does not combine multiple values, but simply resets their default type (for example, object, int, ...). In other words, $$\textcolor{red}{category1} + \textcolor{green}{category2} = object$$ $$\textcolor{red}{category1} + \textcolor{red}{category1} = \textcolor{red}{category1}$$ But we would like to observe a different behavior: $$\textcolor{red}{category1} + \textcolor{green}{category2} = \textcolor{blue}{category3}$$ $$(\textcolor{blue}{category3} = \textcolor{red}{category1} \cup \textcolor{green}{category2})$$ As a result, you need to monitor the reduction of categories before actions such as merge or concat.

2. Categories type control

When you do a type conversion

df['col_name'] = df['col_name'].astype('category')

the type of categories is equal to the type of the source column. But, if you want to change the type of categories, you probably want to write something like

df['col_name'] = df['col_name'].astype('some_new_type').astype('category')

That is, you will temporarily lose the categorical type (respectively, and the advantages of memory). By the way, the usual way of control

df.dtypes

does not display information about the type of categories themselves. You will only see only category next to the desired column.

3. Unused values

Suppose you have filtered the dataset. At the same time, the actual set of values of categorical columns could decrease, but the data type will not be changed. This can negatively affect, for example, when working with groupby on such a column. As a result, grouping will also occur by unused categories. To prevent this from happening, you need to specify the observed=True parameter. For example,

df.groupby(['cat_col_1', 'cat_col_2'], observed=True).agg('mean')

4. Ordered categories

There is a understandable instruction for converting a column type to a categorical (unordered) one

df[col] = df[col].astype('category')

But there is no similar command to convert to an ordered categorical type. There are two non-obvious ways:

df[col] = df[col].astype('category').cat.as_ordered()

Or

df[col] = df[col].astype(pd.CategoricalDtype(ordered=True))

5. Minimum copying

To process large datasets, you need to minimize the creation of copies of even its parts. Therefore, the functions from this package do the transformations in place.

6. Data storage in parquet format

When using pd.to_parquet(path, engine='pyarrow') and pd.read_parque(path, engine='pyarrow')categorical types of some columns can be reset to normal. To solve this problem, you can use engine='fastparquet'.

Note 1: fastparquet usually runs a little slower than pyarrow.

Note 2: pyarrow and fastparquet cannot be used together (for example, save by one and read by the other). This can lead to data loss.

import pandas as pd


df = pd.DataFrame(
	{
		"Date": pd.date_range('2023-01-01', periods=10),
		"Object": ["a"]*5+["b"]+["c"]*4,
		"Int": [1, 1, 1, 2, 3, 1, 2, 4, 3, 2],
		"Float": [1.1]*5+[2.2]*5,
	}
)

print(df.dtypes)
df = df.astype('category')
print(df.dtypes)
df.to_parquet('test.parquet', engine='pyarrow')
df = pd.read_parquet('test.parquet', engine='pyarrow')
print(df.dtypes)

Output:

Date      datetime64[ns]
Object    object
Int       int64
Float     float64
dtype: object

Date      category
Object    category
Int       category
Float     category
dtype: object

Date      datetime64[ns]
Object    category
Int       int64
Float     float64
dtype: object

Examples

Remarks

  1. Processing of categorical indexes has not yet been implemented.
  2. In the future, the function pdc.join_categorical() will appear.
  3. The cat_astype function was designed so that the type information could be redundant (for example, it is specified for all possible column names in the project at once). In the future, it will be possible to set default values for this function.

Links

  1. Official pandas documentation.
  2. https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a
  3. https://towardsdatascience.com/pandas-groupby-aggregate-transform-filter-c95ba3444bbb
  4. The source of the idea that I wanted to develop.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_categorical-1.2.0.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandas_categorical-1.2.0-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file pandas_categorical-1.2.0.tar.gz.

File metadata

  • Download URL: pandas_categorical-1.2.0.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.8.18 Linux/6.8.0-1017-azure

File hashes

Hashes for pandas_categorical-1.2.0.tar.gz
Algorithm Hash digest
SHA256 e26d42149ac108d96b48b429e798a397588e23b5e673e3915758377a7e5d5138
MD5 b2c59848c1ca92852a0f1fe5320f24a7
BLAKE2b-256 fdbef8b6eee9f81dc5913709ae30ab7ed92d80003bc9761114aa64a18b37a116

See more details on using hashes here.

File details

Details for the file pandas_categorical-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: pandas_categorical-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.8.18 Linux/6.8.0-1017-azure

File hashes

Hashes for pandas_categorical-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dbbc52b47e5cd73b3bd48adb57dcb93d4e725ca4f83e74535b5736b04ab80886
MD5 0131f9f706a164f46eb77028ed688ed7
BLAKE2b-256 ddb8be1fa09889ba1b8db5049bba0eec675f317d41571750dcb594c2441ee745

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page