Skip to main content

Shrink Pandas DataFrames with precision safe schema inference.

Project description

Pandas Downcast

image PyPI pyversions Build Status codecov

Shrink Pandas DataFrames with precision safe schema inference. pandas-downcast finds the minimum viable type for each column, ensuring that resulting values are within tolerance of original values.

Installation

pip install pandas-downcast

Dependencies

  • python >= 3.6
  • pandas
  • numpy

License

MIT

Usage

import pdcast as pdc
import numpy as np
import pandas as pd

data = {
    "integers": np.linspace(1, 100, 100),
    "floats": np.linspace(1, 1000, 100).round(2),
    "booleans": np.random.choice([1, 0], 100),
    "categories": np.random.choice(["foo", "bar", "baz"], 100),
}

df = pd.DataFrame(data)

# Downcast DataFrame to minimum viable schema.
df_downcast = pdc.downcast(df)

# Infer minimum schema for DataFrame.
schema = pdc.infer_schema(df)

# Coerce DataFrame to schema - required if converting float to Pandas Integer.
df_new = pdc.coerce_df(df, schema)

Smaller data types $\Rightarrow$ smaller memory footprint.

df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
#  #   Column      Non-Null Count  Dtype  
# ---  ------      --------------  -----  
#  0   integers    100 non-null    float64
#  1   floats      100 non-null    float64
#  2   booleans    100 non-null    int64  
#  3   categories  100 non-null    object 
# dtypes: float64(2), int64(1), object(1)
# memory usage: 3.2+ KB

df_downcast.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
#  #   Column      Non-Null Count  Dtype   
# ---  ------      --------------  -----   
#  0   integers    100 non-null    uint8   
#  1   floats      100 non-null    float32 
#  2   booleans    100 non-null    bool    
#  3   categories  100 non-null    category
# dtypes: bool(1), category(1), float32(1), uint8(1)
# memory usage: 932.0 bytes

Numerical data types will be downcast if the resulting values are within tolerance of the original values. For details on tolerance for numeric comparison, see the notes on np.allclose.

print(df.head())
#    integers  floats  booleans categories
# 0       1.0    1.00         1        foo
# 1       2.0   11.09         0        baz
# 2       3.0   21.18         1        bar
# 3       4.0   31.27         0        bar
# 4       5.0   41.36         0        foo

print(df_downcast.head())
#    integers     floats  booleans categories
# 0         1   1.000000      True        foo
# 1         2  11.090000     False        baz
# 2         3  21.180000      True        bar
# 3         4  31.270000     False        bar
# 4         5  41.360001     False        foo


print(pdc.options.ATOL)
# >>> 1e-08

print(pdc.options.RTOL)
# >>> 1e-05

Tolerance can be set at the module level or passed in function arguments.

pdc.options.ATOL = 1e-10
pdc.options.RTOL = 1e-10
df_downcast_new = pdc.downcast(df)

Or

infer_dtype_kws = {
    "ATOL": 1e-10,
    "RTOL": 1e-10
}
df_downcast_new = pdc.downcast(df, infer_dtype_kws=infer_dtype_kws)

The floats column is now kept as float64 to meet the tolerance requirement. Values in the integers column are still safely cast to uint8.

df_downcast_new.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
#  #   Column      Non-Null Count  Dtype   
# ---  ------      --------------  -----   
#  0   integers    100 non-null    uint8   
#  1   floats      100 non-null    float64 
#  2   booleans    100 non-null    bool    
#  3   categories  100 non-null    category
# dtypes: bool(1), category(1), float64(1), uint8(1)
# memory usage: 1.3 KB

Inferred schemas can be restricted to Numpy data types only.

# Downcast DataFrame to minimum viable Numpy schema.
df_downcast = pdc.downcast(df, numpy_dtypes_only=True)

# Infer minimum Numpy schema for DataFrame.
schema = pdc.infer_schema(df, numpy_dtypes_only=True)

Example

The following example shows how downcasting data often leads to size reductions of greater than 70%, depending on the original types.

import pdcast as pdc
import pandas as pd
import seaborn as sns

df_dict = {df: sns.load_dataset(df) for df in sns.get_dataset_names()}

results = []

for name, df in df_dict.items():
    size_pre = df.memory_usage(deep=True).sum()
    df_post = pdc.downcast(df)
    size_post = df_post.memory_usage(deep=True).sum()
    shrinkage = int((1 - (size_post / size_pre)) * 100)
    results.append(
        {"dataset": name, "size_pre": size_pre, "size_post": size_post, "shrink_pct": shrinkage}
    )

results_df = pd.DataFrame(results).sort_values("shrink_pct", ascending=False).reset_index(drop=True)
print(results_df)
           dataset  size_pre  size_post  shrink_pct
0             fmri    213232      14776          93
1          titanic    321240      28162          91
2        attention      5888        696          88
3         penguins     75711       9131          87
4             dots    122240      17488          85
5           geyser     21172       3051          85
6           gammas    500128     108386          78
7         anagrams      2048        456          77
8          planets    112663      30168          73
9         anscombe      3428        964          71
10            iris     14728       5354          63
11        exercise      3302       1412          57
12         flights      3616       1888          47
13             mpg     75756      43842          42
14            tips      7969       6261          21
15        diamonds   3184588    2860948          10
16  brain_networks   4330642    4330642           0
17     car_crashes      5993       5993           0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_downcast-1.2.4.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

pandas_downcast-1.2.4-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file pandas_downcast-1.2.4.tar.gz.

File metadata

  • Download URL: pandas_downcast-1.2.4.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.10.4 Linux/5.19.0-38-generic

File hashes

Hashes for pandas_downcast-1.2.4.tar.gz
Algorithm Hash digest
SHA256 0cadbfe1a57ffdbb9b4c8c9aa1266c2d50f65cb9579e03956140e1db747e1efa
MD5 6f60626078df7c406f9c25cea61bb882
BLAKE2b-256 88801cf3e8ac26523c26fdaf4b50c951d3af0123cdd4d1706c1ece7e721d0097

See more details on using hashes here.

File details

Details for the file pandas_downcast-1.2.4-py3-none-any.whl.

File metadata

  • Download URL: pandas_downcast-1.2.4-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.10.4 Linux/5.19.0-38-generic

File hashes

Hashes for pandas_downcast-1.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b972f05fa98af04a6dca496c305406cb15607c41521d97328e2bc4390e30c14c
MD5 4ceb86665603eb5178c6b3218233720b
BLAKE2b-256 5ddee6f661357011bfe7e909595f46941d1e99018cdd749a6c2b1d2247203502

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page