Skip to main content

Attempt to shrink Pandas `dtypes` without losing data so you have more RAM (and maybe more speed)

Project description

dtype_diet

Attempt to shrink Pandas dtypes without losing data so you have more RAM (and maybe more speed)

This file will become your README and also the index of your documentation.

Install

pip install dtype_diet

Documentation

https://noklam.github.io/dtype_diet/

How to use

This is a fork of https://github.com/ianozsvald/dtype_diet to continue supoprt and develop the library with approval from the original author @ianozsvald.

This tool checks each column to see if larger dtypes (e.g. 8 byte float64 and int64) could be shrunk to smaller dtypes without causing any data loss. Dropping an 8 byte type to a 4 (or 2 or 1 byte) type will keep halving the RAM requirement for that column. Categoricals are proposed for object columns which can bring significant speed and RAM benefits.

Here's an minimal example with 3 lines of code running on a Kaggle dataset showing a reduction of 957 -> 85MB, you can find the notebook in the repository:

#slow
# sell_prices.csv.zip 
# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/
import pandas as pd
from dtype_diet import report_on_dataframe, optimize_dtypes
df = pd.read_csv('data/sell_prices.csv')
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')
Original df memory: 957.5197134017944 MB
Propsed df memory: 85.09655094146729 MB
#slow
proposed_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Current dtype Proposed dtype Current Memory (MB) Proposed Memory (MB) Ram Usage Improvement (MB) Ram Usage Improvement (%)
Column
store_id object category 203763.920410 3340.907715 200423.012695 98.360403
item_id object category 233039.977539 6824.677734 226215.299805 97.071456
wm_yr_wk int64 int16 26723.191406 6680.844727 20042.346680 74.999825
sell_price float64 None 26723.191406 NaN NaN NaN

Recommendations:

  • Run report_on_dataframe(your_df) to get recommendations
  • Run optimize_dtypes(df, proposed_df) to convert to recommeded dtypes.
  • Consider if Categoricals will save you RAM (see Caveats below)
  • Consider if f32 or f16 will be useful (see Caveats - f32 is probably a reasonable choice unless you have huge ranges of floats)
  • Consider if int32, int16, int8 will be useful (see Caveats - overflow may be an issue)
  • Look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html which recommends Pandas nullable dtype alternatives (e.g. to avoid promoting an int64 with NaN items to float64, instead you get Int64 with NaNs and no data loss)
  • Look at Extension arrays like https://github.com/JDASoftwareGroup/rle-array (thanks @repererum for the tweet)

Look at report_on_dataframe(your_df) to get a printed report - no changes are made to your dataframe.

Caveats

  • reduced numeric ranges might lead to overflow (TODO document)
  • category dtype can have unexpected effects e.g. need for observed=True in groupby (TODO document)
  • f16 is likely to be simulated on modern hardware so calculations will be 2-3* slower than on f32 or f64
  • we could do with a link that explains binary representation of float & int for those wanting to learn more

Development

Contributors

Local Setup

$ conda create -n dtype_diet python=3.8 pandas jupyter pyarrow pytest
$ conda activate dtype_diet

Release

make release

Contributing

The repository is developed with nbdev, a system for developing library with notebook.

Make sure you run this if you want to contribute to the library. For details, please refer to nbdev documentation (https://github.com/fastai/nbdev)

nbdev_install_git_hooks

Some other useful commands

nbdev_build_docs
nbdev_build_lib
nbdev_test_nbs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dtype_diet-0.0.2.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

dtype_diet-0.0.2-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file dtype_diet-0.0.2.tar.gz.

File metadata

  • Download URL: dtype_diet-0.0.2.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for dtype_diet-0.0.2.tar.gz
Algorithm Hash digest
SHA256 3c9161bf67d8e591ea60d3d1bd99f19935292488d36804bb1ba8bf1ff1627dd6
MD5 98781c303620229399538d36043fcb5b
BLAKE2b-256 edc65ce7377dca7d9b1994b6a8b8856b66244948afdddbec947e5d113b905156

See more details on using hashes here.

File details

Details for the file dtype_diet-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: dtype_diet-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for dtype_diet-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 73435e07ab12cc674a474bd396dfc05cc010e2f079c2809055d76330c0e611ef
MD5 0a4082677fa8807cfa09aa1f27f87c09
BLAKE2b-256 05aebf75447edcd6959a0413c6a08e18f78704808fd7a5714f49ca738e0ee9ba

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page