Automated downcasting for Pandas DataFrames.
Project description
Pandas Downcast
Safely infer minimum viable schema for Pandas DataFrame
and Series
.
Installation
pip install pandas-downcast
Dependencies
- python >= 3.6
- pandas
- numpy
License
Usage
import pdcast as pdc
import numpy as np
import pandas as pd
data = {
"integers": np.linspace(1, 100, 100),
"floats": np.linspace(1, 1000, 100).round(2),
"booleans": np.random.choice([1, 0], 100),
"categories": np.random.choice(["foo", "bar", "baz"], 100),
}
df = pd.DataFrame(data)
# Downcast DataFrame to minimum viable schema.
df_downcast = pdc.downcast(df)
# Infer minimum schema from DataFrame.
schema = pdc.infer_schema(df)
# Coerce DataFrame to schema - required if converting float to Pandas Integer.
df_new = pdc.coerce_df(df)
Additional Notes
Smaller types == smaller memory footprint.
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 integers 100 non-null float64
# 1 floats 100 non-null float64
# 2 booleans 100 non-null int64
# 3 categories 100 non-null object
# dtypes: float64(2), int64(1), object(1)
# memory usage: 3.2+ KB
df_downcast.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 integers 100 non-null uint8
# 1 floats 100 non-null float32
# 2 booleans 100 non-null bool
# 3 categories 100 non-null category
# dtypes: bool(1), category(1), float32(1), uint8(1)
# memory usage: 932.0 bytes
Numerical data types will be downcast if the resulting values are within tolerance of the original values.
For details on tolerance for numeric comparison, see the notes on np.allclose
.
print(df.head())
# integers floats booleans categories
# 0 1.0 1.00 1 foo
# 1 2.0 11.09 0 baz
# 2 3.0 21.18 1 bar
# 3 4.0 31.27 0 bar
# 4 5.0 41.36 0 foo
print(df_downcast.head())
# integers floats booleans categories
# 0 1 1.000000 True foo
# 1 2 11.090000 False baz
# 2 3 21.180000 True bar
# 3 4 31.270000 False bar
# 4 5 41.360001 False foo
print(pdc.options.ATOL)
# >>> 1e-08
print(pdc.options.RTOL)
# >>> 1e-05
Tolerance can be set at module level or passed in function arguments:
pdc.options.ATOL = 1e-10
pdc.options.RTOL = 1e-10
df_downcast_new = pdc.downcast(df)
Or
infer_dtype_kws = {
"ATOL": 1e-10,
"RTOL": 1e-10
}
df_downcast_new = pdc.downcast(df, infer_dtype_kws=infer_dtype_kws)
The floats
column is now kept as float64
to meet the tolerance requirement.
Values in the integers
column are still safely cast to uint8
.
df_downcast_new.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 integers 100 non-null uint8
# 1 floats 100 non-null float64
# 2 booleans 100 non-null bool
# 3 categories 100 non-null category
# dtypes: bool(1), category(1), float64(1), uint8(1)
# memory usage: 1.3 KB
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pandas-downcast-0.1.0.tar.gz
(7.4 kB
view hashes)
Built Distribution
Close
Hashes for pandas_downcast-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cdfe0422ab26263415cbdd2c6f72d0f769b9073752e5c266b45442d4558ce336 |
|
MD5 | 4415854597ba2bdfa0842ccf84dfe86d |
|
BLAKE2b-256 | d5c80d09a298412943953adbb1270df2f01b53cac78a80a7b0c5db535411b7f7 |