A python package to compress pandas DataFrames akin to Stata's `compress` command
Project description
df-compress
A python package to compress pandas DataFrames akin to Stata's compress command. This function may prove particularly helpfull to those dealing with large datasets.
Installation
You can install df-compress by running the following command:
pip install df_compress
How to use
After installing the package use the following import:
from df_compress import compress
Example
It follows a reproducible example on df-compress usage:
from df_compress import compress
import pandas as pd
import numpy as np
size = 1000000
df = pd.DataFrame(columns=["Year","State","Value","Int_value"])
df.Year = np.random.randint(low=2000,high=2023,size=size).astype(str)
df.State = np.random.choice(['RJ','SP','ES','MT'],size=size)
df.Value= np.random.rand(size,1)
df.Int_value = df.Value*10 // 1
compress(df, show_conversions=True, parallel = False) # which modifies the original DataFrame without needing to reassign it
Which will print for you the transformations and memory saved:
Initial memory usage: 114.44 MB
Final memory usage: 7.63 MB
Memory reduced by: 106.81 MB (93.3%)
Variable type conversions:
column from to memory saved (MB)
Year object int16 48.637264
State object category 47.683231
Value float64 float32 3.814571
Int_value float64 int8 6.675594
Optional Parameters
The function has three optimal parameters (arguments):
convert_strings(bool): Whether to attempt to parse object columns as numbers- defaults to
True
- defaults to
numeric_threshold(float): Indicates the proportion of valid numeric entries needed to convert a string to numeric- defaults to
0.999
- defaults to
show_conversions(bool): whether to report the changes made column by column- defaults to
False
- defaults to
parallel(bool): whether to compress the columns in parallel- defaults to
False
- defaults to
Parallelization Caveats
The parallelization is implemented using Dask and a local client. Moreover, the code is parallelized at the columns. Thus, opting for the parallel compression does not guarantees perfomance improvements and should be a conscious decision taken at the case-by-case basis. To prove this point, the implementation example provided above runs significantly slower when opting for the parallel compression (0.29x).
As far as I know, the reason why parallelization does not guarantee efficency regards the overhead time. Whenever you run some code in parallel you must "organize" it before computing the operation, which may take some time. If the efficiency gains from parallelizing the operation do not cover the overhead time, you incur an efficiency loss. Therefore, my recommendation is to only opt for the parallel compression when you have a DataFrame with many columns.
It follows a quick benchmark on a 12 CPUs computer to give you perspective on when to use the parallel compression:
import pandas as pd
from df_compress import compress
import sys, os
import numpy as np
from time import time
class HiddenPrints:
def __enter__(self):
self._original_stdout = sys.stdout
sys.stdout = open(os.devnull, 'w')
def __exit__(self, exc_type, exc_val, exc_tb):
sys.stdout.close()
sys.stdout = self._original_stdout
def timereps(reps, func):
start = time()
for i in range(0, reps):
func()
end = time()
return (end - start) / reps
def benchmark_compression(df):
print("Running benchmark on DataFrame with shape:", df.shape, "\n")
# Non-parallel
print("Testing non-parallel compression...")
with HiddenPrints():
time_non_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=False, show_conversions=False))
print(f"Non-parallel time: {time_non_parallel:.2f} seconds\n")
# Parallel
print("Testing parallel compression...")
with HiddenPrints():
time_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=True, show_conversions=False))
print(f"Parallel time: {time_parallel:.2f} seconds\n")
# Summary
speedup = time_non_parallel / time_parallel if time_parallel > 0 else float('inf')
print(f"Parallel speedup: {speedup:.2f}x")
def generate_test_dataframe(n_rows=1_000_000, n_object_cols=10, n_numeric_cols=10):
data = {}
for i in range(n_object_cols):
data[f"obj_{i}"] = np.random.choice(['A', 'B', 'C', 'D', 'E'], size=n_rows)
for i in range(n_numeric_cols):
data[f"num_{i}"] = np.random.randn(n_rows)
return pd.DataFrame(data)
When testing for a 40 column DataFrame (benchmark_compression(generate_test_dataframe(n_object_cols=20, n_numeric_cols=20))) I find that
Running benchmark on DataFrame with shape: (1000000, 40)
Testing non-parallel compression...
Non-parallel time: 17.60 seconds
Testing parallel compression...
Parallel time: 12.06 seconds
Parallel speedup: 1.46x
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file df_compress-0.7.0.tar.gz.
File metadata
- Download URL: df_compress-0.7.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02c5a0424aa37a085a7346abbc7f278c79cc50cd214b0eccfc37941ed9a05c8c
|
|
| MD5 |
988d56ae28efabb2734b132638039b9f
|
|
| BLAKE2b-256 |
211702a5f01bc81ea6090a86d668930b32bb7450c51e516778eda58aaa6aec22
|
File details
Details for the file df_compress-0.7.0-py3-none-any.whl.
File metadata
- Download URL: df_compress-0.7.0-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f505ce7f2034a5b9917965d04f66098bcb9603fe5b1c2af08d793a78e4c47c7
|
|
| MD5 |
29a228abb6aff3a6986dfe84d77b8aad
|
|
| BLAKE2b-256 |
47ebcb89c8a44eddc500bdc84e0b997f4ed51af3b9564f7e93b5bce32c210ff7
|