Skip to main content

A python package to compress pandas DataFrames akin to Stata's `compress` command

Project description

Build Python PyPI DOI

df-compress

A python package to compress pandas DataFrames akin to Stata's compress command. This function may prove particularly helpfull to those dealing with large datasets.

Installation

You can install df-compress by running the following command:

pip install df_compress

How to use

After installing the package use the following import:

from df_compress import compress

Example

It follows a reproducible example on df-compress usage:

from df_compress import compress
import pandas as pd
import numpy as np

size = 1000000
df = pd.DataFrame(columns=["Year","State","Value","Int_value"])
df.Year = np.random.randint(low=2000,high=2023,size=size).astype(str)
df.State = np.random.choice(['RJ','SP','ES','MT'],size=size)
df.Value= np.random.rand(size,1)
df.Int_value = df.Value*10 // 1

compress(df, show_conversions=True, parallel = False) # which modifies the original DataFrame without needing to reassign it

Which will print for you the transformations and memory saved:

Initial memory usage: 114.44 MB
Final memory usage: 7.63 MB
Memory reduced by: 106.81 MB (93.3%)

Variable type conversions:
   column    from       to  memory saved (MB)
     Year  object    int16          48.637264
    State  object category          47.683231
    Value float64  float32           3.814571
Int_value float64     int8           6.675594

Optional Parameters

The function has three optimal parameters (arguments):

  • convert_strings (bool): Whether to attempt to parse object columns as numbers
    • defaults to True
  • numeric_threshold (float): Indicates the proportion of valid numeric entries needed to convert a string to numeric
    • defaults to 0.999
  • show_conversions (bool): whether to report the changes made column by column
    • defaults to False
  • parallel (bool): whether to compress the columns in parallel
    • defaults to False

Parallelization Caveats

The parallelization is implemented using Dask and a local client. Moreover, the code is parallelized at the columns. Thus, opting for the parallel compression does not guarantees perfomance improvements and should be a conscious decision taken at the case-by-case basis. To prove this point, the implementation example provided above runs significantly slower when opting for the parallel compression (0.29x).

As far as I know, the reason why parallelization does not guarantee efficency regards the overhead time. Whenever you run some code in parallel you must "organize" it before computing the operation, which may take some time. If the efficiency gains from parallelizing the operation do not cover the overhead time, you incur an efficiency loss. Therefore, my recommendation is to only opt for the parallel compression when you have a DataFrame with many columns.

It follows a quick benchmark on a 12 CPUs computer to give you perspective on when to use the parallel compression:

import pandas as pd
from df_compress import compress
import sys, os
import numpy as np
from time import time

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

def timereps(reps, func):
    start = time()
    for i in range(0, reps):
        func()
    end = time()
    return (end - start) / reps

def benchmark_compression(df):
    print("Running benchmark on DataFrame with shape:", df.shape, "\n")
    
    # Non-parallel
    print("Testing non-parallel compression...")
    with HiddenPrints():
        time_non_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=False, show_conversions=False))
    print(f"Non-parallel time: {time_non_parallel:.2f} seconds\n")
    
    # Parallel
    print("Testing parallel compression...")
    with HiddenPrints():
        time_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=True, show_conversions=False))
    print(f"Parallel time: {time_parallel:.2f} seconds\n")
    
    # Summary
    speedup = time_non_parallel / time_parallel if time_parallel > 0 else float('inf')
    print(f"Parallel speedup: {speedup:.2f}x")

def generate_test_dataframe(n_rows=1_000_000, n_object_cols=10, n_numeric_cols=10):
    data = {}
    for i in range(n_object_cols):
        data[f"obj_{i}"] = np.random.choice(['A', 'B', 'C', 'D', 'E'], size=n_rows)
    for i in range(n_numeric_cols):
        data[f"num_{i}"] = np.random.randn(n_rows)
    return pd.DataFrame(data)

When testing for a 40 column DataFrame (benchmark_compression(generate_test_dataframe(n_object_cols=20, n_numeric_cols=20))) I find that

Running benchmark on DataFrame with shape: (1000000, 40) 

Testing non-parallel compression...
Non-parallel time: 17.60 seconds

Testing parallel compression...
Parallel time: 12.06 seconds

Parallel speedup: 1.46x

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

df_compress-0.7.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

df_compress-0.7.0-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file df_compress-0.7.0.tar.gz.

File metadata

  • Download URL: df_compress-0.7.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for df_compress-0.7.0.tar.gz
Algorithm Hash digest
SHA256 02c5a0424aa37a085a7346abbc7f278c79cc50cd214b0eccfc37941ed9a05c8c
MD5 988d56ae28efabb2734b132638039b9f
BLAKE2b-256 211702a5f01bc81ea6090a86d668930b32bb7450c51e516778eda58aaa6aec22

See more details on using hashes here.

File details

Details for the file df_compress-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: df_compress-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for df_compress-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f505ce7f2034a5b9917965d04f66098bcb9603fe5b1c2af08d793a78e4c47c7
MD5 29a228abb6aff3a6986dfe84d77b8aad
BLAKE2b-256 47ebcb89c8a44eddc500bdc84e0b997f4ed51af3b9564f7e93b5bce32c210ff7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page