Skip to main content

A Python package for efficiently loading CSV files with optimized data types.

Project description

csv_optimizer (v0.20)

csv_optimizer is a Python utility for loading CSV files into Pandas while optimizing memory usage. It assigns appropriate data types based on a dataset sample, reducing unnecessary memory consumption, which can be enormous for large datasets. Instead of loading the full dataset at once, it processes the file in chunks (default: 1000 rows per chunk) and determines the most appropriate dtype from a sampled fraction (default: 10% of the complete dataset).

Features

  • Uses chunking to efficiently process large datasets.
  • Detects and assigns int, float, category, datetime, and boolean data types.
  • Handles missing values in integer and boolean columns with configurable options.
  • Reduces memory usage compared to Pandas' default read_csv() behavior.
  • Supports different encodings.

Installation

Install the package locally:

pip install -e .

via PyPI:

pip install csv_optimizer

Usage

Basic Example

from csv_optimizer import load_optimized_dataframe

df = load_optimized_dataframe("data.csv")
print(df.info())

Additional Options

df = load_optimized_dataframe(
    "data.csv",
    sample_fraction=0.1,          # Sample size for type detection (default: 10%)
    chunksize=1000,               # Number of rows per chunk when reading the CSV (default: 1000)
    use_float_for_nan_ints=False, # Store NaN-containing integer columns as Int64 or float32
    use_float_for_nan_bools=False,# Store NaN-containing Boolean columns as boolean or float32
    encoding="utf-8"              # File encoding (default: 'latin1')
)

Note on use_float_for_nan_ints and use_float_for_nan_bools: If a column contains NaNs, NumPy's default behavior is to store the column as float32. This will use less memory than Pandas' nullable Int64 or Boolean. However, users can choose what is more important in their workflow: less memory usage but 'incorrect' dtype or correct dtype but additional memory overhead.

CSV processing

  1. Reads the CSV file in chunks (default: 1000 rows per chunk) to improve efficiency when handling large files.
  2. Loads a sample (default: 10%) of the dataset to determine optimal data types.
  3. Detects column types and assigns the most efficient dtype:
    • Converts categorical-like columns to category
    • Optimizes integer columns (int8, int16, int32, int64)
    • Uses float32 where possible for floating-point numbers
    • Supports datetime parsing (Still relies on trial and error, leading to some warning messages.)
    • Detects Boolean columns (bool or Pandas nullable boolean)
    • Allows user-defined handling for NaN-containing columns
  4. Applies optimized dtypes when loading the full dataset.

Development & Contributions

Feel free to contribute!

git clone https://github.com/timmueller0/csv_optimizer.git
cd csv_optimizer
pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_optimizer-0.20.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csv_optimizer-0.20-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file csv_optimizer-0.20.tar.gz.

File metadata

  • Download URL: csv_optimizer-0.20.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for csv_optimizer-0.20.tar.gz
Algorithm Hash digest
SHA256 03f2eeaa729cd91574a5e046cd33922b2e458f839cb67008c67c80da4e4ff352
MD5 8f3c433551f782cfd6543ffdcde382d0
BLAKE2b-256 d0333efa054656eb67e6b76f459b7dffe84cb7f6d86272fbc19ce888cfbf78ea

See more details on using hashes here.

File details

Details for the file csv_optimizer-0.20-py3-none-any.whl.

File metadata

  • Download URL: csv_optimizer-0.20-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for csv_optimizer-0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 93132a748194767c72475db2ccb315b240d7b9a34f323c37f81ab1acee1b54de
MD5 558efbab43fbcd8a8bbe17ab7352d8b3
BLAKE2b-256 26756efd23a453de78e3a48cc41aecbb5b2b182b620eaac1d745af2267ef2836

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page