A Python package for efficiently loading CSV files with optimized data types.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

csv_optimizer (v0.20)

csv_optimizer is a Python utility for loading CSV files into Pandas while optimizing memory usage. It assigns appropriate data types based on a dataset sample, reducing unnecessary memory consumption, which can be enormous for large datasets. Instead of loading the full dataset at once, it processes the file in chunks (default: 1000 rows per chunk) and determines the most appropriate dtype from a sampled fraction (default: 10% of the complete dataset).

Features

Uses chunking to efficiently process large datasets.
Detects and assigns int, float, category, datetime, and boolean data types.
Handles missing values in integer and boolean columns with configurable options.
Reduces memory usage compared to Pandas' default read_csv() behavior.
Supports different encodings.

Installation

Install the package locally:

pip install -e .

via PyPI:

pip install csv_optimizer

Usage

Basic Example

from csv_optimizer import load_optimized_dataframe

df = load_optimized_dataframe("data.csv")
print(df.info())

Additional Options

df = load_optimized_dataframe(
    "data.csv",
    sample_fraction=0.1,          # Sample size for type detection (default: 10%)
    chunksize=1000,               # Number of rows per chunk when reading the CSV (default: 1000)
    use_float_for_nan_ints=False, # Store NaN-containing integer columns as Int64 or float32
    use_float_for_nan_bools=False,# Store NaN-containing Boolean columns as boolean or float32
    encoding="utf-8"              # File encoding (default: 'latin1')
)

Note on use_float_for_nan_ints and use_float_for_nan_bools: If a column contains NaNs, NumPy's default behavior is to store the column as float32. This will use less memory than Pandas' nullable Int64 or Boolean. However, users can choose what is more important in their workflow: less memory usage but 'incorrect' dtype or correct dtype but additional memory overhead.

CSV processing

Reads the CSV file in chunks (default: 1000 rows per chunk) to improve efficiency when handling large files.
Loads a sample (default: 10%) of the dataset to determine optimal data types.
Detects column types and assigns the most efficient dtype:
- Converts categorical-like columns to category
- Optimizes integer columns (int8, int16, int32, int64)
- Uses float32 where possible for floating-point numbers
- Supports datetime parsing (Still relies on trial and error, leading to some warning messages.)
- Detects Boolean columns (bool or Pandas nullable boolean)
- Allows user-defined handling for NaN-containing columns
Applies optimized dtypes when loading the full dataset.

Development & Contributions

Feel free to contribute!

git clone https://github.com/timmueller0/csv_optimizer.git
cd csv_optimizer
pip install -e .

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.20

Jan 31, 2025

0.11

Jan 29, 2025

0.1

Jan 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_optimizer-0.20.tar.gz (3.6 kB view details)

Uploaded Jan 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

csv_optimizer-0.20-py3-none-any.whl (4.5 kB view details)

Uploaded Jan 31, 2025 Python 3

File details

Details for the file csv_optimizer-0.20.tar.gz.

File metadata

Download URL: csv_optimizer-0.20.tar.gz
Upload date: Jan 31, 2025
Size: 3.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for csv_optimizer-0.20.tar.gz
Algorithm	Hash digest
SHA256	`03f2eeaa729cd91574a5e046cd33922b2e458f839cb67008c67c80da4e4ff352`
MD5	`8f3c433551f782cfd6543ffdcde382d0`
BLAKE2b-256	`d0333efa054656eb67e6b76f459b7dffe84cb7f6d86272fbc19ce888cfbf78ea`

See more details on using hashes here.

File details

Details for the file csv_optimizer-0.20-py3-none-any.whl.

File metadata

Download URL: csv_optimizer-0.20-py3-none-any.whl
Upload date: Jan 31, 2025
Size: 4.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for csv_optimizer-0.20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93132a748194767c72475db2ccb315b240d7b9a34f323c37f81ab1acee1b54de`
MD5	`558efbab43fbcd8a8bbe17ab7352d8b3`
BLAKE2b-256	`26756efd23a453de78e3a48cc41aecbb5b2b182b620eaac1d745af2267ef2836`

See more details on using hashes here.

csv-optimizer 0.20

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

csv_optimizer (v0.20)

Features

Installation

Usage

Additional Options

CSV processing

Development & Contributions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes