A Python package for efficiently loading CSV files with optimized data types.
Project description
csv_optimizer (v0.20)
csv_optimizer is a Python utility for loading CSV files into Pandas while optimizing memory usage. It assigns appropriate data types based on a dataset sample, reducing unnecessary memory consumption, which can be enormous for large datasets. Instead of loading the full dataset at once, it processes the file in chunks (default: 1000 rows per chunk) and determines the most appropriate dtype from a sampled fraction (default: 10% of the complete dataset).
Features
- Uses chunking to efficiently process large datasets.
- Detects and assigns
int,float,category,datetime, andbooleandata types. - Handles missing values in integer and boolean columns with configurable options.
- Reduces memory usage compared to Pandas' default
read_csv()behavior. - Supports different encodings.
Installation
Install the package locally:
pip install -e .
via PyPI:
pip install csv_optimizer
Usage
Basic Example
from csv_optimizer import load_optimized_dataframe
df = load_optimized_dataframe("data.csv")
print(df.info())
Additional Options
df = load_optimized_dataframe(
"data.csv",
sample_fraction=0.1, # Sample size for type detection (default: 10%)
chunksize=1000, # Number of rows per chunk when reading the CSV (default: 1000)
use_float_for_nan_ints=False, # Store NaN-containing integer columns as Int64 or float32
use_float_for_nan_bools=False,# Store NaN-containing Boolean columns as boolean or float32
encoding="utf-8" # File encoding (default: 'latin1')
)
Note on use_float_for_nan_ints and use_float_for_nan_bools: If a column contains NaNs, NumPy's default behavior is to store the column as float32. This will use less memory than Pandas' nullable Int64 or Boolean. However, users can choose what is more important in their workflow: less memory usage but 'incorrect' dtype or correct dtype but additional memory overhead.
CSV processing
- Reads the CSV file in chunks (default: 1000 rows per chunk) to improve efficiency when handling large files.
- Loads a sample (default: 10%) of the dataset to determine optimal data types.
- Detects column types and assigns the most efficient dtype:
- Converts categorical-like columns to
category - Optimizes integer columns (
int8,int16,int32,int64) - Uses
float32where possible for floating-point numbers - Supports
datetimeparsing (Still relies on trial and error, leading to some warning messages.) - Detects Boolean columns (
boolor Pandas nullableboolean) - Allows user-defined handling for NaN-containing columns
- Converts categorical-like columns to
- Applies optimized
dtypeswhen loading the full dataset.
Development & Contributions
Feel free to contribute!
git clone https://github.com/timmueller0/csv_optimizer.git
cd csv_optimizer
pip install -e .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file csv_optimizer-0.20.tar.gz.
File metadata
- Download URL: csv_optimizer-0.20.tar.gz
- Upload date:
- Size: 3.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03f2eeaa729cd91574a5e046cd33922b2e458f839cb67008c67c80da4e4ff352
|
|
| MD5 |
8f3c433551f782cfd6543ffdcde382d0
|
|
| BLAKE2b-256 |
d0333efa054656eb67e6b76f459b7dffe84cb7f6d86272fbc19ce888cfbf78ea
|
File details
Details for the file csv_optimizer-0.20-py3-none-any.whl.
File metadata
- Download URL: csv_optimizer-0.20-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93132a748194767c72475db2ccb315b240d7b9a34f323c37f81ab1acee1b54de
|
|
| MD5 |
558efbab43fbcd8a8bbe17ab7352d8b3
|
|
| BLAKE2b-256 |
26756efd23a453de78e3a48cc41aecbb5b2b182b620eaac1d745af2267ef2836
|