Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
Project description
deepcsv
A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
Installation
pip install deepcsv
What it does
- Walks through all folders and subfolders automatically
- Finds every CSV and XLSX file
- Detects columns that contain list strings like
"['item1', 'item2']"and converts them into real Python arrays for faster performance - Detects columns with mixed data types and tries to fix them automatically
- Warns you when a column has mixed types so you know what was changed
- Saves the results as Parquet files to preserve the converted data types
Why Parquet? CSV files cannot store arrays or preserve data types. Parquet solves this by keeping the exact types after conversion.
Why arrays instead of Python lists? Arrays are significantly faster for numerical operations and machine learning workflows.
Functions
ConvertListStrToList(file_path)
Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
import deepcsv
df = deepcsv.ConvertListStrToList("path/to/file.csv")
ReadAllCSVData(path)
Walks through all folders and subfolders, applies ConvertListStrToList on every CSV and XLSX file, and saves the results as Parquet files in a new folder called All CSV Data is Converted Here.
import deepcsv
deepcsv.ReadAllCSVData("path/to/folder")
Notes
- Only files that contain list string columns are saved as Parquet
- Mixed-type columns are converted to float automatically when possible
- Skips NaN values without breaking
- Requires
pyarrowfor Parquet support
Requirements
- Python >= 3.7
- pandas
- pyarrow
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepcsv-0.3.0.tar.gz.
File metadata
- Download URL: deepcsv-0.3.0.tar.gz
- Upload date:
- Size: 3.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3783d17a5bf104e271a02b091caa6dbc49317237f1c8e3671a140f92082e654
|
|
| MD5 |
84c9122c7294bd3008d92219240c9fea
|
|
| BLAKE2b-256 |
557d8180f3d0ac7a8ac9b1b06c2a230fa69de56ea75fc4a50fcbf75dae575250
|
File details
Details for the file deepcsv-0.3.0-py3-none-any.whl.
File metadata
- Download URL: deepcsv-0.3.0-py3-none-any.whl
- Upload date:
- Size: 4.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2250bbd46416b3b3ca3599ac69fad30d0ed23ec39d25b3be791890b5af0a75db
|
|
| MD5 |
637392543400385d2c55580de000f11a
|
|
| BLAKE2b-256 |
5e3a94a139db8b0e822affd745291d4f9ca7375194dfa8307c98b5b89a939999
|