Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
Project description
deepcsv
Stop losing your data types when working with CSV files.
deepcsv automatically cleans messy CSV/XLSX data and converts it into ML-ready Parquet format.
Installation
pip install deepcsv
Example
Before
# CSV column value
"['a', 'b', 'c']"
After
# Automatically converted
['a', 'b', 'c']
Usage
import deepcsv
df = deepcsv.ConvertListStrToList("path/to/file.csv")
What it does
- Walks through all folders and subfolders automatically
- Finds every CSV and XLSX file
- Detects columns that contain list strings like
"['item1', 'item2']"and converts them into real Python arrays for faster performance - Detects columns with mixed data types and tries to fix them automatically
- Warns you when a column has mixed types so you know what was changed
- Saves the results as Parquet files to preserve the converted data types
Why Parquet?
CSV files cannot store arrays or preserve data types.
Parquet solves this by keeping the exact types after conversion and is much faster for data processing workflows.
Why arrays instead of Python lists?
Arrays are significantly faster for numerical operations and machine learning workflows, especially when working with large datasets.
Functions
ConvertListStrToList(file_path)
Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
import deepcsv
df = deepcsv.ConvertListStrToList("path/to/file.csv")
ReadAllCSVData(path)
Walks through all folders and subfolders, applies ConvertListStrToList on every CSV and XLSX file, and saves the results as Parquet files in a new folder called All CSV Data is Converted Here.
import deepcsv
deepcsv.ReadAllCSVData("path/to/folder")
Notes
- Only files that contain list string columns are saved as Parquet
- Mixed-type columns are converted to float automatically when possible
- Skips NaN values without breaking
- Requires
pyarrowfor Parquet support
Requirements
- Python >= 3.7
- pandas
- pyarrow
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepcsv-0.4.0.tar.gz.
File metadata
- Download URL: deepcsv-0.4.0.tar.gz
- Upload date:
- Size: 4.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30d19114ed3dc17216bf37a3b183f8e2c19ed81facbe9fd52a4b3a40ea409ab6
|
|
| MD5 |
5f2aee1a2a8644fb2bc3a26d4ea687a6
|
|
| BLAKE2b-256 |
a7c034eba24a1d33bdad4eb087ef6792c797de5178e7c2aedf2cef6f684a7e7c
|
File details
Details for the file deepcsv-0.4.0-py3-none-any.whl.
File metadata
- Download URL: deepcsv-0.4.0-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7452b6316a402bd35c0f9f5c46641f21c8aceccbe709a6524314ada907e8528f
|
|
| MD5 |
99e79d52a639e98e994b7d6f62f5fb9c
|
|
| BLAKE2b-256 |
8a3039318812eea38a0df236ef51840a2b175892a2cee4545abfe00e267cf791
|