Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files and MORE!
Project description
deepcsv
Ever loaded a CSV file and found your carefully structured lists turned into useless strings?
"['Action', 'Sci-Fi', 'Thriller']" # This is a string, not a list
deepcsv fixes this automatically.
The Solution
deepcsv handles these cases automatically:
- Reads CSV/XLSX files or existing DataFrames
- Converts any string list value
"["into real NumPy arrays (fast and lightweight) - Detects and fixes mixed-type columns by safely converting them to numeric
- Recursively processes all CSV/XLSX files in each directory
- Saves results as Parquet format to preserve types and speed up analysis
Installation
pip install deepcsv
Usage
Single file processing (process_file)
import deepcsv
df = deepcsv.process_file('path/to/file.csv')
- Accepts
str(file path) orpd.DataFrame - Returns
pd.DataFramewith columns converted to arrays
Batch directory processing (process_all_files)
import deepcsv
deepcsv.process_all_files('path/to/folder')
- Processes all
.csvand.xlsxfiles recursively - Saves converted files as Parquet in:
All CSV Files is Converted Here
Utilities
read_any(file_path)
Reads any supported file and returns a pandas DataFrame. No need to manually pick the reader.
from deepcsv import read_any
df = read_any('data/users.csv')
df = read_any('reports/sales.xlsx')
df = read_any('warehouse/orders.parquet')
Supported formats: .csv, .txt, .tsv, .xls, .xlsx, .json, .parquet, .pkl, .feather, .db, .sqlite
clean_values(data_input, ...)
Cleans a DataFrame by removing nulls from specific columns or rows, or dropping rows by index.
from deepcsv import clean_values
# Drop fully-null columns from specific cols
df = clean_values('data.csv', cols=['age', 'salary'])
# Drop rows that have nulls in specific cols
df = clean_values('data.csv', cols=['age', 'salary'], ax_0=True)
# Drop rows by index
df = clean_values(df, index=[0, 5, 12])
# Apply on all columns except some
df = clean_values('data.csv', all_cols_except=['id', 'name'])
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
data_input |
str | DataFrame |
required | File path or DataFrame |
cols |
list |
None |
Columns to apply on |
ax_0 |
bool |
False |
If True: drop rows with nulls. If False: drop fully-null cols |
index |
list |
None |
Row indexes to drop |
all_cols_except |
list |
None |
Apply on all columns except these |
What it does
- Auto-detects files in directory and subdirectories
- Converts values like:
"['item1', 'item2']"→array(['item1', 'item2'])(NumPy array)- Mixed numeric/string columns → single numeric type (float)
- Handles NaN values without breaking
- Stores results in Parquet format for type safety and performance
Function Signatures
process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrameprocess_all_files(directory_path: str) -> Noneread_any(file_path: str) -> pd.DataFrameclean_values(data_input, cols=None, ax_0=False, index=None, all_cols_except=None) -> pd.DataFrame
Output arrays are NumPy arrays for optimal performance in machine learning workflows.
Key Features
- Fast NumPy array conversion instead of slow Python lists
- Mixed-type detection with automatic fixes
- Parquet storage for data integrity
- Recursive directory traversal
- Warning messages for transparency
- Built-in file reader supporting 10+ formats (
read_any) - Flexible null/index cleaning (
clean_values)
Notes
- Requires
pyarrowfor Parquet support - Only saves files that contain converted array columns
Requirements
- Python >= 3.7
- pandas
- pyarrow
Changelog
Added
- finding_value parameter in
clean_values(data_input,finding_value)find and remove rows that have this specific value - finding_type parameter in
clean_values(data_input,finding_type)find and remove rows that have this specific type (ex: str, int) - condition parameter in
clean_values(data_input,condition : [operator, value] → ex: ['>=', 500])applied only with finding_value or finding_type
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepcsv-0.6.2b1.tar.gz.
File metadata
- Download URL: deepcsv-0.6.2b1.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
deaa6697def36092027e5789585ce11a062db07183d0777e2437a2b5bc730be5
|
|
| MD5 |
c7feeb64dc25b76c9f6024da3c015db1
|
|
| BLAKE2b-256 |
3078dcef40ff424a84f9e785d39a5f46954ef9a5f8507de41b27367ac5aabe40
|
File details
Details for the file deepcsv-0.6.2b1-py3-none-any.whl.
File metadata
- Download URL: deepcsv-0.6.2b1-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c92bbf75bffb24cbc69d640768951d872f09e18630cc6d742c85505bc79adff8
|
|
| MD5 |
969089096d5b67578848f26d6a9447a8
|
|
| BLAKE2b-256 |
ba74d4611befde3db4029bb24859f05dba4878f064636fe251704cd08e36c30a
|