Skip to main content

Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files and MORE!

Project description

deepcsv

Ever loaded a CSV file and found your carefully structured lists turned into useless strings?

"['Action', 'Sci-Fi', 'Thriller']"  # This is a string, not a list

deepcsv fixes this automatically.


The Solution

deepcsv handles these cases automatically:

  • Reads CSV/XLSX files or existing DataFrames
  • Converts any string list value "[" into real NumPy arrays (fast and lightweight)
  • Detects and fixes mixed-type columns by safely converting them to numeric
  • Recursively processes all CSV/XLSX files in each directory
  • Saves results as Parquet format to preserve types and speed up analysis

Installation

pip install deepcsv

Usage

Single file processing (process_file)

import deepcsv

df = deepcsv.process_file('path/to/file.csv')
  • Accepts str (file path) or pd.DataFrame
  • Returns pd.DataFrame with columns converted to arrays

Batch directory processing (process_all_files)

import deepcsv

deepcsv.process_all_files('path/to/folder')
  • Processes all .csv and .xlsx files recursively
  • Saves converted files as Parquet in: All CSV Files is Converted Here

Utilities

read_any(file_path)

Reads any supported file and returns a pandas DataFrame. No need to manually pick the reader.

from deepcsv import read_any

df = read_any('data/users.csv')
df = read_any('reports/sales.xlsx')
df = read_any('warehouse/orders.parquet')

Supported formats: .csv, .txt, .tsv, .xls, .xlsx, .json, .parquet, .pkl, .feather, .db, .sqlite


clean_values(data_input, ...)

Cleans a DataFrame by removing nulls from specific columns or rows, or dropping rows by index.

from deepcsv import clean_values

# Drop fully-null columns from specific cols
df = clean_values('data.csv', cols=['age', 'salary'])

# Drop rows that have nulls in specific cols
df = clean_values('data.csv', cols=['age', 'salary'], ax_0=True)

# Drop rows by index
df = clean_values(df, index=[0, 5, 12])

# Apply on all columns except some
df = clean_values('data.csv', all_cols_except=['id', 'name'])

Parameters:

Parameter Type Default Description
data_input str | DataFrame required File path or DataFrame
cols list None Columns to apply on
ax_0 bool False If True: drop rows with nulls. If False: drop fully-null cols
index list None Row indexes to drop
all_cols_except list None Apply on all columns except these

What it does

  • Auto-detects files in directory and subdirectories
  • Converts values like:
    • "['item1', 'item2']"array(['item1', 'item2']) (NumPy array)
    • Mixed numeric/string columns → single numeric type (float)
  • Handles NaN values without breaking
  • Stores results in Parquet format for type safety and performance

Function Signatures

  • process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame
  • process_all_files(directory_path: str) -> None
  • read_any(file_path: str) -> pd.DataFrame
  • clean_values(data_input, cols=None, ax_0=False, index=None, all_cols_except=None) -> pd.DataFrame

Output arrays are NumPy arrays for optimal performance in machine learning workflows.


Key Features

  • Fast NumPy array conversion instead of slow Python lists
  • Mixed-type detection with automatic fixes
  • Parquet storage for data integrity
  • Recursive directory traversal
  • Warning messages for transparency
  • Built-in file reader supporting 10+ formats (read_any)
  • Flexible null/index cleaning (clean_values)

Notes

  • Requires pyarrow for Parquet support
  • Only saves files that contain converted array columns

Requirements

  • Python >= 3.7
  • pandas
  • pyarrow

Changelog


Added

  • finding_value parameter in clean_values(data_input,finding_value) find and remove rows that have this specific value
  • finding_type parameter in clean_values(data_input,finding_type) find and remove rows that have this specific type (ex: str, int)
  • condition parameter in clean_values(data_input,condition : [operator, value] → ex: ['>=', 500]) applied only with finding_value or finding_type

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepcsv-0.6.2b1.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepcsv-0.6.2b1-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file deepcsv-0.6.2b1.tar.gz.

File metadata

  • Download URL: deepcsv-0.6.2b1.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deepcsv-0.6.2b1.tar.gz
Algorithm Hash digest
SHA256 deaa6697def36092027e5789585ce11a062db07183d0777e2437a2b5bc730be5
MD5 c7feeb64dc25b76c9f6024da3c015db1
BLAKE2b-256 3078dcef40ff424a84f9e785d39a5f46954ef9a5f8507de41b27367ac5aabe40

See more details on using hashes here.

File details

Details for the file deepcsv-0.6.2b1-py3-none-any.whl.

File metadata

  • Download URL: deepcsv-0.6.2b1-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deepcsv-0.6.2b1-py3-none-any.whl
Algorithm Hash digest
SHA256 c92bbf75bffb24cbc69d640768951d872f09e18630cc6d742c85505bc79adff8
MD5 969089096d5b67578848f26d6a9447a8
BLAKE2b-256 ba74d4611befde3db4029bb24859f05dba4878f064636fe251704cd08e36c30a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page