Skip to main content

Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files.

Project description

deepcsv (v0.5.0)

Ever loaded a CSV file and found your carefully structured lists turned into useless strings?

"['Action', 'Sci-Fi', 'Thriller']"  # This is a string, not a list

deepcsv fixes this automatically.


The Solution

deepcsv handles these cases automatically:

  • Reads CSV/XLSX files or existing DataFrames
  • Converts any string value starting with [ into real NumPy arrays (fast and lightweight)
  • Detects and fixes mixed-type columns by safely converting them to numeric (float)
  • Recursively processes all CSV/XLSX files in subdirectories
  • Saves results as Parquet format to preserve types and speed up analysis

Installation

pip install deepcsv

Usage

Single file processing (process_file)

import deepcsv

df = deepcsv.process_file('path/to/file.csv')
  • Accepts str (file path) or pd.DataFrame
  • Returns pd.DataFrame with columns converted to arrays

Batch directory processing (process_all_files)

import deepcsv

deepcsv.process_all_files('path/to/folder')
  • Processes all .csv and .xlsx files recursively
  • Saves converted files as Parquet in: All CSV Files is Converted Here

What it does

  • Auto-detects files in directory and subdirectories
  • Converts values like:
    • "['item1', 'item2']"array(['item1', 'item2']) (NumPy array)
    • Mixed numeric/string columns → single numeric type (float)
  • Handles NaN values without breaking
  • Stores results in Parquet format for type safety and performance

Function Signatures

  • process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame
  • process_all_files(directory_path: str) -> None

Output arrays are NumPy arrays for optimal performance in machine learning workflows.


Key Features

  • Fast NumPy array conversion instead of slow Python lists
  • Mixed-type detection with automatic fixes
  • Parquet storage for data integrity
  • Recursive directory traversal
  • Warning messages for transparency

Notes

  • Requires pyarrow for Parquet support
  • Only saves files that contain converted array columns

Requirements

  • Python >= 3.7
  • pandas
  • pyarrow

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepcsv-0.5.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepcsv-0.5.0-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file deepcsv-0.5.0.tar.gz.

File metadata

  • Download URL: deepcsv-0.5.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deepcsv-0.5.0.tar.gz
Algorithm Hash digest
SHA256 c7f816783dfbcb5093266d33ac624722a98e3d0e5830f4dab1c771817719c4db
MD5 d298e34f527010ee4c64a9899b6ce78d
BLAKE2b-256 e93d7c58cf60268b6ec291208b6373e7377b30a51f97a9078ff6a2f0c9e744ee

See more details on using hashes here.

File details

Details for the file deepcsv-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: deepcsv-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deepcsv-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1022028a58ad51d0fa2f736bc5d2224e4d7872ef2bb6321cb0c44a34e952137
MD5 e307785b3daab818249e983b6cc692d0
BLAKE2b-256 61c8bc37b5bfe41b19f1e5ffe82d628998af3f7dcd8ff79e12a5b790f4164ae1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page