Skip to main content

Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.

Project description

deepcsv

Stop losing your data types when working with CSV files.
deepcsv automatically cleans messy CSV/XLSX data and converts it into ML-ready Parquet format.

Installation

pip install deepcsv

Example

Before

# CSV column value
"['a', 'b', 'c']"

After

# Automatically converted
['a', 'b', 'c']

Usage

import deepcsv

df = deepcsv.ConvertListStrToList("path/to/file.csv")

What it does

  • Walks through all folders and subfolders automatically
  • Finds every CSV and XLSX file
  • Detects columns that contain list strings like "['item1', 'item2']" and converts them into real Python arrays for faster performance
  • Detects columns with mixed data types and tries to fix them automatically
  • Warns you when a column has mixed types so you know what was changed
  • Saves the results as Parquet files to preserve the converted data types

Why Parquet?

CSV files cannot store arrays or preserve data types.
Parquet solves this by keeping the exact types after conversion and is much faster for data processing workflows.


Why arrays instead of Python lists?

Arrays are significantly faster for numerical operations and machine learning workflows, especially when working with large datasets.


Functions

ConvertListStrToList(file_path)

Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.

import deepcsv

df = deepcsv.ConvertListStrToList("path/to/file.csv")

ReadAllCSVData(path)

Walks through all folders and subfolders, applies ConvertListStrToList on every CSV and XLSX file, and saves the results as Parquet files in a new folder called All CSV Data is Converted Here.

import deepcsv

deepcsv.ReadAllCSVData("path/to/folder")

Notes

  • Only files that contain list string columns are saved as Parquet
  • Mixed-type columns are converted to float automatically when possible
  • Skips NaN values without breaking
  • Requires pyarrow for Parquet support

Requirements

  • Python >= 3.7
  • pandas
  • pyarrow

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepcsv-0.4.0.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepcsv-0.4.0-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file deepcsv-0.4.0.tar.gz.

File metadata

  • Download URL: deepcsv-0.4.0.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deepcsv-0.4.0.tar.gz
Algorithm Hash digest
SHA256 30d19114ed3dc17216bf37a3b183f8e2c19ed81facbe9fd52a4b3a40ea409ab6
MD5 5f2aee1a2a8644fb2bc3a26d4ea687a6
BLAKE2b-256 a7c034eba24a1d33bdad4eb087ef6792c797de5178e7c2aedf2cef6f684a7e7c

See more details on using hashes here.

File details

Details for the file deepcsv-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: deepcsv-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deepcsv-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7452b6316a402bd35c0f9f5c46641f21c8aceccbe709a6524314ada907e8528f
MD5 99e79d52a639e98e994b7d6f62f5fb9c
BLAKE2b-256 8a3039318812eea38a0df236ef51840a2b175892a2cee4545abfe00e267cf791

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page