Skip to main content

Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files and MORE!

Project description

deepcsv

"You think you saved a list. You open it tomorrow — and it's a string."

deepcsv was built to solve exactly this problem.


The Problem

Your CSV files are lying to you.

You save a list — you open it tomorrow and it's a string. Your column has numbers — it secretly has 3 different data types. You have 200 CSV files across 40 folders — and you process them one by one. You load a file and spend 20 minutes just picking the right reader. You have nulls scattered everywhere with no clean way to handle them.

This is the silent killer of every data pipeline.


The Solution

deepcsv handles all of this in one import.

  • Walks through every folder and subfolder automatically
  • Finds every CSV and XLSX file
  • Detects columns storing lists as strings and converts them to real NumPy arrays
  • Catches mixed-type columns and fixes them automatically
  • Saves everything in any format you choose — not just Parquet
  • Reads any file format with one function — no more picking the right reader
  • Cleans nulls with full control over columns, rows, indexes, values, and types

Installation

pip install deepcsv

Functions

process_file(data_input, save_file_extension= str)

Reads a file or DataFrame, converts array-like strings to NumPy arrays, fixes mixed-type columns, and optionally saves the result in any format you choose.

import deepcsv

# Process only
df = deepcsv.process_file('path/to/file.csv')

# Process and save as parquet
df = deepcsv.process_file('path/to/file.csv', save_file_extension='parquet')

# Process and save as Excel
df = deepcsv.process_file('path/to/file.csv', save_file_extension='xlsx')

Supported save formats: .csv .tsv .txt .xlsx .json .parquet .pkl .feather .html .xml


process_all_files(directory_path, output_dir="All CSV Files is Converted Here", file_extension="parquet")

Walks through all folders and subfolders, applies process_file on every supported file, and saves results in the format you choose.

import deepcsv

# Default — saves as parquet
deepcsv.process_all_files('path/to/folder')

# Custom output folder
deepcsv.process_all_files('path/to/folder', output_dir='Converted Files')

# Save as CSV instead
deepcsv.process_all_files('path/to/folder', file_extension='csv')

Supported input formats: .csv .txt .tsv .xls .xlsx .json .parquet .pkl .feather .db .sqlite


read_any(file_path)

Reads any supported file format and returns a pandas DataFrame — one function for everything.

from deepcsv import read_any

df = read_any('data/users.csv')
df = read_any('reports/sales.xlsx')
df = read_any('warehouse/orders.parquet')
df = read_any('local.db')

Supported formats: .csv .txt .tsv .xls .xlsx .json .parquet .pkl .feather .db .sqlite


clean_values(data_input, ...)

Cleans a DataFrame by removing nulls, specific values, specific types, or rows by index — with full control over which columns to target and optional conditions.

from deepcsv import clean_values

# Drop fully-null columns
df = clean_values('data.csv', cols=['age', 'salary'])

# Drop rows that have nulls in specific cols
df = clean_values('data.csv', cols=['age', 'salary'], ax_0=True)

# Drop rows by index
df = clean_values(df, index=[0, 5, 12])

# Remove rows where a specific value exists
df = clean_values(df, cols=['status'], finding_value='N/A')

# Remove rows where value meets a condition
df = clean_values(df, cols=['score'], finding_value='N/A', condition=['>=', 500])

# Remove rows by Python type
df = clean_values(df, cols=['age'], finding_type=str)

# Apply on all columns except some
df = clean_values('data.csv', all_cols_except=['id', 'name'])
Parameter Type Default Description
data_input str | DataFrame required File path or DataFrame
cols list None Columns to apply on
ax_0 bool False True: drop rows with nulls — False: drop fully-null cols
index list None Row indexes to drop
condition list None [operator, value] — ex: ['>=', 500]
all_cols_except list None Apply on all columns except these
finding_value any None Find and remove rows containing this value
finding_type type None Find and remove rows matching this Python type

Supported condition operators: >= <= > < == !=


Function Signatures

process_file(data_input: Union[str, pd.DataFrame], save_file_extension: str = None) -> pd.DataFrame
process_all_files(directory_path: str, output_dir: str = "All CSV Files is Converted Here", file_extension: str = "parquet") -> None
read_any(file_path: str) -> pd.DataFrame
clean_values(data_input, cols=None, ax_0=False, index=None, condition=None, all_cols_except=None, finding_value=None, finding_type=None) -> pd.DataFrame

Key Features

  • String list → real NumPy array conversion (fast, no manual parsing)
  • Mixed-type column detection and auto-fix
  • Save in any format — CSV, Excel, JSON, Parquet, Feather, and more
  • One universal file reader for 10+ formats
  • Flexible null cleaning by column, row, index, value, or type
  • Conditional filtering with 6 operators
  • Recursive directory traversal
  • Warning messages for full transparency

Notes

  • Requires pyarrow for Parquet and Feather support
  • Only saves files in process_all_files if the DataFrame contains converted array columns

Requirements

  • Python >= 3.7
  • pandas
  • pyarrow

By: Abdullah Bakr

Changelog


Added

  • process_all_files — Added option for user to customize the output folder name in
  • read_any() — Reads any supported file format and returns a pandas DataFrame automatically. Supports: .csv, .txt, .tsv, .xls, .xlsx, .json, .parquet, .pkl, .feather, .db, .sqlite
  • clean_values() — Cleans a DataFrame by removing nulls, specific values, specific types, or rows by index. Supports optional condition filtering with 6 operators
  • _validate_cols() — Internal helper: validates cols is a non-empty list and all columns exist in the DataFrame
  • _validate_index() — Internal helper: validates index is a non-empty list and all indexes exist in the DataFrame. Supports optional reset_index before validation
  • _validate_condition() — Internal helper: validates condition list and returns (operator_func, value)
  • _parse_operator() — Internal helper: converts operator string like '>=' into its Python operator function
  • finding_value parameter in clean_values(data_input,finding_value) find and remove rows that have this specific value
  • finding_type parameter in clean_values(data_input,finding_type) find and remove rows that have this specific type (ex: str, int)
  • condition parameter in clean_values(data_input,condition : [operator, value] → ex: ['>=', 500]) applied only with finding_value or finding_type

Changed

  • process_file() — Added save_file_extension parameter. Now supports saving the processed DataFrame in any format after conversion, not just returning it
  • process_all_files() — Added file_extension parameter. Now supports saving converted files in any format instead of always saving as Parquet. Also expanded supported input formats beyond .csv and .xlsx to cover all formats supported by read_any()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepcsv-0.6.2.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepcsv-0.6.2-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file deepcsv-0.6.2.tar.gz.

File metadata

  • Download URL: deepcsv-0.6.2.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deepcsv-0.6.2.tar.gz
Algorithm Hash digest
SHA256 64dd662201c65686f88e260451471e37b1a1fe94c143efcff1c360cc771e7100
MD5 d681ea6306900a11c168f05b47f89cfd
BLAKE2b-256 efc71bd8a21e1001f52125308f04afe7fd6892ca595828b8d7ec2654ae71b0f9

See more details on using hashes here.

File details

Details for the file deepcsv-0.6.2-py3-none-any.whl.

File metadata

  • Download URL: deepcsv-0.6.2-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for deepcsv-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a1d744ff962c140dceecddd314ad218e0e70a07ae8771c06412ad423dadb1e82
MD5 35c5218bf3926599f03d3b8163865db2
BLAKE2b-256 d15ee8f64458d638f01e285b41b5b597aff1fe76a550122884d69996b32f8c57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page