Skip to main content

A minimalistic solution to messy CSV files.

Project description

tidyCSV.py

CI build mypy tests codecov

Code style: black License: MIT experimental

Tired of having pseudo CSV files full of invalid entries ? Me too, this is my solution.

It has probably occurred to you as it has to me to get this error when reading a csv into Python using pandas.

ParserError: Error tokenizing data. C error: Expected 8 fields in line 7, saw 47

This happens because some lines in your file have more columns than you have in the header, or simply other kind of inconsistencies such as intermediate blank lines or lines containing random tokens.

Fear no more because tidyCSV provides a simple and clear interface to access the semantically coherent chunks of your csv file (if there are any). By default it selects the biggest group found (that is the one containing the most lines).

Maybe I'll add an option to select how many columns you expect, in order to filter the groups according to a preconceived criteria. Eventually I would like this project to become a command line tool as well as having a richer set of features, but It currently serves its purpose so it is not a priority.

Installation

The package has been published to PyPI! You can install it as any other package using pip (I recommend installing it within a virtual environment created in a per project basis).

pip install tidycsv

Otherwise you can install the latest development version using:

pip install git+https://github.com/gmagannaDevelop/tidyCSV.py

Usage

Use the context manager provided at top-level to read an otherwise unreadable csv as follows:

import pandas as pd
from tidycsv import TidyCSV as tidycsv

with tidycsv("your-messy-csv-file.csv") as tidy:
	df = pd.read_csv(tidy)

Now you have a dataframe ready to be used instead of an Exception.

Bugs and feature requests

If you find that tidyCSV is not behaving as you would expect it to, please feel free to open an issue. The same goes for feature requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidycsv-0.1.0a0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tidycsv-0.1.0a0-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file tidycsv-0.1.0a0.tar.gz.

File metadata

  • Download URL: tidycsv-0.1.0a0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.8.10 Linux/5.11.0-7633-generic

File hashes

Hashes for tidycsv-0.1.0a0.tar.gz
Algorithm Hash digest
SHA256 827b92d92f866af17fa2645016feced516a016a52cb81bf44407cff43ec1caa4
MD5 911c4add248df965a3c7fc85a981f15c
BLAKE2b-256 9b06ff3602daa05d02769a2f9e9c1ba5259e37d0b4683cb5cbd5e6fdbb24867f

See more details on using hashes here.

File details

Details for the file tidycsv-0.1.0a0-py3-none-any.whl.

File metadata

  • Download URL: tidycsv-0.1.0a0-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.8.10 Linux/5.11.0-7633-generic

File hashes

Hashes for tidycsv-0.1.0a0-py3-none-any.whl
Algorithm Hash digest
SHA256 dbe95889385ff0c5b731fd79608653c6b0075edc70db2360fef3b61741f55cb5
MD5 dc1384126a6181f1bf128cd9c99ce0e6
BLAKE2b-256 2c7e8522e805bb5d32001eacf06f69f02837682fb7b3f1a98b473f4f89fb5e47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page