Skip to main content

A Python package for handling messy CSV files

Project description

CleverCSV: A Clever CSV Package

Build Status PyPI version Documentation Status Binder

CleverCSV provides a drop-in replacement of the Python csv package with improved dialect detection for messy CSV files. It also provides a handy command line tool that can standardize a messy file or generate Python code to import it.

Useful links:

Introduction

  • CSV files are awesome: they are lightweight, easy to share, human-readable, version-controllable, and supported by many systems and tools!
  • CSV files are terrible: they can have many different formats, multiple tables, headers or no headers, escape characters, and there's no support for data dictionaries.

CleverCSV is a Python package that aims to solve many of the pain points of CSV files, while maintaining many of the good things. The package automatically detects (with high accuracy) the format (dialect) of CSV files, thus making it easier to simply point to a CSV file and load it, without the need for human inspection. In the future, we hope to solve some of the other issues of CSV files too.

CleverCSV is based on science. We investigated thousands of real-world CSV files to find a robust way to automatically detect the dialect of a file. This may seem like an easy problem, but to a computer a CSV file is simply a long string, and every dialect will give you some table. In CleverCSV we use a technique based on the patterns of the parsed file and the data type of the parsed cells. With our method we achieve a 97% accuracy for dialect detection, with a 21% improvement on non-standard (messy) CSV files.

We think this kind of work can be very valuable for working data scientists and programmers and we hope that you find CleverCSV useful (if there's a problem, please open an issue!) Since the academic world counts citations, please cite CleverCSV if you use the package. Here's a BibTeX entry you can use:

@article{van2019wrangling,
        title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns},
        author = {{van den Burg}, G. J. J. and Nazabal, A. and Sutton, C.},
        journal = {Data Mining and Knowledge Discovery},
        year = {2019},
        month = {Jul},
        day = {26},
        issn = {1573-756X},
        doi = {10.1007/s10618-019-00646-y},
}

And of course, if you like the package please spread the word! You can do this by Tweeting about it (#CleverCSV) or clicking the ⭐️ on GitHub!

Installation

The package is available on PyPI:

$ pip install clevercsv

Usage

CleverCSV consists of a Python library and a command line tool called clevercsv.

Library

We designed CleverCSV to provide a drop-in replacement for the built-in CSV module, with some useful functionality added to it. Therefore, if you simply want to replace the builtin CSV module with CleverCSV, you can import CleverCSV as follows, and use it as you would use the builtin csv module.

import clevercsv

CleverCSV provides an improved version of the dialect sniffer in the CSV module, but it also adds some useful wrapper functions. These functions automatically detect the dialect and aim to make working with CSV files easier. We currently have the following helper functions:

  • detect_dialect: takes a path to a CSV file and returns the detected dialect
  • read_csv: automatically detects the dialect and encoding of the file, and returns the data as a list of rows.
  • csv2df: detects the dialect and encoding of the file and then uses Pandas to read the CSV into a DataFrame.

Of course, you can also use the traditional way of loading a CSV file, as in the Python CSV module:

# importing this way makes it easy to port existing code to CleverCsv
import clevercsv as csv

with open("data.csv", "r", newline="") as fp:
  # you can use verbose=True to see what CleverCSV does:
  dialect = csv.Sniffer().sniff(fid.read(), verbose=False)
  fp.seek(0)
  reader = csv.reader(fp, dialect)
  rows = list(reader)

That's the basics! If you want more details, you can look at the code of the package, the test suite, or the API documentation.

Command-Line Tool

The clevercsv command line application has a number of handy features to make working with CSV files easier. For instance, it can be used to view a CSV file on the command line while automatically detecting the dialect. It can also generate Python code for importing data from a file with the correct dialect. The full help text is as follows:

USAGE
  clevercsv [-h] [-v] [-V] <command> [<arg1>] ... [<argN>]

ARGUMENTS
  <command>       The command to execute
  <arg>           The arguments of the command

GLOBAL OPTIONS
  -h (--help)     Display this help message.
  -v (--verbose)  Enable verbose mode.
  -V (--version)  Display the application version.

AVAILABLE COMMANDS
  code            Generate Python code for importing the CSV file.
  detect          Detect the dialect of a CSV file
  help            Display the manual of a command
  standardize     Convert a CSV file to one that conforms to RFC-4180.
  view            View the CSV file on the command line using TabView

Each of the commands has further options (for instance, the code command can generate code for importing a Pandas DataFrame). Use clevercsv help <command> for more information. Below are some examples for each command:

Code

Code generation is useful when you don't want to detect the dialect of the same file over and over again. You simply run the following command and copy the generated code to a Python script!

$ clevercsv code imdb.csv

# Code generated with CleverCSV

import clevercsv

with open("imdb.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
    rows = list(reader)

We also have a version that reads a Pandas dataframe:

$ clevercsv code --pandas imdb.csv

# Code generated with CleverCSV

import clevercsv

df = clevercsv.csv2df("imdb.csv", delimiter=",", quotechar="", escapechar="\\")

Detect

Detection is useful when you only want to know the dialect.

$ clevercsv detect imdb.csv
Detected: SimpleDialect(',', '', '\\')

The --plain flag gives the components of the dialect on separate lines, which makes combining it with grep easier.

$ clevercsv detect --plain imdb.csv
delimiter = ,
quotechar =
escapechar = \

Standardize

Use the standardize command when you want to rewrite a file using the RFC-4180 standard:

$ clevercsv standardize --output imdb_standard.csv imdb.csv

In this particular example the use of the escape character is replaced by using quotes.

View

This command allows you to view the file in the terminal. The dialect is of course detected using CleverCSV! Both this command and the standardize command support the --transpose flag, if you want to transpose the file before viewing or saving:

$ clevercsv view --transpose imdb.csv

Contributors

Code:

Scientific work:

Contributing

If you want to encourage development of CleverCSV, the best thing to do now is to spread the word!

If you encounter an issue in CleverCSV, please open an issue or submit a pull request!

Notes

License: MIT (see LICENSE file).

Copyright (c) 2019 The Alan Turing Institute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clevercsv-0.4.4.tar.gz (80.6 kB view details)

Uploaded Source

Built Distributions

clevercsv-0.4.4-py3.7-linux-x86_64.egg (125.1 kB view details)

Uploaded Source

clevercsv-0.4.4-cp37-cp37m-win_amd64.whl (52.1 kB view details)

Uploaded CPython 3.7m Windows x86-64

clevercsv-0.4.4-cp37-cp37m-win32.whl (50.2 kB view details)

Uploaded CPython 3.7m Windows x86

clevercsv-0.4.4-cp37-cp37m-manylinux1_x86_64.whl (72.4 kB view details)

Uploaded CPython 3.7m

clevercsv-0.4.4-cp37-cp37m-manylinux1_i686.whl (70.9 kB view details)

Uploaded CPython 3.7m

clevercsv-0.4.4-cp37-cp37m-macosx_10_6_intel.whl (51.8 kB view details)

Uploaded CPython 3.7m macOS 10.6+ intel

clevercsv-0.4.4-cp36-cp36m-win_amd64.whl (52.1 kB view details)

Uploaded CPython 3.6m Windows x86-64

clevercsv-0.4.4-cp36-cp36m-win32.whl (50.2 kB view details)

Uploaded CPython 3.6m Windows x86

clevercsv-0.4.4-cp36-cp36m-manylinux1_x86_64.whl (72.4 kB view details)

Uploaded CPython 3.6m

clevercsv-0.4.4-cp36-cp36m-manylinux1_i686.whl (71.0 kB view details)

Uploaded CPython 3.6m

clevercsv-0.4.4-cp36-cp36m-macosx_10_6_intel.whl (51.8 kB view details)

Uploaded CPython 3.6m macOS 10.6+ intel

File details

Details for the file clevercsv-0.4.4.tar.gz.

File metadata

  • Download URL: clevercsv-0.4.4.tar.gz
  • Upload date:
  • Size: 80.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.4.tar.gz
Algorithm Hash digest
SHA256 9998f7c9a0bc01c2df5e3ee20262c4cae5ae9e0f5cc6aceb026b3c54d479c691
MD5 89d31e1afd9df76b327890a06d822707
BLAKE2b-256 f5aeb9f9e1d058aa52ac7afaae481124871aea867a925d9d766e4ec43c5201b5

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-py3.7-linux-x86_64.egg.

File metadata

  • Download URL: clevercsv-0.4.4-py3.7-linux-x86_64.egg
  • Upload date:
  • Size: 125.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.4-py3.7-linux-x86_64.egg
Algorithm Hash digest
SHA256 f542f7f56c8bf663ee086dd0c0a514f4062b7d7af46463fe8216d498c052c821
MD5 5945c2513a7a4c9a4cf05d46ac5ee2d6
BLAKE2b-256 4f50c93d6d5209b2df122d162420ae6d945d08754e5cb087d1e28e6da3d55073

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 52.1 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 f93d126b3373adfc994f054c1168c393699d53889376989a11c447046573ada0
MD5 554d355f0ebf1928384ca981cd67ba58
BLAKE2b-256 2524915de66f7297ae66b66805a0f08c42dbc42e30f3d7e24ec0018ed40a7dee

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp37-cp37m-win32.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.4-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 7cbbe2249444e78c9717bf6b171d279538171102a53229f2500add5ca6fbbad7
MD5 9b999a164f5982245571655144210d99
BLAKE2b-256 2484a485ab9ddd44a5bbd84dcf52f622172cd442b728aabde7c60d425f5e18b3

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 72.4 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.1

File hashes

Hashes for clevercsv-0.4.4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 90c08c9c7eb0ac032698bcb92a829665cdd290811e8fb036a8d43b805b89dff0
MD5 27ee2e82f820fddd3c91ff5ecbf3e809
BLAKE2b-256 de8e3561e55e19ba778f86110694cdc1b02ed7a4f1b45e9132472bf1efc4675a

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp37-cp37m-manylinux1_i686.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp37-cp37m-manylinux1_i686.whl
  • Upload date:
  • Size: 70.9 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.1

File hashes

Hashes for clevercsv-0.4.4-cp37-cp37m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 40a5cc81ef2eb394bae466fd91db10251bb2f622113ff658ac555206262c4b61
MD5 395ceb0dc01226015937655f25d67998
BLAKE2b-256 0e15d791a57769c61e6ed7198fa35feed15eb3e6a140bbb3359a95f612698e1a

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp37-cp37m-macosx_10_6_intel.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp37-cp37m-macosx_10_6_intel.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: CPython 3.7m, macOS 10.6+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.4-cp37-cp37m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 6626c52db83d446908794bd90af8ee7d032cdd842e8219b20f57bcc2bd1bf459
MD5 45a49a9c3c22396ba13787b86962f9d3
BLAKE2b-256 c3b69fc0f2e43e60f1ac55b10c4316ac7d6adcc6256e24d6c06cd954f180b5d7

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 52.1 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.4-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 71107af92103ea782ebd5a01c39845fb6acd9602d4d1542e66871a10a27c078b
MD5 7a60e3021a0f5055fa34d0e3876fdd34
BLAKE2b-256 833ab1c04e45817c31d8fec96b278915986554d969188648e350f261713f0da5

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp36-cp36m-win32.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.4-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 21dcffed7d178f6a7587fd231f77f5e26cd1073b6c80910477c5417d4eb2342a
MD5 970c5c646dbfaeeb50bcd5401116bd21
BLAKE2b-256 228684f6ed2bb25d1e310c2479f0f76ffa0c155fe7cda60ebcd74c1c6096ed63

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 72.4 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.1

File hashes

Hashes for clevercsv-0.4.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0f7aa17ca9417297ff77190e631f0e15dc107098f4197dc4ecb80a96545b6aaf
MD5 c740fbea8e4b052a3c53688d954f826e
BLAKE2b-256 4549244e3918c526f92b2a7bd34f8617040c2da5937d124f85e6f893dc30ff68

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp36-cp36m-manylinux1_i686.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp36-cp36m-manylinux1_i686.whl
  • Upload date:
  • Size: 71.0 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.1

File hashes

Hashes for clevercsv-0.4.4-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 a58f64339bed5ed7bbd9efee777e63fdbd7ecacef7e1abd516531376e5d852e9
MD5 48e03da3654b7880ff9f715d11641e0f
BLAKE2b-256 aab6972586910809f94be781b84235ac905c2cb9b1ea56e236b9f2e775da9a61

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.4-cp36-cp36m-macosx_10_6_intel.whl.

File metadata

  • Download URL: clevercsv-0.4.4-cp36-cp36m-macosx_10_6_intel.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: CPython 3.6m, macOS 10.6+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.4-cp36-cp36m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 4c90f6a3dde8f0851e0da95652f008a698c836f4966435c6c42ef40ae7d6a5b5
MD5 d2d3f60796caa7509050981978b48633
BLAKE2b-256 197d01b05e514bf0cd98adafa9d865887d27ba8ad5bc721fc272c4f46ff6f457

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page