Skip to main content

A clever CSV parser

Project description

CleverCSV: A Clever CSV Parser

Build Status PyPI version Documentation Status Binder

This package is currently in beta. If you encounter any problems, please open an issue or submit a pull request!

Handy links:

Introduction

  • CSV files are awesome: they are lightweight, easy to share, human-readable, version-controllable, and supported by many systems and tools!
  • CSV files are terrible: they can have many different formats, multiple tables, headers or no headers, escape characters, and there's no support for data dictionaries.

CleverCSV is a Python package that aims to solve many of the pain points of CSV files, while maintaining many of the good things. The package automatically detects (with high accuracy) the format (dialect) of CSV files, thus making it easier to simply point to a CSV file and load it, without the need for human inspection. In the future, we hope to solve some of the other issues of CSV files too.

CleverCSV is based on science. We investigated thousands of real-world CSV files to find a robust way to automatically detect the dialect of a file. This may seem like an easy problem, but to a computer a CSV file is simply a long string, and every dialect will give you some table. In CleverCSV we use a technique based on the patterns of the parsed file and the data type of the parsed cells. With our method we achieve a 97% accuracy for dialect detection, with a 21% improvement on non-standard (messy) CSV files.

We think this kind of work can be very valuable for working data scientists and programmers and we hope that you find CleverCSV useful (if there's a problem, please open an issue!) Since the academic world counts citations, please cite CleverCSV if you use the package. Here's a BibTeX entry you can use:

@article{van2019wrangling,
        title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns},
        author = {{van den Burg}, G. J. J. and Nazabal, A. and Sutton, C.},
        journal = {Data Mining and Knowledge Discovery},
        year = {2019},
        month = {Jul},
        day = {26},
        issn = {1573-756X},
        doi = {10.1007/s10618-019-00646-y},
}

And of course, if you like the package please spread the word! You can do this by Tweeting about it (#CleverCSV) or clicking the ⭐️ on GitHub!

Installation

The package is available on PyPI:

$ pip install clevercsv

Usage

CleverCSV consists of a Python library and a command line tool called clevercsv.

Library

We designed CleverCSV to provide a drop-in replacement for the built-in CSV module, with some useful functionality added to it. Therefore, if you simply want to replace the builtin CSV module with CleverCSV, you can import CleverCSV as follows, and use it as you would use the builtin csv module.

import clevercsv

CleverCSV provides an improved version of the dialect sniffer in the CSV module, but it also adds some useful wrapper functions. These functions automatically detect the dialect and aim to make working with CSV files easier. We currently have the following helper functions:

  • detect_dialect: takes a path to a CSV file and returns the detected dialect
  • read_csv: automatically detects the dialect and encoding of the file, and returns the data as a list of rows.
  • csv2df: detects the dialect and encoding of the file and then uses Pandas to read the CSV into a DataFrame.

Of course, you can also use the traditional way of loading a CSV file, as in the Python CSV module:

# importing this way makes it easy to port existing code to CleverCsv
import clevercsv as csv

with open("data.csv", "r", newline="") as fp:
  # you can use verbose=True to see what CleverCSV does:
  dialect = csv.Sniffer().sniff(fid.read(), verbose=False)
  fp.seek(0)
  reader = csv.reader(fp, dialect)
  rows = list(reader)

That's the basics! If you want more details, you can look at the code of the package, the test suite, or the API documentation.

Command-Line Tool

The clevercsv command line application has a number of handy features to make working with CSV files easier. For instance, it can be used to view a CSV file on the command line while automatically detecting the dialect. It can also generate Python code for importing data from a file with the correct dialect. The full help text is as follows:

USAGE
  clevercsv [-h] [-v] [-V] <command> [<arg1>] ... [<argN>]

ARGUMENTS
  <command>       The command to execute
  <arg>           The arguments of the command

GLOBAL OPTIONS
  -h (--help)     Display this help message.
  -v (--verbose)  Enable verbose mode.
  -V (--version)  Display the application version.

AVAILABLE COMMANDS
  code            Generate Python code for importing the CSV file.
  detect          Detect the dialect of a CSV file
  help            Display the manual of a command
  standardize     Convert a CSV file to one that conforms to RFC-4180.
  view            View the CSV file on the command line using TabView

Each of the commands has further options (for instance, the code command can generate code for importing a Pandas DataFrame). Use clevercsv help <command> for more information. Below are some examples for each command:

Code

Code generation is useful when you don't want to detect the dialect of the same file over and over again. You simply run the following command and copy the generated code to a Python script!

$ clevercsv code imdb.csv

# Code generated with CleverCSV

import clevercsv

with open("imdb.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
    rows = list(reader)

We also have a version that reads a Pandas dataframe:

$ clevercsv code --pandas imdb.csv

# Code generated with CleverCSV

import clevercsv

df = clevercsv.csv2df("imdb.csv", delimiter=",", quotechar="", escapechar="\\")

Detect

Detection is useful when you only want to know the dialect.

$ clevercsv detect imdb.csv
Detected: SimpleDialect(',', '', '\\')

The --plain flag gives the components of the dialect on separate lines, which makes combining it with grep easier.

$ clevercsv detect --plain imdb.csv
delimiter = ,
quotechar =
escapechar = \

Standardize

Use the standardize command when you want to rewrite a file using the RFC-4180 standard:

$ clevercsv standardize --output imdb_standard.csv imdb.csv

In this particular example the use of the escape character is replaced by using quotes.

View

This command allows you to view the file in the terminal. The dialect is of course detected using CleverCSV! Both this command and the standardize command support the --transpose flag, if you want to transpose the file before viewing or saving:

$ clevercsv view --transpose imdb.csv

Contributors

Code:

Scientific work:

Contributing

If you want to encourage development of CleverCSV, the best thing to do now is to spread the word!

If you encounter an issue in CleverCSV, please open an issue or submit a pull request!

Notes

License: MIT (see LICENSE file).

Copyright (c) 2019 The Alan Turing Institute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clevercsv-0.4.0.tar.gz (78.0 kB view details)

Uploaded Source

Built Distribution

clevercsv-0.4.0-py3.7-linux-x86_64.egg (123.3 kB view details)

Uploaded Source

File details

Details for the file clevercsv-0.4.0.tar.gz.

File metadata

  • Download URL: clevercsv-0.4.0.tar.gz
  • Upload date:
  • Size: 78.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.0.tar.gz
Algorithm Hash digest
SHA256 ccd8dc20d62cc445940f54617594d6cb8dd8e7ec543ab5feadb8d90f293c053b
MD5 175f8e1b9d67ad4d51393604c63e9a2d
BLAKE2b-256 063d361b194c0bece8318983e5c330976e8776528edd2a8b25f3868e757929cc

See more details on using hashes here.

File details

Details for the file clevercsv-0.4.0-py3.7-linux-x86_64.egg.

File metadata

  • Download URL: clevercsv-0.4.0-py3.7-linux-x86_64.egg
  • Upload date:
  • Size: 123.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.4

File hashes

Hashes for clevercsv-0.4.0-py3.7-linux-x86_64.egg
Algorithm Hash digest
SHA256 ee03572a2917dbbb541965f82206fd2ea204d291897e0672c4f8ac2d7479c98d
MD5 e678e296a2c000fa05c0ce79841fb28f
BLAKE2b-256 af1f8db10cdadbfdb0b7329ab43a53411c43dec1bd9570e9b134b290ce78c5b7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page