A clever CSV parser
Project description
CleverCSV: A Clever CSV Parser
This package is currently in beta. If you encounter any problems, please open an issue or submit a pull request!
Handy links:
Introduction
- CSV files are awesome: they are lightweight, easy to share, human-readable, version-controllable, and supported by many systems and tools!
- CSV files are terrible: they can have many different formats, multiple tables, headers or no headers, escape characters, and there's no support for data dictionaries.
CleverCSV is a Python package that aims to solve many of the pain points of CSV files, while maintaining many of the good things. The package automatically detects (with high accuracy) the format (dialect) of CSV files, thus making it easier to simply point to a CSV file and load it, without the need for human inspection. In the future, we hope to solve some of the other issues of CSV files too.
A Demo of CleverCSV is available on BinderHub.
CleverCSV is based on science. We investigated thousands of real-world CSV files to find a robust way to automatically detect the dialect of a file. This may seem like an easy problem, but to a computer a CSV file is simply a long string, and every dialect will give you some table. In CleverCSV we use a technique based on the patterns of the parsed file and the data type of the parsed cells. With our method we achieve a 97% accuracy for dialect detection, with a 21% improvement on non-standard (messy) CSV files.
We think this kind of work can be very valuable for working data scientists and programmers and we hope that you find CleverCSV useful (if there's a problem, please open an issue!) Since the academic world counts citations, please cite CleverCSV if you use the package. Here's a BibTeX entry you can use:
@article{van2018wrangling,
title={Wrangling Messy {CSV} Files by Detecting Row and Type Patterns},
author={{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.},
journal={arXiv preprint arXiv:1811.11242},
year={2018}
}
Installation
The package is available on PyPI:
$ pip install clevercsv
Usage
CleverCSV consists of a Python library and a command line tool
(clevercsv
).
Library
We designed CleverCSV to provide a drop-in replacement for the built-in CSV module, with some useful functionality added to it. Therefore, if you simply want to replace the builtin CSV module with CleverCSV, you only have to add one letter:
import clevercsv
CleverCSV provides an improved version of the dialect sniffer in the CSV module, but it also adds some useful wrapper functions. For instance, there's a wrapper for loading a CSV file using Pandas, that uses CleverCSV to detect the dialect of the file:
from clevercsv import csv2df
df = csv2df("data.csv")
Of course, you can also use the traditional way of loading a CSV file, as in the Python CSV module:
# importing this way makes it easy to port existing code to CleverCsv
import clevercsv as csv
with open("data.csv", "r", newline="") as fp:
# you can use verbose=True to see what CleverCSV does:
dialect = csv.Sniffer().sniff(fid.read(), verbose=False)
fp.seek(0)
reader = csv.reader(fp, dialect)
rows = list(reader)
That's the basics! If you want more details, you can look at the code of the package or the test suite. Documentation will be provided in the future (but a lot of the functionality is similar to the CSV package in Python!)
Command-Line Tool
The clevercsv
command line application has a number of handy features to
make working with CSV files easier. For instance, it can be used to view a CSV
file on the command line while automatically detecting the dialect. It can
also generate Python code for importing data from a file with the correct
dialect. The full help text is as follows:
CleverCSV version 0.3.1
USAGE
clevercsv [-h] [-v] [-V] <command> [<arg1>] ... [<argN>]
ARGUMENTS
<command> The command to execute
<arg> The arguments of the command
GLOBAL OPTIONS
-h (--help) Display this help message.
-v (--verbose) Enable verbose mode.
-V (--version) Display the application version.
AVAILABLE COMMANDS
code Generate Python code for importing the CSV file.
detect Detect the dialect of a CSV file
help Display the manual of a command
standardize Convert a CSV file to one that conforms to RFC-4180.
view View the CSV file on the command line using TabView
Each of the commands has further options (for instance, the code
command
can generate code for importing a Pandas DataFrame). Use clevercsv help <command>
for more information.
Contributors
Code:
Scientific work:
Contributing
If you want to encourage development of CleverCSV, the best thing to do now is to spread the word!.
If you encounter an issue in CleverCSV, please open an issue or submit a pull request!
Notes
License: MIT (see LICENSE file).
Copyright (c) 2019 The Alan Turing Institute.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file clevercsv-0.3.1.tar.gz
.
File metadata
- Download URL: clevercsv-0.3.1.tar.gz
- Upload date:
- Size: 44.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f1e3494402f5ba75a77aaa6b26244dbbbc4622054d7320bba3f9aaada692eb0 |
|
MD5 | 6c7418133691ed6e571ec16077698609 |
|
BLAKE2b-256 | a308e4276d7b7c5dc2b68629772db119682e0b4c74ee3e3441fda59f770a8984 |