Inspect your data. Find the truth!
Project description
Penny
=====
Inspect your data. Find the truth.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. figure:: http://www.martianwatches.com/wp-content/uploads/2013/10/InspectorGadget.jpg
:alt: alt tag
alt tag
Uncle Gadget was great and all, but when it came to real detective work,
we all know Penny did the heavy lifting. Hence, Penny, the Python module
that inspects stuff. Feed it rows or columns from a dataset, and get
information about the column types -- including whether or not a given
column represents a category or date. Penny also finds column headers
(waaaay more reliably than the ``Sniffer`` class in to the standard
``csv`` module).
Why?
~~~~
If you're working with a few datasets, it's easy to figure out which
columns are supposed to be dates, integers and even categories just by
looking at the raw csv files. But if you need to programmatically deal
with lots of datasets, this gets tedious fast.
Setup
~~~~~
Grab the package.
::
pip install penny
Or grab the code from GitHub.
::
git clone https://github.com/gati/penny
cd penny
pip install -r requirements.txt
Getting Started
~~~~~~~~~~~~~~~
Guess the headers of a csv file.
.. code:: python
from penny.headers import get_headers
with open('your-awesome-file.csv') as csvfile:
has_header, headers = get_headers(csvfile)
# Prints True/False depending on whether or not headers were found
print has_header
# Prints column headers or placeholders if real headers weren't found
print headers # ['Example Header A', 'Example Header B']
Guess the data type of a column in your dataset.
.. code:: python
from penny.inspectors import column_types_probabilities
fileobj = open('your-awesome-file.csv')
rows = list(csv.reader(fileobj))
# Get the values from column 0
column_0 = [x[0] for x in rows]
probs = column_types_probabilities(column_0)
# Prints something like {'date': 1, 'int': .75, 'category': 0 ...}
print probs
Or get type guesses for all the rows in your dataset at once.
.. code:: python
from penny.inspectors import rows_types_probabilities
fileobj = open('your-awesome-file.csv')
rows = list(csv.reader(fileobj))
probs = rows_types_probabilities(rows)
Penny checks for a lot of data "types," not just the standard ``int``,
``str``, etc. Here's the list (for now):
- **date** something ``dateutil.parser`` can parse into a ``datetime``
object
- **int** a whole number
- **bool** y/n or yes/no or something true/falsey
- **float** a number with a decimal
- **category** something you might want to group records by
- **text** string longer than 90 characters (something you could get
names/places/sentiment/etc from)
- **id** unique for each row
- **coord** a float that might be a latitude or longitude
- **coord\_pair** string that looks like "coord,coord"
- **proportion** column where all values sum to 1 or 100
- **street** house number + street name
- **city** one of the world's 80,000 largest cities
- **region** smaller than a country, bigger than a city. state,
province, etc
- **country** a country name on the `ISO 3166
list <http://en.wikipedia.org/wiki/ISO_3166-1#Current_codes>`__
- **phone** a phone number
- **email** an email address
- **url** web address with or without http:// (so http://google.com or
google.com)
- **address** a full address you could geocode with a service like
Google Maps
Last but not least, you can also inspect a column for a single type.
.. code:: python
from penny.list_check import column_probability_for_type
fileobj = open('your-awesome-file.csv')
rows = list(csv.reader(fileobj))
# Get the values from column 0
column_0 = [x[0] for x in rows]
prob = column_probability_for_type(column_0, 'date')
# Prints something like 0.78
print prob
Contributing & Credits
~~~~~~~~~~~~~~~~~~~~~~
This is a work in progress, so pull request at will. Some of this work
was inspired by `messytables <https://github.com/okfn/messytables>`__,
which looks great for xls files but wasn't quite what I needed. Thanks
to `Chris Albon <http://twitter.com/chrisalbon>`__ for putting together
a `repo of useful test
datasets <https://github.com/chrisalbon/Variable-Type-Identification-Test-Datasets>`__.
Questions, concerns, devoted fan mail to
[@jonathonmorgan](http://twitter.com/jonathonmorgan) on Twitter.
=====
Inspect your data. Find the truth.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. figure:: http://www.martianwatches.com/wp-content/uploads/2013/10/InspectorGadget.jpg
:alt: alt tag
alt tag
Uncle Gadget was great and all, but when it came to real detective work,
we all know Penny did the heavy lifting. Hence, Penny, the Python module
that inspects stuff. Feed it rows or columns from a dataset, and get
information about the column types -- including whether or not a given
column represents a category or date. Penny also finds column headers
(waaaay more reliably than the ``Sniffer`` class in to the standard
``csv`` module).
Why?
~~~~
If you're working with a few datasets, it's easy to figure out which
columns are supposed to be dates, integers and even categories just by
looking at the raw csv files. But if you need to programmatically deal
with lots of datasets, this gets tedious fast.
Setup
~~~~~
Grab the package.
::
pip install penny
Or grab the code from GitHub.
::
git clone https://github.com/gati/penny
cd penny
pip install -r requirements.txt
Getting Started
~~~~~~~~~~~~~~~
Guess the headers of a csv file.
.. code:: python
from penny.headers import get_headers
with open('your-awesome-file.csv') as csvfile:
has_header, headers = get_headers(csvfile)
# Prints True/False depending on whether or not headers were found
print has_header
# Prints column headers or placeholders if real headers weren't found
print headers # ['Example Header A', 'Example Header B']
Guess the data type of a column in your dataset.
.. code:: python
from penny.inspectors import column_types_probabilities
fileobj = open('your-awesome-file.csv')
rows = list(csv.reader(fileobj))
# Get the values from column 0
column_0 = [x[0] for x in rows]
probs = column_types_probabilities(column_0)
# Prints something like {'date': 1, 'int': .75, 'category': 0 ...}
print probs
Or get type guesses for all the rows in your dataset at once.
.. code:: python
from penny.inspectors import rows_types_probabilities
fileobj = open('your-awesome-file.csv')
rows = list(csv.reader(fileobj))
probs = rows_types_probabilities(rows)
Penny checks for a lot of data "types," not just the standard ``int``,
``str``, etc. Here's the list (for now):
- **date** something ``dateutil.parser`` can parse into a ``datetime``
object
- **int** a whole number
- **bool** y/n or yes/no or something true/falsey
- **float** a number with a decimal
- **category** something you might want to group records by
- **text** string longer than 90 characters (something you could get
names/places/sentiment/etc from)
- **id** unique for each row
- **coord** a float that might be a latitude or longitude
- **coord\_pair** string that looks like "coord,coord"
- **proportion** column where all values sum to 1 or 100
- **street** house number + street name
- **city** one of the world's 80,000 largest cities
- **region** smaller than a country, bigger than a city. state,
province, etc
- **country** a country name on the `ISO 3166
list <http://en.wikipedia.org/wiki/ISO_3166-1#Current_codes>`__
- **phone** a phone number
- **email** an email address
- **url** web address with or without http:// (so http://google.com or
google.com)
- **address** a full address you could geocode with a service like
Google Maps
Last but not least, you can also inspect a column for a single type.
.. code:: python
from penny.list_check import column_probability_for_type
fileobj = open('your-awesome-file.csv')
rows = list(csv.reader(fileobj))
# Get the values from column 0
column_0 = [x[0] for x in rows]
prob = column_probability_for_type(column_0, 'date')
# Prints something like 0.78
print prob
Contributing & Credits
~~~~~~~~~~~~~~~~~~~~~~
This is a work in progress, so pull request at will. Some of this work
was inspired by `messytables <https://github.com/okfn/messytables>`__,
which looks great for xls files but wasn't quite what I needed. Thanks
to `Chris Albon <http://twitter.com/chrisalbon>`__ for putting together
a `repo of useful test
datasets <https://github.com/chrisalbon/Variable-Type-Identification-Test-Datasets>`__.
Questions, concerns, devoted fan mail to
[@jonathonmorgan](http://twitter.com/jonathonmorgan) on Twitter.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
penny-0.4.1.tar.gz
(1.3 MB
view details)
File details
Details for the file penny-0.4.1.tar.gz
.
File metadata
- Download URL: penny-0.4.1.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b6f639293a0c5e27d4005d02e803e3df928e91b622ebe103972a9f29bdd27eb |
|
MD5 | 8455685ccfbcfc2d865e442f09651828 |
|
BLAKE2b-256 | 69b620c4bd007ab6f1cab009cd0108aadf411634de6ba8e405e97da296c4d8f4 |