Skip to main content

Extract flat data and load it as relational data

Project description

|Build Status|

W-Drive Extractor (wextractor)
==============================

The W-Drive Extractor (or the wextractor), named after the home of the
shared list of contracts in the City of Pittsburgh, is an attempt to
extract and standardize data from spreadsheets, .csvs, and other files
for a relational destination.

Using the W-Drive-Extractor
---------------------------

Getting Started
~~~~~~~~~~~~~~~

Installation
^^^^^^^^^^^^

W-Drive Extractor is available as a pre-release package via pypi. You
can install via pip:

::

pip install wextractor --pre

Usage
^^^^^

Here's a simple example of extracting data from the Pittsburgh Police
blotter:

::

>>> from wextractor.extractors import CsvExtractor
>>> extractor = CsvExtractor('http://apps.pittsburghpa.gov/police/arrest_blotter/arrest_blotter_Monday.csv')
>>> data = extractor.extract()
>>> print data
>>> [{u'NEIGHBORHOOD': u'Spring Garden', u'DESCRIPTION': u'Flight to Avoid Apprehension Tri...

For a more complex example of usage, please see the bottom of this file.

TODO Features:
''''''''''''''

- Add cli support
- Change ``loader`` and ``extractor`` methods to use kwargs
- Add better exception messaging for the ``load`` and ``extract``
methods

Developing W-Drive-Extractor
----------------------------

Getting Started
~~~~~~~~~~~~~~~

W-Drive Extractor has some external dependencies, which can be installed
via pip. It is recommended that you use a
`virtualenv <https://github.com/codeforamerica/howto/blob/master/Python-Virtualenv.md>`__
to manage these.

The W-Drive-Extractor is an object-oriented application. In order to use
it, you must first *extract* data from its original source using an
``Extractor``'s ``extract`` method (the ``ExcelExtractor`` is currently
the only supported example). Once the data is extracted, it can then be
*loaded* back into some other datasource using a ``Loader``'s ``load``
method. (only ``PostgresLoader`` has been implemented thus far). For a
more detailed example on how this works, check out the sample usage at
the bottom of this file.

Extractors
~~~~~~~~~~

The Extractor base class is an interface for implementing data
extraction from different sources. It requires taking in a ``target``
which can be a file or URL and two optional params. Headers is the title
of the columns that will ultimately be extracted from your store, and
dtypes is a list of native python types that each column should have.

Current Implementations:
''''''''''''''''''''''''

- Excel (.xls, .xlsx)
- Comma-Separated Values (.csv)

TODO implementations:
'''''''''''''''''''''

- Generic Text Files (.txt)
- Postgres
- MS Access

Loaders
~~~~~~~

The Loader base class is an interface for implementing data loading into
new sources. It requires connection parameters (a python dictionary of
connection params) and optional schema. The goal is for a single input
source (spreadsheet, denormalized table, etc.) to be split into many
tables.

Current Implementations:
''''''''''''''''''''''''

- Postgres [with relationships and simple deduplication!]

TODO Implementations:
'''''''''''''''''''''

- Simple key/value cache (Memcached/Redis)
- Other relational data stores

Tests
~~~~~

Tests are located in the ``test`` directory. To run the tests, run

::

PYTHONPATH=. nosetests test/

from inside the root directory. For more coverage information, run

::

PYTHONPATH=. nosetests test/ -vs --with-coverage --cover-package=wextractor --cover-erase

Detailed Sample Usage
~~~~~~~~~~~~~~~~~~~~~

Below is an example of extracting data from Excel and loading it into a
local `postgres database <http://postgresapp.com/>`__ with defined
relationships. NOTE: This implementation is still fragile and likely to
be dependent on the fact that to\_relations is the last table in the
list below.

::

import datetime

from wextractor.extractors import ExcelExtractor
from wextractor.loaders import PostgresLoader

one_sheet = ExcelExtractor(
'files/one sheet contract list.xlsx',
dtypes=[
unicode, unicode, unicode, int, unicode,
unicode, datetime.datetime, int, unicode, unicode,
unicode, unicode, unicode, unicode, unicode,
unicode, unicode, unicode, unicode
]
)
data = one_sheet.extract()

loader = PostgresLoader(
{'database': 'w_drive', 'user': 'bensmithgall', 'host': 'localhost'},
[{
'table_name': 'contract',
'to_relations': [],
'from_relations': ['company'],
'pkey': None,
'columns': (
('description', 'TEXT'),
('notes', 'TEXT'),
('contract_number', 'VARCHAR(255)'),
('county', 'VARCHAR(255)'),
('type_of_contract', 'VARCHAR(255)'),
('pa', 'VARCHAR(255)'),
('expiration', 'TIMESTAMP'),
('spec_number', 'VARCHAR(255)'),
('controller_number', 'INTEGER'),
('commcode', 'INTEGER')
)
},
{
'table_name': 'company_contact',
'to_relations': [],
'from_relations': ['company'],
'pkey': None,
'columns': (
('contact_name', 'VARCHAR(255)'),
('address_1', 'VARCHAR(255)'),
('address_2', 'VARCHAR(255)'),
('phone_number', 'VARCHAR(255)'),
('email', 'VARCHAR(255)'),
('fax_number', 'VARCHAR(255)'),
('fin', 'VARCHAR(255)'),
)
},
{
'table_name': 'company',
'to_relations': ['company_contact', 'contract'],
'from_relations': [],
'pkey': None,
'columns': (
('company', 'VARCHAR(255)'),
('bus_type', 'VARCHAR(255)'),
)
}]
)

loader.load(data, True)

.. |Build Status| image:: https://travis-ci.org/codeforamerica/w-drive-extractor.svg?branch=master
:target: https://travis-ci.org/codeforamerica/w-drive-extractor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wextractor-0.1.dev2.tar.gz (16.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page