This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description
|Build Status|

W-Drive Extractor (wextractor)
==============================

The W-Drive Extractor (or the wextractor), named after the home of the
shared list of contracts in the City of Pittsburgh, is an attempt to
extract and standardize data from spreadsheets, .csvs, and other files
for a relational destination.

Using the W-Drive-Extractor
---------------------------

Getting Started
~~~~~~~~~~~~~~~

Installation
^^^^^^^^^^^^

W-Drive Extractor is available as a pre-release package via pypi. You
can install via pip:

::

pip install wextractor --pre

Usage
^^^^^

Here's a simple example of extracting data from the Pittsburgh Police
blotter:

::

>>> from wextractor.extractors import CsvExtractor
>>> extractor = CsvExtractor('http://apps.pittsburghpa.gov/police/arrest_blotter/arrest_blotter_Monday.csv')
>>> data = extractor.extract()
>>> print data
>>> [{u'NEIGHBORHOOD': u'Spring Garden', u'DESCRIPTION': u'Flight to Avoid Apprehension Tri...

For a more complex example of usage, please see the bottom of this file.

TODO Features:
''''''''''''''

- Add cli support
- Change ``loader`` and ``extractor`` methods to use kwargs
- Add better exception messaging for the ``load`` and ``extract``
methods

Developing W-Drive-Extractor
----------------------------

Getting Started
~~~~~~~~~~~~~~~

W-Drive Extractor has some external dependencies, which can be installed
via pip. It is recommended that you use a
`virtualenv <https: github.com="" codeforamerica="" howto="" blob="" master="" python-virtualenv.md="">`__
to manage these.

The W-Drive-Extractor is an object-oriented application. In order to use
it, you must first *extract* data from its original source using an
``Extractor``'s ``extract`` method (the ``ExcelExtractor`` is currently
the only supported example). Once the data is extracted, it can then be
*loaded* back into some other datasource using a ``Loader``'s ``load``
method. (only ``PostgresLoader`` has been implemented thus far). For a
more detailed example on how this works, check out the sample usage at
the bottom of this file.

Extractors
~~~~~~~~~~

The Extractor base class is an interface for implementing data
extraction from different sources. It requires taking in a ``target``
which can be a file or URL and two optional params. Headers is the title
of the columns that will ultimately be extracted from your store, and
dtypes is a list of native python types that each column should have.

Current Implementations:
''''''''''''''''''''''''

- Excel (.xls, .xlsx)
- Comma-Separated Values (.csv)

TODO implementations:
'''''''''''''''''''''

- Generic Text Files (.txt)
- Postgres
- MS Access

Loaders
~~~~~~~

The Loader base class is an interface for implementing data loading into
new sources. It requires connection parameters (a python dictionary of
connection params) and optional schema. The goal is for a single input
source (spreadsheet, denormalized table, etc.) to be split into many
tables.

Current Implementations:
''''''''''''''''''''''''

- Postgres [with relationships and simple deduplication!]

TODO Implementations:
'''''''''''''''''''''

- Simple key/value cache (Memcached/Redis)
- Other relational data stores

Tests
~~~~~

Tests are located in the ``test`` directory. To run the tests, run

::

PYTHONPATH=. nosetests test/

from inside the root directory. For more coverage information, run

::

PYTHONPATH=. nosetests test/ -vs --with-coverage --cover-package=wextractor --cover-erase

Detailed Sample Usage
~~~~~~~~~~~~~~~~~~~~~

Below is an example of extracting data from Excel and loading it into a
local `postgres database <http: postgresapp.com=""/>`__ with defined
relationships. NOTE: This implementation is still fragile and likely to
be dependent on the fact that to\_relations is the last table in the
list below.

::

import datetime

from wextractor.extractors import ExcelExtractor
from wextractor.loaders import PostgresLoader

one_sheet = ExcelExtractor(
'files/one sheet contract list.xlsx',
dtypes=[
unicode, unicode, unicode, int, unicode,
unicode, datetime.datetime, int, unicode, unicode,
unicode, unicode, unicode, unicode, unicode,
unicode, unicode, unicode, unicode
]
)
data = one_sheet.extract()

loader = PostgresLoader(
{'database': 'w_drive', 'user': 'bensmithgall', 'host': 'localhost'},
[{
'table_name': 'contract',
'to_relations': [],
'from_relations': ['company'],
'pkey': None,
'columns': (
('description', 'TEXT'),
('notes', 'TEXT'),
('contract_number', 'VARCHAR(255)'),
('county', 'VARCHAR(255)'),
('type_of_contract', 'VARCHAR(255)'),
('pa', 'VARCHAR(255)'),
('expiration', 'TIMESTAMP'),
('spec_number', 'VARCHAR(255)'),
('controller_number', 'INTEGER'),
('commcode', 'INTEGER')
)
},
{
'table_name': 'company_contact',
'to_relations': [],
'from_relations': ['company'],
'pkey': None,
'columns': (
('contact_name', 'VARCHAR(255)'),
('address_1', 'VARCHAR(255)'),
('address_2', 'VARCHAR(255)'),
('phone_number', 'VARCHAR(255)'),
('email', 'VARCHAR(255)'),
('fax_number', 'VARCHAR(255)'),
('fin', 'VARCHAR(255)'),
)
},
{
'table_name': 'company',
'to_relations': ['company_contact', 'contract'],
'from_relations': [],
'pkey': None,
'columns': (
('company', 'VARCHAR(255)'),
('bus_type', 'VARCHAR(255)'),
)
}]
)

loader.load(data, True)

.. |Build Status| image:: https://travis-ci.org/codeforamerica/w-drive-extractor.svg?branch=master
:target: https://travis-ci.org/codeforamerica/w-drive-extractor
Release History

Release History

0.1.dev4

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.dev3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.dev2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.dev1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
wextractor-0.1.dev4.zip (26.1 kB) Copy SHA256 Checksum SHA256 Source May 12, 2015

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting