This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

This package provides a framework for collaborative, test-driven data cleaning. The framework enables a reproducible method for data cleaning that can be easily validated.

For a given tabular data set, a Trello board is populated with cards for each column so that team members can tag themselves to a column and ensure that work does not overlap. The cards include summary statistics of the columns that can be useful for writing methods to clean the column. Method stubs and test stubs are also scaffolded out for team members to fill out.

Usage:

This works on Linux with Python 2.7, 3.3, 3.4 and 3.5, and on OSX with Python 2.7 and 3.5 (and probably 3.3 and 3.4, but those haven’t been tested). It works on Windows (tested using Python 3.5.2 :: Anaconda 4.1.1 (64-bit)). Integration with Trello on Windows using tddc is yet to be tested though.

Install the package with: $ pip install tddc

You can download a tiny example CSV file at: https://github.com/DataKind-SG/test-driven-data-cleaning/raw/master/input/foobar_data.csv

In the same directory as the file, run:

$ tddc summarize foobar_data.csv

This takes the csv data set and summarizes it, outputing to a json file in a newly created output/ directory.

Next, you can run:

$ tddc build_trello foobar_data.csv

The first time you run this, it will fail and give you instructions on how to create a Trello configuration file in your root directory (in future, this should probably be created through the CLI). Once you create it, you can try to run that step again. This will create a Trello board. The one my run created is here: https://trello.com/b/cqP9VZal/data-cleaning-board-for-foobar-data

Finally, you can run:

$ tddc build foobar_data.csv

This outputs a script into the output/ folder that contains method stubs and glue code to clean the data set. It also outputs stubs for tests in output/.

Contributing:

Before running the tests, you’ll need to run:

$ pip install pytest pytest-cov mock

Then, in the root of the project directory you can run the tests with:

$ py.test

We’re trying out the new Github projects feature. The project we’re currently working on is https://github.com/DataKind-SG/test-driven-data-cleaning/projects/1

Each card is an issue that you can click through to. If you’d like to take a card (thank you!), move the card to the “In progress” column and assign yourself to the issue. Once you’re finished, issue a pull request and move the card to “For review”.

If you think of a new issue, create the card in the appropriate project and convert the card to an issue in the pull-down menu (it’s currently not possible to link to an already created issue from a card).

Release History

Release History

0.1.1

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
tddc-0.1.1.tar.gz (8.3 kB) Copy SHA256 Checksum SHA256 Source Sep 17, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting