Skip to main content

Yellowbrick datasets management and deployment scripts.

Project description

Yellowbrick Datasets

Yellowbrick datasets management and deployment scripts.

Yellowbrick datasets are hosted in an S3 drive in the cloud to allow easy access to the data for examples. This repository manages those datasets, their data structure, and interactions with the cloud.

Getting Started

The ybdata script is installed as an entry point in setup.py. You can install the package and the script using pip install yellowbrick-data. If you've downloaded the source code from GitHub you can install the app using editable mode with pip. In the current working directory of the project, use:

$ pip install -e .

At this point you should have a ybdata command on your $PATH. Like git, this utility has many subcommands for various data related management tasks. To see a list of the commands and their descriptions:

$ ybdata --help

Datasets Basics

All datasets must have the following properties:

  • a unique name that identifies the dataset to a user (e.g. "bikeshare")
  • a README.md that describes the provenance and contents of the data
  • one or more data files that can be read by the yellowbrick library
  • an optional citation.bib file to cite the source of the data

Datasets are stored in the fixtures/ directory in a subdirectory with the same name as the dataset. This subdirectory contains both the data and metadata files that make up the data package structure. The uploads/ directory contains the most recent version of the compressed datasets found in the fixtures/ directory and is the content that is uploaded to S3 for use in Yellowbrick.

Currently there are two kinds of datasets:

  1. Standard: A single data table containing both features and targets.
  2. Corpus: A text corpus for natural language processing.

Both kinds of datasets have their own specific package structures as defined in the following sections. Note that the ybdata validate command can be used to check if a dataset is ready to be uploaded.

Standard Datasets

A standard dataset is composed of a single data table that can be loaded into a data frame or numpy array for machine learning with scikit-learn. In addition to the files mentioned in dataset basics, the data and metadata files that make up the standard dataset package are as follows (where "name" is the unique dataset name):

  • fixtures/name/name.csv.gz: The gzip compressed CSV file with header row to be loaded with pd.read_csv (no index column).
  • fixtures/name/name.npz: The compressed numpy matrix representation of X and y to be loaded with np.load.
  • fixtures/name/meta.json: A metadata file that identifies the features and the target column names of the data in the CSV file.

Consider the following example CSV file:

datetime,temperature,relative humidity,light,CO2,humidity,occupancy
2015-02-04 17:51:00,23.18,27.272,426,721.25,0.00479298817650529,1
2015-02-04 17:51:59,23.15,27.2675,429.5,714,0.00478344094931065,1
2015-02-04 17:53:00,23.15,27.245,426,713.5,0.00477946352442199,1
2015-02-04 17:54:00,23.15,27.2,426,708.25,0.00477150882608175,1
2015-02-04 17:55:00,23.1,27.2,426,704.5,0.00475699293331518,1
2015-02-04 17:55:59,23.1,27.2,419,701,0.00475699293331518,1
2015-02-04 17:57:00,23.1,27.2,419,701.666666666667,0.00475699293331518,1
2015-02-04 17:57:59,23.1,27.2,419,699,0.00475699293331518,1
2015-02-04 17:58:59,23.1,27.2,419,689.333333333333,0.00475699293331518,1

An example meta.json for this file would be as follows:

{
  "features": [
      "temperature",
      "relative humidity",
      "light",
      "CO2",
      "humidity",
  ],
  "target": "occupancy",
  "labels": {
    "occupied": 1,
    "not occupied": 0
  }
}

This will ensure that the dataset X is a pd.DataFrame with columns corresponding to the features list and that y is a pd.Series from the column described in the target key. The labels key is used to transform numerically encoded categorical variables for a classification target.

Corpus Datasets

A corpus dataset contains plain text files stored in subdirectories of the dataset directory that correspond to the class or category the plain text files belong to. Note that these text files should be only one level deep and that each document should be stored in its own file.

At the momement, individual corpus files should be uncompressed (the directory as a whole is compressed). The text corpus is read similarly to the following:

import os
import glob

paths = os.path.join(data_dir, "*", "*.txt")
documents = glob.glob(paths)
labels = [os.path.basename(os.path.dirname(path)) for path in documents]

Documents and labels can then be directly passed to scikit-learn text feature extraction transformers for analysis.

Creating and Uploading Datasets

This section outlines how to create and upload a dataset for use in Yellowbrick examples and testing. More detailed steps follow, but in brief here is a sketch required for the actions to take to package a dataset:

  1. Create a dataset in fixtures/
  2. Convert dataset to all appropriate types using ybdata convert
  3. Validate the dataset is ready using ybdata validate
  4. Package the dataset using ybdata package
  5. Upload the dataset using ybdata upload
  6. Update yellowbrick.datasets with uploads/manifest.json

Most of the datasets in this repository are from the UCI Machine Learning Repository. A basic methodology for creating a repository is to use the unique name of the UCIML Repository as the unique name of the data set to store in fixtures/. Wrangle the data so that it exists as a pandas-readable CSV file with a header row (usually by joining the target with the features or extracting data from a TSV, etc). Make sure that the CSV is gzip-compressed when done!

Once the pandas CSV file is created, create the README.md, meta.json, and citation.bib files manually. It is usually also fairly simple to copy and paste the README.md from the UCIML page description (wrangling it where necessary to include as many details as possible). The citation.bib file can be found by searching with Google Scholar and selecting "cite as bibtex". The meta.json usually has to be manually written. Once done, you can convert the CSV into the .npz objects using ybdata convert as follows:

$ ybdata convert fixtures/mydata/mydata.csv.gz fixtures/mydata/mydata.npz

Note that you can go from .npz to csv.gz asa well, but it is usually easier to go in the reverse direction when wrangling.

Once done, validate that the dataset is ready to be packaged using:

$ ybdata validate fixtures/mydata

This should print out a table of both required and optional items for validation, and the validation status should be listed at the bottom. Once validated, convert the dataset into a package:

$ ybdata convert fixtures/mydata

By default this will create a package in uploads/mydata.zip and update the uploads/manifest.json with the package and signature information. Note if you're updating a previously created dataset, you can use the -f flag to overwrite the old data and create a new package.

Finally upload the datasets to our S3 storage in the cloud. You will need proper AWS access keys to do this (see the environment or aws-configure options). If you would like to upload the datasets elsewhere, use the --bucket flag.

$ ybdata upload --pending v1.0

The upload process also updates the uploads/manifold.json with the final download URL and in a format that can be added to the Yellowbrick library. Make sure the yellowbrick library is updated in the correct Yellowbrick version, otherwise YB downloads will fail!

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yellowbrick-datasets-1.0.tar.gz (13.1 MB view details)

Uploaded Source

Built Distribution

yellowbrick_datasets-1.0-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file yellowbrick-datasets-1.0.tar.gz.

File metadata

  • Download URL: yellowbrick-datasets-1.0.tar.gz
  • Upload date:
  • Size: 13.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.2

File hashes

Hashes for yellowbrick-datasets-1.0.tar.gz
Algorithm Hash digest
SHA256 e809a222811e1de8345a8017f3191db0d167b916f7a180d96bbf0a8462dcd7a9
MD5 76ec933f7b3e68480760d07b266e71b7
BLAKE2b-256 505b551fb4b1321fe0ed51560ebeb8d9fce4a5505d41eec15398bc3a160181c5

See more details on using hashes here.

File details

Details for the file yellowbrick_datasets-1.0-py3-none-any.whl.

File metadata

  • Download URL: yellowbrick_datasets-1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.2

File hashes

Hashes for yellowbrick_datasets-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ede8ad77026465023bcb6a38dfaacecd86c6bcc0847d72d336f4a0e68afa1c57
MD5 f8458da72e42a640ded274af90912091
BLAKE2b-256 0b9c4014f6c0d88ee9be55a03c9ccaa084b0405c97c720e01d53443a08bcc3cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page