Skip to main content

Trane is a software package for automatically generating prediction problems and generating labels for supervised learning.

Project description

Trane

Trane is a software package for automatically generating prediction problems and generating labels for supervised learning. Trane is a system designed to advance the automation of the machine learning problem solving pipeline.

Prediction Problems

In data science, people usually have a few records of an entity and want to predict what will happen to that entity in the future. Trane is designed to generate time-related prediction problems. Trane transforms data meta information into lists of relevant prediction problems and cutoff times. Prediction problems are structured in a formal language described in Operations below. Cutoff times are defined as the last time in the data used for training the classifier. Data after the cutoff time is used for evaluating the classifiers accuracy. Cutoff times are necessary to prevent the classifier from training to test data.

Example

A bank wants to predict how many transactions over 100$ a customer will make in the next year. Assume we have all the transaction records for each client from 2015 to 2017. We want to build a machine learning method to solve the prediction problem. Here is the example database.

User_id Time Transaction_id Amount
u1 2015 1-2015-1 10
u1 2015 1-2015-2 200
u2 2015 2-2015-1 50
u1 2016 1-2016-1 10
u1 2017 1-2017-1 1000
u1 2017 1-2017-2 20
u2 2017 2-2017-1 10

First, we seperate the data by entity. Here the entity is user_id. User u1 for example, has

User_id Time Transaction_id Amount
u1 2015 1-2015-1 10
u1 2015 1-2015-2 200
u1 2016 1-2016-1 10
u1 2017 1-2017-1 1000
u1 2017 1-2017-2 20

Let's consider a cutoff time equal to 2016. The data from 2015-2016 will be used as training data in the machine learning model. Data after 2016, that is data from 2016-2017 will be used to evaluate the trained model. Trane outputs a tuple of (entity, cutoff, label) for each prediction problem. A prediction problem is applied to entity data to generate the label. The data from Trane can be fed directly into Feature Tools to perform feature engineering.

Prediction Problem Generation

As shown in the example, a prediction problem is a sequence of operations applied to data as well as a cutoff time.

In Trane, we generate prediction problems with four operations: Filter Operations, Row Operations, Transformation Operations and Aggregation Operations. Filter operations are applied on the filter_column. Row, Transformation and Aggregation Operations are applied on the label_generating_column.

Workflow

The workflow of using Trane on a database is as follows:

  • Data scientist writes a meta.json describing columns and data types in the new database.
  • PredictionProblemGenerator reads the meta data and generates possible prediction problems. The prediction problems are saved to problems.json.
  • The data scientist can change parameters to the prediction problems in problems.json.
  • The labeler applies prediction problems in problems.json to the database data.csv

Built-in Operations

  • FilterOp
    • IdentityFilterOp
    • GreaterFilterOp
  • RowOp
    • IdentityRowOp
    • GreaterRowOp
  • TransformationOp
    • IdentityTransformationOp
    • DiffTransformationOp
  • AggregationOp
    • FirstAggregationOp
    • CountAggregationOp
    • SumAggregationOp
    • LastAggregationOp
    • LMFAggregationOp

Unit Testing

We use pytest to automatically collecting unit testings and pytest-cov to measure the coverage of unit testing. The application code is in Trane/trane/. The unit testing code is in Trane/tests/. To run all unit testing, change directory to Trane and execute

> pytest --cov=trane tests

Setup/Install

Clone from Git

> git clone https://github.com/HDI-Project/Trane.git

Run pip install

> pip3 install Trane/

Quick Usage

We have a tutorial notebook here.

TODO

  • Need an easier way to add customize operations. Currently, external plugin operations are not allowed. The bottleneck is we need to maintain a list of operations so that we can save, load, and iterate over operations. It's not easy to add an external operation into operation list.
  • Currently, all operations are in-place operations. The aggregation ops simply take a record, change the value in the column and return. May not be a good design.
  • API for setting thresholds.
  • Some NotImplementedError.
  • NL system should be independent of Trane. Seems better to generate NL from JSON.

History

0.1.0 (2018-04-12)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

trane-0.1.0-py3.6.egg (70.9 kB view details)

Uploaded Egg

trane-0.1.0-py3.5.egg (72.2 kB view details)

Uploaded Egg

trane-0.1.0-py3.4.egg (72.4 kB view details)

Uploaded Egg

File details

Details for the file trane-0.1.0-py3.6.egg.

File metadata

  • Download URL: trane-0.1.0-py3.6.egg
  • Upload date:
  • Size: 70.9 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for trane-0.1.0-py3.6.egg
Algorithm Hash digest
SHA256 01482f56cd489b8ba4dada339d73b37fccea4e92df1747dbbeaa7f666b33cac1
MD5 19c89d418f4c6e2f79af15fefb5cd920
BLAKE2b-256 6f73821c69f0c8c1ffe8e09ac461241a36638034eff1c638b6b091a568663a41

See more details on using hashes here.

File details

Details for the file trane-0.1.0-py3.5.egg.

File metadata

  • Download URL: trane-0.1.0-py3.5.egg
  • Upload date:
  • Size: 72.2 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for trane-0.1.0-py3.5.egg
Algorithm Hash digest
SHA256 2b8b818615c1455f31d84093c1039d5760bdecde0ac30591688d297d51db3cd9
MD5 89290cb39e3aacd6a3b0a6d0121a1855
BLAKE2b-256 85f047d2620a7f1048f7cbe68c762935e88916b12f01fdd8284e6af3c849cff9

See more details on using hashes here.

File details

Details for the file trane-0.1.0-py3.4.egg.

File metadata

  • Download URL: trane-0.1.0-py3.4.egg
  • Upload date:
  • Size: 72.4 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for trane-0.1.0-py3.4.egg
Algorithm Hash digest
SHA256 b2b13baa61441bc1f010e38c9932db5b0872ae911387ed90d6b1fb232cb02977
MD5 ec50cefc64ec3d242deb86db2bb371c5
BLAKE2b-256 e2f475a1ff906ef995c3c8d1d64a83b0acb3576225aae4693f2a024324d90b52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page