Skip to main content

Implements temporal validation for machine learning/

Project description

Copyright ©2017,. The University of Chicago (“Chicago”). All Rights Reserved.

Permission to use, copy, modify, and distribute this software, including all object code and source code, and any accompanying documentation (together the “Program”) for educational and not-for-profit research purposes, without fee and without a signed licensing agreement, is hereby granted, provided that the above copyright notice, this paragraph and the following three paragraphs appear in all copies, modifications, and distributions. For the avoidance of doubt, educational and not-for-profit research purposes excludes any service or part of selling a service that uses the Program. To obtain a commercial license for the Program, contact the Technology Commercialization and Licensing, Polsky Center for Entrepreneurship and Innovation, University of Chicago, 1452 East 53rd Street, 2nd floor, Chicago, IL 60615.

Created by Data Science and Public Policy, University of Chicago

The Program is copyrighted by Chicago. The Program is supplied "as is", without any accompanying services from Chicago. Chicago does not warrant that the operation of the Program will be uninterrupted or error-free. The end-user understands that the Program was developed for research purposes and is advised not to rely exclusively on the Program for any reason.

IN NO EVENT SHALL CHICAGO BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THE PROGRAM, EVEN IF CHICAGO HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. CHICAGO SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE PROGRAM PROVIDED HEREUNDER IS PROVIDED "AS IS". CHICAGO HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

Description: # timechop
Generate temporal validation time windows for matrix creation


[![Build Status](https://travis-ci.org/dssg/timechop.svg?branch=master)](https://travis-ci.org/dssg/timechop)

[![codecov](https://codecov.io/gh/dssg/timechop/branch/master/graph/badge.svg)](https://codecov.io/gh/dssg/timechop)

[![codeclimate](https://codeclimate.com/github/dssg/timechop.png)](https://codeclimate.com/github/dssg/timechop)

In predictive analytics, temporal validation can be complicated. There are a variety of questions to balance: How frequently to retrain models? Should the time between rows for the same entity in the train and test matrices be different? Keeping track of how to create matrix time windows that successfully answer all of these questions is difficult.

That's why we created timechop. Timechop takes in high-level time configuration (e.g. lists of train label spans, test data frequencies) and returns all matrix time definitions.


Timechop currently works with the following:

- feature_start_time - data aggregated into features begins at this point
- feature_end_time - data aggregated into features is from *before* this point
- label_start_time - data aggregated into labels begins at this point
- label_end_time - data aggregated is from *before* this point
- model_update_frequency - amount of time between train/test splits
- training_as_of_date_frequencies - how much time between rows for a single entity in a training matrix
- max_training_histories - the maximum amount of history for each entity to train on (early matrices may contain less than this time if it goes past label/feature start times)
- training_label_timespans - how much time is covered by training labels (e.g., outcomes in the next 1 year? 3 days? 2 months?)
- test_as_of_date_frequencies - how much time between rows for a single entity in a test matrix
- test_durations - how far into the future should a model be used to make predictions (in the typical case of wanting a single prediction set immediately after model training, this should be set to 0 days)
- test_label_timespans - how much time is covered by test predictions (e.g., outcomes in the next 1 year? 3 days? 2 months?)

Here's an example of a typical set-up with a single prediction immediately after training and models built at an annual frequency:
```
from timechop.timechop import Timechop

chopper = Timechop(
feature_start_time=datetime.datetime(1990, 1, 1, 0, 0),
feature_end_time=datetime.datetime(2017, 1, 1, 0, 0),
label_start_time=datetime.datetime(2014, 1, 1, 0, 0),
label_end_time=datetime.datetime(2017, 1, 1, 0, 0),
model_update_frequency='1 year',
training_as_of_date_frequencies=['6 months'],
max_training_histories=['2 years'],
training_label_timespans=['6 months'],
test_as_of_date_frequencies=['1 days'],
test_durations=['0 days'],
test_label_timespans=['6 months']
)
result = chopper.chop_time()
print(result)
```
```
[
{
'feature_end_time': datetime.datetime(2017, 1, 1, 0, 0),
'feature_start_time': datetime.datetime(1990, 1, 1, 0, 0),
'label_end_time': datetime.datetime(2017, 1, 1, 0, 0),
'label_start_time': datetime.datetime(2014, 1, 1, 0, 0),
'test_matrices': [{
'as_of_times': [
datetime.datetime(2014, 7, 1, 0, 0)
],
'last_as_of_time': datetime.datetime(2014, 7, 1, 0, 0),
'first_as_of_time': datetime.datetime(2014, 7, 1, 0, 0),
'matrix_info_end_time': datetime.datetime(2015, 1, 1, 0, 0),
'test_as_of_date_frequency': '1 days',
'test_label_timespan': '6 months',
'test_duration': '0 days'
}],
'train_matrix': {
'as_of_times': [
datetime.datetime(2014, 1, 1, 0, 0)
],
'last_as_of_time': datetime.datetime(2014, 1, 1, 0, 0),
'first_as_of_time': datetime.datetime(2014, 1, 1, 0, 0),
'matrix_info_end_time': datetime.datetime(2014, 7, 1, 0, 0),
'max_training_history': '2 years',
'training_as_of_date_frequency': '6 months',
'training_label_timespan': '6 months'
}
},
{
'feature_end_time': datetime.datetime(2017, 1, 1, 0, 0),
'feature_start_time': datetime.datetime(1990, 1, 1, 0, 0),
'label_end_time': datetime.datetime(2017, 1, 1, 0, 0),
'label_start_time': datetime.datetime(2014, 1, 1, 0, 0),
'test_matrices': [{
'as_of_times': [
datetime.datetime(2015, 7, 1, 0, 0)
],
'last_as_of_time': datetime.datetime(2015, 7, 1, 0, 0),
'first_as_of_time': datetime.datetime(2015, 7, 1, 0, 0),
'matrix_info_end_time': datetime.datetime(2016, 1, 1, 0, 0),
'test_as_of_date_frequency': '1 days',
'test_label_timespan': '6 months',
'test_duration': '0 days'
}],
'train_matrix': {
'as_of_times': [
datetime.datetime(2014, 1, 1, 0, 0),
datetime.datetime(2014, 7, 1, 0, 0),
datetime.datetime(2015, 1, 1, 0, 0)
],
'last_as_of_time': datetime.datetime(2015, 1, 1, 0, 0),
'first_as_of_time': datetime.datetime(2014, 1, 1, 0, 0),
'matrix_info_end_time': datetime.datetime(2015, 7, 1, 0, 0),
'max_training_history': '2 years',
'training_as_of_date_frequency': '6 months',
'training_label_timespan': '6 months'
}
},
{
'feature_end_time': datetime.datetime(2017, 1, 1, 0, 0),
'feature_start_time': datetime.datetime(1990, 1, 1, 0, 0),
'label_end_time': datetime.datetime(2017, 1, 1, 0, 0),
'label_start_time': datetime.datetime(2014, 1, 1, 0, 0),
'test_matrices': [{
'as_of_times': [
datetime.datetime(2016, 7, 1, 0, 0)
],
'last_as_of_time': datetime.datetime(2016, 7, 1, 0, 0),
'first_as_of_time': datetime.datetime(2016, 7, 1, 0, 0),
'matrix_info_end_time': datetime.datetime(2017, 1, 1, 0, 0),
'test_as_of_date_frequency': '1 days',
'test_label_timespan': '6 months',
'test_duration': '0 days'
}],
'train_matrix': {
'as_of_times': [
datetime.datetime(2014, 1, 1, 0, 0),
datetime.datetime(2014, 7, 1, 0, 0),
datetime.datetime(2015, 1, 1, 0, 0),
datetime.datetime(2015, 7, 1, 0, 0),
datetime.datetime(2016, 1, 1, 0, 0)
],
'last_as_of_time': datetime.datetime(2016, 1, 1, 0, 0),
'first_as_of_time': datetime.datetime(2014, 1, 1, 0, 0),
'matrix_info_end_time': datetime.datetime(2016, 7, 1, 0, 0),
'max_training_history': '2 years',
'training_as_of_date_frequency': '6 months',
'training_label_timespan': '6 months'
}
}
]
```


And a second example with multiple testing dates and showing how the train matrices behave at the edge cases, showing the effects of some of the other paramters:

```
from timechop.timechop import Timechop

chopper = Timechop(
feature_start_time=datetime.datetime(1990, 1, 1, 0, 0),
feature_end_time=datetime.datetime(2010, 1, 16, 0, 0),
label_start_time=datetime.datetime(2010, 1, 1, 0, 0),
label_end_time=datetime.datetime(2010, 1, 16, 0, 0),
model_update_frequency='5 days',
training_as_of_date_frequencies=['1 days'],
max_training_histories=['5 days'],
training_label_timespans=['1 day'],
test_as_of_date_frequencies=['3 days'],
test_durations=['5 days'],
test_label_timespans=['3 days']
)
result = chopper.chop_time()
print(result)
```

```
[
{
'feature_end_time': datetime.datetime(2010, 1, 16, 0, 0),
'feature_start_time': datetime.datetime(1990, 1, 1, 0, 0),
'label_end_time': datetime.datetime(2010, 1, 16, 0, 0),
'label_start_time': datetime.datetime(2010, 1, 1, 0, 0),
'test_matrices': [{
'as_of_times': [
datetime.datetime(2010, 1, 3, 0, 0),
datetime.datetime(2010, 1, 6, 0, 0)
],
'last_as_of_time': datetime.datetime(2010, 1, 6, 0, 0),
'first_as_of_time': datetime.datetime(2010, 1, 3, 0, 0),
'matrix_info_end_time': datetime.datetime(2010, 1, 9, 0, 0),
'test_as_of_date_frequency': '3 days',
'test_label_timespan': '3 days',
'test_duration': '5 days'
}],
'train_matrix': {
'as_of_times': [
datetime.datetime(2010, 1, 1, 0, 0),
datetime.datetime(2010, 1, 2, 0, 0)
],
'last_as_of_time': datetime.datetime(2010, 1, 2, 0, 0),
'first_as_of_time': datetime.datetime(2010, 1, 1, 0, 0),
'matrix_info_end_time': datetime.datetime(2010, 1, 3, 0, 0),
'max_training_history': '5 days',
'training_as_of_date_frequency': '1 days',
'training_label_timespan': '1 day'
}
},
{
'feature_end_time': datetime.datetime(2010, 1, 16, 0, 0),
'feature_start_time': datetime.datetime(1990, 1, 1, 0, 0),
'label_end_time': datetime.datetime(2010, 1, 16, 0, 0),
'label_start_time': datetime.datetime(2010, 1, 1, 0, 0),
'test_matrices': [{
'as_of_times': [
datetime.datetime(2010, 1, 8, 0, 0),
datetime.datetime(2010, 1, 11, 0, 0)
],
'last_as_of_time': datetime.datetime(2010, 1, 11, 0, 0),
'first_as_of_time': datetime.datetime(2010, 1, 8, 0, 0),
'matrix_info_end_time': datetime.datetime(2010, 1, 14, 0, 0),
'test_as_of_date_frequency': '3 days',
'test_label_timespan': '3 days',
'test_duration': '5 days'
}],
'train_matrix': {
'as_of_times': [
datetime.datetime(2010, 1, 2, 0, 0),
datetime.datetime(2010, 1, 3, 0, 0),
datetime.datetime(2010, 1, 4, 0, 0),
datetime.datetime(2010, 1, 5, 0, 0),
datetime.datetime(2010, 1, 6, 0, 0),
datetime.datetime(2010, 1, 7, 0, 0)
],
'last_as_of_time': datetime.datetime(2010, 1, 7, 0, 0),
'first_as_of_time': datetime.datetime(2010, 1, 2, 0, 0),
'matrix_info_end_time': datetime.datetime(2010, 1, 8, 0, 0),
'max_training_history': '5 days',
'training_as_of_date_frequency': '1 days',
'training_label_timespan': '1 day'
}
}
]
```

The output of Timechop works as input to the [architect.Planner](https://github.com/dssg/architect/blob/master/architect/planner.py).

Keywords: timechop
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

timechop-1.0.0.tar.gz (9.9 kB view details)

Uploaded Source

Built Distributions

timechop-1.0.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

timechop-1.0.0-py2-none-any.whl (14.3 kB view details)

Uploaded Python 2

File details

Details for the file timechop-1.0.0.tar.gz.

File metadata

  • Download URL: timechop-1.0.0.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for timechop-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b29cc7210cffe5227c4474f243b8b62518dc58559986d29e1ff6d77a4906c035
MD5 1260bbec2b7560ecb74dcca01f8e8784
BLAKE2b-256 e43f0b813bb299ce33b4d55e390e5f25bddbb8ca4993350d403889aacfbf4bd3

See more details on using hashes here.

File details

Details for the file timechop-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for timechop-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5cadf2659ccf4882a9fe74d575a46b33d0f2d2331e0aa1e0d5b1bd8faefcf341
MD5 1add8d4c9b642e6b9ae01a4305b2872b
BLAKE2b-256 467d4a190ab061701afcba1e18d54ceff0e2a11690723714325dc9395b06b963

See more details on using hashes here.

File details

Details for the file timechop-1.0.0-py2-none-any.whl.

File metadata

File hashes

Hashes for timechop-1.0.0-py2-none-any.whl
Algorithm Hash digest
SHA256 be2c093c566b7665bf3bb97acf45fe36761068996b245c714020553e1ad62484
MD5 6172595dc16b0451d9c0803108531d8b
BLAKE2b-256 db120733599cd6be7033f2a4236d497f475c24da0e61b089447e8a0cd32bf0ca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page