Skip to main content

A Python library for pre-processing ubiquitous aggregated self-tracking data

Project description

UBIWEAR

A Python library for pre-processing ubiquitous aggregated self-tracking data.

What is this library about

This library is influenced by the work of ours in which we utilized in-the-wild data coming from the "MyHeart Counts" study [1].

Through our time-consuming experimentation with these real-world data, we extracted a set of prescriptive guidelines of pre-processing steps related to aggregated data gathered from wearable devices.

We hope UBIWEAR serves as a starting point to the research community towards the unexplored domain of physical activity prediction and promote a standardized definition for pre-processing wearables and self-tracking devices data.

When to use this library

To the best of our knowledge since this library was written, there were no suggested techniques to apply for handling time-series data coming from self-tracking devices.

In UBIWEAR we offer some pre-processing methods related to univariate time-series problems with some slight modifications exclusively for wearables data.

It handles univariate time-series aggregated data and process the data in a structure for predictive modeling.

Usage of UBIWEAR

Install the library

Create virtual environment

$ python3 -m venv venv
$ source venv/bin/activate

Upgrade pip

$ python -m pip install --upgrade pip

Install UBIWEAR

$ pip install ubiwear

Load your data

The input to UBIWEAR is always a pandas' DataFrame with the index as type of DatetimeIndex and a column named value of type float or int with the recorded observations representing your time-series data.

For comprehension reasons we included an example of such data in the assets/ directory in .csv format.

import pandas as pd

df = pd.read_csv('assets/df-wearable-time-series-example.csv', index_col='startTime', parse_dates=True)

The df must have the following format like in the example:

                     value
startTime                 
2015-08-07 05:37:31   59.0
2015-08-07 05:43:31  139.0
2015-08-07 07:06:16  245.0
2015-08-07 07:11:18  148.0
2015-08-07 07:15:49   43.0
                    ...
2015-08-25 04:52:35   18.0
2015-08-25 05:03:11   15.0
2015-08-25 05:04:51   44.0
2015-08-25 05:06:13   80.0
2015-08-25 05:41:19  112.0

Clean and process the data

Import the Processor class. Its' purpose is to pre-process time-series aggregated wearable data.

The available methods of the class should be used in a chaining style.

It also offers a "magic" method process that processes the data in a pre-defined suggested pipeline, that works especially for physical activity data.

from ubiwear.processor import Processor

ubiwear_processor = Processor(df=df)

# Call the magic method
df = ubiwear_processor.process(granularity='1H', q=0.05, impute_start=8, impute_end=24)

The df has the following format:

                          value  dayofweek_sin  ...  hour_sin      hour_cos
startTime                                       ...                        
2015-08-07 05:00:00  198.000000      -0.433884  ...  0.965926  2.588190e-01
2015-08-07 06:00:00    0.000000      -0.433884  ...  1.000000  6.123234e-17
2015-08-07 07:00:00  467.000000      -0.433884  ...  0.965926 -2.588190e-01
2015-08-07 08:00:00  544.333333      -0.433884  ...  0.866025 -5.000000e-01
2015-08-07 09:00:00  621.666667      -0.433884  ...  0.707107 -7.071068e-01
                         ...            ...  ...       ...           ...
2015-08-25 01:00:00    0.000000       0.781831  ...  0.258819  9.659258e-01
2015-08-25 02:00:00   82.000000       0.781831  ...  0.500000  8.660254e-01
2015-08-25 03:00:00    0.000000       0.781831  ...  0.707107  7.071068e-01
2015-08-25 04:00:00    0.000000       0.781831  ...  0.866025  5.000000e-01
2015-08-25 05:00:00   95.000000       0.781831  ...  0.965926  2.588190e-01

What has happened ?

  • removed duplicate observations related to time-series examples.
  • removed NaN/NaT records
  • removed outlier values using the quantiles method
  • resampled the data in a unified granularity i.e. hourly granularity
  • imputed specifically for wearables' data missing values on active hours (08:00 - 24:00)
  • enhanced feature space with date features and converted them into their cyclical transformation

All of the above methods can be called individually and select those that fit your problem.

You can also implement your own methods in Processor class and call it in your desired pre-processing pipeline in a chaining manner.

For example:

from ubiwear.processor import Processor

ubiwear_processor = Processor(df=df)

ubiwear_processor \
    .remove_nan() \
    .remove_duplicate_values_at_same_timestamp() \
    .add_date_features() \
    # ... \    
    # your_own_method()

# Get the processed data
df = ubiwear_processor.df

Re-frame the problem from time-series to a supervised dataset

Use the Window class which provides two main functionalities that transforms a time-series problem to a supervised set ready to be used by machine learning algorithms.

  • Sliding window to transform a time-series problem to a supervised
  • Our novel aggregated tumbling window
from ubiwear.window import Window

# Transform from time-series to supervised dataset for ML
window = Window(n_in=2 * 24)
dataset = window.sliding_window(data=df)

# OR aggregated tumbling window
# dataset = window.tumbling_window(data=df, freq='1D')

The dataset has the following format:

                     var1(t-48)  var2(t-48)  ...  var11(t)  var1(t)
startTime                                    ...                   
2015-08-09 05:00:00       198.0   -0.433884  ...  0.258819      0.0
2015-08-10 05:00:00         0.0   -0.974928  ...  0.258819      0.0
                                                    ...
2015-08-11 05:00:00         0.0   -0.781831  ...  0.258819      0.0
2015-08-22 05:00:00         0.0    0.433884  ...  0.258819      0.0
2015-08-23 05:00:00         0.0   -0.433884  ...  0.258819   4562.0
2015-08-24 05:00:00         0.0   -0.974928  ...  0.258819   1861.5
2015-08-25 05:00:00       450.0   -0.781831  ...  0.258819    177.0

Convert dataset for ML

The Dataset is a class that provides sub-datasets for training ML models. It takes as input the dataset created from the UBIWEAR's Window class.

from ubiwear.dataset import Dataset

ubiwear_dataset = Dataset(dataset=dataset)

# Get train/test sub-datasets
x_train, x_test, y_train, y_test = ubiwear_dataset.get_train_test(train_ratio=0.75)

# OR train/validation/test sub-datasets
x_train, x_val, x_test, y_train, y_val, y_test = ubiwear_dataset.get_train_val_test(train_ratio=0.75, val_ratio=0.2)

Apply your favorite ML or DL model

You know have clean, pre-processed and ready your well-known X's and y's for your ML problem!

You can call your favorite model, and record the performance on your favorite regression metrics.

Literature

[1] Hershman, Steven G., et al. "Physical activity, sleep and cardiovascular health data for 50,000 individuals from the MyHeart Counts Study." Scientific data 6.1 (2019): 1-10.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ubiwear-0.0.2.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

ubiwear-0.0.2-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file ubiwear-0.0.2.tar.gz.

File metadata

  • Download URL: ubiwear-0.0.2.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for ubiwear-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7ab298a12ba9694cba1a90388d3adbd28a2fa597a99cb81c28b5559780d32c7b
MD5 abaad32730c98132d29a3e44437a1610
BLAKE2b-256 4a241f8fc09656e75fca860192f4be465f60c2b6d0233e408cbd63006248cccd

See more details on using hashes here.

File details

Details for the file ubiwear-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: ubiwear-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for ubiwear-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5cf28046c00aecaa672950b92ef5355453a519b31ad1aac2adac1ad354172837
MD5 b8a2278f73113755f40770e8559800fe
BLAKE2b-256 def3fbfe30641fb8d05befbdfd61ca1860e29047441ad1f455cd4155e8e49179

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page