Skip to main content

Affective Research Dataset Tookit (ARDT): an extensible utility package for working with AER Datasets such as ASCERTAIN, CUADS, DREAMER and more

Project description

Affective Research Dataset Toolkit (ARDT)

AARDT, pronouced "art," is a utility library for working with AER Datasets available to the academic community for research in automated emotion recognition. While it may likely be applied to datasets in other research areas, the author(s)' are primarily focused on AER.

Quick Index of this README:

Quick Start

Step 1: Installation

pip install ardt

Step 2: Configuration

Configure that paths to your AER datasets in the ardt_config.yaml file. In your project root, create a file named ardt_config.yaml like so:

{
    # Some ARDT dataset implementation may need to preprocess the raw data. When this happens, it'll store
    # the intermediate outputs in the working_dir
    'working_dir': '/mnt/datasets/ardt/working_storage',
    
    # Configure any datasets you want to use... key is defined by the AERDataset implementation itself.
    # We show templates for the three dataset implementations provided out of the box, but you can add more or remove 
    #  any of these as needed.
    'datasets': {
        # For ardt.dataets.ascertain.AscertainDataset:
        'ascertain': {
            # Path to the expanded ASCERTAIN dataset:
            'path': '/mnt/datasets/ascertain',
            
            # Names of the subfolders under ASCERTAIN where you expanded ASCERTAIN_Raw.zip and ASCERTAIN_Features.zip respectively:
            'raw_data_path': 'ASCERTAIN_Raw',
            'features_data_path': 'ASCERTAIN_Features'
        },
        # For ardt.dataets.dreamer.DreamerDataset:
        'dreamer': {
            'path': '/mnt/datasets/dreamer',
            'dreamer_data_filename': "DREAMER_Data.json"
        },
        # For ardt.dataset.cuads.CuadsDataset:
        'cuads': {
            'path': '/mnt/datasets/cuads',
        }
    },
}

Step 3: Consume a Dataset

In the simplest possible case, you just want to load a single dataset and iterate over its trials. Most likely you also want to process one of the trial's recorded signals. The following example prints trial data and does something with that trial's ECG signal data...

from ardt.datasets.cuads import CuadsDataset

# Loads cuads from the datasets.cuads.path in ardt_config.yaml
dataset = CuadsDataset()
dataset.preload()           # always call preload prior to load_trials
dataset.load_trials()       # loads the dataset trial data...

for trial in dataset.trials:
    print(f'Participant {trial.participant_id} viewed media file {trial.media_name} '
          f'and evaluated it into quadrant {trial.participapant_response}. '
          f'Expected response was {trial.expected_response}')
    
    process_ecg_signal(trial.load_signal_data('ECG'))    

Step 4: Learn About What Else You Can Do:

ARDT is a versatile framework that allows you to work with multiple datasets simultaneously. It provides APIs to wrap the datasets in TensorFlow Datasets for machine learning, and a comprehensive preprocessing pipeline for signal filtering and manipulation.

Much of this is covered in this README. For additional assistance you can open an issue on our GitHub, or reach out to the authors directly.

You will also find comprehensive examples in this the CUADS Data Quality Notebook.

Intended Use and License

This library is intended for use by only by academic researchers to facilitate advancements in emotion research. It is not for commercial use under any circumstances.

This library is licensed under the CC BY-NC-SA 4.0 International License.

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

Under the followiung terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • NonCommercial — You may not use the material for commercial purposes
  • ShareALike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
  • No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Quick Start

Concepts

AARDT is designed with a few simple concepts:

  1. A trial is a single session in which a participant is exposed to an emotional stimulus, and includes data from one or more sensors captured during the session. This may include ECG, EEG, video or audio recordings of the participant, or whatever else you can think of.
  2. A dataset is a collection of trials from multiple participants
  3. Sensor data from a trial may need to be processed before being used, and you can do so using the preprocessor pipeline

Most importantly, AER datasets are not distributed with this library. You need to request access to the datasets from the dataset authors and download them before following this guide.

Loading signals from dataset trials

In this example we assume that you have downloaded DREAMER, which is provided in a single JSON file, and that it is stored at ${DREAMER_HOME}/DREAMER_Data.json.

Step 1 - Instantiate an AERDataset: The AERDataset is the baseclass for all AER datasets, and the details of interacting with each one is in its subclasses, which currently includes ardt.datasets.ascertain.AscertainDataset and ardt.datasets.dreamer.DreamerDataset. Instantiate a DreamerDataset like so:

import os
from ardt.datasets.dreamer import DreamerDataset

# Typically you'd load this from a configuration file... we'll get to that later.
dreamer_home = os.environ['DREAMER_HOME']
ecg_dataset = DreamerDataset(dreamer_home, signals=['ECG'])

The signals argument takes a list of signals to load into the AERDataset, and can be any proper subset of the signals available within the dataset in question. DREAMER provides ECG and EEG recordings, so you can specify any of ['EEG'], ['ECG'], or ['EEG','ECG']. The order specified does not matter.

Step 2 - preload and load the dataset: Now that you have the DreamerDataset, there are two steps to get it ready for use: preload, and load.

The preload step performs any preprocessing of the raw dataset provided by the dataset authors necessary to get it ready to use in AARDT. DREAMER, for example, is provided as a single JSON file that is several gigabytes in size. AARDT's preload breaks the JSON into individual Numpy files for each trial, without ever loading the entire JSON file into memory. This allows it to be used on memory constrained systems, and enables efficient prefetching from NAS storage. The preload mechanism is cached, and therefore only runs the first time it is invoked on a given dataset. The preload mechanism only preloads the signals listed when the dataset was constructed, and will automatically re-run if a new signal is requested that was not included in the previous preload.

The load step populates the datasets list of trials, with metadata only. Signal data is lazy-loaded later.

# preload only runs once, regardless of how many times you call it
# so there is no need to check. 
ecg_dataset.preload()

# after preloading, you can load the trials
ecg_dataset.load_trials()

Step 3 - obtain signal data from the trials: With the trials loaded, you can now obtain the signal data and do your analysis on it.

for trial in ecg_signal.trials:
    ecg_signal = trial.load_signal_data('ECG')
    process_ecg(ecg_signal)

That's it! And its the same regardless of which AER dataset you are using. If you want to use ASCERTAIN instead of DREAMER, just replace ardt.datasets.dreamer.DreamerDataset with ardt.datasets.ascertain.AscertainDataset, everything else remains the same.

Preprocessing signals

The first step in virtually all workloads is to preprocess the signal data, and you can use AARDTs preprocessors to build an automated pipeline to do that when signals are loaded from a trial.

For example, lets assume you want to trim each ECG signal in the DREAMER dataset to the final 30 seconds of the sample. You can use the FixedDurationPreprocessor to do this automatically, like so:

import os
from ardt.datasets.dreamer import DreamerDataset
from ardt.preprocessors import FixedDurationPreprocessor

# Typically you'd load this from a configuration file... we'll get to that later.
dreamer_home = os.environ['DREAMER_HOME']
ecg_dataset = DreamerDataset(dreamer_home, signals=['ECG'])

# Add the preprocessor pipeline to the dataset, for the signal it should be applied to.
# Each signal type can have its own preprocessor pipeline.
ecg_dataset.signal_preprocessors['ECG'] = FixedDurationPreprocessor(signal_duration=30, sample_rate=256,
                                                                    padding_value=0)

# Preload and load the dataset...
ecg_dataset.preload()
ecg_dataset.load_trials()

for trial in ecg_dataset.trials:
    # When you request the signal data from the trial, if the dataset
    # has a preprocessor for that signal type, it will be applied to the
    # signal before it is returned. You are guaranteed to have a 30s 
    # sample here.
    #
    # If the signal was less than 30s originally, it was padded on the left 
    # with 0 values. 
    ecg_signal_30s = trial.load_signal_data('ECG')

    # Do something with ecg_signal_30s

Creating your own preprocessor, and preprocessor chaining

You can subclass SignalPreprocessor to create your own, and preprocessors can be chained together. For example, let's say we want to normalize the signal to values between 0 and 1, and also trim them to 30 seconds fixed duration.

import os
import numpy as np
from sklearn import preprocessing as p

from ardt.datasets.dreamer import DreamerDataset
from ardt.preprocessors import FixedDurationPreprocessor


class MyNormalizer(ardt.preprocessors.SignalPreprocessor):
    def __init__(self, parent_preprocessor=None):
        super().__init__(parent_preprocessor)

    def process_signal(self, signal):
        min_max_scaler = p.MinMaxScaler()
        return min_max_scaler.fit_transform(signal)


dreamer_home = os.environ['DREAMER_HOME']
ecg_dataset = DreamerDataset(dreamer_home, signals=['ECG'])

# Create a pipeline by instantiating MyNormalizer, and passing in a 
# FixedDurationPreprocessor as its parent. You can chain as many 
# preprocessors together as you need. The parent will always be called
# first - so the outermost preprocessor is the last one to execute.
pipeline = MyNormalizer(
    FixedDurationPreprocessor(signal_duration=30, sample_rate=256, padding_value=0)
)

ecg_dataset.signal_preprocessors['ECG'] = pipeline

# Preload and load the dataset...
ecg_dataset.preload()
ecg_dataset.load_trials()

for trial in ecg_dataset.trials:
    # Here, the signal data is already trimmed or padded to be 30s long, 
    # and has been normalized using the MinMaxScaler to values between 
    # 0 and 1.
    ecg_signal = trial.load_signal_data('ECG')

Note that the order of your pipeline is critically important. Here, we apply FixedDurationPreprocessor first, before we normalize the values. This may be problematic, since ECG signals are prone to baseline wander. Padding zero values in before normalization will artificially skew the normalization results. It would be better to normalize the signal first, then apply the FixedDurationPreprocessor:

pipeline = FixedDurationPreprocessor(
    signal_duration=30, 
    sample_rate=256, 
    padding_value=0, 
    parent_preprocessor=MyNormalizer()    
)

Alternatively you can use the child_preprocessor to chain the other way:

pipeline = MyNormalizer(
    child_preprocessor=FixedDurationPreprocessor(signal_duration=30, sample_rate=256, padding_value=0)
)

A child_preprocessor will be invoked after the preprocessor completes, so this achieves the same effect of normalizing the signal first, then truncating or padding it to 30 seconds.

Using with TensorFlow

To facilitate use with TensorFlow, use the TFDatasetWrapper to decorate your AERDataset as a tf.data.Dataset suitable for use with tf.model.fit()

import ardt.datasets

# Don't forget to setup your preprocessor pipelines, then preload and 
# load the dataset first!
tfdsw = ardt.datasets.TFDatasetWrapper(ecg_dataset)

# Create the tf.data.Dataset 
tfdataset = tfdsw('ECG', batch_size=64, buffer_size=500, repeat=1)

# Setup your tensorflow model, then use the tfdataset:
myModel = get_tensorflow_model()

# Train your model using preprocessed signals from the AERDataset
myModel.fit(tfdataset)

To separate training, validation and test splits, you can specify the splits to the TFDatasetWrapper and then indicate which split you intend when you call it.

import ardt.datasets

# Don't forget to setup your preprocessor pipelines, then preload and 
# load the dataset first!

# Specify 60% of participants for the training split, 30% for validation and 10% for testing.
tfdsw = ardt.datasets.TFDatasetWrapper(ecg_dataset, splits=[.6, .3, .1])

# Setup your tensorflow model, then use the tfdataset:
myModel = get_tensorflow_model()

# Train your model using preprocessed signals from the AERDataset, using trials from the split at index 0 (60%)
myModel.fit(
    x=tfdsw('ECG', n_split=0),
    validation_data=tfdsw('ECG', n_split=1)
    ...
)

# Later, evaluate against the test set
reults = myModel.evaluate(
    x=tfdsw('ECG', n_split=2)
)

TFDatasetWrapper provides a tf.data.dataset which will prefetch up to buffer_size trials at random, creating batches of size batch_size, and will iterate the dataset repeat times. The prefetch queue uses tf.data.AUTOTUNE to self-optimize.

Adding New Datasets

Whether you are creating your own dataset, or just want to use one that isn't already included, AARDT is designed to be extensible allowing you to integrate additional datasets as needed. This section serves as a guide to help you do this.

Step 1: Dataset Paths

Dataset paths are configured in the config.yml file. Each dataset has its own section, and you can add new ones as needed. For example, to add the CUADS dataset, we did this:

config = {
    'working_dir': '/mnt/affectsai/aerds/',
    'datasets': {
        ...,
        'cuads': {
            'path': '/mnt/affectsai/datasets/cuads',
        }
    },
}

Any additional properties you need can be added under the cuads element.

Step 2: Implement AERDataset and AERTrial Subclasses

The AERDataset is the base class for all dataset implementations in AARDT. It is primarily responsible for loading instances of AERTrial.

All the implementation details, including dataset layout and access details, are encapsulated in your implementation of this base class. See any of the existing implementations for examples. We provide implementations for ASCERTAIN, CUADS, and DREAMER each of which is thoroughly commented. See

  • src/ardt/datasets/ascertain/Ascertaindataset.py,
  • src/ardt/datasets/dreamer/DreamerDataset.py,
  • src/ardt/datasets/cuads/CuadsDataset.py.

To extend AERDataset do the following:

  1. Create a new class as a subclass of AERDataset like so:

    from ardt.datasets import AERDataset
    
    class MyAwesomeDataset(AERDataset):
        def __init__(self, signals):
            super().__init__(signals)
    

    You should minimally provide a list of signal types to super.__init__. This is a list of signal types provided by this dataset, e.g.: ['ECG','EEG']. Feel free to add whatever additional arguments you might need to support your implementation.

  2. Override load_trials(self) and get_signal_metadata methods from AERDataset. load_trial(self) is where all the hard work of implementing a dataset is done... here, you will parse the dataset to produce individual AERTrial instances. get_signal_metadata(self,signal) returns a map of metadata about the requested signal. Minimally this should include:

    • n_channels: the number of channels for this signal, and
    • sample_rate: the sample rate in Hz for this signal
    from ardt.datasets import AERDataset
        
    class MyAwesomeDataset(AERDataset):
        def __init__(self, signals):
            if signals is None:
                signals = ['ECG']       # If not specified, let's load ECG signals from MyAwesomeDataset...
                    
            super().__init__(signals)
        
        @abc.abstractmethod
        def load_trials(self):
            """
            Loads the AERTrials from the preloaded dataset into memory. This method should load all relevant trials from
            the dataset. To avoid memory utilization issues, it is strongly recommended to defer loading signal data into
            the AERTrial until that AERTrial's load_signal_data method is called.
        
            During load_trials, implementations should populate `self.trials`. Trial participant and media identifiers must
            be numbered sequentially from 1 to N where N is the number of participants or media files in the dataset
        
            See subclasses for dataset-specific details.
            :return:
            """
            mytrials = []  # actually load your trial data...
            self.trials.extend( mytrials )
        
        @abc.abstractmethod
        def get_signal_metadata(self, signal_type):
            """
            Returns a dict containing the requested signal's metadata. Mandatory keys include:
            - 'signal_type' (the signal type)
            - 'sample_rate' (in samples per second)
            - 'n_channels' (the number of channels in the signal)
        
            See subclasses for implementation-specific keys that may also be present.
        
            :param signal_type: the type of signal for which to retrieve the metadata.
            :return: a dict containing the requested signal's metadata
            """
            if signal_type not in self._signal_types:
                raise ValueError('Signal type {} is not known in this AERTrial'.format(signal_type))
        
            if signal_type == 'ECG':
                return {
                    'n_channels': 2,
                    'sample_rate': 256
                }
                
            return {}
    
    1. Create a new class as a subclass of AERTrial like so:

      from ardt.datasets import AERTrial
      
      class MyAwesomeDatasetTrial(AERTrial):
          def __init__(self, dataset, participant_id, movie_id)):
              super().__init__(dataset, participant_id, movie_id))
      
          @abc.abstractmethod
          def load_signal_data(self, signal_type):
              """
              Loads and returns the requested signal as an (N+1)xM numpy array, where N is the number of channels, and M is
              the number of samples in the signal. The row at N=0 represents the timestamp of each sample. The value is
              given in epoch time if a real start time is available, otherwise it is in elapsed milliseconds with 0
              representing the start of the sample.
      
              :param signal_type:
              :return:
              """
              if signal_type not in self._signal_types:
                  raise ValueError('Signal type {} is not known in this AERTrial'.format(signal_type))
      
              return np.empty(0)
      
          @abc.abstractmethod
          def load_ground_truth(self):
              """
              Returns the ground truth label for this trial. For AER trials, this is the quadrant within the A/V space,
              numbered 0 through 3 as follows:
              - 0: High Arousal, High Valence
              - 1: High Arousal, Low Valence
              - 2: Low Arousal, Low Valence
              - 3: Low Arousal, High Valence
      
              :return: The ground truth label for this trial
              """
              return 0
      
          @abc.abstractmethod
          def get_signal_metadata(self, signal_type):
              """
              Returns a dict containing the requested signal's metadata. Mandatory keys include:
              - 'signal_type' (the signal type)
              - 'sample_rate' (in samples per second)
              - 'n_channels' (the number of channels in the signal)
      
              See subclasses for implementation-specific keys that may also be present.
      
              :param signal_type: the type of signal for which to retrieve the metadata.
              :return: a dict containing the requested signal's metadata
              """
              if signal_type not in self._signal_types:
                  raise ValueError('Signal type {} is not known in this AERTrial'.format(signal_type))
      
              response = self.dataset.get_signal_metadata(signal_type)
              response['duration'] = 60 # get the length of the signal data
      
              return response    
      

      The AERTrial takes a reference to the dataset that created it, and the participant_id and media_id that this trial represents. It must implement load_signal_data and load_ground_truth as documented. It may optionally override get_signal_metadata to augment the response from the dataset, for example, to include signal duration.

There is more to it than this but this should be enough to get you started. See the AERDataset and AERTrial classes for method documentation, and then CUADS, ASCERTAIN and DREAMER examples for guidance.

Contributing

We are happy to support you by accepting pull requests that make this library more broadly applicable, or by accepting issues to do the same. If you have an AER dataset you would like us to integrate, please open an issue for that as well, but we will be unable to process issues requesting integration with non-AER datasets at this time.

If you would like to get involved by maintaining dataset integrations in other areas of research, please get in touch and we'd be happy to have the help!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ardt-0.1.6.tar.gz (38.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ardt-0.1.6-py3-none-any.whl (53.4 kB view details)

Uploaded Python 3

File details

Details for the file ardt-0.1.6.tar.gz.

File metadata

  • Download URL: ardt-0.1.6.tar.gz
  • Upload date:
  • Size: 38.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.0

File hashes

Hashes for ardt-0.1.6.tar.gz
Algorithm Hash digest
SHA256 086829ce8711789877f16359c9a85880456554edd8ec5b6c67d0e6d29560f598
MD5 e1af4dcd725e5b60c2458ef63b60553e
BLAKE2b-256 0452506aa7afc51c81206123243e3a071de4b122a07f7552d280f7258741167d

See more details on using hashes here.

File details

Details for the file ardt-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: ardt-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 53.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.0

File hashes

Hashes for ardt-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 5e82b459c56d0c723b6661505b188da83d8aa7443c02df6ab5874ed455ae965f
MD5 1d38ec852a2063756b484e55579cc9bc
BLAKE2b-256 3f92db554270dbffb4774e97dc5bb37713ca3f39f5eae9cd202759101683fe40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page