Affective Research Dataset Tookit (ARDT): an extensible utility package for working with AER Datasets such as ASCERTAIN, CUADS, DREAMER and more

These details have not been verified by PyPI

Project links

Project description

Affective Research Dataset Toolkit (ARDT)

AARDT, pronouced "art," is a utility library for working with AER Datasets available to the academic community for research in automated emotion recognition. While it may likely be applied to datasets in other research areas, the author(s)' are primarily focused on AER.

Quick Index of this README:

Want to know if you can use it? Jump to Intended Use and License
Want to know how to use it? Jump to Quick Start
Want to help out? Jump to Contributing

Quick Start

Step 1: Installation

pip install ardt

Step 2: Configuration

Configure that paths to your AER datasets in the ardt_config.yaml file. In your project root, create a file named ardt_config.yaml like so:

{
    # Some ARDT dataset implementation may need to preprocess the raw data. When this happens, it'll store
    # the intermediate outputs in the working_dir
    'working_dir': '/mnt/datasets/ardt/working_storage',
    
    # Configure any datasets you want to use... key is defined by the AERDataset implementation itself.
    # We show templates for the three dataset implementations provided out of the box, but you can add more or remove 
    #  any of these as needed.
    'datasets': {
        # For ardt.dataets.ascertain.AscertainDataset:
        'ascertain': {
            # Path to the expanded ASCERTAIN dataset:
            'path': '/mnt/datasets/ascertain',
            
            # Names of the subfolders under ASCERTAIN where you expanded ASCERTAIN_Raw.zip and ASCERTAIN_Features.zip respectively:
            'raw_data_path': 'ASCERTAIN_Raw',
            'features_data_path': 'ASCERTAIN_Features'
        },
        # For ardt.dataets.dreamer.DreamerDataset:
        'dreamer': {
            'path': '/mnt/datasets/dreamer',
            'dreamer_data_filename': "DREAMER_Data.json"
        },
        # For ardt.dataset.cuads.CuadsDataset:
        'cuads': {
            'path': '/mnt/datasets/cuads',
        }
    },
}

Step 3: Consume a Dataset

In the simplest possible case, you just want to load a single dataset and iterate over its trials. Most likely you also want to process one of the trial's recorded signals. The following example prints trial data and does something with that trial's ECG signal data...

from ardt.datasets.cuads import CuadsDataset

# Loads cuads from the datasets.cuads.path in ardt_config.yaml
dataset = CuadsDataset()
dataset.preload()           # always call preload prior to load_trials
dataset.load_trials()       # loads the dataset trial data...

for trial in dataset.trials:
    print(f'Participant {trial.participant_id} viewed media file {trial.media_name} '
          f'and evaluated it into quadrant {trial.participapant_response}. '
          f'Expected response was {trial.expected_response}')
    
    process_ecg_signal(trial.load_signal_data('ECG'))

Step 4: Learn About What Else You Can Do:

ARDT is a versatile framework that allows you to work with multiple datasets simultaneously. It provides APIs to wrap the datasets in TensorFlow Datasets for machine learning, and a comprehensive preprocessing pipeline for signal filtering and manipulation.

Much of this is covered in this README. For additional assistance you can open an issue on our GitHub, or reach out to the authors directly.

You will also find comprehensive examples in this the CUADS Data Quality Notebook.

Intended Use and License

This library is intended for use by only by academic researchers to facilitate advancements in emotion research. It is not for commercial use under any circumstances.

This library is licensed under the CC BY-NC-SA 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

Under the followiung terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes
ShareALike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Quick Start

Concepts

AARDT is designed with a few simple concepts:

A trial is a single session in which a participant is exposed to an emotional stimulus, and includes data from one or more sensors captured during the session. This may include ECG, EEG, video or audio recordings of the participant, or whatever else you can think of.
A dataset is a collection of trials from multiple participants
Sensor data from a trial may need to be processed before being used, and you can do so using the preprocessor pipeline

Most importantly, AER datasets are not distributed with this library. You need to request access to the datasets from the dataset authors and download them before following this guide.

Loading signals from dataset trials

In this example we assume that you have downloaded DREAMER, which is provided in a single JSON file, and that it is stored at ${DREAMER_HOME}/DREAMER_Data.json.

Step 1 - Instantiate an AERDataset: The AERDataset is the baseclass for all AER datasets, and the details of interacting with each one is in its subclasses, which currently includes ardt.datasets.ascertain.AscertainDataset and ardt.datasets.dreamer.DreamerDataset. Instantiate a DreamerDataset like so:

import os
from ardt.datasets.dreamer import DreamerDataset

# Typically you'd load this from a configuration file... we'll get to that later.
dreamer_home = os.environ['DREAMER_HOME']
ecg_dataset = DreamerDataset(dreamer_home, signals=['ECG'])

The signals argument takes a list of signals to load into the AERDataset, and can be any proper subset of the signals available within the dataset in question. DREAMER provides ECG and EEG recordings, so you can specify any of ['EEG'], ['ECG'], or ['EEG','ECG']. The order specified does not matter.

Step 2 - preload and load the dataset: Now that you have the DreamerDataset, there are two steps to get it ready for use: preload, and load.

The preload step performs any preprocessing of the raw dataset provided by the dataset authors necessary to get it ready to use in AARDT. DREAMER, for example, is provided as a single JSON file that is several gigabytes in size. AARDT's preload breaks the JSON into individual Numpy files for each trial, without ever loading the entire JSON file into memory. This allows it to be used on memory constrained systems, and enables efficient prefetching from NAS storage. The preload mechanism is cached, and therefore only runs the first time it is invoked on a given dataset. The preload mechanism only preloads the signals listed when the dataset was constructed, and will automatically re-run if a new signal is requested that was not included in the previous preload.

The load step populates the datasets list of trials, with metadata only. Signal data is lazy-loaded later.

# preload only runs once, regardless of how many times you call it
# so there is no need to check. 
ecg_dataset.preload()

# after preloading, you can load the trials
ecg_dataset.load_trials()

Step 3 - obtain signal data from the trials: With the trials loaded, you can now obtain the signal data and do your analysis on it.

for trial in ecg_signal.trials:
    ecg_signal = trial.load_signal_data('ECG')
    process_ecg(ecg_signal)

That's it! And its the same regardless of which AER dataset you are using. If you want to use ASCERTAIN instead of DREAMER, just replace ardt.datasets.dreamer.DreamerDataset with ardt.datasets.ascertain.AscertainDataset, everything else remains the same.

Preprocessing signals

The first step in virtually all workloads is to preprocess the signal data, and you can use AARDTs preprocessors to build an automated pipeline to do that when signals are loaded from a trial.

For example, lets assume you want to trim each ECG signal in the DREAMER dataset to the final 30 seconds of the sample. You can use the FixedDurationPreprocessor to do this automatically, like so:

import os
from ardt.datasets.dreamer import DreamerDataset
from ardt.preprocessors import FixedDurationPreprocessor

# Typically you'd load this from a configuration file... we'll get to that later.
dreamer_home = os.environ['DREAMER_HOME']
ecg_dataset = DreamerDataset(dreamer_home, signals=['ECG'])

# Add the preprocessor pipeline to the dataset, for the signal it should be applied to.
# Each signal type can have its own preprocessor pipeline.
ecg_dataset.signal_preprocessors['ECG'] = FixedDurationPreprocessor(signal_duration=30, sample_rate=256,
                                                                    padding_value=0)

# Preload and load the dataset...
ecg_dataset.preload()
ecg_dataset.load_trials()

for trial in ecg_dataset.trials:
    # When you request the signal data from the trial, if the dataset
    # has a preprocessor for that signal type, it will be applied to the
    # signal before it is returned. You are guaranteed to have a 30s 
    # sample here.
    #
    # If the signal was less than 30s originally, it was padded on the left 
    # with 0 values. 
    ecg_signal_30s = trial.load_signal_data('ECG')

    # Do something with ecg_signal_30s

Creating your own preprocessor, and preprocessor chaining

You can subclass SignalPreprocessor to create your own, and preprocessors can be chained together. For example, let's say we want to normalize the signal to values between 0 and 1, and also trim them to 30 seconds fixed duration.

import os
import numpy as np
from sklearn import preprocessing as p

from ardt.datasets.dreamer import DreamerDataset
from ardt.preprocessors import FixedDurationPreprocessor


class MyNormalizer(ardt.preprocessors.SignalPreprocessor):
    def __init__(self, parent_preprocessor=None):
        super().__init__(parent_preprocessor)

    def process_signal(self, signal):
        min_max_scaler = p.MinMaxScaler()
        return min_max_scaler.fit_transform(signal)


dreamer_home = os.environ['DREAMER_HOME']
ecg_dataset = DreamerDataset(dreamer_home, signals=['ECG'])

# Create a pipeline by instantiating MyNormalizer, and passing in a 
# FixedDurationPreprocessor as its parent. You can chain as many 
# preprocessors together as you need. The parent will always be called
# first - so the outermost preprocessor is the last one to execute.
pipeline = MyNormalizer(
    FixedDurationPreprocessor(signal_duration=30, sample_rate=256, padding_value=0)
)

ecg_dataset.signal_preprocessors['ECG'] = pipeline

# Preload and load the dataset...
ecg_dataset.preload()
ecg_dataset.load_trials()

for trial in ecg_dataset.trials:
    # Here, the signal data is already trimmed or padded to be 30s long, 
    # and has been normalized using the MinMaxScaler to values between 
    # 0 and 1.
    ecg_signal = trial.load_signal_data('ECG')

Note that the order of your pipeline is critically important. Here, we apply FixedDurationPreprocessor first, before we normalize the values. This may be problematic, since ECG signals are prone to baseline wander. Padding zero values in before normalization will artificially skew the normalization results. It would be better to normalize the signal first, then apply the FixedDurationPreprocessor:

pipeline = FixedDurationPreprocessor(
    signal_duration=30, 
    sample_rate=256, 
    padding_value=0, 
    parent_preprocessor=MyNormalizer()    
)

Alternatively you can use the child_preprocessor to chain the other way:

pipeline = MyNormalizer(
    child_preprocessor=FixedDurationPreprocessor(signal_duration=30, sample_rate=256, padding_value=0)
)

A child_preprocessor will be invoked after the preprocessor completes, so this achieves the same effect of normalizing the signal first, then truncating or padding it to 30 seconds.

Using with TensorFlow

To facilitate use with TensorFlow, use the TFDatasetWrapper to decorate your AERDataset as a tf.data.Dataset suitable for use with tf.model.fit()

import ardt.datasets

# Don't forget to setup your preprocessor pipelines, then preload and 
# load the dataset first!
tfdsw = ardt.datasets.TFDatasetWrapper(ecg_dataset)

# Create the tf.data.Dataset 
tfdataset = tfdsw('ECG', batch_size=64, buffer_size=500, repeat=1)

# Setup your tensorflow model, then use the tfdataset:
myModel = get_tensorflow_model()

# Train your model using preprocessed signals from the AERDataset
myModel.fit(tfdataset)

To separate training, validation and test splits, you can specify the splits to the TFDatasetWrapper and then indicate which split you intend when you call it.

import ardt.datasets

# Don't forget to setup your preprocessor pipelines, then preload and 
# load the dataset first!

# Specify 60% of participants for the training split, 30% for validation and 10% for testing.
tfdsw = ardt.datasets.TFDatasetWrapper(ecg_dataset, splits=[.6, .3, .1])

# Setup your tensorflow model, then use the tfdataset:
myModel = get_tensorflow_model()

# Train your model using preprocessed signals from the AERDataset, using trials from the split at index 0 (60%)
myModel.fit(
    x=tfdsw('ECG', n_split=0),
    validation_data=tfdsw('ECG', n_split=1)
    ...
)

# Later, evaluate against the test set
reults = myModel.evaluate(
    x=tfdsw('ECG', n_split=2)
)

TFDatasetWrapper provides a tf.data.dataset which will prefetch up to buffer_size trials at random, creating batches of size batch_size, and will iterate the dataset repeat times. The prefetch queue uses tf.data.AUTOTUNE to self-optimize.

Adding New Datasets

Whether you are creating your own dataset, or just want to use one that isn't already included, AARDT is designed to be extensible allowing you to integrate additional datasets as needed. This section serves as a guide to help you do this.

Step 1: Dataset Paths

Dataset paths are configured in the config.yml file. Each dataset has its own section, and you can add new ones as needed. For example, to add the CUADS dataset, we did this:

config = {
    'working_dir': '/mnt/affectsai/aerds/',
    'datasets': {
        ...,
        'cuads': {
            'path': '/mnt/affectsai/datasets/cuads',
        }
    },
}

Any additional properties you need can be added under the cuads element.

Step 2: Implement AERDataset and AERTrial Subclasses

The AERDataset is the base class for all dataset implementations in AARDT. It is primarily responsible for loading instances of AERTrial.

All the implementation details, including dataset layout and access details, are encapsulated in your implementation of this base class. See any of the existing implementations for examples. We provide implementations for ASCERTAIN, CUADS, and DREAMER each of which is thoroughly commented. See

src/ardt/datasets/ascertain/Ascertaindataset.py,
src/ardt/datasets/dreamer/DreamerDataset.py,
src/ardt/datasets/cuads/CuadsDataset.py.

To extend AERDataset do the following:

Create a new class as a subclass of AERDataset like so:
```
from ardt.datasets import AERDataset

class MyAwesomeDataset(AERDataset):
    def __init__(self, signals):
        super().__init__(signals)
```
You should minimally provide a list of signal types to super.__init__. This is a list of signal types provided by this dataset, e.g.: ['ECG','EEG']. Feel free to add whatever additional arguments you might need to support your implementation.

Override load_trials(self) and get_signal_metadata methods from AERDataset. load_trial(self) is where all the hard work of implementing a dataset is done... here, you will parse the dataset to produce individual AERTrial instances. get_signal_metadata(self,signal) returns a map of metadata about the requested signal. Minimally this should include:

n_channels: the number of channels for this signal, and
sample_rate: the sample rate in Hz for this signal

from ardt.datasets import AERDataset
    
class MyAwesomeDataset(AERDataset):
    def __init__(self, signals):
        if signals is None:
            signals = ['ECG']       # If not specified, let's load ECG signals from MyAwesomeDataset...
                
        super().__init__(signals)
    
    @abc.abstractmethod
    def load_trials(self):
        """
        Loads the AERTrials from the preloaded dataset into memory. This method should load all relevant trials from
        the dataset. To avoid memory utilization issues, it is strongly recommended to defer loading signal data into
        the AERTrial until that AERTrial's load_signal_data method is called.
    
        During load_trials, implementations should populate `self.trials`. Trial participant and media identifiers must
        be numbered sequentially from 1 to N where N is the number of participants or media files in the dataset
    
        See subclasses for dataset-specific details.
        :return:
        """
        mytrials = []  # actually load your trial data...
        self.trials.extend( mytrials )
    
    @abc.abstractmethod
    def get_signal_metadata(self, signal_type):
        """
        Returns a dict containing the requested signal's metadata. Mandatory keys include:
        - 'signal_type' (the signal type)
        - 'sample_rate' (in samples per second)
        - 'n_channels' (the number of channels in the signal)
    
        See subclasses for implementation-specific keys that may also be present.
    
        :param signal_type: the type of signal for which to retrieve the metadata.
        :return: a dict containing the requested signal's metadata
        """
        if signal_type not in self._signal_types:
            raise ValueError('Signal type {} is not known in this AERTrial'.format(signal_type))
    
        if signal_type == 'ECG':
            return {
                'n_channels': 2,
                'sample_rate': 256
            }
            
        return {}

Create a new class as a subclass of AERTrial like so:

from ardt.datasets import AERTrial

class MyAwesomeDatasetTrial(AERTrial):
    def __init__(self, dataset, participant_id, movie_id)):
        super().__init__(dataset, participant_id, movie_id))

    @abc.abstractmethod
    def load_signal_data(self, signal_type):
        """
        Loads and returns the requested signal as an (N+1)xM numpy array, where N is the number of channels, and M is
        the number of samples in the signal. The row at N=0 represents the timestamp of each sample. The value is
        given in epoch time if a real start time is available, otherwise it is in elapsed milliseconds with 0
        representing the start of the sample.

        :param signal_type:
        :return:
        """
        if signal_type not in self._signal_types:
            raise ValueError('Signal type {} is not known in this AERTrial'.format(signal_type))

        return np.empty(0)

    @abc.abstractmethod
    def load_ground_truth(self):
        """
        Returns the ground truth label for this trial. For AER trials, this is the quadrant within the A/V space,
        numbered 0 through 3 as follows:
        - 0: High Arousal, High Valence
        - 1: High Arousal, Low Valence
        - 2: Low Arousal, Low Valence
        - 3: Low Arousal, High Valence

        :return: The ground truth label for this trial
        """
        return 0

    @abc.abstractmethod
    def get_signal_metadata(self, signal_type):
        """
        Returns a dict containing the requested signal's metadata. Mandatory keys include:
        - 'signal_type' (the signal type)
        - 'sample_rate' (in samples per second)
        - 'n_channels' (the number of channels in the signal)

        See subclasses for implementation-specific keys that may also be present.

        :param signal_type: the type of signal for which to retrieve the metadata.
        :return: a dict containing the requested signal's metadata
        """
        if signal_type not in self._signal_types:
            raise ValueError('Signal type {} is not known in this AERTrial'.format(signal_type))

        response = self.dataset.get_signal_metadata(signal_type)
        response['duration'] = 60 # get the length of the signal data

        return response

The AERTrial takes a reference to the dataset that created it, and the participant_id and media_id that this trial represents. It must implement load_signal_data and load_ground_truth as documented. It may optionally override get_signal_metadata to augment the response from the dataset, for example, to include signal duration.

There is more to it than this but this should be enough to get you started. See the AERDataset and AERTrial classes for method documentation, and then CUADS, ASCERTAIN and DREAMER examples for guidance.

Contributing

We are happy to support you by accepting pull requests that make this library more broadly applicable, or by accepting issues to do the same. If you have an AER dataset you would like us to integrate, please open an issue for that as well, but we will be unable to process issues requesting integration with non-AER datasets at this time.

If you would like to get involved by maintaining dataset integrations in other areas of research, please get in touch and we'd be happy to have the help!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.1

Apr 8, 2025

0.6.0

Apr 8, 2025

0.5.1

Apr 4, 2025

0.1.11

Mar 8, 2025

0.1.10

Feb 26, 2025

0.1.9

Feb 21, 2025

0.1.8

Feb 14, 2025

0.1.7

Jan 30, 2025

This version

0.1.6

Jan 29, 2025

0.1.5

Jan 29, 2025

0.1.4

Jan 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ardt-0.1.6.tar.gz (38.4 kB view details)

Uploaded Jan 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ardt-0.1.6-py3-none-any.whl (53.4 kB view details)

Uploaded Jan 29, 2025 Python 3

File details

Details for the file ardt-0.1.6.tar.gz.

File metadata

Download URL: ardt-0.1.6.tar.gz
Upload date: Jan 29, 2025
Size: 38.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.27.0

File hashes

Hashes for ardt-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`086829ce8711789877f16359c9a85880456554edd8ec5b6c67d0e6d29560f598`
MD5	`e1af4dcd725e5b60c2458ef63b60553e`
BLAKE2b-256	`0452506aa7afc51c81206123243e3a071de4b122a07f7552d280f7258741167d`

See more details on using hashes here.

File details

Details for the file ardt-0.1.6-py3-none-any.whl.

File metadata

Download URL: ardt-0.1.6-py3-none-any.whl
Upload date: Jan 29, 2025
Size: 53.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.27.0

File hashes

Hashes for ardt-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e82b459c56d0c723b6661505b188da83d8aa7443c02df6ab5874ed455ae965f`
MD5	`1d38ec852a2063756b484e55579cc9bc`
BLAKE2b-256	`3f92db554270dbffb4774e97dc5bb37713ca3f39f5eae9cd202759101683fe40`

See more details on using hashes here.

ardt 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Affective Research Dataset Toolkit (ARDT)

Quick Start

Intended Use and License

Quick Start

Concepts

Loading signals from dataset trials

Preprocessing signals

Creating your own preprocessor, and preprocessor chaining

Using with TensorFlow

Adding New Datasets

Step 1: Dataset Paths

Step 2: Implement AERDataset and AERTrial Subclasses

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes