Skip to main content

Comparison of real and simulated AIRR datasets

Project description

evalAIRR

A tool that allows comparison of real and simulated AIRR datasets by providing different statistical indicators and dataset visualizations in one report.

Installation

It is recommended to use a virtual python environment to run evalAIRR if another python environment is used. Here is a quick guide on how you can set up a virtual environment:

https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments

Install using pip

Run this command to install the evalAIRR package:

pip install evalairr

Quickstart

evalAIRR uses a YAML file for configuration. If you are unfamiliar with how YAML files are structured, read this guide to the syntax:

https://docs.fileformat.com/programming/yaml/#syntax

This is the stucture of a sample report configuration file you can use to start off with (it is included in the repository location ./yaml_files/quickstart.yaml):

datasets:
  real:
    path: ./data/real_data.csv
  sim:
    path: ./data/sim_data.csv
reports:
  feature_based:
    report1:
      features:
        - CAS
        - SAS
      report_types:
        - ks
        - distr_densityplot
        - distance
        - statistics
  observation_based:
    report1:
      observations:
        - all
      report_types:
        - distr_densityplot
  general:
    feat_mean_vs_variance:
    obs_mean_vs_variance:
    pca_2d_feat:
    pca_2d_obs:
    corr_feat_hist:
    corr_obs_hist:
output:
  path: ./output/report.html

This report will process the two provided datasets (real and simulated) with encoded kmer data, and create an HTML report with multiple report types. These include feature-based report types - Kolmogorov–Smirnov test (indicated by report type ks), a feature distribution density plot (indicated by report type distr_densityplot), Euclidean distance measures (indicated by report type distance) and descriptive statistics (mean, median, variance and standard deviation)(indicated by report type statistics) for the features CAS and SAS. It will then export the report to the path ./output/report.html. It will also create an observation-based report with the feature distribution density plot (indicated by report type distr_densityplot) for all observations (indicated by keyword all in the observation list). Finally, these general reports will be generated: feature mean compared with feature variance (indicated by report type feat_mean_vs_variance), observation mean compared with observation variance (indicated by report type obs_mean_vs_variance), two dimensional representation of all features using PCA (indicated by report type pca_2d_feat), two dimensional representation of all observations using PCA (indicated by report type pca_2d_obs), feature correlation coefficient distribution histogram (indicated by report type corr_feat_hist) and an observation correlation coefficient distribution histogram (indicated by report type corr_obs_hist). More details on what reports can be created can be found in the YAML Configuration Guidelines section.

The repository contains sample datafiles and a quickstart YAML configuration files. You can clone the repository and run evalAIRR within it to use sample data.

Within the cloned repository run the command:

evalairr -i ./yaml_files/quickstart.yaml

The report will be generated in the specified output path in the configuration file or, if a specific path is not provided, in <CURRENT_DIRECTORY>/output/report.html. The report is exported in the HTML format.

YAML Configuration Guidelines

The configuration YAML file consists of 3 main sections: datasets, reports and output.

Datasets

In the datasets section, you have to provide paths to a real and a simulated datasets that you are comparing. CSV files with encoded kmer data are supported. This can be done by specifying the file path of each file in the path variable under the sections real and sim respectively. Here is an example of how a configured datasets section looks like:

datasets:
  real:
    path: ./data/real_data.csv
  sim:
    path: ./data/sim_data.csv

Reports

In the reports section, you can provide the list of report types you want to create and their parameters. There are three types of report groups depending on the different parameters: feature_based, observation_based and generic. Here is the list of reports you can create that compare the features of the real dataset with the simulated dataset:

Feature-based reports

  • ks - Kolmogorov–Smirnov statistic. Parameters: list of features you are creating the report for.
  • distr_histogram - feature distribution histogram. Parameters: list of features you are creating the report for.
  • distr_boxplot - feature distribution boxplot. Parameters: list of features you are creating the report for.
  • distr_violinplot - feature distribution violin plot. Parameters: list of features you are creating the report for.
  • distr_densityplot - feature distribution density plot. Parameters: list of features you are creating the report for.
  • distance - Euclidean distance between the real and simulated feature. Parameters: list of features you are creating the report for.
  • statistics - statistical indicators (mean, median, standard deviation and variance) of a feature in both real and simulated datasets. Parameters: list of features you are creating the report for.

Observation-based reports

  • ks - Kolmogorov–Smirnov statistic. Parameters: list of observations you are creating the report for.
  • distr_histogram - observation distribution histogram. Parameters: list of observations you are creating the report for.
  • distr_boxplot - observation distribution boxplot. Parameters: list of observations you are creating the report for.
  • distr_violinplot - observation distribution violin plot. Parameters: list of observations you are creating the report for.
  • distr_densityplot - observation distribution density plot. Parameters: list of observations you are creating the report for. The observation index all can be used to report on all observations in one plot. with_ml_sim - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset (only applies in reports with all observations). ml_random_state - optional integer parameter, relevant only if with_ml_sim is set to True, which sets a seed in the machine learning model random number generation.
  • distance - Euclidean distance between the real and simulated observation. Parameters: list of observations you are creating the report for.
  • statistics - statistical indicators (mean, median, standard deviation and variance) of an observation in both real and simulated datasets. Parameters: list of observations you are creating the report for.

General reports

  • ks_feat - Kolmogorov–Smirnov statistic for all features. Parameters: output - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to ./output/ks_feat.csv). The csv file contains two rows, with the first row containing the ks-statistic and the second one - the p-values.
  • ks_obs - Kolmogorov–Smirnov statistic for all observations. Parameters: output - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to ./output/ks_obs.csv). The csv file contains two rows, with the first row containing the ks-statistic and the second one - the p-values.
  • copula_2d - a 2D scatter plot that displays two features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified.
  • copula_3d - a 3D scatter plot that displays three features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified.
  • feat_mean_vs_variance - a scatter plot that displays the mean value of every feature on one axis and the variance of every feature on the other axis. Parameters: with_ml_sim - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. ml_random_state - optional integer parameter, relevant only if with_ml_sim is set to True, which sets a seed in the machine learning model random number generation.
  • obs_mean_vs_variance - a scatter plot that displays the mean value of every observation on one axis and the variance of every observation on the other axis. Parameters: with_ml_sim - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. ml_random_state - optional integer parameter, relevant only if with_ml_sim is set to True, which sets a seed in the machine learning model random number generation.
  • corr - correlation matrix heatmaps of the real and simulated datasets. Parameters: reduce_to_n_features - an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)). with_ml_sim - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. ml_random_state - optional integer parameter, relevant only if with_ml_sim is set to True, which sets a seed in the machine learning model random number generation.
  • corr_feat_hist - feature correlation matrix distribution histogram for the real and simulated datasets. Parameters: n_bins - an optional parameter that sets the number of bins in the histogram (default value is 30). reduce_to_n_features - an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)). with_ml_sim - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. ml_random_state - optional integer parameter, relevant only if with_ml_sim is set to True, which sets a seed in the machine learning model random number generation.
  • corr_obs_hist - observation correlation matrix distribution histogram for the real and simulated datasets. Parameters: n_bins - an optional parameter that sets the number of bins in the histogram (default value is 30). reduce_to_n_obs - an optional parameter for dimensionality reduction using PCA. The number of observations to reduce the dataset to (must be reduce_to_n_obs < min(n_observations, n_features)). with_ml_sim - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. ml_random_state - optional integer parameter, relevant only if with_ml_sim is set to True, which sets a seed in the machine learning model random number generation.
  • corr_csv - CSV file exporting of the difference between correlation matrices of the real and simulated datasets. Parameters: output - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to ./output/corr.csv).
  • pca_2d_feat - two feature-level scatter plots with both datasets reduced to two dimensions using PCA. Parameters: with_ml_sim - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. ml_random_state - optional integer parameter, relevant only if with_ml_sim is set to True, which sets a seed in the machine learning model random number generation.
  • pca_2d_obs - two observation-level scatter plots with both datasets reduced to two dimensions using PCA. Parameters: with_ml_sim - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. ml_random_state - optional integer parameter, relevant only if with_ml_sim is set to True, which sets a seed in the machine learning model random number generation.
  • distance_feat - Euclidean distance between the real and simulated features. Parameters: output - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to ./output/dist.csv).
  • statistics_feat - statistical indicators (mean, median, standard deviation and variance) of all features in both real and simulated datasets. Parameters: output_dir - optional parameter, that specifies the directory for the csv files in which the csv result files real_stat.csv and sim_stat.csv will be exported to (default value is set to ./output/). Each csv file contain four rows, each with a different statistic: 1 - mean, 2 - median, 3 - standard deviation, 4 - variance.
  • distance_obs - Euclidean distance between the real and simulated observations. Parameters: output - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to ./output/obs_dist.csv).
  • statistics_obs - statistical indicators (mean, median, standard deviation and variance) of all observation in both real and simulated datasets. Parameters: output_dir - optional parameter, that specifies the directory for the csv files in which the csv result files real_obs_stat.csv and sim_obs_stat.csv will be exported to (default value is set to ./output/). Each csv file contain four rows, each with a different statistic: 1 - mean, 2 - median, 3 - standard deviation, 4 - variance.
  • jensen_shannon_feat - Jensen-Shannon divergence metric between the real and simulated features. Parameters: output - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to ./output/jenshan.csv).
  • jensen_shannon_obs - Jensen-Shannon divergence metric between the real and simulated observations. Parameters: output - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to ./output/obs_jenshan.csv).

Here is a sample reports section of a configuration file containing all of the reports:

reports:
  feature_based:
    report1:
      features:
        - CAS
        - SAS
      report_types:
        - ks
        - distr_histogram
        - distr_boxplot
        - distr_violinplot
        - distr_densityplot
        - distance
        - statistics
  observation_based:
    report1:
      observations:
        - 20
      report_types:
        - ks
        - distr_histogram
        - distr_boxplot
        - distr_violinplot
        - distr_densityplot
        - distance
        - statistics
    report2:
      observations:
        - all
      report_types:
        - distr_densityplot
      with_ml_sim: True
      ml_random_state: 0
  general:
    copula_2d:
      report1:
        - CAS
        - SAS
    copula_3d:
      report1:
        - CAS
        - SAS
        - TGT
    feat_mean_vs_variance:
      with_ml_sim: True
      ml_random_state: 0
    obs_mean_vs_variance:
      with_ml_sim: True
      ml_random_state: 0
    corr_hist:
      n_bins: 30
      with_ml_sim: True
      ml_random_state: 0
      reduce_to_n_features: 200
    corr:
      reduce_to_n_features: 200
      with_ml_sim: True
      ml_random_state: 0
    pca_2d_feat:
      with_ml_sim: True
      ml_random_state: 0
    pca_2d_obs:
      with_ml_sim: True
      ml_random_state: 0
    corr_csv:
      output: ./output/corr.csv
    ks_feat:
      output: ./output/ks_feat.csv
    ks_obs:
      output: ./output/ks_obs.csv
    statistics_feat:
      output_dir: ./output/
    statistics_obs:
      output_dir: ./output/
    distance_feat:
      output: ./output/dist.csv
    distance_obs:
      output: ./output/obs_dist.csv
    jensen_shannon_feat:
      output: ./output/jenshan.csv
    jensen_shannon_obs:
      output: ./output/obs_jenshan.csv

Output

An optional section where you can specify the file path of the generated report. The default path of the generated report is <CURRENT_DIRECTORY>/output/report.html. The report is exported in the HTML format. If you declare the path as 'NONE', the report will not be created.

An example output section:

output:
  path: ./output/report.html

For example, this output section would result in a report file not being created:

output:
  path: NONE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalAIRR-0.0.44.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalAIRR-0.0.44-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file evalAIRR-0.0.44.tar.gz.

File metadata

  • Download URL: evalAIRR-0.0.44.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for evalAIRR-0.0.44.tar.gz
Algorithm Hash digest
SHA256 71c3b3b88a7fb26c77c0e9b9fb9baee7a244cfbb33478f818da05d5ee28be0bc
MD5 d05cdcb7660d9e1b348ba6589a336b82
BLAKE2b-256 7b201f0019d6e578c7b833a525a4fa7506597c9ed88618943b386bf0f1acb433

See more details on using hashes here.

File details

Details for the file evalAIRR-0.0.44-py3-none-any.whl.

File metadata

  • Download URL: evalAIRR-0.0.44-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for evalAIRR-0.0.44-py3-none-any.whl
Algorithm Hash digest
SHA256 1191185348e52c08a82016edb07504a3e5b3338e96ae87061c2839ef67d9a25a
MD5 f157129640a74065135d9a4fad3f667e
BLAKE2b-256 29571bd119b7192359c616cdfd8fa4342382a2bf75989df3ad2a8fbf47b4b4df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page