Comparison of real and simulated AIRR datasets
Project description
evalAIRR
A tool that allows comparison of real and simulated datasets by providing different statistical indicators and dataset visualizations in one report.
Installation
It is recommended to use a virtual python environment to run evalAIRR if another python environment is used. Here is a quick guide on how you can set up a virtual environment:
https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments
Install using pip
Run this command to install the evalAIRR package:
pip install evalairr
Quickstart
evalAIRR uses a YAML file for configuration. If you are unfamiliar with how YAML files are structured, read this guide to the syntax:
https://docs.fileformat.com/programming/yaml/#syntax
This is the stucture of a sample report configuration file you can use to start off with (it is included in the repository location ./yaml_files/quickstart.yaml):
datasets:
real:
path: ./data/encoded_real_1000_200.csv
sim:
path: ./data/encoded_sim_1000_200.csv
reports:
feature_based:
report1:
features:
- TGT
- ANV
report_types:
- ks
- distr_densityplot
output:
path: ./output/report.html
This report will process the two provided datasets (real and simulated) with encoded kmer data, and create an HTML report with feature-based report types - Kolmogorov–Smirnov test (indicated by ks
) and a feature distribution density plot (indicated by distr_densityplot
) for the features TGT
and ANV
. It will then export the report to the path ./output/report.html
. More details on what reports can be created can be found in the YAML Configuration Guidelines section.
The repository contains sample datafiles and a quickstart YAML configuration files. You can clone the repository and run evalAIRR within it to use sample data.
Within the cloned repository run the command:
evalairr -i ./yaml_files/quickstart.yaml
The report will be generated in the specified output path in the configuration file or, if a specific path is not provided, in <CURRENT_DIRECTORY>/output/report.html
. The report is exported in the HTML format.
YAML Configuration Guidelines
The configuration YAML file consists of 3 main sections: datasets
, reports
and output
.
Datasets
In the datasets
section, you have to provide paths to a real and a simulated datasets that you are comparing. CSV files with encoded kmer data are supported. This can be done by specifying the file path of each file in the path
variable under the sections real
and sim
respectively. Here is an example of how a configured datasets
section looks like:
datasets:
real:
path: ./data/encoded_real_1000_200.csv
sim:
path: ./data/encoded_sim_1000_200.csv
Reports
In the reports
section, you can provide the list of report types you want to create and their parameters. There are three types of report groups depending on the different parameters: feature_based
, observation_based
and generic
. Here is the list of reports you can create that compare the features of the real dataset with the simulated dataset:
Feature-based reports
ks
- Kolmogorov–Smirnov statistic. Parameters: list of features you are creating the report for.distr_histogram
- feature distribution histogram. Parameters: list of features you are creating the report for.distr_boxplot
- feature distribution boxplot. Parameters: list of features you are creating the report for.distr_violinplot
- feature distribution violin plot. Parameters: list of features you are creating the report for.distr_densityplot
- feature distribution density plot. Parameters: list of features you are creating the report for.distance
- Euclidean distance between the real and simulated feature. Parameters: list of features you are creating the report for.statistics
- statistical indicators (average, median, standard deviation and variance) of a feature in both real and simulated datasets. Parameters: list of features you are creating the report for.
Observation-based reports
observation_distr_histogram
- observation distribution histogram. Parameters: list of observations you are creating the report for.observation_distr_boxplot
- observation distribution boxplot. Parameters: list of observations you are creating the report for.observation_distr_violinplot
- observation distribution violin plot. Parameters: list of observations you are creating the report for.observation_distr_densityplot
- observation distribution density plot. Parameters: list of observations you are creating the report for. The observation indexall
can be used to report on all observations in one plot.observation_distance
- Euclidean distance between the real and simulated observation. Parameters: list of observations you are creating the report for.observation_statistics
- statistical indicators (average, median, standard deviation and variance) of an observation in both real and simulated datasets. Parameters: list of observations you are creating the report for.
Additional parameters: with_ml_sim
- optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset (currently, only applies to the report type observation_distr_densityplot
in reports with all
observations). ml_random_state
- optional integer parameter, relevant only if with_ml_sim
is set to True, which sets a seed in the machine learning model random number generation.
General reports
ks
- Kolmogorov–Smirnov statistic for all features. Parameters:output
- optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to./output/ks.csv
).copula_2d
- a 2D scatter plot that displays two features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified.copula_3d
- a 3D scatter plot that displays three features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified.feature_average_vs_variance
- a scatter plot that displays the average value of every feature on one axis and the variance of every feature on the other axis. Parameters:with_ml_sim
- optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset.ml_random_state
- optional integer parameter, relevant only ifwith_ml_sim
is set to True, which sets a seed in the machine learning model random number generation.observation_average_vs_variance
- a scatter plot that displays the average value of every observation on one axis and the variance of every observation on the other axis. Parameters:with_ml_sim
- optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset.ml_random_state
- optional integer parameter, relevant only ifwith_ml_sim
is set to True, which sets a seed in the machine learning model random number generation.corr
- correlation matrix heatmaps of the real and simulated datasets. Parameters:reduce_to_n_features
- an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)).with_ml_sim
- optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset.ml_random_state
- optional integer parameter, relevant only ifwith_ml_sim
is set to True, which sets a seed in the machine learning model random number generation.pca_2d
- two scatter plots with both datasets reduced to two dimensions using PCA. Parameters:with_ml_sim
- optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset.ml_random_state
- optional integer parameter, relevant only ifwith_ml_sim
is set to True, which sets a seed in the machine learning model random number generation.distance
- Euclidean distance between the real and simulated features. Parameters:output
- optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to./output/dist.csv
).statistics
- statistical indicators (average, median, standard deviation and variance) of all features in both real and simulated datasets. Parameters:output_dir
- optional parameter, that specifies the directory for the csv files in which the csv result filesreal_stat.csv
andsim_stat.csv
will be exported to (default value is set to./output/
). Each csv file contain four rows, each with a different statistic: 1 - average, 2 - median, 3 - standard deviation, 4 - variance.observation_distance
- Euclidean distance between the real and simulated observations. Parameters:output
- optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to./output/obs_dist.csv
).observation_statistics
- statistical indicators (average, median, standard deviation and variance) of all observation in both real and simulated datasets. Parameters:output_dir
- optional parameter, that specifies the directory for the csv files in which the csv result filesreal_obs_stat.csv
andsim_obs_stat.csv
will be exported to (default value is set to./output/
). Each csv file contain four rows, each with a different statistic: 1 - average, 2 - median, 3 - standard deviation, 4 - variance.
Here is a sample reports
section of a configuration file containing all of the reports:
reports:
feature_based:
report1:
features:
- TGT
- ANV
report_types:
- ks
- distr_histogram
- distr_boxplot
- distr_violinplot
- distr_densityplot
- distance
- statistics
observation_based:
report1:
observations:
- 0
report_types:
- observation_distr_histogram
- observation_distr_boxplot
- observation_distr_violinplot
- observation_distr_densityplot
- observation_distance
- observation_statistics
report2:
observations:
- all
report_types:
- observation_distr_densityplot
with_ml_sim: True
ml_random_state: 0
general:
copula_2d:
report1:
- TGT
- ANV
copula_3d:
report1:
- TGT
- ANV
- CAS
feature_average_vs_variance:
with_ml_sim: True
ml_random_state: 0
observation_average_vs_variance:
with_ml_sim: True
ml_random_state: 0
corr:
reduce_to_n_features: 150
with_ml_sim: True
ml_random_state: 0
pca_2d:
with_ml_sim: True
ml_random_state: 0
ks:
output: ./output/ks.csv
statistics:
output_dir: ./output/
observation_statistics:
output_dir: ./output/
distance:
output: ./output/dist.csv
observation_distance:
output: ./output/obs_dist.csv
Output
An optional section where you can specify the file path of the generated report. The default path of the generated report is <CURRENT_DIRECTORY>/output/report.html
. The report is exported in the HTML format. If you declare the path as 'NONE', the report will not be created.
An example output section:
output:
path: ./output/report.html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for evalAIRR-0.0.13-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af50992535858fe91cc315ae9842b3f7276e8bc32d9758fe6eda7857f7f2ba01 |
|
MD5 | 11a98b2453dee3e848d1de543feb3843 |
|
BLAKE2b-256 | 7f7e99262ad37e7086254472358782e1d249811f7037bc932d1dbef8fbbb6e2d |