A python library to develop genomic data simulators
Project description
HaploDynamics
The python library HaploDynamics, or HaploDX for short, provides a collection of functions that can be used to simulate population-specific genomic data. This package is part of the Genetic Simulator Resources (GSR) catalog (click the link below for more information).
New features recently added
- Installation via
pip
; - A class-based module
Framework
for software development and experimentations.
Installation
Manual installation
Download the github package by using the following command in a terminal.
$ git clone https://github.com/remytuyeras/HaploDynamics.git
Then, from your current directory, record the absolute path leading to the downloaded package, as shown below.
$ ls
HaploDynamics
$ cd HaploDynamics/
$ pwd
absolute/path/to/HaploDynamics
To import the modules of the library to your script, use the following syntax where you need replace absolute/path/to/HaploDynamics
with the path that you obtained above.
import sys
sys.path.insert (0,"absolute/path/to/HaploDynamics")
import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as hdx_frm
Installation via pip
Install the package by using pip install
as follows.
$ pip install HaploDynamics
Then, you can import the modules of the library to your script as follows.
import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as hdx_frm
Quick start
The following script generates a VCF file containing simulated diploid genotypes for a population of 1000 individuals with LD-blocks of length 20kb, 5kb, 20kb, 35kb, 30kb and 15kb.
import HaploDynamics.HaploDX as hdx
simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)
The equation strength=1
forces a high amount of linkage disequilibrium and the equation population=0.1
increases the likelyhood of the simulated population to have rare mutations (e.g. to simulate a population profile close to African and South-Asian populations).
More generally, the function genmatrix()
takes the following types of parameters:
Parameters | Type | Values |
---|---|---|
blocks |
list[int] |
List of positive integers, ideally between 1 and 40. |
strength |
float |
From -1 (little linkage) to 1 (high linkage) |
population |
float |
From 0 (for more rare mutations) to 1 (for less rare mutations) |
Npop |
int |
Positive integer specifying the number of individuals in the genomic matrix |
The generation of each locus in a VCF file tends to be linear in the parameter Npop
. On average, a genetic variant can take from 0.3 to 0.8 seconds to be generated when Npop=100000
(this may vary depending on your machine). The estimated time complexity for an average machine is shown below.
Use cases
The following script shows how to display linkage disequilibirum correlations for the simulated data.
import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx
simulated_data = hdx.genmatrix([20,20,20,20,20,20],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)
rel, m, _ = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()
A typical output for the previous script should look as follows.
The following script shows how you can control linkage disequilibrium by using LD-blocks of varying legnths. You can display the graph relating distances between pairs of SNPS to average correlation scores by using the last output of the function LD_corr_matrix()
.
import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx
ld_blocks = [5,5,5,10,20,5,5,5,5,5,5,1,1,1,2,2,10,20,40]
strength=1
population=0.1
Npop = 1000
simulated_data = hdx.genmatrix(ld_blocks,strength,population,Npop)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)
#Correlations
rel, m, dist = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()
#from SNP-distance to average correlaions
plt.plot([i for i in range(len(dist)-1)],dist[1:])
plt.ylim([0, 1])
plt.show()
Typical outputs for the previous script should look as follows.
Correlations | SNP-distance to average correlations |
---|---|
Finally, the following script shows how you can generate large regions of linkage.
import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx
ld_blocks = [1] * 250
strength=1
population=0.1
Npop = 1000
simulated_data = hdx.genmatrix(ld_blocks,strength,population,Npop)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)
#Correlations
rel, m, dist = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()
#from SNP-distance to average correlaions
plt.plot([i for i in range(len(dist)-1)],dist[1:])
plt.ylim([0, 1])
plt.show()
Typical outputs for the previous script should look as follows.
Correlations | SNP-distance to average correlations |
---|---|
Functions from HaploDX
You can find a complete presentation (or in fact a thorough tutorial) of the HaploDX module on my personal webpage (here). Below is the list of all functions accessible from the library. It is recommended to first experiment with the functions presented in the Data Generation section.
Population and Allele Frequency Spectrum Modeling
def stochastic_line(a: float,b: float,sigma: float) -> Callable[float,float]
population_mut = stochastic_line(0.08,0.17,0.01)
def afs_distribution(index: int,alpha: float = 4/30) -> float
def afs_intervals(pick: float,alpha: float = 4/30) -> list[float]
def afs_sample(alpha: float = 4/30) -> float
def genotype_schema(alpha: float = 4/30) -> tuple[float,list[float]]
def genotype(hwp: list[float],minor: int) -> tuple[int,int]
gref = lambda g: g[0] if g[0] in [2,0] else g[1]
def population_mld(t: float) -> tuple[float,float,float]
LD and Hardy–Weinberg Principle Modeling
def decay(initial: float,halfwidth: float,shift: float) -> Callable[float,float]
def ref_alt_function(y: float,x: float) -> float
def alt_alt_function(y: float,z: float,x: float) -> float
def amplifier(beta: float,p: float,q: float,s: float = 1) -> float
def lb_freq(beta: float,gamma: float,previous_freq: float,distance: float,shift: float) -> float
def ub_freq(beta: float,gamma: float,previous_freq: float,distance: float,shift: float) -> float
def linkage_disequilibrium(alpha: float,beta: float,gamma: float,strength: float = -1) -> Callable[float,Callabel[float,tuple[float,float]]]
def cond_genotype_schema(previous_maf: float,distance: float,alpha: float,beta: float,gamma: float,strength: float = -1) -> tuple[float,list[float],float]
Data Generation
def SNP_distribution(reference: float,length: float) -> list[float]
Description
- generates the list of positions for the VCF file
def initiate_block(reference: float,alpha: float,Npop: int = 1000) -> tuple[float,list[list],list[list]]
Description
- initializes the first LD-block of the simulation
- the parameter
reference
refers to the first locus position at which the generation starts
def continue_block(maf0: float,pre_matrix: list[list],matrix: list[list],positions: list[float],alpha: float,beta: float,gamma: float,strength: int = -1,Npop: int = 1000) -> tuple[float,list[list],list[list]]
Description
- generates an LD-block from a given position with a given minor allele frequency
- augments the genomic matrix with further genetic variants as specified by the arguments
def genmatrix(blocks: list[int],strength: float,population: float,Npop: int)
Description
- implements a basic genomic matrix generator using
SNP_distribution()
,initiate_block()
andcontinue_block()
- note: many possible improvements or variations of this function are possible (see tutorial here for more detail)
def gt_vcf(value: int)-> str
def create_vcfgz(vcf_name: str,matrix: list[list],alpha: float,beta: float,gamma: float,system: str = "unix") -> None
Description
- generates a VCF file from a
matrix
generated byinitiate_block()
,continue_block()
orgenmatrix()
LD Analytics
def LD_corr_matrix(matrix: list[list]) -> tuple[list[list],float,list[float]]
Description
- generates a (non-normalized) correlation matrix from a
matrix
generated byinitiate_block()
,continue_block()
orgenmatrix()
def LD_r2_matrix(pre_matrix: list[list]) -> tuple[list[list],float,list[float]]
Description
- generates a (non-normalized) LD-r2 matrix from a
pre_matrix
generated byinitiate_block()
,continue_block()
- note: this function cannot be used with
genmatrix()
since thepre_matrix
is not returned
def display(rel: list[list],m: float) -> list[list[tuple[float,float,float]]]
def minor_haplotype(sub_pre_matrix: list[list]) -> float
Classes from Framework
To be added.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.