A python library to develop genomic data simulators
Project description
HaploDynamics
The python library HaploDynamics, or HaploDX for short, provides a collection of functions that can be used to simulate population-specific genomic data. This package is part of the Genetic Simulator Resources (GSR) catalog (click the link below for more information).
News
- Adding installation procedure via
pip
- Adding the module
Framework
for a class-based development framework
Installation
Manual installation
Download the github package using the following command.
$ git clone https://github.com/remytuyeras/HaploDynamics.git
Then, from your current directory, record the absolute path leading to the main module of the package, as shown below.
$ ls
HaploDynamics
$ cd HaploDynamics/
$ pwd
path/to/main/module/HaploDynamics
To import the modules of the library to your script, use the following syntax where you must replace path/to/main/module/HaploDynamics
with the path that you obtained above.
import sys
sys.path.insert (0,"path/to/main/module/HaploDynamics")
import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as hdx_frm
Pip [still in progress]
Install the package using pip install
as follows.
$ pip install HaploDynamics
Then, you can import the modules of the library to your script as follows.
import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as hdx_frm
Quick start
The following script generates a VCF file containing simulated diploid genotypes for a population of 1000 individuals with LD-blocks of length 20kb, 5kb, 20kb, 35kb, 30kb and 15kb.
import HaploDynamics.HaploDX as hdx
simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)
The equation strength=1
forces a high amount of linkage disequilibrium and the equation population=0.1
increases the likelyhood of the simulated population to have rare mutations (e.g. to simulate a population profile close to African and South-Asian populations).
More generally, the function genmatrix()
takes the following types of parameters:
Parameters | Type | Values |
---|---|---|
blocks |
list[int] |
List of positive integers, ideally between 1 and 40. |
strength |
float |
From -1 (little linkage) to 1 (high linkage) |
population |
float |
From 0 (for more rare mutations) to 1 (for less rare mutations) |
Npop |
int |
Positive integer specifying the number of individuals in the genomic matrix |
The generation of each locus in a VCF file tend to be linear in the parameter Npop
. For example, one genetic variant can take from 0.3 to 0.8 seconds to be generated/simulated when we set Npop=100000
(this may vary depending on your machine). The estimated time complexity for an average machine is shown below.
The following script shows how to display the linkage disequilibirum correlations associated with the simulated data.
import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx
simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)
rel, m, _ = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()
The following plot is an example of the output that can be returned by the previous script when using 6 LD-blocks of 20kb each.
HaploDX Functions
You can find a complete presentation (or in fact a thorough tutorial) of the HaploDX library on my personal webpage (here). Below is the list of all functions accessible from the library. It is recommended to first experiment with the functions presented in the Data Generation section.
Population and Allele Frequency Spectrum Modeling
def stochastic_line(a: float,b: float,sigma: float) -> Callable[float,float]
population_mut = stochastic_line(0.08,0.17,0.01)
def afs_distribution(index: int,alpha: float = 4/30) -> float
def afs_intervals(pick: float,alpha: float = 4/30) -> list[float]
def afs_sample(alpha: float = 4/30) -> float
def genotype_schema(alpha: float = 4/30) -> tuple[float,list[float]]
def genotype(hwp: list[float],minor: int) -> tuple[int,int]
gref = lambda g: g[0] if g[0] in [2,0] else g[1]
def population_mld(t: float) -> tuple[float,float,float]
LD and Hardy–Weinberg Principle Modeling
def decay(initial: float,halfwidth: float,shift: float) -> Callable[float,float]
def ref_alt_function(y: float,x: float) -> float
def alt_alt_function(y: float,z: float,x: float) -> float
def amplifier(beta: float,p: float,q: float,s: float = 1) -> float
def lb_freq(beta: float,gamma: float,previous_freq: float,distance: float,shift: float) -> float
def ub_freq(beta: float,gamma: float,previous_freq: float,distance: float,shift: float) -> float
def linkage_disequilibrium(alpha: float,beta: float,gamma: float,strength: float = -1) -> Callable[float,Callabel[float,tuple[float,float]]]
def cond_genotype_schema(previous_maf: float,distance: float,alpha: float,beta: float,gamma: float,strength: float = -1) -> tuple[float,list[float],float]
Data Generation
def SNP_distribution(reference: float,length: float) -> list[float]
Description
- generates the list of positions for the VCF file
def initiate_block(reference: float,alpha: float,Npop: int = 1000) -> tuple[float,list[list],list[list]]
Description
- initializes the first LD-block of the simulation
- the parameter
reference
refers to the first locus position at which the generation starts
def continue_block(maf0: float,pre_matrix: list[list],matrix: list[list],positions: list[float],alpha: float,beta: float,gamma: float,strength: int = -1,Npop: int = 1000) -> tuple[float,list[list],list[list]]
Description
- generates an LD-block from a given position with a given minor allele frequency
- augments the genomic matrix with further genetic variants as specified by the arguments
def genmatrix(blocks: list[int],strength: float,population: float,Npop: int)
Description
- implements a basic genomic matrix generator using
SNP_distribution()
,initiate_block()
andcontinue_block()
- note: many possible improvements or variations of this function are possible (see tutorial here for more detail)
def gt_vcf(value: int)-> str
def create_vcfgz(vcf_name: str,matrix: list[list],alpha: float,beta: float,gamma: float,system: str = "unix") -> None
Description
- generates a VCF file from a
matrix
generated byinitiate_block()
,continue_block()
orgenmatrix()
LD Analytics
def LD_corr_matrix(matrix: list[list]) -> tuple[list[list],float,list[float]]
Description
- generates a (non-normalized) correlation matrix from a
matrix
generated byinitiate_block()
,continue_block()
orgenmatrix()
def LD_r2_matrix(pre_matrix: list[list]) -> tuple[list[list],float,list[float]]
Description
- generates a (non-normalized) LD-r2 matrix from a
pre_matrix
generated byinitiate_block()
,continue_block()
- note: this function cannot be used with
genmatrix()
since thepre_matrix
is not returned
def display(rel: list[list],m: float) -> list[list[tuple[float,float,float]]]
def minor_haplotype(sub_pre_matrix: list[list]) -> float
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.