Skip to main content

A python library to develop genomic data simulators

Project description

HaploDynamics

The python library HaploDynamics, or HaploDX for short, provides a collection of functions that can be used to simulate population-specific genomic data. This package is part of the Genetic Simulator Resources (GSR) catalog (click the link below for more information).

Catalogued on GSR

News

  • Adding installation procedure via pip
  • Adding the module Framework for a class-based development framework

Installation

Manual installation

Download the github package using the following command.

$ git clone https://github.com/remytuyeras/HaploDynamics.git

Then, from your current directory, record the absolute path leading to the main module of the package, as shown below.

$ ls
HaploDynamics
$ cd HaploDynamics/
$ pwd
path/to/main/module/HaploDynamics

To import the modules of the library to your script, use the following syntax where you must replace path/to/main/module/HaploDynamics with the path that you obtained above.

import sys
sys.path.insert (0,"path/to/main/module/HaploDynamics")
import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as hdx_frm

Pip [still in progress]

Install the package using pip install as follows.

$ pip install HaploDynamics

Then, you can import the modules of the library to your script as follows.

import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as hdx_frm

Quick start

The following script generates a VCF file containing simulated diploid genotypes for a population of 1000 individuals with LD-blocks of length 20kb, 5kb, 20kb, 35kb, 30kb and 15kb.

import HaploDynamics.HaploDX as hdx

simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

The equation strength=1 forces a high amount of linkage disequilibrium and the equation population=0.1 increases the likelyhood of the simulated population to have rare mutations (e.g. to simulate a population profile close to African and South-Asian populations).

More generally, the function genmatrix() takes the following types of parameters:

Parameters Type Values
blocks list[int] List of positive integers, ideally between 1 and 40.
strength float From -1 (little linkage) to 1 (high linkage)
population float From 0 (for more rare mutations) to 1 (for less rare mutations)
Npop int Positive integer specifying the number of individuals in the genomic matrix

The generation of each locus in a VCF file tend to be linear in the parameter Npop. For example, one genetic variant can take from 0.3 to 0.8 seconds to be generated/simulated when we set Npop=100000 (this may vary depending on your machine). The estimated time complexity for an average machine is shown below.

GitHub Logo

The following script shows how to display the linkage disequilibirum correlations associated with the simulated data.

import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx

simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

rel, m, _ = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()

The following plot is an example of the output that can be returned by the previous script when using 6 LD-blocks of 20kb each.

alt text

HaploDX Functions

You can find a complete presentation (or in fact a thorough tutorial) of the HaploDX library on my personal webpage (here). Below is the list of all functions accessible from the library. It is recommended to first experiment with the functions presented in the Data Generation section.

Population and Allele Frequency Spectrum Modeling

def stochastic_line(a: float,b: float,sigma: float) -> Callable[float,float]
population_mut = stochastic_line(0.08,0.17,0.01)
def afs_distribution(index: int,alpha: float = 4/30) -> float
def afs_intervals(pick: float,alpha: float = 4/30) -> list[float]
def afs_sample(alpha: float = 4/30) -> float
def genotype_schema(alpha: float = 4/30) -> tuple[float,list[float]]
def genotype(hwp: list[float],minor: int) -> tuple[int,int]
gref = lambda g: g[0] if g[0] in [2,0] else g[1]
def population_mld(t: float) -> tuple[float,float,float]

LD and Hardy–Weinberg Principle Modeling

def decay(initial: float,halfwidth: float,shift: float) -> Callable[float,float]
def ref_alt_function(y: float,x: float) -> float
def alt_alt_function(y: float,z: float,x: float) -> float
def amplifier(beta: float,p: float,q: float,s: float = 1) -> float
def lb_freq(beta: float,gamma: float,previous_freq: float,distance: float,shift: float) -> float
def ub_freq(beta: float,gamma: float,previous_freq: float,distance: float,shift: float) -> float
def linkage_disequilibrium(alpha: float,beta: float,gamma: float,strength: float = -1) -> Callable[float,Callabel[float,tuple[float,float]]]
def cond_genotype_schema(previous_maf: float,distance: float,alpha: float,beta: float,gamma: float,strength: float = -1) -> tuple[float,list[float],float]

Data Generation

def SNP_distribution(reference: float,length: float) -> list[float]

Description

  • generates the list of positions for the VCF file
def initiate_block(reference: float,alpha: float,Npop: int = 1000) -> tuple[float,list[list],list[list]]

Description

  • initializes the first LD-block of the simulation
  • the parameter reference refers to the first locus position at which the generation starts
def continue_block(maf0: float,pre_matrix: list[list],matrix: list[list],positions: list[float],alpha: float,beta: float,gamma: float,strength: int = -1,Npop: int = 1000) -> tuple[float,list[list],list[list]]

Description

  • generates an LD-block from a given position with a given minor allele frequency
  • augments the genomic matrix with further genetic variants as specified by the arguments
def genmatrix(blocks: list[int],strength: float,population: float,Npop: int)

Description

  • implements a basic genomic matrix generator using SNP_distribution(), initiate_block() and continue_block()
  • note: many possible improvements or variations of this function are possible (see tutorial here for more detail)
def gt_vcf(value: int)-> str
def create_vcfgz(vcf_name: str,matrix: list[list],alpha: float,beta: float,gamma: float,system: str = "unix") -> None

Description

  • generates a VCF file from a matrix generated by initiate_block(), continue_block() or genmatrix()

LD Analytics

def LD_corr_matrix(matrix: list[list]) -> tuple[list[list],float,list[float]]

Description

  • generates a (non-normalized) correlation matrix from a matrix generated by initiate_block(), continue_block() or genmatrix()
def LD_r2_matrix(pre_matrix: list[list]) -> tuple[list[list],float,list[float]]

Description

  • generates a (non-normalized) LD-r2 matrix from a pre_matrix generated by initiate_block(), continue_block()
  • note: this function cannot be used with genmatrix() since the pre_matrix is not returned
def display(rel: list[list],m: float) -> list[list[tuple[float,float,float]]]
def minor_haplotype(sub_pre_matrix: list[list]) -> float

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HaploDynamics-0.1b7.tar.gz (24.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page