A python library to develop genomic data simulators

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

HaploDynamics

The python library HaploDynamics, or HaploDX for short, provides a collection of functions that can be used to simulate population-specific genomic data. This package is part of the Genetic Simulator Resources (GSR) catalog (click the link below for more information).

News

Adding installation procedure via pip
Adding the module Framework for a class-based development framework

Installation

Manual installation

Download the github package using the following command.

$ git clone https://github.com/remytuyeras/HaploDynamics.git

Then, from your current directory, record the absolute path leading to the main module of the package, as shown below.

$ ls
HaploDynamics
$ cd HaploDynamics/
$ pwd
path/to/main/module/HaploDynamics

To import the modules of the library to your script, use the following syntax where you must replace path/to/main/module/HaploDynamics with the path that you obtained above.

import sys
sys.path.insert (0,"path/to/main/module/HaploDynamics")
import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as hdx_frm

Pip [still in progress]

Install the package using pip install as follows.

$ pip install HaploDynamics

Then, you can import the modules of the library to your script as follows.

import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as hdx_frm

Quick start

The following script generates a VCF file containing simulated diploid genotypes for a population of 1000 individuals with LD-blocks of length 20kb, 5kb, 20kb, 35kb, 30kb and 15kb.

import HaploDynamics.HaploDX as hdx

simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

The equation strength=1 forces a high amount of linkage disequilibrium and the equation population=0.1 increases the likelyhood of the simulated population to have rare mutations (e.g. to simulate a population profile close to African and South-Asian populations).

More generally, the function genmatrix() takes the following types of parameters:

Parameters	Type	Values
`blocks`	`list[int]`	List of positive integers, ideally between 1 and 40.
`strength`	`float`	From -1 (little linkage) to 1 (high linkage)
`population`	`float`	From 0 (for more rare mutations) to 1 (for less rare mutations)
`Npop`	`int`	Positive integer specifying the number of individuals in the genomic matrix

The generation of each locus in a VCF file tend to be linear in the parameter Npop. For example, one genetic variant can take from 0.3 to 0.8 seconds to be generated/simulated when we set Npop=100000 (this may vary depending on your machine). The estimated time complexity for an average machine is shown below.

GitHub Logo

The following script shows how to display the linkage disequilibirum correlations associated with the simulated data.

import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx

simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

rel, m, _ = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()

The following plot is an example of the output that can be returned by the previous script when using 6 LD-blocks of 20kb each.

alt text

HaploDX Functions

You can find a complete presentation (or in fact a thorough tutorial) of the HaploDX library on my personal webpage (here). Below is the list of all functions accessible from the library. It is recommended to first experiment with the functions presented in the Data Generation section.

Population and Allele Frequency Spectrum Modeling

def stochastic_line(a: float,b: float,sigma: float) -> Callable[float,float]

population_mut = stochastic_line(0.08,0.17,0.01)

def afs_distribution(index: int,alpha: float = 4/30) -> float

def afs_intervals(pick: float,alpha: float = 4/30) -> list[float]

def afs_sample(alpha: float = 4/30) -> float

def genotype_schema(alpha: float = 4/30) -> tuple[float,list[float]]

def genotype(hwp: list[float],minor: int) -> tuple[int,int]

gref = lambda g: g[0] if g[0] in [2,0] else g[1]

def population_mld(t: float) -> tuple[float,float,float]

LD and Hardy–Weinberg Principle Modeling

def decay(initial: float,halfwidth: float,shift: float) -> Callable[float,float]

def ref_alt_function(y: float,x: float) -> float

def alt_alt_function(y: float,z: float,x: float) -> float

def amplifier(beta: float,p: float,q: float,s: float = 1) -> float

def lb_freq(beta: float,gamma: float,previous_freq: float,distance: float,shift: float) -> float

def ub_freq(beta: float,gamma: float,previous_freq: float,distance: float,shift: float) -> float

def linkage_disequilibrium(alpha: float,beta: float,gamma: float,strength: float = -1) -> Callable[float,Callabel[float,tuple[float,float]]]

def cond_genotype_schema(previous_maf: float,distance: float,alpha: float,beta: float,gamma: float,strength: float = -1) -> tuple[float,list[float],float]

Data Generation

def SNP_distribution(reference: float,length: float) -> list[float]

Description

generates the list of positions for the VCF file

def initiate_block(reference: float,alpha: float,Npop: int = 1000) -> tuple[float,list[list],list[list]]

Description

initializes the first LD-block of the simulation
the parameter reference refers to the first locus position at which the generation starts

def continue_block(maf0: float,pre_matrix: list[list],matrix: list[list],positions: list[float],alpha: float,beta: float,gamma: float,strength: int = -1,Npop: int = 1000) -> tuple[float,list[list],list[list]]

Description

generates an LD-block from a given position with a given minor allele frequency
augments the genomic matrix with further genetic variants as specified by the arguments

def genmatrix(blocks: list[int],strength: float,population: float,Npop: int)

Description

implements a basic genomic matrix generator using SNP_distribution(), initiate_block() and continue_block()
note: many possible improvements or variations of this function are possible (see tutorial here for more detail)

def gt_vcf(value: int)-> str

def create_vcfgz(vcf_name: str,matrix: list[list],alpha: float,beta: float,gamma: float,system: str = "unix") -> None

Description

generates a VCF file from a matrix generated by initiate_block(), continue_block() or genmatrix()

LD Analytics

def LD_corr_matrix(matrix: list[list]) -> tuple[list[list],float,list[float]]

Description

generates a (non-normalized) correlation matrix from a matrix generated by initiate_block(), continue_block() or genmatrix()

def LD_r2_matrix(pre_matrix: list[list]) -> tuple[list[list],float,list[float]]

Description

generates a (non-normalized) LD-r2 matrix from a pre_matrix generated by initiate_block(), continue_block()
note: this function cannot be used with genmatrix() since the pre_matrix is not returned

def display(rel: list[list],m: float) -> list[list[tuple[float,float,float]]]

def minor_haplotype(sub_pre_matrix: list[list]) -> float

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4b1 pre-release

Aug 25, 2023

0.4b0 pre-release

Aug 24, 2023

0.3b1 pre-release

Aug 14, 2023

0.3b0 pre-release

Aug 14, 2023

0.2b4 pre-release

Aug 9, 2023

0.2b1 pre-release

Aug 7, 2023

0.2b0 pre-release

Aug 7, 2023

0.1b9 pre-release

Aug 7, 2023

This version

0.1b7 pre-release

Aug 7, 2023

0.1b6 pre-release

Aug 7, 2023

0.1b5 pre-release

Aug 7, 2023

0.1b4 pre-release

Aug 7, 2023

0.1b3 pre-release

Aug 6, 2023

0.1b2 pre-release

Aug 6, 2023

0.1b1 pre-release

Aug 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HaploDynamics-0.1b7.tar.gz (24.4 kB view hashes)

Uploaded Aug 7, 2023 Source

Hashes for HaploDynamics-0.1b7.tar.gz

Hashes for HaploDynamics-0.1b7.tar.gz
Algorithm	Hash digest
SHA256	`70795a520e84f3203e44d4085de5548c5aeba43bae682211cb8986027a5ff621`
MD5	`9d9fd40ecae8ac509d4b11e1fcd72638`
BLAKE2b-256	`4377681dd636417a1a943c68a8c648c2f6e52099f25a65101103b185d9f3a12b`