ASMC is a method to efficiently estimate pairwise coalescence time along the genome
Project description
█████╗ ███████╗ ███╗ ███╗ ██████╗
██╔══██╗ ██╔════╝ ████╗ ████║ ██╔════╝
███████║ ███████╗ ██╔████╔██║ ██║
██╔══██║ ╚════██║ ██║╚██╔╝██║ ██║
██║ ██║ ███████║ ██║ ╚═╝ ██║ ╚██████╗
╚═╝ ╚═╝ ╚══════╝ ╚═╝ ╚═╝ ╚═════╝
The Ascertained Sequentially Markovian Coalescent is a method to efficiently estimate pairwise coalescence time along the genome. It can be run using SNP array or whole-genome sequencing (WGS) data.
This repository contains code, installation instructions, and example files for the ASMC program. A user manual can be found here, data and annotations from the ASMC paper can be found here.
Installation
ASMC is regularly built and tested on Ubuntu and macOS. It is a C++ library with optional Python bindings.
The ASMC C++ library requires:
- A C++ compiler (C++14 or later)
- CMake (3.12 or later)
- Boost (1.62 or later)
- Eigen (3.3.4 or later)
The Python bindings additionally require:
- Python (3.5 or later)
- PyBind11 (distributed with ASMC as a submodule)
Install dependencies
Ubuntu (using the package manager)
sudo apt install g++ cmake libboost-all-dev libeigen3-dev
macOS (using homebrew and assuming Xcode is installed)
brew install cmake boost eigen
Getting and compiling FastSMC
C++ library and executable
First, get the code.
For code users
git clone https://github.com/PalamaraLab/FastSMC
cd FastSMC
For code developers
git clone https://github.com/PalamaraLab/FastSMC_dev
cd FastSMC_dev
Then, build the library and executable
mkdir FASTSMC_BUILD_DIR && cd FASTSMC_BUILD_DIR
cmake ..
cmake --build .
Note: you can locate the build directory outside the FastSMC
directory if you wish: just run cmake /path/to/FastSMC
from any directory you like.
C++ library and Python bindings
First, get the code.
For code users
git clone --recurse-submodules https://github.com/PalamaraLab/FastSMC
cd FastSMC
For code developers
git clone --recurse-submodules https://github.com/PalamaraLab/FastSMC_dev
cd FastSMC_dev
Then, build the library and executable
pip install .
Note: the --recurse-submodules
is important as PyBind11 is distributed as a submodule.
You will not get PyBind11 if you download the zip archive from GitHub.
Decoding Quantities
To generate decoding quantities, several additional requirements are required.
Ubuntu (using the package manager)
sudo apt install libgmp-dev libmpfr-dev libgsl0-dev default-jdk jblas
macOS (using homebrew and assuming cask is installed)
brew install mpfr gmp gsl
brew cask install java
Install python dependencies
pip install cython numpy
pip install -r TOOLS/PREPARE_DECODING/requirements.txt
Basic functionality for generating decoding quantities can be seen in:
./prepare.sh
███████╗ █████╗ ███████╗ ████████╗ ███████╗ ███╗ ███╗ ██████╗
██╔════╝ ██╔══██╗ ██╔════╝ ╚══██╔══╝ ██╔════╝ ████╗ ████║ ██╔════╝
█████╗ ███████║ ███████╗ ██║ ███████╗ ██╔████╔██║ ██║
██╔══╝ ██╔══██║ ╚════██║ ██║ ╚════██║ ██║╚██╔╝██║ ██║
██║ ██║ ██║ ███████║ ██║ ███████║ ██║ ╚═╝ ██║ ╚██████╗
╚═╝ ╚═╝ ╚═╝ ╚══════╝ ╚═╝ ╚══════╝ ╚═╝ ╚═╝ ╚═════╝
The Fast Sequentially Markovian Coalescent (FastSMC) algorithm is an extension to the ASMC algorithm, adding an identification step by hashing (currently using an improved version of the GERMLINE algorithm). FastSMC is an accurate method to detect Identical-By-Descent segments which enables estimating the time to most recent common ancestor for IBD individuals, and provides an estimate of uncertainty for detected IBD regions.
This document is not intended as an extensive guide, a more detailed user manual is under development, data and annotations from the FastSMC paper can be found here.
Installation
FastSMC is compiled with ASMC, using the same instructions as above.
Running FastSMC
You can run FastSMC as a C++ compiled executable or using Python (see below for examples).
Detailed command line options
See ASMC's documentation for parameters related to the validation step. Additional parameters related to the identification step are listed below. Note: default parameter values are likely to change in future versions.
--inFileRoot Prefix of input files (.hap, .samples, .map).
[mandatory]
--decodingQuantFile Decoding quantities file.
[mandatory]
--outFileRoot Prefix of output file.
[mandatory]
--GERMLINE Use of GERMLINE to pre-process IBD segments. If off, no identification step will be performed.
[default 0/off]
--min_m arg (=1) Minimum match length (in cM).
[default = 1.0]
--time arg (=100) Time threshold to define IBD in number of generations.
[default = 100]
--skip arg (=0) Skip words with (seeds/samples) less than this value
[default 0.0]
--min_maf arg (=0) Minimum minor allele frequency
[default 0.0]
--gap arg (=1) Allowed gaps
[default 1]
--max_seeds arg (=0) Dynamic hash seed cutoff
[default 0/off]
--recall arg (=3) Recall level from 0 to 3 (higher value means higher recall).
[default = 3]
--segmentLength Output length in centimorgans of each IBD segment.
[default 0/off]
--perPairMAP Output MAP age estimate for each IBD segment.
[default 0/off]
--perPairPosteriorMeans Output posterior mean age estimate for each IBD segment.
[default 0/off]
--noConditionalAgeEstimates Do not condition the age estimates on the TMRCA being between present time and t generations ago
(where t is the time threshold).
[default 0/off]
--bin Binary output
[default off]
--batchSize Size of batches to be decoded.
[default = 32]
Suggested optimal parameters for IBD detection within the past 25, 50, 100, 150 and 200 generations are provided in the FastSMC paper.
Input file formats
Input files are provided to FastSMC with the --inFileRoot option. You may want to look at files in FILES/FASTSMC_EXAMPLE/* for examples of the file formats described below.
Phased haplotypes in Oxford haps/sample format (.hap/.hap.gz, .samples)
These files are provided in input to FastSMC. The file format explained here. These files are output by phasing programs like Eagle and Shapeit.
Genetic map (.map)
The genetic map provided in input to FastSMC has 4 columns with format "Physical_position Recombination_rate Genetic_position Mutation_rate". Genetic positions are in centimorgans, physical positions are in bp. The map can be optionally compressed using gzip.
Decoding quantities (.decodingQuantities.gz)
See the instructions above to generate decoding quantities files and the ASMC manual here for more details.
Output format
FastSMC generates an .ibd.gz (or .bibd.gz if binary output) file in the specified location. Each line corresponds to a pairwise shared segment, with the following fields:
0. First individual's family identifier
1. First individual identifier
2. First individual haplotype identifier (1 or 2)
3. Second individual's family identifier
4. Second individual identifier
5. Second individual haplotype identifier (1 or 2)
6. Chromosome number
7. Starting position of the IBD segment (inclusive)
8. Ending position of the IBD segment (inclusive)
9. (optional) Length in centimorgans of IBD segment
10. IBD score
11. (optional) Average mean posterior age estimate of the IBD segment
12. (optional) Average MAP age estimate of the IBD segment
Binary output
If you use the --bin option, FastSMC will generate a compressed binary (.bib.gz) output. This can be then converted to text format using the BinaryDataReader class in Python (see notebooks for an example) and using the convertBinary executable in C++ (see the C++ example below).
Examples using the Python bindings (Python)
FastSMC can also be run using Python. See the notebooks
directory for an example.
There are two Jupyter notebooks:
- a minimal working example, where sensible defaults for parameters are chosen automatically
- a more detailed example that demonstrates how to customise parameters, how to convert the binary file to text format, and how to analyse the output if it is too large to fit in memory.
Example using the compiled FastSMC executable (C++)
Following the compilation instructions above will create an executable
ASMC_BUILD_DIR/FastSMC_exe
which can be used by providing command line arguments summarised above. For an example of IBD detection within the past 50 generations, please run the following command line:
sh c++_example/FastSMC_example.sh
A binary output file will be generated and then converted to text format using the convertBinary executable. The first 10 lines will be printed.
Either way of running FastSMC (Python bindings or C++) will run it on a simulated dataset as described in the FastSMC paper. An output file with IBD segments will be generated (in notebooks/ or c++_example/ respectively), and run time should be less than 4s.
License
ASMC and FastSMC are distributed under the GNU General Public License v3.0 (GPLv3). For any questions or comments on ASMC, please contact Pier Palamara using <lastname>@stats.ox.ac.uk
.
Reference
If you use this software, please cite the appropriate reference(s) below.
The ASMC algorithm and software were developed in
- P. Palamara, J. Terhorst, Y. Song, A. Price. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nature Genetics, 2018.
The FastSMC algorithm and software were developed in
- J. Nait Saada, G. Kalantzis, D. Shyr, F. Cooper, M. Robinson, A. Gusev, P. F. Palamara. Identity-by-descent detection across 487,409 British samples reveals fine-scale evolutionary history and trait associations. Nature Communications, in press.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for asmc-0.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6271dec44a251d7fa81335c70d5621f64904a7136f377a3c872a0fecd4c8113 |
|
MD5 | 56adc3e678668aa2bd96f32fea40b284 |
|
BLAKE2b-256 | defefe5bea6dbdeab3ce01b8680b009e24dc0d6ad334c737dcedbbff88287368 |
Hashes for asmc-0.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb2442fa917280cc227eb539972a7a1df7f242349f7a3ce756967ed310b5abc7 |
|
MD5 | 74b785f80159237fba99adf0b905d78d |
|
BLAKE2b-256 | 43194cfc82d26195cda9cf83d2d5a40039fd2b93550beac0e571b106f387e127 |
Hashes for asmc-0.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3b62d2b1732352b52d37d84c7b72618a5c5644da3a9666c4d84ae0f16113dabb |
|
MD5 | 3beb925833fd071f62d4463a6b601186 |
|
BLAKE2b-256 | ffba6ecacae8a12a693a2a665413518146b736cae4411120bb795c2d049df2fd |
Hashes for asmc-0.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 03bd18d45e82537acb3c40e0d46cddb0ea5ebfb347887ca18ab016655452250f |
|
MD5 | 3a46b5bf58e3a878adacee3fb5a50756 |
|
BLAKE2b-256 | ce4d579b2212a00419271affb8204ad45a16d12dcb1992e20fe31dcfa038cc1c |
Hashes for asmc-0.1-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53c12e222096e553af829afff873bdce61ac24874bd36af2c46a5c9e2c358d2f |
|
MD5 | 65b8b3ad532bbd297d6274c67bf04974 |
|
BLAKE2b-256 | 78ca4f9dee117451b4fb54efd403a95cc5ab1d67e59cfb2eb8ac33cd9eb58ce6 |
Hashes for asmc-0.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee6d8777acfb5796e76f634b0518e18f2f9f0579a4f2984625b5d1413b157e8a |
|
MD5 | 84d269c1faa5fcc458b7a9cb32fdb99a |
|
BLAKE2b-256 | ccd1cc238cb4b715dd7392d236ced152c373b7bc5c6e329da402d9ad195ba160 |
Hashes for asmc-0.1-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58a9125d9548323201951ee35653b9de6b1d8d754e316516f46c718c0e4d48f3 |
|
MD5 | 2b21d167bce6d1513ddb9ded566345fd |
|
BLAKE2b-256 | f0e792bf196b3ecadd7e2daa7622dc0f298c28b368f4f428204c442653d4ba73 |
Hashes for asmc-0.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7434baeeaee2e608dfb82b59f610308ee8d3ccc7db6e15bf0100de6766481f47 |
|
MD5 | 86d3834b9ac678369730a489783dce65 |
|
BLAKE2b-256 | d718cd2459c961ccea90cb2645c6b0b12b02a8d2dfc4f7abf053624ba4ce9164 |