Skip to main content

Expected Genetic Relationship Matrix

Project description

Expected Genetic Relationship Matrix

Expected Genetic Relationship Matrix (EGRM) is the expected value of the Genetic Relationship Matrix (GRM) on unknown SNPs given the complete genealogical tree of a sample of individuals, derived under the coalescent theory framework.

This method is described in the following paper: Fan, Caoqi, Nicholas Mancuso, and Charleston WK Chiang. "A genealogical estimate of genetic relationships." bioRxiv (2021). Please cite our paper if you use our method.

Installation

Install from PyPI (not available yet):

pip install egrm

Or download the package and install from local:

git clone https://github.com/Ephraim-usc/egrm.git

pip install ./egrm

Command Line Tools

The software to compute the eGRM from tskit tree sequence output is handled by trees2egrm. Its usage is given by,

usage: trees2egrm [-h] [--output OUTPUT] [--c-extension] [--skip-first-tree] [--run-var] [--genetic-map GENETIC_MAP] [--left LEFT]
                  [--right RIGHT] [--rlim RLIM] [--alim ALIM] [--verbose] [--haploid] [--output-format {gcta,numpy}]
                  input

Construct eGRM matrix from tree sequence data

positional arguments:
  input                 path to ts-kit tree sequence file

optional arguments:
  -h, --help            show this help message and exit
  --output OUTPUT, --o OUTPUT
                        output file prefix (default: egrm)
  --c-extension, --c    acceleration by C extension (default: False)
  --skip-first-tree, --sft
                        discard the first tree in the tree sequence (default: False)
  --run-var, --var      compute varGRM in addition to eGRM (default: False)
  --genetic-map GENETIC_MAP, --map GENETIC_MAP
                        map file fullname (default: None)
  --left LEFT, --l LEFT
                        leftmost genomic position to be included (default: 0)
  --right RIGHT, --r RIGHT
                        rightmost genomic position to be included (default: inf)
  --rlim RLIM           most recent time limit (default: 0)
  --alim ALIM           most ancient time limit (default: inf)
  --verbose             verbose logging. Includes debug info. (default: False)
  --haploid             output eGRM over haploids. Default is diploid/genotype eGRM. (default: False)
  --output-format {gcta,numpy}, --f {gcta,numpy}
                        output format of eGRM (default: gcta)

Where input is the tree sequence file prefix (so that the full name should be "INPUT.trees"), and OUTPUT is the output file prefix.

Optional parameters:

[--c_extension] or [--c]

This specifies whether to use the C exntension model to accelerate the algorithm. Usually this makes it ~10 times faster. Recommended whenever the C environment is available.

[--skip_first_tree] or [--sft]

This option skips the first tree in the tree sequence. This is often useful because RELATE and some other tools always output tree sequences from 0bp, even when the genotype data starts from within the chromosome.

[--run_var] or [--var]

With this option turned on, the algorithm will output the varGRM in addition to eGRM, while roughly doubling the compuation time.

[--genetic_map] or [--map]

A (comma/space/tab separated) three-column file with first column specifying the physical position in bp and the third column specifying the genetic position in cM. The second column is not used. The first line will always be ignored as the header.

[--left LEFT] [--right RIGHT]

The leftmost and rightmost positions (in bp) between which the eGRM is computed.

[--rlim RLIM] [--alim ALIM]

RLIM and ALIM are the most recent and most ancient times (in generations) between which the eGRM is computed.

If output-format is set to gcta, the eGRM will be output into GCTA format (.grm.bin, .grm.N.bin and .grm.id files):

  • OUTPUT.grm.bin, which contains the eGRM matrix;

  • OUTPUT.grm.N.bin, which contains all the same number of the measure of the tree sequence (i.e., the expected number of mutations on this tree sequence);

  • OUTPUT.grm.id, which contains dummy ids for the samples.

  • OUTPUT_var(.grm.bin/.grm.N.bin/.grm.id), for the varGRM matrix, if the --var option is selected.

If If output-format is set to numpy, he output will be two (or three, if using --var option) files in numpy NPY format:

  • OUTPUT.npy, which contains the eGRM matrix;

  • OUTPUT_mu.npy, which contains a single number of the measure of the tree sequence (i.e., the expected number of mutations on this tree sequence);

  • OUTPUT_var.npy, which contains the varGRM matrix, if the --var option is selected.

Python Functions

varGRM_C(trees)

varGRM(trees)

The C and non-C versions of the eGRM algorithm. The input is a tskit TreeSequence object. See the source code for a complete explanation of its parameters.

mTMRCA_C(trees)

mTMRCA(trees)

The C and non-C versions of the mTMRCA algorithm. The input is a tskit TreeSequence object. See the source code for a complete explanation of its parameters.

Reproducing Results in the paper

There is an additional commandline tool

manuscript/simulate 

which is included in the package, but not installed by default. You may manually run this script.

A complete explanation of its parameters and output files can be found at

manuscript/reproduce.rst

Support

If you are having issues, please let us know. Email the author: caoqifan@usc.edu

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

egrm-0.1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

egrm-0.1-cp39-cp39-macosx_10_14_x86_64.whl (13.7 kB view details)

Uploaded CPython 3.9macOS 10.14+ x86-64

File details

Details for the file egrm-0.1.tar.gz.

File metadata

  • Download URL: egrm-0.1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2

File hashes

Hashes for egrm-0.1.tar.gz
Algorithm Hash digest
SHA256 33660779f63c9911138bac41b83a9593e15098f21698d40d585d50810f0509a5
MD5 ca36c4f26fff60a7cb968939dd4dba80
BLAKE2b-256 fbb303aeebb64a09e26033bd90ca758f0bd88bbf90298d5d63296013cbda337c

See more details on using hashes here.

File details

Details for the file egrm-0.1-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: egrm-0.1-cp39-cp39-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: CPython 3.9, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2

File hashes

Hashes for egrm-0.1-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 b749623edbafaf7865ec61c70332cdfd855fa9972c3e806aa381fb225bf6295d
MD5 f5daef49175ac73b04994a3e638abfa4
BLAKE2b-256 18004caaada6690a12411016a1a7206f1a2ded34b397697448f6b0172e3458b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page