Skip to main content

Genotype Representation Graph Library

Project description

Genotype Representation Graphs

A Genotype Representation Graph (GRG) is a compact way to store reference-aligned genotype data for large genetic datasets. These datasets are typically stored in tabular formats (VCF, BCF, BGEN, etc.) and then compressed using off-the-shelf compression. In contrast, a GRG contains Mutation nodes (representing variants) and Sample nodes (representing haploid samples), where there is a path from a Mutation node to a Sample node if-and-only-if that sample contains that mutation. These paths go through internal nodes that represent common ancestry between multiple samples, and this can result in significant compression (30-50x smaller than .vcf.gz). Calculations on the whole dataset can be performed very quickly on GRG, using GRGL.

Recent releases (after v2.3) support the following improvements over the initial paper:

  1. Graphs are more than 2x smaller (in RAM and on disk)
  2. Graph construction is 10-25x faster
  3. Loading graphs from disk is 10-20x faster
  4. First-class matrix multiplication API matmul
  5. (Prototype) unphased data is supported
  6. GWAS, GWAS with covariates, PCA, and other analyses are available with grapp (pip install grapp)
  7. Phenotype simulation is available with grg_pheno_sim (pip install grg_pheno_sim)
  8. Construction from .vcf.gz now supports tabix indexes, making that input format feasible for large datasets
  9. Better support for missing data, see the documentation

If you need to cite something, use "Enabling efficient analysis of biobank-scale data with genotype representation graphs".

Genotype Representation Graph Library (GRGL)

GRGL can be used as a library in both C++ and Python. Support is currently limited to Linux and MacOS. It contains both an API (see docs) and a set of command-line tools.

Installing from pip

If you just want to use the tools (e.g., constructing GRG or converting tree-sequence to GRG) and the Python API then you can install via pip (from PyPi).

pip install pygrgl

This will use prebuilt packages for most modern Linux situations, and will build from source for MacOS. In order to build from source it will require CMake (at least v3.14), zlib development headers, and a clang or GCC compiler that supports C++11.

Building (Python)

The Python installation installs the command line tools and Python libraries (the C++ executables are packaged as part of this). Make sure you clone with git clone --recursive!

Requires Python 3.7 or newer to be installed (including development headers). It is recommended that you build/install in a virtual environment.

python3 -m venv /path/to/MyEnv
source /path/to/MyEnv/bin/activate
python setup.py bdist_wheel               # Compiles C++, builds a wheel in the dist/ directory
pip install --force-reinstall dist/*.whl  # Install from wheel

Build and installation should take at most a few minutes on the typical computer. For more details on build options, see DEVELOPING.md.

Building (C++ only)

The C++ build is only necessary for folks who want to include GRGL as a library in their C++ project. Typically, you would include our CMake into your project via add_subdirectory, but you can also build standalone as below. Make sure you clone with git clone --recursive!

If you only intend to use GRGL from C++, you can just build it via CMake:

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j4

See below to install the libraries to your system. It is recommended to install it to a custom location (prefix) since removing packages installed via make install is a pain otherwise. Example:

mkdir /path/to/grgl_installation/
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/path/to/grgl_installation/
make -j4
make install
# There should now be bin/, lib/, etc., directories under /path/to/grgl_installation/

Building (Docker)

We've included a Dockerfile if you want to use GRGL in a container.

Example to build:

docker build . -t grgl:latest

Example to run, constructing a GRG from an example VCF file:

docker run -v $PWD:/working -it grgl:latest bash -c "cd /working && grg construct --force /working/test/inputs/msprime.example.vcf"

Usage (Command line)

There is a command line tool that is mostly for file format conversion and performing common computations on the GRG. For more flexibility, use the Python or C++ APIs. After building and installing the Python version, run grg --help to see all the command options. Some examples are below.

Convert a tskit tree-sequence into a GRG. This creates my_arg_data.grg from my_arg_data.trees:

grg convert /path/to/my_arg_data.trees my_arg_data.grg

Load a GRG and emit some simple statistics about the GRG itself:

grg process stats my_arg_data.grg

To construct a GRG from a VCF file, use the grg construct command. (NOTE raw VCF is incredibly slow for non-trivial datasets, use BGZF indexed with tabix or IGD):

grg construct -j 1 path/to/foo.vcf.gz

To convert a VCF(.gz) to an IGD and then build a GRG:

pip install igdtools
igdtools path/to/foo.vcf -o foo.igd
grg construct -j 1 foo.igd

Increase -j to the number of threads you have. Construction for small datasets (such as those included as tests in this repository) should be very fast, on the order of seconds. Really large datasets (such as Biobank-scale whole genome sequences) can take on the order of hours when using lots of threads (e.g., 70). 1,000 Genomes Project chromosomes usually take on the order of a few minutes.

Usage (Python API)

See the provided jupyter notebooks and GettingStarted.md for more examples.

Limits

Quantity Limit
Haploid samples 2,147,483,646
Total nodes 2,147,483,646
Total mutations (variants) 4,294,967,294
Total edges 18,446,744,073,709,551,615
Edges to/from a single node 4,294,967,295

Note: Node limits can theoretically be expanded to about a trillion, by turning off the COMPACT_NODE_IDS preprocessor flag, but this mode is not well tested.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygrgl-2.3.tar.gz (8.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pygrgl-2.3-cp313-cp313-manylinux_2_24_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64

pygrgl-2.3-cp312-cp312-manylinux_2_24_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64

pygrgl-2.3-cp311-cp311-manylinux_2_24_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64

pygrgl-2.3-cp310-cp310-manylinux_2_24_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

pygrgl-2.3-cp39-cp39-manylinux_2_24_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64

pygrgl-2.3-cp38-cp38-manylinux_2_24_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.24+ x86-64

File details

Details for the file pygrgl-2.3.tar.gz.

File metadata

  • Download URL: pygrgl-2.3.tar.gz
  • Upload date:
  • Size: 8.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.3.tar.gz
Algorithm Hash digest
SHA256 aae419fcf3c8c90b4774e6c5bb529643676a3b80620abf8ef4df256b3bb51b70
MD5 b648bd58981c86edbeedaa5facc7e729
BLAKE2b-256 624b846588eaf3d4bff326616373342ba4609ffa6a38b16d6a6631e1979dd052

See more details on using hashes here.

File details

Details for the file pygrgl-2.3-cp313-cp313-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.3-cp313-cp313-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 bf71ac1c97386ab1255ff100014cafe2cf57e3aa453a9f03132964b411ab30ad
MD5 8bc193987c039a01288959a0b8a6f24d
BLAKE2b-256 0d26631f8021ec98a0c02c846d3a63a8ea696ae1cb4f34dc1043489b87a5c69f

See more details on using hashes here.

File details

Details for the file pygrgl-2.3-cp312-cp312-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.3-cp312-cp312-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 c0d04c897d11a846b4c6ba86eb499aaa201e20e414bc5afe2895e6657730fcc9
MD5 d956740b7e86934490f5679ede3b9119
BLAKE2b-256 4738766b2f0d8a6425cc7c0083130133cc05e1f1c2ab40bdc26b03d54b89c697

See more details on using hashes here.

File details

Details for the file pygrgl-2.3-cp311-cp311-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.3-cp311-cp311-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 215f4e4a4c03409bd79dca6fe2bba9c3e7f602febf070b20e5740fc8f4ffdd8a
MD5 fdbcf146a758be2c0aa5b22b40cfa498
BLAKE2b-256 c08af3c76d8a34044ab49cbfe1317485213db090468d4d16ae69e476d8237d76

See more details on using hashes here.

File details

Details for the file pygrgl-2.3-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.3-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 1bfc6b89e022c4f175914529006d9206f098d11e51f15cd1fda0944a75b66fb7
MD5 b4c8fa3b78daa7b804a6b286f365b0c0
BLAKE2b-256 fcf8867a291997e0bc188d5be6d5f135061486296a72261150b6526d1b451dc4

See more details on using hashes here.

File details

Details for the file pygrgl-2.3-cp39-cp39-manylinux_2_24_x86_64.whl.

File metadata

  • Download URL: pygrgl-2.3-cp39-cp39-manylinux_2_24_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.9, manylinux: glibc 2.24+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.3-cp39-cp39-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 e5761a1a9794f42565a25e0b0c1f3c6023aa08d144796ec72f960b524b4a0722
MD5 c601de411f7026d9ab8b811403218d94
BLAKE2b-256 58db0a21f307ffb84030022b59f570f99691105fb36f2b220d487dc710f3e74e

See more details on using hashes here.

File details

Details for the file pygrgl-2.3-cp38-cp38-manylinux_2_24_x86_64.whl.

File metadata

  • Download URL: pygrgl-2.3-cp38-cp38-manylinux_2_24_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, manylinux: glibc 2.24+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.3-cp38-cp38-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 5c18096aa3ecd89816654c858dba865d43ebea004a487e77d4865bcff8bbb540
MD5 3ab22f8c263a2a0da52ccc490d34ff05
BLAKE2b-256 25ed7f1f4287eb1ba732731848afba47cd1ce490ae53708ebdf7a999c43e972b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page