Skip to main content

Genotype Representation Graph Library

Project description

install with bioconda

Genotype Representation Graphs

A Genotype Representation Graph (GRG) is a compact way to store reference-aligned genotype data for large genetic datasets. These datasets are typically stored in tabular formats (VCF, BCF, BGEN, etc.) and then compressed using off-the-shelf compression. In contrast, a GRG contains Mutation nodes (representing variants) and Sample nodes (representing haploid samples), where there is a path from a Mutation node to a Sample node if-and-only-if that sample contains that mutation. These paths go through internal nodes that represent common ancestry between multiple samples, and this can result in significant compression (30-50x smaller than .vcf.gz). Calculations on the whole dataset can be performed very quickly on GRG, using GRGL.

Recent releases (after v2.3) support the following improvements over the initial paper:

  1. Graphs are more than 2x smaller (in RAM and on disk)
  2. Graph construction is 10-25x faster
  3. Loading graphs from disk is 10-20x faster
  4. First-class matrix multiplication API matmul
  5. (Prototype) unphased data is supported
  6. GWAS, GWAS with covariates, PCA, and other analyses are available with grapp (pip install grapp)
  7. Phenotype simulation is available with grg_pheno_sim (pip install grg_pheno_sim)
  8. Construction from .vcf.gz now supports tabix indexes, making that input format feasible for large datasets
  9. Better support for missing data, see the documentation

If you need to cite something, use "Enabling efficient analysis of biobank-scale data with genotype representation graphs".

Documentation

Check out the main documentation for core API documentation, examples, tutorials, etc. Things covered in the documentation include:

  • Creating and using GRGs
  • Performing GWAS, PCA, GWAS with covariates, or other analyses with GRG
  • Simulating phenotypes with GRG
  • Using GRG with Python (integration with numpy, pandas, scipy, etc.)

You can also download the tutorials as Jupyter Notebooks and work through them interactively.

Genotype Representation Graph Library (GRGL)

GRGL can be used as a library in both C++ and Python. Support is currently limited to Linux and MacOS. It contains both an API (see docs) and a set of command-line tools.

Installing from pip

If you just want to use the tools (e.g., constructing GRG or converting tree-sequence to GRG) and the Python API then you can install via pip (from PyPi).

pip install pygrgl

This will use prebuilt packages for most modern Linux situations, and will build from source for MacOS. In order to build from source it will require CMake (at least v3.14), zlib development headers, and a clang or GCC compiler that supports C++11.

Installing from conda

You can also install the conda package via the bioconda channel: conda install pygrgl.

Building (Python)

The Python installation installs the command line tools and Python libraries (the C++ executables are packaged as part of this). Make sure you clone with git clone --recursive!

Requires Python 3.7 or newer to be installed (including development headers). It is recommended that you build/install in a virtual environment.

python3 -m venv /path/to/MyEnv
source /path/to/MyEnv/bin/activate
python setup.py bdist_wheel               # Compiles C++, builds a wheel in the dist/ directory
pip install --force-reinstall dist/*.whl  # Install from wheel

Build and installation should take at most a few minutes on the typical computer. For more details on build options, see DEVELOPING.md.

Building (C++ only)

The C++ build is only necessary for folks who want to include GRGL as a library in their C++ project. Typically, you would include our CMake into your project via add_subdirectory, but you can also build standalone as below. Make sure you clone with git clone --recursive!

If you only intend to use GRGL from C++, you can just build it via CMake:

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j4

See below to install the libraries to your system. It is recommended to install it to a custom location (prefix) since removing packages installed via make install is a pain otherwise. Example:

mkdir /path/to/grgl_installation/
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/path/to/grgl_installation/
make -j4
make install
# There should now be bin/, lib/, etc., directories under /path/to/grgl_installation/

Building (Docker)

We've included a Dockerfile if you want to use GRGL in a container.

Example to build:

docker build . -t grgl:latest

Example to run, constructing a GRG from an example VCF file:

docker run -v $PWD:/working -it grgl:latest bash -c "cd /working && grg construct --force /working/test/inputs/msprime.example.vcf"

Usage (Command line)

There is a command line tool that is mostly for file format conversion and performing common computations on the GRG. For more flexibility, use the Python or C++ APIs. After building and installing the Python version, run grg --help to see all the command options. Some examples are below.

Convert a tskit tree-sequence into a GRG. This creates my_arg_data.grg from my_arg_data.trees:

grg convert /path/to/my_arg_data.trees my_arg_data.grg

Load a GRG and emit some simple statistics about the GRG itself:

grg process stats my_arg_data.grg

To construct a GRG from a VCF file, use the grg construct command. (NOTE raw VCF is incredibly slow for non-trivial datasets, use BGZF indexed with tabix or IGD):

grg construct -j 1 path/to/foo.vcf.gz

To convert a VCF(.gz) to an IGD and then build a GRG:

pip install igdtools
igdtools path/to/foo.vcf -o foo.igd
grg construct -j 1 foo.igd

Increase -j to the number of threads you have. Construction for small datasets (such as those included as tests in this repository) should be very fast, on the order of seconds. Really large datasets (such as Biobank-scale whole genome sequences) can take on the order of hours when using lots of threads (e.g., 70). 1,000 Genomes Project chromosomes usually take on the order of a few minutes.

Usage (Python API)

See the provided jupyter notebooks and GettingStarted.md for more examples.

Limits

Quantity Limit
Haploid samples 2,147,483,646
Total nodes 2,147,483,646
Total mutations (variants) 4,294,967,294
Total edges 18,446,744,073,709,551,615
Edges to/from a single node 4,294,967,295

Note: Node limits can theoretically be expanded to about a trillion, by turning off the COMPACT_NODE_IDS preprocessor flag, but this mode is not well tested.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygrgl-2.7.tar.gz (8.2 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pygrgl-2.7-cp313-cp313-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64

pygrgl-2.7-cp312-cp312-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64

pygrgl-2.7-cp311-cp311-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64

pygrgl-2.7-cp310-cp310-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

pygrgl-2.7-cp39-cp39-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64

pygrgl-2.7-cp38-cp38-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.24+ x86-64

File details

Details for the file pygrgl-2.7.tar.gz.

File metadata

  • Download URL: pygrgl-2.7.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.7.tar.gz
Algorithm Hash digest
SHA256 84136bd6ea4d1f54a23c7678e8dbff5d9d08732659523ead1585dd83e7c76f0c
MD5 3ec618345c54228a4ee3527055964263
BLAKE2b-256 3da546a6feaafe14d19adf1476e647b7837ccaf4f05ea780c65f582ba8141451

See more details on using hashes here.

File details

Details for the file pygrgl-2.7-cp313-cp313-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.7-cp313-cp313-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 e8c192f7423b9e1e6d85f017a72b308fcfffc509fab92f214770c36e14ec12f6
MD5 dcc78a2eb40babe14743d4bc9f870bb2
BLAKE2b-256 a765dbd5e38db827721310c52c0a7810e95a684d5851dcd41df91392e4164b84

See more details on using hashes here.

File details

Details for the file pygrgl-2.7-cp312-cp312-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.7-cp312-cp312-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 01bc8e9a7e700fbce5bc64ac5650d4390d5be4961639adbe5f004540d161483f
MD5 4cf86c6e0a259afb1fdd1e70e56f9c59
BLAKE2b-256 bf4fe39c002731794e6e636fc68689eba17941c09f2c2a572821379dd77fbb3d

See more details on using hashes here.

File details

Details for the file pygrgl-2.7-cp311-cp311-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.7-cp311-cp311-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 c0c68f4a07784799222fc5d6546e5e665ab9f11ad743a40327e129c6151fa065
MD5 3b6a7ecd177ff67f6bfc070a1db06ca3
BLAKE2b-256 885362a4e24e3e239e3c412f7d8064016e01c54ca3791a6e120e031e636d07f9

See more details on using hashes here.

File details

Details for the file pygrgl-2.7-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.7-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 2390ab72b1227d6b660746c92890d5d80c8581aa1032fecda40ca6b001c2b1e1
MD5 9bec541c180728c2de84391e7c8ea3e4
BLAKE2b-256 bdf5e78cd44d69a4ff5c0a82ae3c0dd9b0bb964a99074ea2209fd4de27c94e1b

See more details on using hashes here.

File details

Details for the file pygrgl-2.7-cp39-cp39-manylinux_2_24_x86_64.whl.

File metadata

  • Download URL: pygrgl-2.7-cp39-cp39-manylinux_2_24_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.9, manylinux: glibc 2.24+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.7-cp39-cp39-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 f1d1dce0451ad2b31bd8238f5389c41a85aa7cd2e9a82fc807e5fe6bcc550888
MD5 3922d5cb6df1cf0155f373e846bc5e68
BLAKE2b-256 8d6f528ee81342df26d68c4879b7f73606d673f759dffe347c01c70d2b2b1101

See more details on using hashes here.

File details

Details for the file pygrgl-2.7-cp38-cp38-manylinux_2_24_x86_64.whl.

File metadata

  • Download URL: pygrgl-2.7-cp38-cp38-manylinux_2_24_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.8, manylinux: glibc 2.24+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.7-cp38-cp38-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 9c9eed784fb0a3fac4ce1d2b17eb39d6a4ce0f57361276100fbd315a4ce819e2
MD5 fef937c425c5c9a4677d85a35eeb9dcd
BLAKE2b-256 e930788e1c5ec263ad8a4123794929f5c569177e54304412950c49e7b4aee056

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page