Skip to main content

Genotype Representation Graph Library

Project description

install with bioconda

Genotype Representation Graphs

A Genotype Representation Graph (GRG) is a compact way to store reference-aligned genotype data for large genetic datasets. These datasets are typically stored in tabular formats (VCF, BCF, BGEN, etc.) and then compressed using off-the-shelf compression. In contrast, a GRG contains Mutation nodes (representing variants) and Sample nodes (representing haploid samples), where there is a path from a Mutation node to a Sample node if-and-only-if that sample contains that mutation. These paths go through internal nodes that represent common ancestry between multiple samples, and this can result in significant compression (30-50x smaller than .vcf.gz). Calculations on the whole dataset can be performed very quickly on GRG, using GRGL.

Recent releases (after v2.3) support the following improvements over the initial paper:

  1. Graphs are more than 2x smaller (in RAM and on disk)
  2. Graph construction is 10-25x faster
  3. Loading graphs from disk is 10-20x faster
  4. First-class matrix multiplication API matmul
  5. (Prototype) unphased data is supported
  6. GWAS, GWAS with covariates, PCA, and other analyses are available with grapp (pip install grapp)
  7. Phenotype simulation is available with grg_pheno_sim (pip install grg_pheno_sim)
  8. Construction from .vcf.gz now supports tabix indexes, making that input format feasible for large datasets
  9. Better support for missing data, see the documentation

If you need to cite something, use "Enabling efficient analysis of biobank-scale data with genotype representation graphs".

Documentation

Check out the main documentation for core API documentation, examples, tutorials, etc. Things covered in the documentation include:

  • Creating and using GRGs
  • Performing GWAS, PCA, GWAS with covariates, or other analyses with GRG
  • Simulating phenotypes with GRG
  • Using GRG with Python (integration with numpy, pandas, scipy, etc.)

You can also download the tutorials as Jupyter Notebooks and work through them interactively.

Genotype Representation Graph Library (GRGL)

GRGL can be used as a library in both C++ and Python. Support is currently limited to Linux and MacOS. It contains both an API (see docs) and a set of command-line tools.

Installing from pip

If you just want to use the tools (e.g., constructing GRG or converting tree-sequence to GRG) and the Python API then you can install via pip (from PyPi).

pip install pygrgl

This will use prebuilt packages for most modern Linux situations, and will build from source for MacOS. In order to build from source it will require CMake (at least v3.14), zlib development headers, and a clang or GCC compiler that supports C++11.

Installing from conda

You can also install the conda package via the bioconda channel: conda install pygrgl.

Building (Python)

The Python installation installs the command line tools and Python libraries (the C++ executables are packaged as part of this). Make sure you clone with git clone --recursive!

Requires Python 3.7 or newer to be installed (including development headers). It is recommended that you build/install in a virtual environment.

python3 -m venv /path/to/MyEnv
source /path/to/MyEnv/bin/activate
python setup.py bdist_wheel               # Compiles C++, builds a wheel in the dist/ directory
pip install --force-reinstall dist/*.whl  # Install from wheel

Build and installation should take at most a few minutes on the typical computer. For more details on build options, see DEVELOPING.md.

Building (C++ only)

The C++ build is only necessary for folks who want to include GRGL as a library in their C++ project. Typically, you would include our CMake into your project via add_subdirectory, but you can also build standalone as below. Make sure you clone with git clone --recursive!

If you only intend to use GRGL from C++, you can just build it via CMake:

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j4

See below to install the libraries to your system. It is recommended to install it to a custom location (prefix) since removing packages installed via make install is a pain otherwise. Example:

mkdir /path/to/grgl_installation/
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/path/to/grgl_installation/
make -j4
make install
# There should now be bin/, lib/, etc., directories under /path/to/grgl_installation/

Building (Docker)

We've included a Dockerfile if you want to use GRGL in a container.

Example to build:

docker build . -t grgl:latest

Example to run, constructing a GRG from an example VCF file:

docker run -v $PWD:/working -it grgl:latest bash -c "cd /working && grg construct --force /working/test/inputs/msprime.example.vcf"

Usage (Command line)

There is a command line tool that is mostly for file format conversion and performing common computations on the GRG. For more flexibility, use the Python or C++ APIs. After building and installing the Python version, run grg --help to see all the command options. Some examples are below.

Convert a tskit tree-sequence into a GRG. This creates my_arg_data.grg from my_arg_data.trees:

grg convert /path/to/my_arg_data.trees my_arg_data.grg

Load a GRG and emit some simple statistics about the GRG itself:

grg process stats my_arg_data.grg

To construct a GRG from a VCF file, use the grg construct command. (NOTE raw VCF is incredibly slow for non-trivial datasets, use BGZF indexed with tabix or IGD):

grg construct -j 1 path/to/foo.vcf.gz

To convert a VCF(.gz) to an IGD and then build a GRG:

pip install igdtools
igdtools path/to/foo.vcf -o foo.igd
grg construct -j 1 foo.igd

Increase -j to the number of threads you have. Construction for small datasets (such as those included as tests in this repository) should be very fast, on the order of seconds. Really large datasets (such as Biobank-scale whole genome sequences) can take on the order of hours when using lots of threads (e.g., 70). 1,000 Genomes Project chromosomes usually take on the order of a few minutes.

Usage (Python API)

See the provided jupyter notebooks and GettingStarted.md for more examples.

Limits

Quantity Limit
Haploid samples 2,147,483,646
Total nodes 2,147,483,646
Total mutations (variants) 4,294,967,294
Total edges 18,446,744,073,709,551,615
Edges to/from a single node 4,294,967,295

Note: Node limits can theoretically be expanded to about a trillion, by turning on the LARGE_NODE_IDS preprocessor flag, but this mode is not well tested.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygrgl-2.8.tar.gz (8.2 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pygrgl-2.8-cp315-cp315-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.15manylinux: glibc 2.24+ x86-64

pygrgl-2.8-cp314-cp314-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.24+ x86-64

pygrgl-2.8-cp313-cp313-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64

pygrgl-2.8-cp312-cp312-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64

pygrgl-2.8-cp311-cp311-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64

pygrgl-2.8-cp310-cp310-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

pygrgl-2.8-cp39-cp39-manylinux_2_24_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64

File details

Details for the file pygrgl-2.8.tar.gz.

File metadata

  • Download URL: pygrgl-2.8.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.8.tar.gz
Algorithm Hash digest
SHA256 2cba2eacb94d94c9d57ab82fb95c2305b851f585c28854f2bc4b89575b4d633f
MD5 d9f4ca09e87c018353db7b086ab6855a
BLAKE2b-256 5f52ee319dc521329285fb0c9e8d821835b3e259cbc8507484c3c91e98cefacd

See more details on using hashes here.

File details

Details for the file pygrgl-2.8-cp315-cp315-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.8-cp315-cp315-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 efa7a4a63be1d5962e3e88755fb34bcad38bd326be74f611a255fb4ffd9a0fec
MD5 b7dd5352850c359730faf23520b66bbc
BLAKE2b-256 c88fa2b0fa5aae92b7f76b8bb1af357d59ca2ba255af366c55fde11a390c627e

See more details on using hashes here.

File details

Details for the file pygrgl-2.8-cp314-cp314-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.8-cp314-cp314-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 856d36829c64e3b8b95f0d57abacf718770e6ec4108d7415e30d27f86daca53c
MD5 61e34830a663c752c8a486e707e2f712
BLAKE2b-256 7c89c4ad97dd3992eac87246c0f1e4666079f431740513384a424c388579bdf2

See more details on using hashes here.

File details

Details for the file pygrgl-2.8-cp313-cp313-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.8-cp313-cp313-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 7d504ff12305f8f88637735b4f77f85738f4ca079791a2b2c9ab0fdf4f6ea1b8
MD5 b57386e70d382301dd1a7d3b3896f97b
BLAKE2b-256 71e60ca62435ca1072402a8c79f31ee2d9c3d5389374a82ae7da33d5d3c7fc19

See more details on using hashes here.

File details

Details for the file pygrgl-2.8-cp312-cp312-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.8-cp312-cp312-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 4abaed65bd5674eac2c31c373574da2134f548d56f5b3203aed471508e8182b7
MD5 2c4d894cf2eca241f97d62c786a62f9f
BLAKE2b-256 2ffef9aa43d216fd715e5f3148549246fdc2de58663d8437f4b75e22d995eafb

See more details on using hashes here.

File details

Details for the file pygrgl-2.8-cp311-cp311-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.8-cp311-cp311-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 c7eb2e99f27fcf45e6d26e7eaa5f599b887cad1029a366aeaae9b3533fe0f214
MD5 3a133f2b03d5f792ea1d3860d153c848
BLAKE2b-256 203c089dd43d5e556f44ab5aa5b0fea7c5b0cbb920a0bd28ec32cbba0c15afc0

See more details on using hashes here.

File details

Details for the file pygrgl-2.8-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.8-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 2f9c050f8d641e065c2da1016bba1cb1547c7931c152481bf7e7d8965f44f7e8
MD5 f9034ce1a24daca22b613ec961bdeecd
BLAKE2b-256 4cf00d6b9e5d24618a6abe12d51426d9b56d3f9e47f3789caf630b37737c86c3

See more details on using hashes here.

File details

Details for the file pygrgl-2.8-cp39-cp39-manylinux_2_24_x86_64.whl.

File metadata

  • Download URL: pygrgl-2.8-cp39-cp39-manylinux_2_24_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.9, manylinux: glibc 2.24+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.8-cp39-cp39-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 28783da934c4af8c65cef505421782e7a72345e72fd53a0b8171c15468248914
MD5 36234c2b96d69e7d63ae04ccdbbe21ef
BLAKE2b-256 ffd8e768d4a0f8b212b7dc2e07ec37c9f383de8f2e7c0a0ea032e324adf42fda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page