Skip to main content

Genotype Representation Graph Library

Project description

Genotype Representation Graphs

A Genotype Representation Graph (GRG) is a compact way to store reference-aligned genotype data for large genetic datasets. These datasets are typically stored in tabular formats (VCF, BCF, BGEN, etc.) and then compressed using off-the-shelf compression. In contrast, a GRG contains Mutation nodes (representing variants) and Sample nodes (representing haploid samples), where there is a path from a Mutation node to a Sample node if-and-only-if that sample contains that mutation. These paths go through internal nodes that represent common ancestry between multiple samples, and this can result in significant compression (30-50x smaller than .vcf.gz). Calculations on the whole dataset can be performed very quickly on GRG, using GRGL. See our paper "Enabling efficient analysis of biobank-scale data with genotype representation graphs" for more details.

Since the publication of the paper, version 2.0 has been released, which further reduced the GRG size (by about half) and significantly sped up graph load time (by about 20x).

Genotype Representation Graph Library (GRGL)

GRGL can be used as a library in both C++ and Python. Support is currently limited to Linux and MacOS. It contains both an API (see docs) and a set of command-line tools.

Installing from pip

If you just want to use the tools (e.g., constructing GRG or converting tree-sequence to GRG) and the Python API then you can install via pip (from PyPi).

pip install pygrgl

This will use prebuilt packages for most modern Linux situations, and will build from source for MacOS. In order to build from source it will require CMake (at least v3.14), zlib development headers, and a clang or GCC compiler that supports C++11.

Building (Python)

The Python installation installs the command line tools and Python libraries (the C++ executables are packaged as part of this). Make sure you clone with git clone --recursive!

Requires Python 3.7 or newer to be installed (including development headers). It is recommended that you build/install in a virtual environment.

python3 -m venv /path/to/MyEnv
source /path/to/MyEnv/bin/activate
python setup.py bdist_wheel               # Compiles C++, builds a wheel in the dist/ directory
pip install --force-reinstall dist/*.whl  # Install from wheel

Build and installation should take at most a few minutes on the typical computer. For more details on build options, see DEVELOPING.md.

Building (C++ only)

The C++ build is only necessary for folks who want to include GRGL as a library in their C++ project. Typically, you would include our CMake into your project via add_subdirectory, but you can also build standalone as below. Make sure you clone with git clone --recursive!

If you only intend to use GRGL from C++, you can just build it via CMake:

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j4

See below to install the libraries to your system. It is recommended to install it to a custom location (prefix) since removing packages installed via make install is a pain otherwise. Example:

mkdir /path/to/grgl_installation/
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/path/to/grgl_installation/
make -j4
make install
# There should now be bin/, lib/, etc., directories under /path/to/grgl_installation/

Building (Docker)

We've included a Dockerfile if you want to use GRGL in a container.

Example to build:

docker build . -t grgl:latest

Example to run, constructing a GRG from an example VCF file:

docker run -v $PWD:/working -it grgl:latest bash -c "cd /working && grg construct /working/test/inputs/msprime.example.vcf"

Usage (Command line)

There is a command line tool that is mostly for file format conversion and performing common computations on the GRG. For more flexibility, use the Python or C++ APIs. After building and installing the Python version, run grg --help to see all the command options. Some examples are below.

Convert a tskit tree-sequence into a GRG. This creates my_arg_data.grg from my_arg_data.trees:

grg convert /path/to/my_arg_data.trees my_arg_data.grg

Load a GRG and emit some simple statistics about the GRG itself:

grg process stats my_arg_data.grg

To construct a GRG from a VCF file, use the grg construct command:

grg construct --parts 20 -j 1 path/to/foo.vcf

WARNING: VCF access for GRG is not indexed, and in general really slow. For anything beyond toy datasets, it is recommended to convert VCF files to IGD first. You can use the grg convert tool (available as part of GRGL) or igdtools from picovcf.

To convert a VCF(.gz) to an IGD and then build a GRG:

grg convert path/to/foo.vcf foo.igd
grg construct --parts 20 -j 1 foo.igd

Construction for small datasets (such as those included as tests in this repository) should be very fast, a few minutes at most. Really large datasets (such as Biobank-scale) can take on the order of a day when using lots of threads (e.g., 70).

Usage (Python API)

See the provided jupyter notebooks and GettingStarted.md for more examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygrgl-2.1.tar.gz (7.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pygrgl-2.1-cp312-cp312-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64

pygrgl-2.1-cp311-cp311-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64

pygrgl-2.1-cp310-cp310-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

pygrgl-2.1-cp39-cp39-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64

pygrgl-2.1-cp38-cp38-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.24+ x86-64

File details

Details for the file pygrgl-2.1.tar.gz.

File metadata

  • Download URL: pygrgl-2.1.tar.gz
  • Upload date:
  • Size: 7.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.1.tar.gz
Algorithm Hash digest
SHA256 8564293c7d6cd48e5fbf117c72e42bd46f0777a459017bea973d105fe7291141
MD5 e2951eb6f8a72ba94a9444a1573143f2
BLAKE2b-256 3ed11592fc83ab28e12942b16fbde3158e190626395e67342dfd09ed06585fb2

See more details on using hashes here.

File details

Details for the file pygrgl-2.1-cp312-cp312-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.1-cp312-cp312-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 466795d412704e320e0b96f31ccf4e75e78b2193ed8e9d270f200ddc7305b17f
MD5 2ff27fe1423d0505c860bffc5a4f6c9d
BLAKE2b-256 ca3024ff6547c3f19210e9df80178c3d80c9703ff3ff3d3ea3eb4aff1dc033ab

See more details on using hashes here.

File details

Details for the file pygrgl-2.1-cp311-cp311-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.1-cp311-cp311-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 76fe97554f56808fda2ebc181e17b54845f125b3a186f3b278891ccd0920dde9
MD5 425640c6a4cf4f2d701d79de82880d8a
BLAKE2b-256 55989d70efbfcc3729687538b0c19e724ac41a03fb5dd9367f254f23850f51ea

See more details on using hashes here.

File details

Details for the file pygrgl-2.1-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.1-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 b49dda921c61672aefcb77b74d27427347827e766d2ecbd715530e27eeb52a6d
MD5 8dadc2bdb7de1e52e9b820f38f881b59
BLAKE2b-256 49935192cc07a3558a97e2c556ecf15b85b8cfad1a1253167e5371b0400cc315

See more details on using hashes here.

File details

Details for the file pygrgl-2.1-cp39-cp39-manylinux_2_24_x86_64.whl.

File metadata

  • Download URL: pygrgl-2.1-cp39-cp39-manylinux_2_24_x86_64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.9, manylinux: glibc 2.24+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.1-cp39-cp39-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 eedb9158dc7d05e286a99aa2e27019353799f9363ea24a0887be326a76da5219
MD5 2dfcf97e110627edd0d21b23bce4b946
BLAKE2b-256 98bd5a25c8581f5c26eba4ce431422537c33c9a49c48727edfe3e573cc3854c9

See more details on using hashes here.

File details

Details for the file pygrgl-2.1-cp38-cp38-manylinux_2_24_x86_64.whl.

File metadata

  • Download URL: pygrgl-2.1-cp38-cp38-manylinux_2_24_x86_64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.8, manylinux: glibc 2.24+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for pygrgl-2.1-cp38-cp38-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 1a3d569b26816b2733793c160ff2e2aa38e88a6a063b54695ca2a82abde61f0b
MD5 928a8d811b7bbbefbe8820233f83f362
BLAKE2b-256 0df044cb60b95196c511e76eb6c463899788e3dc139e16988b525b37647f586a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page