Skip to main content

Genotype Representation Graph Library

Project description

Genotype Representation Graphs

A Genotype Representation Graph (GRG) is a compact way to store reference-aligned genotype data for large genetic datasets. These datasets are typically stored in tabular formats (VCF, BCF, BGEN, etc.) and then compressed using off-the-shelf compression. In contrast, a GRG contains Mutation nodes (representing variants) and Sample nodes (representing haploid samples), where there is a path from a Mutation node to a Sample node if-and-only-if that sample contains that mutation. These paths go through internal nodes that represent common ancestry between multiple samples, and this can result in significant compression (10-15x smaller than .vcf.gz). Calculations on the whole dataset can be performed very quickly on GRG, using GRGL. See our paper "Enabling efficient analysis of biobank-scale data with genotype representation graphs" for more details.

Genotype Representation Graph Library (GRGL)

GRGL can be used as a library in both C++ and Python. Support is currently limited to Linux and MacOS. It contains both an API (see docs) and a set of command-line tools.

Installing from pip

If you just want to use the tools (e.g., constructing GRG or converting tree-sequence to GRG) and the Python API then you can install via pip (from PyPi).

pip install pygrgl

This will use prebuilt packages for most modern Linux situations, and will build from source for MacOS. In order to build from source it will require CMake (at least v3.14), zlib development headers, and a clang or GCC compiler that supports C++11.

Building (C++ only)

Make sure you clone with git clone --recursive!

If you only intend to use GRGL from C++, you can just build it via CMake:

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j4

See below to install the libraries to your system. It is recommended to install it to a custom location (prefix) since removing packages installed via make install is a pain otherwise. Example:

mkdir /path/to/grgl_installation/
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/path/to/grgl_installation/
make -j4
make install
# There should now be bin/, lib/, etc., directories under /path/to/grgl_installation/

Building (Python)

Make sure you clone with git clone --recursive!

Requires Python 3.7 or newer to be installed (including development headers). It is recommended that you build/install in a virtual environment.

python3 -m venv /path/to/MyEnv
source /path/to/MyEnv/bin/activate
python setup.py bdist_wheel               # Compiles C++, builds a wheel in the dist/ directory
pip install --force-reinstall dist/*.whl  # Install from wheel

Build and installation should take at most a few minutes on the typical computer. For more details on build options, see DEVELOPING.md.

Building (Docker)

We've included a Dockerfile if you want to use GRGL in a container.

Example to build:

docker build . -t grgl:latest

Example to run, constructing a GRG from an example VCF file:

docker run -v $PWD:/working -it grgl:latest bash -c "cd /working && grg construct /working/test/inputs/msprime.example.vcf

Usage (Command line)

There is a command line tool that is mostly for file format conversion and performing common computations on the GRG. For more flexibility, use the Python or C++ APIs. After building and installing the Python version, run grg --help to see all the command options. Some examples are below.

Convert a tskit tree-sequence into a GRG. This creates my_arg_data.grg from my_arg_data.trees:

grg convert /path/to/my_arg_data.trees my_arg_data.grg

Load a GRG and emit some simple statistics about the GRG itself:

grg process stats my_arg_data.grg

To construct a GRG from a VCF file, use the grg construct command:

grg construct --parts 20 -j 1 path/to/foo.vcf

WARNING: VCF access for GRG is not indexed, and in general really slow. For anything beyond toy datasets, it is recommended to convert VCF files to IGD first. You can use the grg convert tool (available as part of GRGL) or igdtools from picovcf.

To convert a VCF(.gz) to an IGD and then build a GRG:

grg convert path/to/foo.vcf foo.igd
grg construct --parts 20 -j 1 foo.igd

Construction for small datasets (such as those included as tests in this repository) should be very fast, a few minutes at most. Really large datasets (such as Biobank-scale) can take on the order of a day when using lots of threads (e.g., 70).

Usage (Python API)

See the provided jupyter notebooks and GettingStarted.md for more examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygrgl-2.0.tar.gz (7.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pygrgl-2.0-cp312-cp312-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64

pygrgl-2.0-cp311-cp311-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64

pygrgl-2.0-cp310-cp310-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

pygrgl-2.0-cp39-cp39-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64

pygrgl-2.0-cp38-cp38-manylinux_2_24_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.24+ x86-64

File details

Details for the file pygrgl-2.0.tar.gz.

File metadata

  • Download URL: pygrgl-2.0.tar.gz
  • Upload date:
  • Size: 7.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for pygrgl-2.0.tar.gz
Algorithm Hash digest
SHA256 73ad7953351eb0f0ba2250eb06ac945720726719faa66bab36d8706302c01ff8
MD5 8715c9fe48c43d3e43256049a54da885
BLAKE2b-256 2b6aa480a77b4620bbe6d4da6e54e23bf85c50d5cefb47a00296e30ab6d76524

See more details on using hashes here.

File details

Details for the file pygrgl-2.0-cp312-cp312-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.0-cp312-cp312-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 b309c8b30f67c752b02164aa97f56e1fe7f4c8c65e46cbdb0cac1d1c04d1e250
MD5 a1269cc0d8f249c7c72bf348d0c7ba3f
BLAKE2b-256 a2d7ea3eab2f309fe86a8239df27ebd249ad4319fbda63d9856d246b0eb7a744

See more details on using hashes here.

File details

Details for the file pygrgl-2.0-cp311-cp311-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.0-cp311-cp311-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 d417e17a1a60308c63b8797341d793a84bd4f8a31795d6121f90014053d5344f
MD5 23588b02a4b89bd2752c3458de259015
BLAKE2b-256 d85c68f36eb5d3bdb3bc19f6dc491969636b9513523e8c3df52e3f3edf9f2524

See more details on using hashes here.

File details

Details for the file pygrgl-2.0-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.0-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 a36bdadff5e5af77f020f9442b105abd1e1ad9af37036d5217e8243f98c4f971
MD5 cf89f1192a8ac7eab50f53bc25701744
BLAKE2b-256 3a98d762cb588dfda25ffd31f3f3878d5b675d83e6ea9ce7230bba249cafed40

See more details on using hashes here.

File details

Details for the file pygrgl-2.0-cp39-cp39-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.0-cp39-cp39-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 41fe8bcf85aa93d07ae6a096c61268700dedbd5b149ca67b2aff6bc6bb220fbf
MD5 4b487fbf4cf6aa439bed39137afd07fe
BLAKE2b-256 d49fec82b49fe96e149ec1379d985ddbe76a90c54fb762361f4a94c2d7e8a98f

See more details on using hashes here.

File details

Details for the file pygrgl-2.0-cp38-cp38-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for pygrgl-2.0-cp38-cp38-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 f290c233787854c213d63862eceaaa8efaa28a4d85a1893ec13ca5803f8b730f
MD5 0a56f64684aecd9c24c35609da5ce980
BLAKE2b-256 6419cb95c5744a9566b527491a6b23db950f424248430b01a49e8750d17ef7c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page