Statistical and population genetics methods on GRG

These details have not been verified by PyPI

Project links

Homepage

Project description

grapp

A library and command-line tool for tackling problems in statistical and population genetics, implemented on top of the Genotype Representation Graph (GRG) format. GRG is a file format and data structure that losslessly represents a genetic dataset. It has the advantage of compressing large datasets significantly, while also making calculations over that dataset extremely fast (see the paper and the core library).

Some notes on usage are below, and you can also browse the Python documentation for grapp.

Check out the command-line cheatsheet for examples on how to perform many tasks.

Installation

pip install grapp

Modules

assoc

Perform association tests between phenotypes and genotypes.

Command Line

usage: grapp assoc [-h] [-p PHENOTYPES] [-c COVARIATES] [-o OUT_FILE] grg_input

positional arguments:
  grg_input             The input GRG file

options:
  -h, --help            show this help message and exit
  -p PHENOTYPES, --phenotypes PHENOTYPES
                        The file containing the phenotypes. If no file is provided, random phenotype values are used.
  -c COVARIATES, --covariates COVARIATES
                        Covariates text file to load
  -o OUT_FILE, --out-file OUT_FILE
                        Tab-separated output file (with header); exported Pandas DataFrame. Default: <grg_input>.assoc.tsv

Library

There are methods for GWAS with (grapp.assoc.linear_assoc_covar) and without covariates (grapp.assoc.linear_assoc_no_covar).

linalg

Linear algebra functionality that integrates GRG with numpy and scipy. The main workhorses behind this module are the operators compatible with scipy.sparse.linalg.LinearOperator.

Command Line

The Principal Component Analysis (PCA) is available via the command line:

usage: grapp pca [-h] [-d DIMENSIONS] [-o PCS_OUT] [--normalize] [--pro-pca] grg_input

positional arguments:
  grg_input             The input GRG file

options:
  -h, --help            show this help message and exit
  -d DIMENSIONS, --dimensions DIMENSIONS
                        The number of PCs to extract. Default: 10.
  -o PCS_OUT, --pcs-out PCS_OUT
                        Output filename to write the PCs to. Default: "<grg_input>.pcs.tsv"
  --normalize           Normalize the PCs according to sqrt(eigenvalue) for each.
  --pro-pca             Use the ProPCA algorithm to compute principal components.

Library

The core of the library are the LinearOperators that operate on GRGs:

grapp.linalg.ops_scipy.SciPyXOperator: An operator that performs matrix multiplication against the genotype matrix X (NxM) or its transpose (MxN).
grapp.linalg.ops_scipy.SciPyXTXOperator: An operator that performs matrix multiplication against the MxM product transpose(X) * X.
grapp.linalg.ops_scipy.SciPyStdXOperator: The same as SciPyXOperator, except the genotype matrix is standardized by using the allele frequencies (standard deviation via binomial distribution).
grapp.linalg.ops_scipy.SciPyStdXTXOperator: The same as SciPyXTXOperator, except the genotype matrix is standardized by using the allele frequencies (standard deviation via binomial distribution).

Additionally, there is a helpful method for eigen decomposition (grapp.linalg.eigs) and PCA (grapp.linalg.PCs).

util

Common utility functions for working with the GRG format.

Command Line

GRG can be exported to tabular data formats. .vcf.gz is supported but slow. It is recommended to use IGD which is dramatically faster and more compact, while being similar to VCF in how it is structured. Export command:

usage: grapp export [-h] [--igd IGD | --vcf VCF] [-f] [-j JOBS] [--temp-dir TEMP_DIR] [--contig CONTIG] grg_input

positional arguments:
  grg_input             The input GRG file

options:
  -h, --help            show this help message and exit
  --igd IGD             Export the entire dataset to the given IGD filename.
  --vcf VCF             Export the entire dataset to the given VCF filename. Use '-' to write to stdout (and, e.g., pipe through bgzip). If the filename ends with .gz then the Python GZIP codec will be used
                        (not bgzip). Otherwise, a plaintext VCF file will be created.
  -f, --force           Force overwrite of the output file, if it exists.
  -j JOBS, --jobs JOBS  Number of processes/threads to use, if possible. Default: 1.
  --temp-dir TEMP_DIR   Put all temporary files in the given directory, instead of creating a directory in the system temporary location. WARNING: Intermediate/temporary files will not be cleaned up when
                        this is specified.
  --contig CONTIG       Use the given contig name when exporting to VCF. Default: "unknown".

GRG files can be filtered, prior to performing analysis on them. See the filter command:

sage: grapp filter [-h] [-S INDIVIDUALS | --hap-samples HAP_SAMPLES | -P POPULATIONS] [-r RANGE] [-c MIN_AC] [-C MAX_AC] [-q MIN_AF] [-Q MAX_AF] [-v TYPES] [-A] [-m MIN_ALLELES] [-M MAX_ALLELES]
                    grg_input grg_output

positional arguments:
  grg_input             The input GRG file
  grg_output            The output GRG file

options:
  -h, --help            show this help message and exit

sample filters:
  -S INDIVIDUALS, --individuals INDIVIDUALS
                        Keep only the individuals with the IDs given as a comma-separated list or in the given filename.
  --hap-samples HAP_SAMPLES
                        Keep only the haploid samples with the NodeIDs (indexes) given as a comma-separated list or in the given filename.
  -P POPULATIONS, --populations POPULATIONS
                        Keep only the individuals with populations matching the comma-separated list or in the given filename.

mutation filters:
  -r RANGE, --range RANGE
                        Keep only the variants within the given range, in base pairs. Example: "lower-upper", where both are integers and lower is inclusive, upper is exclusive.
  -c MIN_AC, --min-ac MIN_AC
                        Minimum allele count to keep. All Mutations with count below this value will be dropped
  -C MAX_AC, --max-ac MAX_AC
                        Maximum allele count to keep. All Mutations with count above this value will be dropped
  -q MIN_AF, --min-af MIN_AF
                        Minimum allele frequency to keep. All Mutations with frequency below this value will be dropped
  -Q MAX_AF, --max-af MAX_AF
                        Maximum allele frequency to keep. All Mutations with frequency above this value will be dropped
  -v TYPES, --types TYPES
                        Comma-separated list of variant types to select. Site is selected if any of the ALT alleles is of the type requested. Types are determined by comparing the REF and ALT alleles.
  -A, --apply-to-sites  By default, all filters apply to each variant independently. This flag will cause an entire site to be dropped if any variants are filtered out.
  -m MIN_ALLELES, --min-alleles MIN_ALLELES
                        Only keep sites with at least this many alleles. Counts all REF alleles as 1.
  -M MAX_ALLELES, --max-alleles MAX_ALLELES
                        Only keep sites with at most this many alleles. Counts all REF alleles as 1. Use '-m 2 -M 2 -v snps' to view only biallelic SNPs.

Library

Library API reference document is here. There are also some tutorials in the GRG documentation (and correspond Jupyter notebooks).

nn

Experimental library for nearest neighbors search over GRG.

The nn module lets you search a dataset stored as a GRG for nearest neighbors in a variety of ways:

Similarity is either between samples (haplotypes) or mutations (variants). I.e., you can ask to find samples that are similar to a given sample, or mutations that are similar to a given mutation.
Similarity is defined as the Hamming distance between items. The Hamming distance is just the number of differences, so for two samples their distance is defined as the number of mutations that either of them has, but both of them do not. I.e., Hamming(A, B) = |Muts(A)| + |Muts(B)| - 2*|Muts(A) intersect Muts(B)|.
There are APIs that let you query using a sample/mutation already in the dataset (GRG) or more generally you can query using an external sample/mutation that is not in the dataset, though your options may be slightly more limited in the latter case.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.4

Apr 9, 2026

0.3

Feb 6, 2026

0.2

Jan 28, 2026

0.1

Aug 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grapp-0.4.tar.gz (58.3 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

grapp-0.4-py3-none-any.whl (60.7 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file grapp-0.4.tar.gz.

File metadata

Download URL: grapp-0.4.tar.gz
Upload date: Apr 9, 2026
Size: 58.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for grapp-0.4.tar.gz
Algorithm	Hash digest
SHA256	`83c9ab524d94bfc3e61a4d28a1039c2f862fff048abec73b6c108584d289232d`
MD5	`9b13e824f67ea95f2d6a4003d6e4ecd8`
BLAKE2b-256	`b927408288b04faff41588eb1554097140a452114d60c737d36111d814412595`

See more details on using hashes here.

File details

Details for the file grapp-0.4-py3-none-any.whl.

File metadata

Download URL: grapp-0.4-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 60.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for grapp-0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5988b9c22b45874ec2ff3f10db67c7c6ec5114ab8d6f04b813eb2a45f02bbd8`
MD5	`ab58288f187b714ff4b74200ceae61b7`
BLAKE2b-256	`97cd505765e551a41fa60ddb86a77a4e25ba210f9ce1e3cfb214f2293d33b434`

See more details on using hashes here.

grapp 0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

grapp

Installation

Modules

assoc

Command Line

Library

linalg

Command Line

Library

util

Command Line

Library

nn

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes