Skip to main content

Add your description here

Project description

genoray

If you want to use NumPy with genetic variant data, genoray is for you! genoray enables ergonomic and efficient range queries of genotypes and dosages from VCF and PGEN (PLINK 2.0) files. genoray is also fully type-safe and has minimal dependencies.

Summary

The genoray API more-or-less boils down to just two classes and up to five methods:

  • VCF and PGEN classes for reading VCF and PGEN files, respectively.
  • read read variants for a single range.
  • chunk read variants for a single range in chunks.
  • read_ranges read multiple ranges of variants at once.
  • chunk_ranges read multiple ranges of variants in chunks.
  • set_samples subset and/or re-order the samples.

The other important arguments to know are mode (and phasing for VCF) to set the return type and max_mem for chunking. The modes that are available for each file format are always accessible from the class itself, e.g. VCF.Genos16, PGEN.GenosDosages, etc. You can also filter variants on the fly using the filter argument to class constructors.

Examples

VCF

We work with VCFs using the (you guessed it) VCF class:

from genoray import VCF

vcf = VCF("file.vcf.gz")

Querying data for a region is as simple as:

# shape: (samples ploidy variants)
genos = vcf.read("1")  # read all variants on chromosome 1

You can also change the return type to be either genotypes and/or dosages by providing a mode argument:

vcf = VCF("file.vcf.gz", dosage_field="DS")  # need a dosage_field to read dosages

genos, dosages = vcf.read("1", mode=VCF.Genos16Dosages)

Dosages have shape (samples, variants) and dtype np.float32.

[!NOTE] VCFs must also be provided a FORMAT dosage_field to read dosages and this field must have Number=A in the header, meaning there is one value for each ALT allele.

A key feature of genoray is letting you work with data that is too large to fit into memory. For example:

vcf = VCF("file.vcf.gz", phasing=True)  # include phasing status

# max_mem defaults to "4g", can also be capitalized or be "GB", for example
# Genos8 reduces precision to int8 from the default int16 that cyvcf2 uses
genos = vcf.chunk("1", max_mem="4g", mode=VCF.Genos8)

for chunk in genos:
    # do something with chunk, each chunk is a NumPy array of shape (samples, ploidy+1, variants)
    ...

The chunk method will automatically chunk the data along the variants axis to respect the memory limit, returning a generator of data instead of everything at once.

[!NOTE] We also set phasing=True and changed the mode to VCF.Genos8 to read phased genotypes as int8. The phasing argument lets us have access to the phase of the genotype in the format the cyvcf2 adheres to: the 3rd entry along the ploidy axis is the phase: 0 for unphased, 1 for phased. Reducing precision to int8 instead of int16 reduces the memory per variant by half -- we would only need higher precision if we expected to have more than 128 alleles at a variant site.

PGEN

from genoray import PGEN

pgen = PGEN("file.pgen")

[!IMPORTANT] PGEN files are automatically indexed on construction, creating a <prefix>.gvi file. This is a one-time cost to enable fast range queries, but it takes longer for larger files. Don't delete this index file unless you want to re-index the PGEN file.

We can query data for a region in the same way as VCF:

# shape: (samples ploidy variants)
genos = pgen.read("1")  # read all variants on chromosome 1
genos = pgen.chunk("1")  # read all variants on chromosome 1

However, PGEN files also support reading multiple ranges at once since this improves throughput substantially:

# shape: (samples, ploidy, variants), shape: (n_ranges+1)
genos, offsets = pgen.read_ranges('1', starts=[1, 1000, 2000], ends=[1000, 2000, 3000])
first_range_genos = genos[..., offsets[0]:offsets[1]]

genos = pgen.chunk_ranges('1', starts=[1, 1000, 2000], ends=[1000, 2000, 3000])
for range_ in genos:
    if range_ is None:
        # no data for this range
        continue
    for chunk in range_:
        # do something with chunk, each chunk is a NumPy array of shape (samples, ploidy, variants)
        ...

The read_ranges method takes starts and ends and returns data for each range and the offsets to slice out the variants for each range. Since the data is allocated as a single array, the offsets let you slice out the data for each range from the variants axis.

[!NOTE] We do not provide an API for multi-range queries of VCFs because benchmarking showed that this provided no benefit to throughput.

Like VCF, methods for PGENs accept a mode argument to change the return type to include genotypes, phasing, and/or dosages:

genos, phasing, dosages = pgen.read("1", mode=PGEN.GenosPhasingDosages)

The PGEN reader adheres to pgenlib's API, so the phasing information is in a separate boolean array instead of using an extra column like VCF/cyvcf2. The phasing information is a boolean array of shape (samples, variants) where True indicates that the genotype is phased and False indicates that it is unphased.

[!IMPORTANT] PGEN files either store hardcalls (genotypes) or dosages, not both, and dosage PGENs infer hardcalls based on a hardcall threshold. Thus, if you want to read hardcalls that do not correspond to inferred hardcalls from a dosage PGEN, you can provide two different PGEN files to the constructor. This will read hardcalls from hardcalls.pgen and dosages from dosage.pgen. The two PGEN files must have the same samples and variants in the same order. The dosage_path argument is optional, and if not provided, both hardcalls and dosages will be sourced from the path argument ("hardcalls.pgen" in the example):

pgen = PGEN("hardcalls.pgen", dosage_path="dosage.pgen", ...)

Filtering

You can filter variants from VCF or PGEN files by a providing a function or polars expression to the constructor, respectively.

For VCFs, the function must accept a cyvcf2.Variant and return a boolean indicating whether to include the site.

# only include variants that are common in EUR
vcf = VCF("file.vcf.gz", filter=lambda v: v.INFO['AF_EUR'] > 0.05)

For PGENs, the expression will operate on a polars DataFrame with all of the columns available in the underlying PVAR except #CHROM, POS, and ALT, which are superseded by columns added by genoray:

  • Chromosome
  • Start
  • End
  • ALT as a list of strings
  • ilen (indel length)
  • kind a list of strings that indicates the type of each ALT allele as "SNP", "INDEL", "MNP", or "OTHER"

The expression should return a boolean mask indicating which variants to include.

# only include SNPs
pgen = PGEN("file.pgen", filter=pl.col("kind").list.eval(pl.element() == "SNP").list.all())

⚠️ Important ⚠️

  • For the time being, ploidy is 2 for all classes in genoray, but this could be more flexible for VCFs in the future. The PGEN format does not support ploidy other than 2.
  • Different file formats may use different data types for their respective representations of genotypes, phasing, and dosages.
  • Ranges are 0-based, so starts begin at 0 and ends are exclusive.
  • Missing genotypes and dosages are encoded as -1 and np.nan, respectively.
  • Dosages from PGEN files may not exactly match VCF files (up to a fraction of a percent) because PLINK 2.0 must encode dosages with fixed precision which can not match what can be represented by text in a VCF (may also disagree with how BCF encodes dosage).

Contributing

To contribute to genoray, please fork the repository and create a pull request. We welcome contributions of all kinds, including bug fixes, new features, and documentation improvements. Please make sure to run the tests before submitting a pull request. We provide a Pixi environment that includes all development dependencies. To use the environment, install Pixi and run pixi run pre-commit to activate pre-commit in your clone of the repo, and then run pixi s in the repository root directory. pixi s will activate the development environment and install all dependencies. You can then run the tests using pytest. ❗Note that all commits must adhere to conventional commits. If you have any questions or suggestions, please open an issue on the repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genoray-0.17.0.tar.gz (116.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genoray-0.17.0-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file genoray-0.17.0.tar.gz.

File metadata

  • Download URL: genoray-0.17.0.tar.gz
  • Upload date:
  • Size: 116.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for genoray-0.17.0.tar.gz
Algorithm Hash digest
SHA256 1c370273df7212c429966ecff8149fbf80f28ee77cbcaae0454268ee2c2b6b21
MD5 21fee29729e8f49159c26269d5a67a52
BLAKE2b-256 2574a8d79b655d82b73b8c430b84e95d872af9908df8653fedf66be9039506e2

See more details on using hashes here.

File details

Details for the file genoray-0.17.0-py3-none-any.whl.

File metadata

  • Download URL: genoray-0.17.0-py3-none-any.whl
  • Upload date:
  • Size: 38.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for genoray-0.17.0-py3-none-any.whl
Algorithm Hash digest
SHA256 571fdf7fb563fba64f8f504f08b3f745d6b7b5f690a3196a12fc79c0c57e6ed4
MD5 28848c15ff0f59a16b2d2fdeef6a8211
BLAKE2b-256 878dbbf18153a67b4b5f669fd800e4215c59b0219b1f07195e33c8e06ef2be24

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page