Add your description here
Project description
genoray
If you want to use NumPy with genetic variant data, genoray is for you! genoray enables ergonomic and efficient range queries of genotypes and dosages from VCF and PGEN (PLINK 2.0) files. genoray is also fully type-safe and has minimal dependencies.
Summary
The genoray API more-or-less boils down to just two classes and up to five methods:
VCFandPGENclasses for reading VCF and PGEN files, respectively.readread variants for a single range.chunkread variants for a single range in chunks.read_rangesread multiple ranges of variants at once.chunk_rangesread multiple ranges of variants in chunks.set_samplessubset and/or re-order the samples.
The other important arguments to know are mode (and phasing for VCF) to set the return type and max_mem for chunking. The modes that are available for each file format are always accessible from the class itself, e.g. VCF.Genos16, PGEN.GenosDosages, etc. You can also filter variants on the fly using the filter argument to class constructors.
Examples
VCF
We work with VCFs using the (you guessed it) VCF class:
from genoray import VCF
vcf = VCF("file.vcf.gz")
Querying data for a region is as simple as:
# shape: (samples ploidy variants)
genos = vcf.read("1") # read all variants on chromosome 1
You can also change the return type to be either genotypes and/or dosages by providing a mode argument:
vcf = VCF("file.vcf.gz", dosage_field="DS") # need a dosage_field to read dosages
genos, dosages = vcf.read("1", mode=VCF.Genos16Dosages)
Dosages have shape (samples, variants) and dtype np.float32.
[!NOTE] VCFs must also be provided a FORMAT
dosage_fieldto read dosages and this field must haveNumber=Ain the header, meaning there is one value for each ALT allele.
A key feature of genoray is letting you work with data that is too large to fit into memory. For example:
vcf = VCF("file.vcf.gz", phasing=True) # include phasing status
# max_mem defaults to "4g", can also be capitalized or be "GB", for example
# Genos8 reduces precision to int8 from the default int16 that cyvcf2 uses
genos = vcf.chunk("1", max_mem="4g", mode=VCF.Genos8)
for chunk in genos:
# do something with chunk, each chunk is a NumPy array of shape (samples, ploidy+1, variants)
...
The chunk method will automatically chunk the data along the variants axis to respect the memory limit, returning a generator of data instead of everything at once.
[!NOTE] We also set
phasing=Trueand changed the mode toVCF.Genos8to read phased genotypes asint8. Thephasingargument lets us have access to the phase of the genotype in the format the cyvcf2 adheres to: the 3rd entry along the ploidy axis is the phase: 0 for unphased, 1 for phased. Reducing precision toint8instead ofint16reduces the memory per variant by half -- we would only need higher precision if we expected to have more than 128 alleles at a variant site.
PGEN
from genoray import PGEN
pgen = PGEN("file.pgen")
[!IMPORTANT] PGEN files are automatically indexed on construction, creating a
<prefix>.gvifile. This is a one-time cost to enable fast range queries, but it takes longer for larger files. Don't delete this index file unless you want to re-index the PGEN file.
We can query data for a region in the same way as VCF:
# shape: (samples ploidy variants)
genos = pgen.read("1") # read all variants on chromosome 1
genos = pgen.chunk("1") # read all variants on chromosome 1
However, PGEN files also support reading multiple ranges at once since this improves throughput substantially:
# shape: (samples, ploidy, variants), shape: (n_ranges+1)
genos, offsets = pgen.read_ranges('1', starts=[1, 1000, 2000], ends=[1000, 2000, 3000])
first_range_genos = genos[..., offsets[0]:offsets[1]]
genos = pgen.chunk_ranges('1', starts=[1, 1000, 2000], ends=[1000, 2000, 3000])
for range_ in genos:
if range_ is None:
# no data for this range
continue
for chunk in range_:
# do something with chunk, each chunk is a NumPy array of shape (samples, ploidy, variants)
...
The read_ranges method takes starts and ends and returns data for each range and the offsets to slice out the variants for each range. Since the data is allocated as a single array, the offsets let you slice out the data for each range from the variants axis.
[!NOTE] We do not provide an API for multi-range queries of VCFs because benchmarking showed that this provided no benefit to throughput.
Like VCF, methods for PGENs accept a mode argument to change the return type to include genotypes, phasing, and/or dosages:
genos, phasing, dosages = pgen.read("1", mode=PGEN.GenosPhasingDosages)
The PGEN reader adheres to pgenlib's API, so the phasing information is in a separate boolean array instead of using an extra column like VCF/cyvcf2. The phasing information is a boolean array of shape (samples, variants) where True indicates that the genotype is phased and False indicates that it is unphased.
[!IMPORTANT] PGEN files either store hardcalls (genotypes) or dosages, not both, and dosage PGENs infer hardcalls based on a hardcall threshold. Thus, if you want to read hardcalls that do not correspond to inferred hardcalls from a dosage PGEN, you can provide two different PGEN files to the constructor. This will read hardcalls from
hardcalls.pgenand dosages fromdosage.pgen. The two PGEN files must have the same samples and variants in the same order. Thedosage_pathargument is optional, and if not provided, both hardcalls and dosages will be sourced from the path argument ("hardcalls.pgen"in the example):
pgen = PGEN("hardcalls.pgen", dosage_path="dosage.pgen", ...)
Filtering
You can filter variants from VCF or PGEN files by a providing a function or polars expression to the constructor, respectively.
For VCFs, the function must accept a cyvcf2.Variant and return a boolean indicating whether to include the site.
# only include variants that are common in EUR
vcf = VCF("file.vcf.gz", filter=lambda v: v.INFO['AF_EUR'] > 0.05)
For PGENs, the expression will operate on a polars DataFrame with all of the columns available in the underlying PVAR except #CHROM, POS, and ALT, which are superseded by columns added by genoray:
ChromosomeStartEndALTas a list of stringsilen(indel length)kinda list of strings that indicates the type of each ALT allele as"SNP","INDEL","MNP", or"OTHER"
The expression should return a boolean mask indicating which variants to include.
# only include SNPs
pgen = PGEN("file.pgen", filter=pl.col("kind").list.eval(pl.element() == "SNP").list.all())
⚠️ Important ⚠️
- For the time being, ploidy is 2 for all classes in
genoray, but this could be more flexible for VCFs in the future. The PGEN format does not support ploidy other than 2. - Different file formats may use different data types for their respective representations of genotypes, phasing, and dosages.
- Ranges are 0-based, so starts begin at 0 and ends are exclusive.
- Missing genotypes and dosages are encoded as -1 and
np.nan, respectively. - Dosages from PGEN files may not exactly match VCF files (up to a fraction of a percent) because PLINK 2.0 must encode dosages with fixed precision which can not match what can be represented by text in a VCF (may also disagree with how BCF encodes dosage).
Contributing
To contribute to genoray, please fork the repository and create a pull request. We welcome contributions of all kinds, including bug fixes, new features, and documentation improvements. Please make sure to run the tests before submitting a pull request. We provide a Pixi environment that includes all development dependencies. To use the environment, install Pixi and run pixi run pre-commit to activate pre-commit in your clone of the repo, and then run pixi s in the repository root directory. pixi s will activate the development environment and install all dependencies. You can then run the tests using pytest. ❗Note that all commits must adhere to conventional commits. If you have any questions or suggestions, please open an issue on the repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genoray-0.11.0.tar.gz.
File metadata
- Download URL: genoray-0.11.0.tar.gz
- Upload date:
- Size: 87.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dffebda7895a61e7f3941568032d6588406bbeb8eeba8f632d155470a0e21ce1
|
|
| MD5 |
04b038d5568b8e4c9fdc2aa6ef0d32c7
|
|
| BLAKE2b-256 |
3e8c5dde0424aeb9b214c630a60d7e6b2ed748c4cb5585a8357c97e186dad207
|
File details
Details for the file genoray-0.11.0-py3-none-any.whl.
File metadata
- Download URL: genoray-0.11.0-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4aad12a1511ebe2cc078a48275df000c3c2a27260d1ba71cfa916af4126af60
|
|
| MD5 |
c8554505831d816f55b2d09bdc9cdbed
|
|
| BLAKE2b-256 |
e38738ae7c8ede83c8ff57215efcfe2d23f00771309e37661034c003c596e425
|