Add your description here

Project description

`genoray`

If you want to use NumPy with genetic variant data, genoray is for you! genoray provides a uniform API for efficient range queries of genotypes and dosages from VCF and PGEN (PLINK 2.0) files. genoray is also fully type-safe and has minimal dependencies.

The API is minimal, with only three core methods: single range queries, automatically chunking a range query to respect a memory limit, and reading multiple range queries.

`read`

Read genotypes and/or dosages for a single range (contig, start, end). Returns a NumPy array of shape (samples, ploidy, variants). For VCF, the dtype is np.int8, and for PGEN, the dtype is np.int32.

from genoray import VCF, PGEN

vcf = VCF("file.vcf.gz")
pgen = PGEN("file.pgen")

# shape: (samples ploidy variants)
genos = vcf.read("1")  # read all variants on chromosome 1
genos = pgen.read("1")

You can also change the return type to be either genotypes and/or dosages using the read_as argument in the constructor:

from genoray import VCF, PGEN

vcf = VCF("file.vcf.gz", read_as=VCF.GenosDosages)  # can be VCF.Genos, VCF.Dosages, or VCF.GenosDosages
pgen = PGEN("file.pgen", read_as=PGEN.GenosDosages)

# shape: (samples ploidy variants), shape: (samples variants)
genos, dosages = vcf.read("1")
genos, dosages = pgen.read("1")

Dosages have shape (samples, variants) and dtype np.float32 for both VCF and PGEN.

Note that VCFs must also be provided a FORMAT dosage_field to read dosages and this field must have Number=A in the header, meaning there is one value for each ALT allele. PGEN files either store hardcalls (genotypes) or dosages, not both, and dosage PGENs infer hardcalls based on a hardcall threshold. Thus, if you want to read hardcalls that do not correspond to inferred hardcalls from a dosage PGEN, you can provide two different PGEN files to the constructor:

pgen = PGEN("hardcalls.pgen", dosage_path="dosage.pgen", ...)

This will read hardcalls from hardcalls.pgen and dosages from dosage.pgen. The two PGEN files must have the same samples and variants in the same order. The dosage_path argument is optional, and if not provided, both hardcalls and dosages will be sourced from the path argument ("hardcalls.pgen" in the example).

[!IMPORTANT] PGEN files are automatically indexed on construction, creating a <prefix>.gvi file. This is a one-time cost to enable much faster range queries, but it takes longer for larger files. Don't delete this index file unless you want to re-index the PGEN file.

`read_chunks`

If you don't have enough memory to read a range of variants, you can split the read into chunks. This method returns a generator that yields NumPy arrays of shape (samples, ploidy, variants) (and/or (samples, variants) for dosages) that are automatically chunked along the variants axis such that each chunk of data is no larger than the specified maximum memory.

genos = vcf.read_chunks("1", max_mem="4g")  # default is "4g", can also be capitalized or be "GB", for example

`read_ranges`

If you want to read multiple ranges on the same contig at once, you can use the read_ranges method. This method takes starts and ends and returns a NumPy array of shape (samples, ploidy, variants) (and/or (samples, variants) for dosages) for each range, as well as the offsets to slice out the variants for each range. Since the data is allocated as a single array, the offsets let you slice out the data for each range from the variants axis.

# shape: (samples, ploidy, variants), shape: (n_ranges+1)
genos, offsets = vcf.read_ranges('1', starts=[1, 1000, 2000], ends=[1000, 2000, 3000])
first_range_genos = genos[..., offsets[0]:offsets[1]]

Filtering

You can filter variants from VCF or PGEN files by a providing a function or polars expression to the constructor, respectively. For VCFs, the function must accept a cyvcf2.Variant and return a boolean indicating whether to include the variant. For PGENs, the expression will operate on a polars DataFrame with four columns: "Chromosome", "Start", "End", and "kind". The expression should return a boolean mask indicating which variants to include. The "kind" column is a string that indicates the type of variant, which can be "SNP", "INDEL", or "MNP".

vcf = VCF("file.vcf.gz", filter=lambda v: v.QUAL > 20)  # only include variants with quality > 20
pgen = PGEN("file.pgen", filter=pl.col("kind") == "SNP")  # only include SNPs

Type Safety

genoray is fully type-safe, meaning that type checkers like mypy and pyright can infer the types of the input and output data for everything in genoray and provide compile-time errors. In addition, if you want to abstract your code to work with any genoray reader, you can use the genoray.Reader base class, which is the typing.Protocol that all genoray readers adhere to. This allows you to write code that works with either VCF or PGEN files without having to worry about the underlying implementation details.

⚠️ Important ⚠️

For the time being, ploidy is always 2, but this could be more flexible for VCFs in the future. PGEN does not support ploidy other than 2.
Multi-allelic variants are not supported and any multi-allelic sites will only have their first allele returned. As a workaround, you can split multi-allelic sites into bi-allelic ones using bcftools norm or plink2 --make-bpgen. You can then pass the PLINK 1 .bed file to genoray.PGEN, although this will erase phasing information. If you need the phasing information, you should export the file to BCF and split the sites with bcftools, then convert the BCF back to PGEN with plink2 --make-pgen.
PGEN returns genotypes with dtype np.int32 instead of np.int8 because this is the native dtype for pgenlib.
Ranges are 0-based, so starts begin at 0 and ends are exclusive.
Missing genotypes and dosages are encoded as -1 and np.nan, respectively.

Contributing

To contribute to genoray, please fork the repository and create a pull request. We welcome contributions of all kinds, including bug fixes, new features, and documentation improvements. Please make sure to run the tests before submitting a pull request. We provide a Pixi environment that includes all development dependencies. To use the environment, install Pixi and run pixi run pre-commit to activate pre-commit in your clone of the repo, and then run pixi s in the repository root directory. pixi s will activate the development environment and install all dependencies. You can then run the tests using pytest. ❗Note that all commits must adhere to conventional commits. If you have any questions or suggestions, please open an issue on the repository.

Project details

Release history Release notifications | RSS feed

2.2.3

Apr 22, 2026

2.2.2

Mar 31, 2026

2.2.1

Mar 9, 2026

2.2.0

Mar 9, 2026

2.1.3

Feb 10, 2026

2.1.2

Feb 6, 2026

2.1.1

Feb 5, 2026

2.1.0

Feb 4, 2026

2.0.1

Feb 4, 2026

2.0.0

Jan 28, 2026

1.0.1

Dec 20, 2025

1.0.0

Dec 20, 2025

0.17.0

Dec 3, 2025

0.16.1

Nov 19, 2025

0.16.0

Oct 21, 2025

0.15.0

Aug 22, 2025

0.14.6

Jul 20, 2025

0.14.5

Jul 20, 2025

0.14.4

Jul 9, 2025

0.14.3

Jul 9, 2025

0.14.2

Jun 27, 2025

0.14.1

Jun 12, 2025

0.14.0

Jun 12, 2025

0.13.1

Jun 11, 2025

0.13.0

Jun 10, 2025

0.12.2

Jun 5, 2025

0.12.1

Jun 5, 2025

0.12.0

May 28, 2025

0.11.3

May 19, 2025

0.11.2

May 18, 2025

0.11.1

May 17, 2025

0.11.0

May 17, 2025

0.10.8

May 15, 2025

0.10.7

May 13, 2025

0.10.6

May 13, 2025

0.10.5

May 13, 2025

0.10.4

May 13, 2025

0.10.3

May 10, 2025

0.10.2

May 7, 2025

0.10.1

May 7, 2025

0.10.0

May 5, 2025

0.9.0

May 2, 2025

0.8.0

Apr 30, 2025

0.7.1

Apr 25, 2025

0.7.0

Apr 21, 2025

0.6.0

Apr 21, 2025

0.5.1

Apr 19, 2025

0.5.0

Apr 18, 2025

0.4.4

Apr 16, 2025

0.4.3

Apr 16, 2025

0.4.2

Apr 16, 2025

0.4.1

Apr 16, 2025

0.4.0

Apr 16, 2025

0.3.0

Apr 15, 2025

This version

0.2.0

Apr 13, 2025

0.1.0

Apr 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genoray-0.2.0.tar.gz (69.7 kB view details)

Uploaded Apr 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

genoray-0.2.0-py3-none-any.whl (15.5 kB view details)

Uploaded Apr 13, 2025 Python 3

File details

Details for the file genoray-0.2.0.tar.gz.

File metadata

Download URL: genoray-0.2.0.tar.gz
Upload date: Apr 13, 2025
Size: 69.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.7

File hashes

Hashes for genoray-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`05ed07f822c7cf69502fcf0713c69a7a3ebca7103bb1324ae50dee936d9a1ca9`
MD5	`2290e140fdadb6ddd80296dd10e0382c`
BLAKE2b-256	`038347de6efe91d3f4ffcc77c0fbcb58882390e0cdcd673c14b0327d82ff1768`

See more details on using hashes here.

File details

Details for the file genoray-0.2.0-py3-none-any.whl.

File metadata

Download URL: genoray-0.2.0-py3-none-any.whl
Upload date: Apr 13, 2025
Size: 15.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.7

File hashes

Hashes for genoray-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58c1968a663e373ac8403a58b7a424e8a4b59ec7c319f5ada6cb7cf94709146f`
MD5	`da0473fc5961c4c92f1d6a060f857884`
BLAKE2b-256	`0e45a92bd77b90e9f4e3fef42b65bcaf89ba45345cef63bdc170ed6f122cc22e`

See more details on using hashes here.

genoray 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

`genoray`

`read`

`read_chunks`

`read_ranges`

Filtering

Type Safety

⚠️ Important ⚠️

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes