Project description

Python build and test

pyigd

PyIGD is a Python-only parser the Indexable Genotype Data (IGD) format.

For a C++ library that supports creating and parsing IGD, see picovcf (which also supports VCF -> IGD conversion).

Installation

Clone the code and then either install for development:

pip install -e pyigd/

or build and install via the wheel:

cd pyigd/ && python setup.py bdist_wheel
pip install --force-reinstall dist/*.whl

Usage

The pyigd.IGDFile class is a context manager, so it is recommended that you use it via the with statement. Below is an example script that loads an IGD file, prints out some meta-data, and then iterates the genotype data for all variants.

import pyigd
import sys

if len(sys.argv) < 2:
    print("Pass in IGD filename")
    exit(1)

with pyigd.IGDFile(sys.argv[1]) as igd_file:
    print(f"Version: {igd_file.version}")
    print(f"Ploidy: {igd_file.ploidy}")
    print(f"Variants: {igd_file.num_variants}")
    print(f"Individuals: {igd_file.num_individuals}")
    print(f"Source: {igd_file.source}")
    print(f"Description: {igd_file.description}")
    for variant_index in range(igd_file.num_variants):
        # Approach 1: Get the samples as a list
        print(f"REF: {igd_file.get_ref_allele(variant_index)}, ALT: {igd_file.get_alt_allele(variant_index)}")
        position, is_missing, sample_list = igd_file.get_samples(variant_index)
        print( (position, is_missing, len(sample_list))  )

        # Approach 2: Get the samples as a BitVector object
        # See https://engineering.purdue.edu/kak/dist/BitVector-3.5.0.html
        position, is_missing, bitvect = igd_file.get_samples_bv(variant_index)
        print( (position, is_missing, bitvect.count_bits())  )

IGD can be highly performant for a few reasons:

It stores sparse data sparsely. Low-frequency variants are stored as sample lists. Medium/high frequency variants are stored as bit vectors.
It is indexable (you can jump directly to data for the ith variant). Since the index is stored in its own section of the file, scanning the index is extremely fast. So only looking at variants for a particular range of the genome is very fast (in this case you would use pyigd.IGDFile.get_position_and_flags() to find the first variant index within the range, and then use pyigd.IGDFile.get_samples() after that).
The genotype data is stored in one of two very simple binary formats. This makes parsing fast, and the compact nature of the file makes reading from disk/memory fast as well.

How do I use IGD in my project?

Clone picovcf and follow the instructions in its README to build the tools for that library.
- If you want to be able to convert .vcf.gz (compressed VCF) to IGD, make sure you build with -DENABLE_VCF_GZ=ON
One of the built tools will be igdtools, which can converts from VCF to IGD, among other things (such as filtering IGD files).
Do one of the following:
- If your project is C++, copy picovcf.hpp into your project, #include it somewhere and then use according to the documentation
- If your project is Python, clone pyigd and install it per the README instructions.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1

Aug 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyigd-0.1.tar.gz (8.0 kB view hashes)

Uploaded Aug 16, 2024 Source

Built Distribution

pyigd-0.1-py3-none-any.whl (6.4 kB view hashes)

Uploaded Aug 16, 2024 Python 3

Hashes for pyigd-0.1.tar.gz

Hashes for pyigd-0.1.tar.gz
Algorithm	Hash digest
SHA256	`85155fdb10254efb68e8b206a5e3e22e34751bc15e8bbafaca46c584a4b78790`
MD5	`3d73aab45ebf5bfe2e3d343b065bd7a5`
BLAKE2b-256	`205509785d9c491c0329baf2473f1e965d66aa99cde7d82f1c86dc20b07e02b4`

Hashes for pyigd-0.1-py3-none-any.whl

Hashes for pyigd-0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`751ace4fe2d4413650b007928412d544d3469d3c2a59ccb6a266fb02088e90ee`
MD5	`883f0f2d17324247ce88b6a6308c4d5f`
BLAKE2b-256	`38d8787b8ff4754011ef50c625940a17bfdd07c2cdbb4ebada5e634326abc59c`