Skip to main content

Pure python implementation Tabix reader.

Project description

Pure Tabix

Build Status PyPI version

This is a pure-python Tabix index parser. Useful as an alternative to PySAM and PyTabix for rapid read access by position to Tabix indexed block gzipped files such as VCFs and other common bioinfomatics formats.

See https://samtools.github.io/hts-specs/tabix.pdf and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042176 for information about Tabix and the detailed file format specification.

from puretabix import TabixIndexedFile

tabix_indexed_file = TabixIndexedFile.from_files(open('somefile.vcf.gz', 'rb'), open('somefile.vcf.gz.tbi', 'rb'))
tabix_indexed_file.fetch("1", 1000, 5000)

Documentation is supported via Python built-in module PyDoc: python3 -m pydoc -b puretabix

VCF

Included in this package is tooling for reading and writing VCF lines.

To read a file:

from puretabix.vcf import read_vcf_lines

with open("source.vcf") as input:
    for vcfline in read_vcf_lines(input):
        if vcfline.is_comment:
            # its a comment or meta-information
            pass
        else:
            # access the parsed information
            if "PASS" not in vcfline._filter:
                print(f"{vcfline.chrom} {vcfline.pos} {vcfline.get_genotype()}")

To write some lines:

from puretabix.vcf import VCFLine

with open("output.vcf") as output:
    output.write(str(VCFLine.as_comment_key_dict("fileformat", "VCFv4.2")))
    output.write("\n")
    output.write(
        str(
            VCFLine.as_comment_raw(
                "\t".join(
                    (
                        "CHROM",
                        "POS",
                        "ID",
                        "REF",
                        "ALT",
                        "QUAL",
                        "FILTER",
                        "INFO",
                        "FORMAT",
                        "SAMPLE",
                    )
                )
            )
        )
    )
    output.write("\n")
    output.write(
        str(
            VCFLine.as_data(
                "chr1",
                123,
                ("rs123",),
                "A",
                ("C",),
                ".",
                ("PASS",),
                {},
                ({"GT": "1/0"},),
            )
        )
    )
    output.write("\n")

VCF with index

If there is a tabix index for a block gzipped VCF file, that index can be used for fast random access

import puretabix

with open("input.vcf.gz", "rb") as vcf:
    with open("input.vcf.gz.tbi", "rb") as vcf_tbi:
        indexed = puretabix.TabixIndexedVCFFile.from_files(vcf, vcf_tbi)
        vcfline = tuple(indexed.fetch_vcf_lines("chr1", 1108138))
        assert vcfline.chrom == "chr1"
        assert vcfline.pos == 1108138
        print(f"gt = {vcfline.get_genotype()}")

development

TL;DR: pip install -e '.[dev]' && pre-commit install

pip install -e '.[dev]'  # Install using pip including development extras
pre-commit install  # Enable pre-commit hooks
pre-commit run --all-files  # Run pre-commit hooks without committing
# Note pre-commit is configured to use:
# - seed-isort-config to better categorise third party imports
# - isort to sort imports
# - black to format code
pip-compile  # Freeze dependencies
pytest  # Run tests
coverage run --source=puretabix -m pytest && coverage report -m  # Run tests, print coverage
mypy .  # Type checking
pipdeptree  # Print dependencies
scalene --outfile tests/perf_test.txt --profile-all --cpu-sampling-rate 0.0001 tests/perf_test.py  # performance measurements

Global git ignores per https://help.github.com/en/github/using-git/ignoring-files#configuring-ignored-files-for-all-repositories-on-your-computer

For release to PyPI see https://packaging.python.org/tutorials/packaging-projects/

For information about packaging wheels see https://realpython.com/python-wheels/

git checkout master
git pull
git add setup.py CHANGES.txt
git commit -m"prepare for x.x.x"
git push
git tag x.x.x
git push origin x.x.x
python3 setup.py sdist bdist_wheel && python3 -m twine upload dist/*

acknowledgements

Inspired by @yangmqglobe code in https://github.com/cggh/scikit-allel/pull/297

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

puretabix-5.4.0.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

puretabix-5.4.0-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file puretabix-5.4.0.tar.gz.

File metadata

  • Download URL: puretabix-5.4.0.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.18

File hashes

Hashes for puretabix-5.4.0.tar.gz
Algorithm Hash digest
SHA256 9c6d307ccb97fb55498e4e82f532afb2635e576454b6479da0c74af12681749d
MD5 239b78bcfa743b077d1d8868ebc8bdee
BLAKE2b-256 21b4946c81ccf63fbac991184d146ad6f4c3715987770f9c6443fc4384552180

See more details on using hashes here.

File details

Details for the file puretabix-5.4.0-py3-none-any.whl.

File metadata

  • Download URL: puretabix-5.4.0-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.18

File hashes

Hashes for puretabix-5.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ef9067e8dd0f8d91a9f4eff6fd482347dd664f01a51ebf45c4b184a283f5882
MD5 33ad72792e7bd5b7928de2e883b84d14
BLAKE2b-256 5960593949da3a0f288873b8d5c83142158033a2194ec8744884424aebcc1797

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page