Skip to main content

Tabix reader written 100% in Python

Project description

tabixpy.py

Tabix parser writtern in Python3.

Tabix

https://samtools.github.io/hts-specs/tabix.pdf

Field                   Description                                     Type     Value
---------------------------------------------------------------------------------------
magic                   Magic string                                    char[4]  TBI\1
n_ref                   # sequences                                     int32_t
format                  Format (0: generic; 1: SAM; 2: VCF)             int32_t
col_seq                 Column for the sequence name                    int32_t
col_beg                 Column for the start of a region                int32_t
col_end                 Column for the end of a region                  int32_t
meta                    Leading character for comment lines             int32_t
skip                    # lines to skip at the beginning                int32_t
l_nm                    Length of concatenated sequence names           int32_t
names                   Concatenated names, each zero terminated        char[l_nm]
======================= List of indices (n=n_ref )            =======================
    n_bin               # distinct bins (for the binning index)         int32_t
======================= List of distinct bins (n=n_bin)       =======================
        bin             Distinct bin number                             uint32_t
        n_chunk         # chunks                                        int32_t
======================= List of chunks (n=n_chunk)            =======================
            cnk_beg     Virtual file offset of the start of the chunk   uint64_t
            cnk_end     Virtual file offset of the end of the chunk     uint64_t
    n_intv              # 16kb intervals (for the linear index)         int32_t
======================= List of distinct intervals (n=n_intv) =======================
        ioff            File offset of the first record in the interval uint64_t
n_no_coor (optional)    # unmapped reads without coordinates set        uint64_t

Notes:
- The index file is BGZF compressed.

- All integers are little-endian.

- When (format&0x10000) is true, the coordinate follows the BED rule (i.e. half-closed-half-open and
zero based); otherwise, the coordinate follows the GFF rule (closed and one based).

- For the SAM format, the end of a region equals POS plus the reference length in the alignment, inferred
from CIGAR. For the VCF format, the end of a region equals POS plus the size of the deletion.

- Field col beg may equal col end, and in this case, the end of a region is end=beg+1.

- Example:
  For GFF, format=0      , col seq=1, col beg=4, col end=5, meta=‘#’ and skip=0.
  For BED, format=0x10000, col seq=1, col beg=2, col end=3, meta=‘#’ and skip=0.

- Given a zero-based, half-closed and half-open region [beg, end), the bin number is calculated with
the following C function:
    int reg2bin(int beg, int end) {
        --end;
        if (beg>>14 == end>>14) return ((1<<15)-1)/7 + (beg>>14);
        if (beg>>17 == end>>17) return ((1<<12)-1)/7 + (beg>>17);
        if (beg>>20 == end>>20) return ((1<< 9)-1)/7 + (beg>>20);
        if (beg>>23 == end>>23) return ((1<< 6)-1)/7 + (beg>>23);
        if (beg>>26 == end>>26) return ((1<< 3)-1)/7 + (beg>>26);
        return 0;
    }

- The list of bins that may overlap a region [beg, end) can be obtained with the following C function:
    #define MAX_BIN (((1<<18)-1)/7)
    int reg2bins(int rbeg, int rend, uint16_t list[MAX_BIN]) {
        int i = 0, k;
        --rend;
        list[i++] = 0;
        for (k =    1 + (rbeg>>26); k <=    1 + (rend>>26); ++k) list[i++] = k;
        for (k =    9 + (rbeg>>23); k <=    9 + (rend>>23); ++k) list[i++] = k;
        for (k =   73 + (rbeg>>20); k <=   73 + (rend>>20); ++k) list[i++] = k;
        for (k =  585 + (rbeg>>17); k <=  585 + (rend>>17); ++k) list[i++] = k;
        for (k = 4681 + (rbeg>>14); k <= 4681 + (rend>>14); ++k) list[i++] = k;
        return i; // #elements in list[]
    }

Schema

https://jsonschema.net/home

Example output

JSON

{
    "__format_name__": "TBJ",
    "__format_ver__": 2,
    "n_ref": 1,
    "format": 2,
    "col_seq": 1,
    "col_beg": 2,
    "col_end": 0,
    "meta": "#",
    "skip": 0,
    "l_nm": 11,
    "names": [
        "SL2.50ch00"
    ],
    "refs": [{
        "ref_n": 0,
        "ref_name": "SL2.50ch00",
        "n_bin": 86,
        "bins": [{
                "bin_n": 0,
                "bin": 4681,
                "n_chunk": 1,
                "chunks": [
                  [29542, 8160890030]
                ]
            },
            {
                "bin_n": 85,
                "bin": 4766,
                "n_chunk": 1,
                "chunks": [
                    [460168303127, 461352730624]
                ]
            }
        ],
        "n_intv": 86,
        "intvs": [29542, 460168303127]
    }],
    "n_no_coor": null
}

File Sizes

Compressed

1.1K annotated_tomato_150.100000.vcf.gz.tbi
2.0K annotated_tomato_150.100000.vcf.gz.tbj
727K annotated_tomato_150.vcf.bgz.tbi
1.2M annotated_tomato_150.vcf.bgz.tbj

Uncompressed

1.1K annotated_tomato_150.100000.vcf.gz.tbi
 15K annotated_tomato_150.100000.vcf.gz.tbj
727K annotated_tomato_150.vcf.bgz.tbi
8.4M annotated_tomato_150.vcf.bgz.tbj

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabixpy-1.tar.gz (5.4 kB view details)

Uploaded Source

Built Distributions

tabixpy-1.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

tabixpy-1-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file tabixpy-1.tar.gz.

File metadata

  • Download URL: tabixpy-1.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.1

File hashes

Hashes for tabixpy-1.tar.gz
Algorithm Hash digest
SHA256 7ed98f4131a7781d2671f0c39b5ae8a26b6b9179001bf35e9e5ac007ae8a8550
MD5 6563a3736e73a8179d42b19145f248e6
BLAKE2b-256 219ddf516cecf2c81bda67dde27a8d7c92657844e7e97689bc02a8e9a5ce1be4

See more details on using hashes here.

File details

Details for the file tabixpy-1.0-py3-none-any.whl.

File metadata

  • Download URL: tabixpy-1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for tabixpy-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ff868f0778562bc97eb6f162525a6614d30ef154ec4a4ac04aece5fd829896f
MD5 1b775f2755171d5aef045ccc2a51737f
BLAKE2b-256 04cfbffbed5ce1dff22acb29db558c0c0d1783f5dba7886ea0aa8cf44d5cd016

See more details on using hashes here.

File details

Details for the file tabixpy-1-py3-none-any.whl.

File metadata

  • Download URL: tabixpy-1-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.1

File hashes

Hashes for tabixpy-1-py3-none-any.whl
Algorithm Hash digest
SHA256 0e09d9c2fc3f9f677abb19813e2c0832aa69256401e3ffc3a94ec7b7caf1e842
MD5 f12dd951bc733a610b33232ae67a1285
BLAKE2b-256 c42ceef7beb120f6ee5057217f2176f98a775415ea3f3dd5254ae9a42a812d98

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page