Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.
Project description
biocommons.seqrepo
Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.
Released under the Apache License, 2.0.
Features
Timestamped snapshots of read-only sequence repository
Space-efficient storage of sequences within a single snapshot and across snapshots
Bandwidth-efficient transfer incremental updates
Fast fetching of sequence slices on chromosome-scale sequences
Precomputed digests that may be used as sequence aliases
Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences
The above features are achieved by storing sequences non-redundantly and compressed, using an add-only journalled filesystem structure within a single snapshot, and by using hard links across snapshots. Each sequence is associated with a namespaced alias such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <ncbi,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4> (all of which refer to the same sequence). Block gzipped format (BGZF)) enables pysam to provide fast random access to compressed sequences.
For more information, see doc/design.rst.
Deployments Scenarios
Available now: Local read-only archive, mirrored from public site, accessed via Python API (see Mirroring documentation)
Available now: Local read-write archive, maintained with command line utility and/or API (see Command Line Interface documentation).
Planned: Docker-based data-only container that may be linked to application container
Planned: Docker image that provides REST interface for local or remote access
Requirements
Reading a sequence repository requires several packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.
Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.
Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.
Quick Start
On Ubuntu 16.04:
$ sudo apt install -y python3-dev gcc zlib1g-dev tabix $ pip install seqrepo $ seqrepo pull -i 20160906 $ seqrepo show-status -i 20160906 seqrepo 0.2.3.post3.dev8+nb8298bd62283 root directory: /usr/local/share/seqrepo/20160906, 7.9 GB backends: fastadir (schema 1), seqaliasdb (schema 1) sequences: 773587 sequences, 93051609959 residues, 192 files aliases: 5579572 aliases, 5480085 current, 26 namespaces, 773587 sequences $ seqrepo start-shell -i 20160906 In [1]: sr["NC_000001.11"][780000:780020] Out[1]: 'TGGTGGCACGCGCTTGTAGT' # N.B. The following output is edited $ seqrepo export -i 20160906 | head -n100 >sha1:9a2acba3dd7603f... seguid:mirLo912A/MppLuS1cUyFMduLUQ ensembl-85:GENSCAN00000003538 sh:---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3 ... MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF >sha1:ca996b263102b1... seguid:yplrJjECsVqQufeYy0HkDD16z58 ncbi:XR_001733142.1 sh:---WkVUs3IT3_ZZM-ReDjypLo6d_vJx6 gi:1034683989 TTTACGTCTTTCTGGGAATTTATACTGGAAGTATACTTACCTCTGTGCAAAATTGCAAATATATAAGGTAATTCATTCCAGCATTGCTTATATTAGGTTG AACTATGTAACATTGACATTGATGTGAATCAAAAATGGTTGAAGGCTGGCAGTTTCATATGATTCAGCCTATAATAGCAAAAGATTGAAAAAATCCATTA ATACAGTGTGGTTCAAAAAAATTTGTTGTATCAAGGTAAAATAATAGCCTGAATATAATTAAGATAGTCTGTGTATACATCGATGAAAACATTGCCAATA
See Installation and Mirroring for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for biocommons.seqrepo-0.3.0a1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f1466fd8c15cee65e266279a348e8ff2f119c1ec2505996e9585b9a3248a068 |
|
MD5 | 70e0e4a905595c328add0f1518478082 |
|
BLAKE2b-256 | 913d7ab8f24d5dd3da8aa7bab56a778ed0c10e830d6c134e636e7c2339fd7998 |
Hashes for biocommons.seqrepo-0.3.0a1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72a71ef74fb19f6f51673eb55bed0d5cde50309fc79161e6e1c735015126b34b |
|
MD5 | 0c6dcf5274ded60297b75b2d9ed62199 |
|
BLAKE2b-256 | 4987e428c81f693bc29fac0f9fbb155c541e60bff6f029119d3ad8c0b3fe6b04 |
Hashes for biocommons.seqrepo-0.3.0a1-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | e93dd3ab3a19ba24e9daf5cda85831c5eee4b0569586ff784b147cf3e7bf3273 |
|
MD5 | 1915dc87f5e8c38d1fa00abdbfbc79f8 |
|
BLAKE2b-256 | c6d9cdaa85acca8735ebf7af237d07dfc60fa6649f9c33d881d8b656f8478f0f |