biocommons.seqrepo

Python package for writing and reading a local collection of

Project description

biocommons.seqrepo

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.

Released under the Apache License, 2.0.

Features

Timestamped snapshots of read-only sequence repository
Space-efficient storage of sequences within a single snapshot and across snapshots
Bandwidth-efficient transfer incremental updates
Fast fetching of sequence slices on chromosome-scale sequences
Precomputed digests that may be used as sequence aliases
Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences

The above features are achieved by storing sequences non-redundantly and compressed, using an add-only journalled filesystem structure within a single snapshot, and by using hard links across snapshots. Each sequence is associated with a namespaced alias such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <ncbi,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4> (all of which refer to the same sequence). Block gzipped format (BGZF)) enables pysam to provide fast random access to compressed sequences.

For more information, see doc/design.rst.

Anticipated deployments

Local read-only archive, mirrored from public site, accessed via Python API
Local read-only archive, mirrored from public site, accessed via REST interface (not yet available)
Local read-write archive, maintained with command line utility and/or API

Requirements

Reading a sequence repository requires several packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.

Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.

Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.

Quick Start

On Ubuntu 16.04:

$ sudo apt install -y python3-dev gcc zlib1g-dev tabix
$ pip install seqrepo
$ rsync -HRavP rsync.biocommons.org::seqrepo/20160828 /usr/local/share/seqrepo/
$ seqrepo -d /usr/local/share/seqrepo/20160828 start-shell
seqrepo 0.2.3.dev2+neeca95d3ae6e.d20160830
root directory: /opt/seqrepo/20160828, 7.9 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 773511 sequences, 93005806376 residues, 189 files
aliases: 5572724 aliases, 5473237 current, 9 namespaces, 773511 sequences

In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'

See Installation and Mirroring for more information.

Project details

Release history Release notifications | RSS feed

0.6.11

Mar 19, 2025

0.6.10.post1

Mar 3, 2025

0.6.10a1 pre-release

Mar 3, 2025

0.6.9

Feb 20, 2024

0.6.8

Feb 20, 2024

0.6.7

Feb 14, 2024

0.6.6

Nov 13, 2023

0.6.5

Dec 10, 2021

0.6.5a0 pre-release

Dec 10, 2021

0.6.4

Jun 14, 2021

0.6.3.post1

May 25, 2021

0.6.3

Sep 11, 2020

0.6.2

Jul 13, 2020

0.6.1

Jul 8, 2020

0.6.0 yanked

Jul 6, 2020

Reason this release was yanked:

The wheel in this package is broken. Modules in implicit namespaces are not found/packaged correctly by wheel.

0.5.6

Apr 12, 2020

0.5.5

Apr 11, 2020

0.5.4

Apr 7, 2020

0.5.3

Apr 3, 2020

0.5.2

Aug 30, 2019

0.5.1

May 27, 2019

0.5.0

May 20, 2019

0.4.5

May 14, 2019

0.4.4

Nov 26, 2018

0.4.3

Nov 26, 2018

0.4.2

Oct 21, 2018

0.4.1

Oct 21, 2018

0.4.0

Oct 21, 2018

0.3.9

Oct 1, 2018

0.3.8

Oct 1, 2018

0.3.7.post0

Sep 29, 2018

0.3.7

Sep 29, 2018

0.3.6

Aug 19, 2018

0.3.5

Jul 16, 2017

0.3.4

Jul 4, 2017

0.3.3

Jul 3, 2017

0.3.2

Jul 3, 2017

0.3.1

Dec 14, 2016

0.3.0

Oct 6, 2016

0.3.0a1 pre-release

Sep 22, 2016

0.3.0.dev2 pre-release

Sep 13, 2016

0.3.0.dev0 pre-release

Sep 13, 2016

This version

0.2.3.post2

Sep 6, 2016

0.2.2

Aug 30, 2016

0.2.1.post1

Aug 24, 2016

0.2.1

Aug 24, 2016

0.2.0

Aug 24, 2016

0.1.9

Aug 23, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocommons.seqrepo-0.2.3.post2.tar.gz (40.8 kB view details)

Uploaded Sep 6, 2016 Source

File details

Details for the file biocommons.seqrepo-0.2.3.post2.tar.gz.

File metadata

Download URL: biocommons.seqrepo-0.2.3.post2.tar.gz
Upload date: Sep 6, 2016
Size: 40.8 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for biocommons.seqrepo-0.2.3.post2.tar.gz
Algorithm	Hash digest
SHA256	`c45ea6b301c82fdcf90194a637e9f36a8cdb5889a37d8b67c5b5534e430bc81b`
MD5	`0c181cef8073857d59e45ff81b3dafc6`
BLAKE2b-256	`830383961b64dca6bcc802f42dfb641e3f9a8a3448cfc1b853d8d8dc93ff7a82`

See more details on using hashes here.

biocommons.seqrepo 0.2.3.post2

Navigation

Verified details

Owner

Maintainers

Unverified details

Project description

biocommons.seqrepo

Features

Anticipated deployments

Requirements

Quick Start

Project details

Verified details

Owner

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes