Hash-based phonemic sequence identifiers
Project description
Konstel(lations)
Not yet stable, proceed with caution
An extensible command line tool and library for generating memorable and pronounceable hash-based identifier schemes for sequences, biological or otherwise. For further details and my SARS-CoV-2 naming proposal, please read my blog post. Requires Python 3.6+.
SARS-CoV-2 naming
Phonemic and truncated cbase32 identifiers provide 36 and 40 bits of entropy respectively, producing no collisions within publicly deposited SARS-CoV-2 spike protein sequences as of 2021-04-12.
Install
Ideally inside a new virtualenv or conda environment:
# Latest release
pip install konstel
# Development version
git clone https://github.com/bede/konstel
pip install --editable konstel
Usage
Command line
$ konstel gen sars-cov-2-s.genome konstel/tests/data/spike.genome.fa --output table
scheme sars-cov-2-s
hash S:0k8n9hjh5xh5kbef1k6ye7e2d4brhpry5r985avrtf69v6amrbc0
hash-8 S:0k8n9hjh
id S:huhiji-gakihi
$ echo "ACGT" | konstel gen generic.nucl - --output table
scheme generic
hash 3qzkx17yf1vy0ssvd6xxvkt02973jvhzk51xv28cj6va16pvkbr0
id bituzu-gupahu-zolodu-lumaki-suripi-rozitu-guhabi-figogo
Python
>>> from konstel import konstel
>>> konstel.generate('sars-cov-2-s.protein', 'konstel/tests/data/spike.prot.fa')
{'scheme': 'sars-cov-2-s', 'hash': 'S:0k8n9hjh5xh5kbef1k6ye7e2d4brhpry5r985avrtf69v6amrbc0', 'hash-8': 'S:0k8n9hjh', 'id': 'S:huhiji-gakihi'}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
konstel-0.8.0.tar.gz
(9.3 kB
view hashes)