[DEPRECATED] Select non-redundant subset of DNA or protein-sequences
Project description
[DEPRECATED] hobohm
[DEPRECATED] Note: hobohm
has been merged into greedysub which should be used instead. greedysub
implements a better algorithm (typically giving larger subsets).
The hobohm
program aims to select a non-redundant subset of DNA- or protein-sequences, such that all pairwise sequence identities
are below a given threshold.
The program takes as input (1) a text-file containing a list of pairwise similarities between sequences (name1 name2 similarity
), and (2) a cutoff for deciding when two sequences are too similar (i.e., when they are "neighbors").
The output (written to file) is a list of names that should be kept in the subset. No retained items are neighbors, and the algorithm aims to pick the maximally sized such set, given the cutoff. (Note that this is a hard problem, and this heuristic is not optimal. See notes on computational intractibility of the problem and performance of heuristics in the greedysub README).
The "Hobohm" algorithm was originally created with the purpose of selecting homology-reduced sets of protein data from larger datasets. "Homology-reduced" here means that the resulting data set should contain no pairs of sequences with high sequence identity:
"Selection of representative protein data sets", Protein Sci. 1992. 1(3):409-17.
This command-line program implements algorithm 2 from that paper, and can be applied to any type of data for which pairwise similarities (or distances) can be defined.
Availability
The hobohm
source code is available on GitHub: https://github.com/agormp/hobohm. The executable can be installed from PyPI: https://pypi.org/project/hobohm/
Installation
python3 -m pip install hobohm
Upgrading to latest version:
python3 -m pip install --upgrade hobohm
Dependencies
hobohm
relies on the pandas package, which is automatically included when using pip to install.
Usage
usage: hobohm [-h] [--val VALUETYPE] [-c CUTOFF] [-k KEEPFILE] INFILE OUTFILE
Select non-redundant subset of DNA or protein-sequences, such that all pairwise
sequence identities are below threshold.
positional arguments:
INFILE input file containing similarity or distance for each pair of
items: name1 name2 value
OUTFILE output file contatining neighborless subset of items (one name per
line)
options:
-h, --help show this help message and exit
--val VALUETYPE specify whether values in INFILE are distances (--val dist) or
similarities (--val sim)
-c CUTOFF cutoff value for deciding which pairs are neighbors
-k KEEPFILE (optional) file with names of items that must be kept (one name per
line)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hobohm-2.0.2.tar.gz
.
File metadata
- Download URL: hobohm-2.0.2.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e88ecd5f290d4efb987d3e116d4b2c480c982d624e458e7ac3c62598eee7b90 |
|
MD5 | ecba18fead00f52144ca283da9241e02 |
|
BLAKE2b-256 | c5172e2476416d810dc9cead352f8e42db4b5a0b28ab5389bcd022cc7e4ffd11 |
File details
Details for the file hobohm-2.0.2-py3-none-any.whl
.
File metadata
- Download URL: hobohm-2.0.2-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33bca8e445bae40720f59d787e9acddbe9356662f80b97307afccbcff8b4dbd6 |
|
MD5 | dc26f3a8ce82ba682d1d5abae06c557c |
|
BLAKE2b-256 | 855a1363b6b8cd68566fba99980e9ffb3702960f28a7bb18a3d37ed52b5c5abf |