(NOT YET FUNCTIONAL) Select representative, non-redundant data set from larger set, based on list of pairwise similarities (or distances).
Project description
hobohm: command line program for selecting representative, non-redundant data set from larger set, based on list of pairwise similarities (or distances).
(placeholder - not yet working as described)
The "Hobohm" algorithm was originally created with the purpose of selecting representative, non-redundant sets of protein data from a larger data set. Non-redundant here means that the resulting data set should contain no pairs of sequences with high similarity: "Selection of representative protein data sets". This command-line program implements algorithm 2 from that paper.
The hobohm
program takes as input (1) a text-file containing a list of pairwise similarities and (2) a cutoff for deciding when two sequences are too similar.
The output (written to stdout) is a list of names that should be kept in the homology-reduced set. The algorithm aims to pick the maximally sized such set, given the cutoff.
It is also possible to use a list of pairwise distances instead of similarities. The cutoff is then interpreted as the minimum distance required in the output data set.
Availability
The hobohm
source code is available on GitHub: https://github.com/agormp/hobohm. The executable can be installed from PyPI: https://pypi.org/project/hobohm/
Installation
python3 -m pip install hobohm
Dependencies
There are no dependencies (apart from the python standard library).
Overview
Input:
Option -s: pairwise similarities
(1) A text file containing pairwise similarities, one pair per line:
name1 name2 similarity
name1 name3 similarity
...
(2) A cutoff value. Pairs of items that are more similar than this cutoff are taken to be redundant, and one of them will be removed in the final output.
Option -d: pairwise distances
(1) A text file containing pairwise distances, one pair per line:
name1 name2 distance
name1 name3 distance
...
(2) A cutoff value. Pairs of items that are less distant than this cutoff are taken to be redundant, and one of them will be removed in the final output.
Output:
A list of names of items that should be kept in the non-redundant set, written to stdout. This set contains no pairs of items that are more similar (less distant) than the cutoff. The algorithm aims at making the set the maximal possible size. This can occassionally fail if there are multiple items with the same number of "neighbors".
Usage
To do
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hobohm-0.0.1.tar.gz
.
File metadata
- Download URL: hobohm-0.0.1.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d94ae9ce0500d333337cb14f851420abdf2ac91333d5859330d05007be948929 |
|
MD5 | 69c653519ae1b521c5b9607a13dce6e2 |
|
BLAKE2b-256 | bf6324e365ab70e1361777b52e6320cfbb8f60605c7de1b6b10d42f727276167 |
File details
Details for the file hobohm-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: hobohm-0.0.1-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1dc16f19d0abb6f33399d114fa8cd7c99872e14703a47b90975c27b3c3c5c084 |
|
MD5 | be7c93382cd56cad8de308aea83c141e |
|
BLAKE2b-256 | f6bd36393f3076b1b1db089bb7fc7b0c7a2b65d3b7e196d6356bf86eb8863fca |