Skip to main content

(NOT YET FUNCTIONAL) Select representative, non-redundant data set from larger set, based on list of pairwise similarities (or distances).

Project description

hobohm: command line program for selecting representative, non-redundant data set from larger set, based on list of pairwise similarities (or distances).

PyPI downloads

The "Hobohm" algorithm was originally created with the purpose of selecting representative, non-redundant sets of protein data from a larger data set. Non-redundant here means that the resulting data set should contain no pairs of sequences with high similarity:

"Selection of representative protein data sets", Protein Sci. 1992. 1(3):409-17.

This command-line program implements algorithm 2 from that paper.

The hobohm program takes as input (1) a text-file containing a list of pairwise similarities and (2) a cutoff for deciding when two sequences are too similar.

The output (written to stdout) is a list of names that should be kept in the homology-reduced set. The algorithm aims to pick the maximally sized such set, given the cutoff.

It is also possible to use a list of pairwise distances instead of similarities. The cutoff is then interpreted as the minimum distance required in the output data set.

Availability

The hobohm source code is available on GitHub: https://github.com/agormp/hobohm. The executable can be installed from PyPI: https://pypi.org/project/hobohm/

Installation

python3 -m pip install hobohm

Upgrading to latest version:

python3 -m pip install --upgrade hobohm

Dependencies

There are no dependencies (apart from the python standard library).

Overview

Input:

Option -s: pairwise similarities

(1) A text file containing pairwise similarities, one pair per line. All pairs of names must be listed.

name1 name2 similarity
name1 name3 similarity
...

(2) A cutoff value. Pairs of items that are more similar than this cutoff are taken to be redundant, and one of them will be removed in the final output.

Option -d: pairwise distances

(1) A text file containing pairwise distances, one pair per line. All pairs of names must be listed.

name1 name2 distance
name1 name3 distance
...

(2) A cutoff value. Pairs of items that are less distant than this cutoff are taken to be redundant, and one of them will be removed in the final output.

Output:

A list of names of items that should be kept in the non-redundant set, written to stdout. This set contains no pairs of items that are more similar (less distant) than the cutoff. The algorithm aims at making the set the maximal possible size. This can occassionally fail if there are multiple items with the same number of "neighbors".

Usage

Usage: hobohm [-s|-d] FILE -c CUTOFF [-k KEEPFILE]

Options:
  --version    show program's version number and exit
  -h, --help   show this help message and exit
  -s SIMFILE   file with pairwise similarities: name1 name2 sim
  -d DISTFILE  file with pairwise distances: name1 name2 dist
  -c CUTOFF    cutoff for deciding which pairs are neighbors
  -k KEEPFILE  file with names that must be kept (one name per line)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hobohm-1.0.0.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

hobohm-1.0.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file hobohm-1.0.0.tar.gz.

File metadata

  • Download URL: hobohm-1.0.0.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for hobohm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 afe36a7faa99960281665f19408e763ef3af2f265e177769e73a4311816f5584
MD5 9443540e57582bd122213ca4285c8572
BLAKE2b-256 19519090d9c8f676d2d3532555dd277d3f3e14b01566d5e88ccc6435f8166c65

See more details on using hashes here.

File details

Details for the file hobohm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: hobohm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for hobohm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad2e970400555882119a797b35743a2216f7f163e800b871bb8d8ea53c919b52
MD5 7ebb4d2d0a46f4e38e57bf71e0a480db
BLAKE2b-256 c8b474728911b5dda2cb4911c5066ceb7ba5b0eb81845450cbbe1dc755184679

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page