Skip to main content

Select representative subset of data, based on list of pairwise similarities (or distances) between items.

Project description

hobohm: command line program for selecting representative subset of data, based on list of pairwise similarities (or distances) between items.

PyPI downloads

The hobohm program aims to select a representative subset from a collection of items for which the pairwise similarities are known.

The program takes as input (1) a text-file containing a list of pairwise similarities between items in a data set, and (2) a cutoff for deciding when two items are too similar (i.e., when they are "neighbors").

The output (written to stdout) is a list of names that should be kept in the subset. No retained items are neighbors, and the algorithm aims to pick the maximally sized such set, given the cutoff.

It is also possible to use a list of pairwise distances instead of similarities. The cutoff is then interpreted as the minimum distance required in the selected subset.

The "Hobohm" algorithm was originally created with the purpose of selecting homology-reduced sets of protein data from larger datasets. "Homology-reduced" here means that the resulting data set should contain no pairs of sequences with high sequence identity:

"Selection of representative protein data sets", Protein Sci. 1992. 1(3):409-17.

This command-line program implements algorithm 2 from that paper, and can be applied to any type of data for which pairwise similarities (or distances) can be defined.

Availability

The hobohm source code is available on GitHub: https://github.com/agormp/hobohm. The executable can be installed from PyPI: https://pypi.org/project/hobohm/

Installation

python3 -m pip install hobohm

Upgrading to latest version:

python3 -m pip install --upgrade hobohm

Dependencies

There are no dependencies (apart from the python standard library).

Overview

Input:

Option -s: pairwise similarities

(1) A text file containing pairwise similarities, one pair per line. All pairs of names must be listed. The similarity matrix is assumed to be symmetric, and it is only necessary to list one direction for each pair of names.

name1 name2 similarity
name1 name3 similarity
...

(2) A cutoff value. Pairs of items that are more similar than this cutoff are taken to be redundant, and at least one of them will be removed in the final output.

Option -d: pairwise distances

(1) A text file containing pairwise distances, one pair per line. All pairs of names must be listed. The distance matrix is assumed to be symmetric, and it is only necessary to list one direction for each pair of names.

name1 name2 distance
name1 name3 distance
...

(2) A cutoff value. Pairs of items that are less distant than this cutoff are taken to be redundant, and at least one of them will be removed in the final output.

Output:

A list of names of items that should be kept in the non-redundant set, written to stdout. This set contains no pairs of items that are more similar (less distant) than the cutoff. The algorithm aims at making the set the maximal possible size. This can occassionally fail if there are multiple items with the same number of "neighbors" and the order of removal of items has an impact.

Usage

usage: hobohm.py [-h] [-s | -d] [-c CUTOFF] [-k KEEPFILE] PAIRFILE

Selects representative subset of data based on list of pairwise similarities (or
distances), such that no retained items are close neighbors

positional arguments:
  PAIRFILE     file containing the similarity (option -s) or distance (option -d) for each
               pair of items: name1 name2 value

optional arguments:
  -h, --help   show this help message and exit
  -s           values in PAIRFILE are similarities (larger values = more similar)
  -d           values in PAIRFILE are distances (smaller values = more similar)
  -c CUTOFF    cutoff for deciding which pairs are neighbors
  -k KEEPFILE  file with names of items that must be kept (one name per line)

Usage examples

Select items such that max pairwise similarity is 0.65

hobohm -s -c 0.65 pairsims.txt > nonredundant.txt

Select items such that minimum pairwise distance is 10

hobohm -d -c 10 pairdist.txt > nonredundant.txt

Select items such that max pairwise similarity is 0.3, while keeping items in keeplist.txt

hobohm -s -c 0.3 -k keeplist.txt pairsims.txt > nonredundant.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hobohm-1.0.3.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

hobohm-1.0.3-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file hobohm-1.0.3.tar.gz.

File metadata

  • Download URL: hobohm-1.0.3.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for hobohm-1.0.3.tar.gz
Algorithm Hash digest
SHA256 b0df3dff14cca6ad59ed83adfb750573973858df6ec72d45a94cfd30fa4abc79
MD5 ab4c3e47babc79ccf56a5d3005292624
BLAKE2b-256 5da3065c494fd0e5a8dd0d11d4812858857566374b6294fd30446346b4939a94

See more details on using hashes here.

File details

Details for the file hobohm-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: hobohm-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for hobohm-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3a7bb5c94ff0611a8d082e7ca092416ecd2bf08a597655fc8983877504ec437f
MD5 d6f234057bcb1488a1bd2973fbb4f7d2
BLAKE2b-256 2b15843c5fbbb1c3e0644b225d36d2fac109ae567ff9c1adae549f9c7eea5871

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page