Skip to main content

Select non-redundant subset of DNA or protein-sequences, such that all pairwise sequence identities

Project description

hobohm: command line program for selecting representative subset of data, based on list of pairwise similarities (or distances) between items.

PyPI downloads


Note: The greedysub program implements a better algorithm (typically giving larger subsets), and should be used instead. The hobohm program works, but is no longer maintained).


The hobohm program aims to select a non-redundant subset of DNA- or protein-sequences, such that all pairwise sequence identities are below a given threshold.

The program takes as input (1) a text-file containing a list of pairwise similarities between sequences (name1 name2 similarity), and (2) a cutoff for deciding when two sequences are too similar (i.e., when they are "neighbors").

The output (written to file) is a list of names that should be kept in the subset. No retained items are neighbors, and the algorithm aims to pick the maximally sized such set, given the cutoff. (Note that this is a hard problem, and this heuristic is not optimal. See notes on computational intractibility of the problem and performance of heuristics in the greedysub README).

The "Hobohm" algorithm was originally created with the purpose of selecting homology-reduced sets of protein data from larger datasets. "Homology-reduced" here means that the resulting data set should contain no pairs of sequences with high sequence identity:

"Selection of representative protein data sets", Protein Sci. 1992. 1(3):409-17.

This command-line program implements algorithm 2 from that paper, and can be applied to any type of data for which pairwise similarities (or distances) can be defined.

Availability

The hobohm source code is available on GitHub: https://github.com/agormp/hobohm. The executable can be installed from PyPI: https://pypi.org/project/hobohm/

Installation

python3 -m pip install hobohm

Upgrading to latest version:

python3 -m pip install --upgrade hobohm

Dependencies

hobohm relies on the pandas package, which is automatically included when using pip to install.

Usage

usage: hobohm [-h] [--val VALUETYPE] [-c CUTOFF] [-k KEEPFILE] INFILE OUTFILE

Select non-redundant subset of DNA or protein-sequences, such that all pairwise sequence
identities are below threshold.

positional arguments:
  INFILE           input file containing similarity or distance for each pair of items: name1
                   name2 value
  OUTFILE          output file contatining neighborless subset of items (one name per line)

options:
  -h, --help       show this help message and exit
  --val VALUETYPE  specify whether values in INFILE are distances (--val dist) or similarities
                   (--val sim)
  -c CUTOFF        cutoff value for deciding which pairs are neighbors
  -k KEEPFILE      (optional) file with names of items that must be kept (one name per line)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hobohm-2.0.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

hobohm-2.0.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file hobohm-2.0.0.tar.gz.

File metadata

  • Download URL: hobohm-2.0.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for hobohm-2.0.0.tar.gz
Algorithm Hash digest
SHA256 23389de411a4d4e0b7da6042575a58b84778b3d16f51c39cf48732ad5a8b23c5
MD5 c63cf951e61d9c0c08ae63763f117d05
BLAKE2b-256 80d3f4c1880dd99014e4f48040d232b9d0e7b68556a69672a7cfcfe6f534c1c1

See more details on using hashes here.

File details

Details for the file hobohm-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: hobohm-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for hobohm-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b42ef1c26aa86de9cc94e78572d2683ca9d9d4eb916d62f44fe648dbf373447
MD5 5fd25572a92e742affb86fafe6769dd1
BLAKE2b-256 e7aeeb391bdb4590b01a5fa624c43be32b39d6a6d03c69f3510b1d236556c1c0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page