Skip to main content

Cython bindings and Python interface to HMMER3.

Project description

🐍🟡♦️🟦 pyHMMER Stars

Cython bindings and Python interface to HMMER3.

TravisCI Coverage PyPI Wheel Python Versions Python Implementations License Source Mirror GitHub issues Docs Changelog Downloads DOI

🗺️ Overview

HMMER is a biological sequence analysis tool that uses profile hidden Markov models to search for sequence homologs. HMMER3 is maintained by members of the the Eddy/Rivas Laboratory at Harvard University.

pyhmmer is a Python module, implemented using the Cython language, that provides bindings to HMMER3. It directly interacts with the HMMER internals, which has the following advantages over CLI wrappers (like hmmer-py):

  • single dependency: If your software or your analysis pipeline is distributed as a Python package, you can add pyhmmer as a dependency to your project, and stop worrying about the HMMER binaries being properly setup on the end-user machine.
  • no intermediate files: Everything happens in memory, in Python objects you have control on, making it easier to format your inputs to pass to HMMER without needing to write them to a file. Output retrieval is also done in memory, through instances of the pyhmmer.plan7.TopHits class.
  • no input formatting: The Easel object model is exposed in the pyhmmer.easel module, and you have the possibility to build a Sequence object yourself to pass to the HMMER pipeline. This is useful if your sequences are already loaded in memory, for instance because you obtained them from another Python library (such as Pyrodigal or Biopython).
  • no output formatting: HMMER3 is notorious for its numerous output files and its fixed-width tabular output, which is hard to parse (even Bio.SearchIO.HmmerIO is struggling on some sequences).
  • efficient: Using pyhmmer to launch hmmsearch on sequences and HMMs in disk storage is typically not slower than directly using the hmmsearch binary (see the Benchmarks section). pyhmmer.hmmsearch uses a different parallelisation strategy compared to the hmmsearch binary from HMMER, which helps getting the most of multiple CPUs.

This library is still a work-in-progress, and in a very experimental stage, but it should already pack enough features to run simple biological analyses involving hmmsearch.

🔧 Installing

pyhmmer can be installed from PyPI, which hosts some pre-built CPython wheels for x86-64 Linux, as well as the code required to compile from source with Cython:

$ pip install pyhmmer

Compilation for UNIX PowerPC is not tested in CI, but should work out of the box. Other architectures (e.g. Arm) and OSes (e.g. Windows) are not supported by HMMER.

A bioconda package is planned when this package exits the alpha status.

📖 Documentation

A complete API reference can be found in the online documentation, or directly from the command line using pydoc:

$ pydoc pyhmmer.easel
$ pydoc pyhmmer.plan7

💡 Example

Use pyhmmer to run hmmsearch, and obtain an iterable over TopHits that can be used for further sorting/querying in Python:

import pyhmmer

with pyhmmer.easel.SequenceFile("938293.PRJEB85.HG003687.faa") as file:
    alphabet = file.guess_alphabet()
    sequences = [seq.digitize(alphabet) for seq in file]

with pyhmmer.plan7.HMMFile("Pfam.hmm") as hmms:
    all_hits = list(pyhmmer.hmmsearch(hmms, sequences_file, cpus=4))

Processing happens in parallel using Python threads, and a TopHits object is yielded for every HMM passed in the input iterable. Note that for optimal performance, you should pass the number of physical cores to the cpus argument of the pyhmmer.hmmsearch function, as HMMER requires too many SIMD registers to benefit from hyperthreading.

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

⏱️ Benchmarks

Benchmarks were run on a i7-8550U CPU running at 1.80GHz, using a FASTA file containing 2100 protein sequences (tests/data/seqs/938293.PRJEB85.HG003687.faa) and a subset of the Pfam HMM library containing 2873 domains. Commands were run 20 times.

Command # CPUs mean (s) σ (ms) min (s) max (s) Speedup
python -m pyhmmer hmmsearch 4 20.706 316 19.960 42.457 x1.00
python -m pyhmmer hmmsearch 2 24.076 842 22.289 21.118 x1.16
hmmsearch 2 35.046 161 34.734 35.183 x1.69
hmmsearch 4 37.721 78 37.605 37.847 x1.82
python -m pyhmmer hmmsearch 1 39.022 1346 36.081 40.644 x1.88
hmmsearch 1 44.360 243 44.184 45.018 x2.14
hmmscan 2 102.248 381 101.479 102.765 x4.93
hmmscan 4 106.779 375 106.197 107.482 x5.15
hmmscan 1 107.945 326 107.460 108.502 x5.21

⚖️ License

This library is provided under the MIT License. The HMMER3 and Easel code is available under the BSD 3-clause license. See vendor/hmmer/LICENSE and vendor/easel/LICENSE for more information.

This project is in no way not affiliated, sponsored, or otherwise endorsed by the original HMMER authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

Release history Release notifications | RSS feed

This version

0.1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhmmer-0.1.0.tar.gz (2.0 MB view hashes)

Uploaded Source

Built Distributions

pyhmmer-0.1.0-cp39-cp39-manylinux2010_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

pyhmmer-0.1.0-cp39-cp39-manylinux1_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.9

pyhmmer-0.1.0-cp38-cp38-manylinux2010_x86_64.whl (3.4 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

pyhmmer-0.1.0-cp38-cp38-manylinux1_x86_64.whl (3.4 MB view hashes)

Uploaded CPython 3.8

pyhmmer-0.1.0-cp37-cp37m-manylinux2010_x86_64.whl (3.2 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

pyhmmer-0.1.0-cp37-cp37m-manylinux1_x86_64.whl (3.2 MB view hashes)

Uploaded CPython 3.7m

pyhmmer-0.1.0-cp36-cp36m-manylinux2010_x86_64.whl (3.2 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

pyhmmer-0.1.0-cp36-cp36m-manylinux1_x86_64.whl (3.2 MB view hashes)

Uploaded CPython 3.6m

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page