Skip to main content

Cython bindings and Python interface to HMMER3.

Project description

🐍🟡♦️🟦 PyHMMER Stars

Cython bindings and Python interface to HMMER3.

Actions Coverage PyPI Bioconda AUR Wheel Python Versions Python Implementations License Source Mirror GitHub issues Docs Changelog Downloads DOI

🗺️ Overview

HMMER is a biological sequence analysis tool that uses profile hidden Markov models to search for sequence homologs. HMMER3 is developed and maintained by the Eddy/Rivas Laboratory at Harvard University.

pyhmmer is a Python package, implemented using the Cython language, that provides bindings to HMMER3. It directly interacts with the HMMER internals, which has the following advantages over CLI wrappers (like hmmer-py):

  • single dependency: If your software or your analysis pipeline is distributed as a Python package, you can add pyhmmer as a dependency to your project, and stop worrying about the HMMER binaries being properly setup on the end-user machine.
  • no intermediate files: Everything happens in memory, in Python objects you have control on, making it easier to pass your inputs to HMMER without needing to write them to a temporary file. Output retrieval is also done in memory, via instances of the pyhmmer.plan7.TopHits class.
  • no input formatting: The Easel object model is exposed in the pyhmmer.easel module, and you have the possibility to build a DigitalSequence object yourself to pass to the HMMER pipeline. This is useful if your sequences are already loaded in memory, for instance because you obtained them from another Python library (such as Pyrodigal or Biopython).
  • no output formatting: HMMER3 is notorious for its numerous output files and its fixed-width tabular output, which is hard to parse (even Bio.SearchIO.HmmerIO is struggling on some sequences).
  • efficient: Using pyhmmer to launch hmmsearch on sequences and HMMs in disk storage is typically as fast as directly using the hmmsearch binary (see the Benchmarks section). pyhmmer.hmmer.hmmsearch uses a different parallelisation strategy compared to the hmmsearch binary from HMMER, which can help getting the most of multiple CPUs when annotating smaller sequence databases.

This library is still a work-in-progress, and in an experimental stage, but it should already pack enough features to run biological analyses or workflows involving hmmsearch, hmmscan, nhmmer, phmmer, hmmbuild and hmmalign.

🔧 Installing

pyhmmer can be installed from PyPI, which hosts some pre-built CPython wheels for x86-64 Linux, as well as the code required to compile from source with Cython:

$ pip install pyhmmer

Compilation for UNIX PowerPC is not tested in CI, but should work out of the box. Other architectures (e.g. Arm) and OSes (e.g. Windows) are not supported by HMMER.

A Bioconda package is also available:

$ conda install -c bioconda pyhmmer

📖 Documentation

A complete API reference can be found in the online documentation, or directly from the command line using pydoc:

$ pydoc pyhmmer.easel
$ pydoc pyhmmer.plan7

💡 Example

Use pyhmmer to run hmmsearch to search for Type 2 PKS domains (t2pks.hmm) inside proteins extracted from the genome of Anaerococcus provencensis (938293.PRJEB85.HG003687.faa). This will produce an iterable over TopHits that can be used for further sorting/querying in Python. Processing happens in parallel using Python threads, and a TopHits object is yielded for every HMM passed in the input iterable.

import pyhmmer

with pyhmmer.easel.SequenceFile("pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    sequences = list(seq_file)

with pyhmmer.plan7.HMMFile("pyhmmer/tests/data/hmms/txt/t2pks.hmm") as hmm_file:
    for hits in pyhmmer.hmmsearch(hmm_file, sequences, cpus=4):
      print(f"HMM {hits.query_name.decode()} found {len(hits)} hits in the target sequences")

Have a look at more in-depth examples such as building a HMM from an alignment, analysing the active site of a hit, or fetching marker genes from a genome in the Examples page of the online documentation.

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

⏱️ Benchmarks

Benchmarks were run on a i7-10710U CPU running @1.10GHz with 6 physical / 12 logical cores, using a FASTA file containing 4,489 protein sequences extracted from the genome of Escherichia coli (562.PRJEB4685) and the version 33.1 of the Pfam HMM library containing 18,259 domains. Commands were run 3 times on a warm SSD. Plain lines show the times for pressed HMMs, and dashed-lines the times for HMMs in text format.

Benchmarks

Raw numbers can be found in the benches folder. They suggest that phmmer should be run with the number of logical cores, while hmmsearch should be run with the number of physical cores (or less). A possible explanation for this observation would be that HMMER platform-specific code requires too many SIMD registers per thread to benefit from simultaneous multi-threading.

To read more about how PyHMMER achieves better parallelism than HMMER for many-to-many searches, have a look at the Performance page of the documentation.

🔍 See Also

Building a HMM from scratch? Then you may be interested in the pyfamsa package, providing bindings to FAMSA, a very fast multiple sequence aligner. In addition, you may want to trim alignments: in that case, consider pytrimal, which wraps trimAl 2.0.

If despite of all the advantages listed earlier, you would rather use HMMER through its CLI, this package will not be of great help. You can instead check the hmmer-py package developed by Danilo Horta at the EMBL-EBI.

⚖️ License

This library is provided under the MIT License. The HMMER3 and Easel code is available under the BSD 3-clause license. See vendor/hmmer/LICENSE and vendor/easel/LICENSE for more information.

This project is in no way affiliated, sponsored, or otherwise endorsed by the original HMMER authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pyhmmer_arm-0.7.5-cp311-cp311-macosx_11_0_arm64.whl (10.7 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

pyhmmer_arm-0.7.5-cp310-cp310-macosx_11_0_arm64.whl (10.7 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

pyhmmer_arm-0.7.5-cp39-cp39-macosx_11_0_arm64.whl (10.8 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

pyhmmer_arm-0.7.5-cp38-cp38-macosx_11_0_arm64.whl (10.8 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

File details

Details for the file pyhmmer_arm-0.7.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyhmmer_arm-0.7.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 719bc2c883f3ca01beccce4b0e46a4a5ff038e7e7e4804f22670efaff3b2ea37
MD5 a40c4e47414406cb9475479cd742fcd2
BLAKE2b-256 5c45208f8e7c93bfec45fd5c3f07cd58c474e95cc7ae11ffb56fc32ee893b866

See more details on using hashes here.

File details

Details for the file pyhmmer_arm-0.7.5-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyhmmer_arm-0.7.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d62c32b78dbb6763e5e9051c39635e573541c962df623a7e6c7e32a68d46513b
MD5 e7a09f84bb55881a91e58165105172eb
BLAKE2b-256 ce3a1dec8e34f3d588768c09e1405ef900032b3d95365efd72a8a37bb0b2979e

See more details on using hashes here.

File details

Details for the file pyhmmer_arm-0.7.5-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyhmmer_arm-0.7.5-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0e41474efb27cd239e50b1d9a0c34182351d795ea0010b8568a3f5437ed1649e
MD5 e18795114c04048e30875e78f4fabddc
BLAKE2b-256 1a69a77e2ef5e8bc927477a9cba67e90fc41cc37f8f96cd131a44501908ccf70

See more details on using hashes here.

File details

Details for the file pyhmmer_arm-0.7.5-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyhmmer_arm-0.7.5-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9fb49305069f176a4f3cd518d970dcd1f90a26ea41f955cbf59f5f74cc7145f
MD5 2b91704068d7160842203efe811970de
BLAKE2b-256 739316ac4f541a57b7aa021e50286e3334abbbbe638701c1b047cda3e715eeb7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page