SonicParanoid: fast, easy and accurate orthology inference
Project description
SonicParanoid
A fast, accurate and easy to use orthology inference tool.
Description
SonicParanoid is a stand-alone software for the identification of orthologous relationships among multiple species. SonicParanoid is an open source software released under the GNU GENERAL PUBLIC LICENSE, Version 3.0 (GPLv3), implemented in Python3, Cython, and C++. It works on Linux and Mac OSX.
Fast and Scalable
SonicParanoid is able to infer the orthologs for dozens of prokaryotes in minutes, or hours for eukaryotes, using a desktop computer with 8 CPUs. This figure is much smaller when running on HPC servers with dozens of CPUs (e.g. <1h for the QfO benchmark). It is also highly scalable, as it inferred the orthologs for 2000 MAGs in only 1 day using 128 CPUs.
Accurate
SonicParanoid was tested using a benchmark proteome dataset from the Quest for Orthologs consortium, and the correctness of its predictions was evaluated using a standardized Orthology Benchmarking service.
SonicParanoid showed a balanced trade-off between precision and recall, with an accuracy comparable to those of well-established inference methods.
Easy to use
Thanks to its speed, accuracy, and usability SonicParanoid substantially relieves the difficulties of orthology inference for biologists who need to construct and maintain their own genomic datasets.
Installation
For more detail on how to use and install SonicParanoid go its web-page:
http://iwasakilab.k.u-tokyo.ac.jp/sonicparanoid.
Citation
Salvatore Cosentino and Wataru Iwasaki (2018) SonicParanoid: fast, accurate and easy orthology inference.
Bioinformatics. Volume 35, Issue 1, 1 January 2019, Pages 149–151,
https://doi.org/10.1093/bioinformatics/bty631
Changelog
For a more clopete changelog visit the release page on GitLab
2.0.0 (May 2, 2023)
This is a massive update which introduces a lot new and features and improvements.
SonicParanoid2
uses machine leaning for faster orthology and more comprehensive ortholgy inference.
Visit the web-page for more details.
- New: reduced all-vs-all execution time for all-vs-all alignments by 20~50% (depending on the dataset).
- New: domain-aware orthology inference
- Enhancement: you can now see the state of your run in real-time through status bars
- Breaking change: many parameters have removed/added check the web-page more details.
- Breaking change: removed single-linkage clustering for OGs
1.3.8 (November 10, 2021)
- Summary: fixed some important issues related to
Diamond
introduced with versionv1.3.7
. - Hot-fix: Missing otholog table.
- Hot-fix: Error when using Diamond and index files.
- Others: The minimum required memory per thread was reduced to
1 GigaByte
.
1.3.7 (November 8, 2021)
- Maintenance: upgraded to Diamond (v2.0.12)
- Breaking change: the ortholog tables do not have their own directory anymore. For example for species 1 and 2 the ortholog table will stored under
/project/orthologs_db/1/table.1-2
- Breaking change: the ortholog matrixes are now stored under the directory '/project/ortholog_matrixes/'
- Enhancement: more efficient directory structure for the
orthologs_db
directory. - Fix: Inconsistent OG counts with the same input dataset.
- Others: set default value for the
--max-len-diff
parameter to0.75
.
1.3.6 (September 17, 2021)
- Feature: BLAST can now be selected using the parameter
--aln-tool
- Feature: Diamond (v2.0.11) can now be selected using the parameter
--aln-tool
- Feature: added parameter
--min-bitscore
to set minimum bitscore for all-vs-all alignments (default is 40) - Usability:
ANACONDA
should now be used for installation on MacOS (and Linux were needed). Check the web-page for more details - Enhancement: added support for Python 3.9
- Enhancement: retrained Adaboost model with new training data
- Maintenance: upgraded to MMseqs2 version 13-45111
- Fix: Throw an ERROR when empty files are input
- Fix: Wrong automatic project naming
- Breaking change: binaries (e.g., of MMSeqs) are now inside a single directory called
software_packages
- Breaking change: the
-ml
parameter is set to 1 by default - Breaking change: single linkage clustering was removed. The
-slc
parameter was accordingly removed - Breaking change: the parameter
--max-gene-per-sp
was removed - Others: minimum coverages for orthologs set to 20% and 20%
1.3.5 (December 11, 2020)
- Enhancement: by default alignments are now compressed using the DEFLATE method in order to save storage space. The default compression level is 5 but it can be changed using the
--compression-lev
parameter. - Enhancement: reduces the I/O operations.
- Usability: Added guide for the installation using CONDA to the web-page
- Usability: removed homebrew as a requirement on MacOS
- Usability: general improvements to the web-page
- Maintenance: added filetype as a dependency
- Fix: Execution error when using python 3.6
1.3.4 (July 25, 2020)
- Enhancement: execution is 5~10% faster when many small proteomes are given input (e.g. > 1000)
- Enhancement: considerably reduced IO when generating the alignments
- Enhancement: when the available CPUs are more than the required alignment jobs these will be equally split between jobs instead of using 1 thread per job. This considerably reduces execution times when few big proteomes are in input, and many threads are available.
- Enhancement: more informative output from the command line
- Enhancement: output directories are now easier to browse even when many input files are provided
- Enhancement: MCL binaries automatically installed for Linux and MacOS
- Enhancement: warnings are shown only in debug mode
- Enhancement: avoid users to restart a run using a different MMseqs sensitivity
- Enhancement: automatically remove incomplete alignments when restarting a run
- Maintenance: added
wheel
as a dependency and removedsh
- Maintenance: upgraded to MMseqs2 version 11-e1a1c
- Fix: Inconsistent results when using non-indexed target databases. Big thanks to Keito for providing the dataset.
- Fix: wrongly formatted execution times in the alignments stats file.
- Breaking change: alignments and ortholog tables are now organized into subdirectories, please check the web-page for details
1.3.2 (April 23, 2020)
- Enhancement: Added support for Python 3.8
- Maintenance: Increased minimum version for packages, Cython(0.29); pandas(1.0); numpy(1.18); scikit-learn(0.22); scipy(1.2.1); mypy(0.720); biopython(1.73)
- Maintenance: Retrained prediction models using the latest version scikit-learn (0.22)
- Fix: Too many open files error. Big thanks to Eva Deutekom
- Fix: Removed scikit-lean warnings
1.3.0 (November 26, 2019)
- Enhancement: SonicParanoid is much faster when using high sensitivity modes! Check the web-page
- Enhancement: run directory names embed information about the run settings
- Enhancement: generated temporary files are much smaller now
- Fix: error with only 2 input species. Big thanks to Benjamin Hume
- Fix: force overwriting of MMseqs2 binaries if the version is different from the supported one
- Usability: Tested on Arch-based Manjaro Linux
- Others: Big thanks to Shun Yamanouchi for providing some challenging datasets used for testing
- Maintenance: upgraded to MMseqs2 version 10-6d92c
1.2.6 (August 26, 2019)
- Fix:
to many files open
error which sometimes happened when using more than 20 threads
1.2.5 (August 7, 2019)
- Fix: Logical threads are considered instead of physical cores in the adjustment of the threads number
- Requirements: a minimum of 1.75 gigabytes per thread is required (the number of threads is automatically adjusted)
- Enhancement: added parameter
--force-all-threads
to bypass the check for minimum per-thread memory
1.2.4 (July 14, 2019)
- Enhancement: Added control to avoid selecting a number threads higher than the available physical CPU cores (big thanks to Shun Yamanouchi)
- Fix: Removed some scipy warnings, now shown only in debug mode (thanks to Alexie Papanicolaou)
- Requirements: psutils>=5.6.0 is now required
- Requirements: mypy>=0.701 is now required
- Requirements: at least Python 3.6 is now required
1.2.3 (June 7, 2019)
- Enhancement: some error messages are more informative (big thanks to Jeff Stein)
1.2.2 (May 13, 2019)
- Fix: solved a bug that caused MCL to be not properly compiled on some Linux distributions
- Info: source code migrated to GitLab
1.2.1 (May 10, 2019)
- Fix: solved bug related to random missing alignments
- Info: this issue was first described in here
1.2.0 (April 26, 2019)
- Change: Markov Clustering (MCL) is now used by default for the creation of ortholog groups
- Enhancement: the MCL inflation can be controlled through the parameter
--inflation
- Enhancement: Output file with single-copy ortholog groups
- Feature: single-linkage clustering for ortholog groups creation through the
--single-linkage
parameter - Enhancement: added secondary program to filter ortholog groups
- Info: type
sonicparanoid-extract --help
to see the list of options - Enhancement: Filter ortholog groups by species ID
- Enhancement: Filter ortholog groups by species composition (e.g. only groups with a given number of species)
- Enhancement: Extract FASTA sequences of orthologs in selected groups
- Fix: The correct version of SonicParanoid is now shown in the help
- Others: General bug fixes and under-the-hood improvements
1.1.2 (March, 2019)
- Enhancement: Filter ortholog groups by species ID
- Enhancement: Filter ortholog groups by species composition (e.g. only groups with a given number of species)
- Enhancement: Extract FASTA files corresponding orthologs in selected groups
- Fix: The correct version of SonicParanoid is now shown in the help
1.1.1 (January 24, 2019)
- Enhancement: No restriction on file names
- Enhancement: No restriction on symbols used in FASTA headers
- Enhancement: Added file with genes that could not be inserted in any group (not orthologs)
- Enhancement: Added some statistics on the predicted ortholog groups
- Enhancement: Update runs are automatically detected
- Enhancement: Improved inference of in-paralogs
- Enhancement: The directory structure has been redesigned to better support run updated
1.0.14 (October 19, 2018)
- Enhancement: a warning is shown if non-protein sequences are given in input
- Enhancement: upgraded to MMseqs2 6-f5a1c
- Enhancement: SonicParanoid is now available through Bioconda
1.0.13 (September 18, 2018)
- Fix: allow FASTA headers containing the '@' symbol
1.0.12 (September 7, 2018)
- Improved accuracy
- Added new sensitivity mode (most-sensitive)
- Fix: internal input directory is wiped at every new run
- Fix: available disk space calculation
1.0.11 (August 7, 2018)
- Added new program (sonicparanoid-extract) to process output multi-species clusters
- Added the possibility to analyse only 2 proteomes
- Added support for Python3.7
- Python3 versions: 3.5, 3.6, 3.7
- Upgraded MMseqs2 (commit: a856ce, August 6, 2018)
1.0.9 (May 10, 2018)
- First public release
- Python3 versions: 3.4, 3.5, 3.6
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.