Accurate host read removal
Project description
Hostile
Hostile accurately removes host sequences from short and long read (meta)genomes, consuming paired or unpaired fastq[.gz]
input. Batteries are included – a human reference genome is downloaded when run for the first time. Hostile is precise by default, removing an order of magnitude fewer microbial reads than existing approaches while removing >99.5% of real human reads from 1000 Genomes Project samples. For the best possible retention of microbial reads, use an existing index masked against bacterial and/or viral genomes, or make your own using the built-in masking utility. Read headers can be replaced with integers (using --rename
) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default aligner for short (paired) reads while Minimap2 is default aligner for long reads. In benchmarks, bacterial Illumina reads were decontaminated at 32Mbp/s (210k reads/sec) and bacterial ONT reads at 22Mbp/s, using 8 alignment threads. By default, Hostile requires 4GB of RAM for decontaminating short reads and 13GB for long reads (Minimap2). Further information and benchmarks can be found in the paper and blog post. Please open an issue to report problems or otherwise reach out for help, advice etc.
Reference genomes (indexes)
The default index human-t2t-hla
comprises T2T-CHM13v2.0 and IPD-IMGT/HLA v3.51, and is downloaded automatically when running Hostile unless another index is specified. Slightly higher microbial sequence retention is may be possible using masked indexes, listed below. The index human-t2t-hla-argos985
is masked against 985 reference grade bacterial genomes including common human pathogens, while human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401
is further masked comoprehensively against all known virus and phage genomes. The latter should be used when retention of viral sequences is a priority. To use a standard index, simply pass its name as the value of the --index
argument which takes care of downloading and caching the relevant index. Automatic download can be disabled using the --offline
flag, and --index
can accept a path to a custom reference genome or Bowtie2 index. Object storage is provided by the ModMedMicro research unit at the University of Oxford.
Name | Composition | Date | Masked positions |
---|---|---|---|
human-t2t-hla (default) |
T2T-CHM13v2.0 + IPD-IMGT/HLA v3.51 | 2023-07 | 0 (0%) |
human-t2t-hla-argos985 |
human-t2t-hla masked with 150mers for 985 FDA-ARGOS bacterial genomes |
2023-07 | 317,973 (0.010%) |
human-t2t-hla.rs-viral-202401_ml-phage-202401 |
human-t2t-hla masked with 150mers for 18,719 RefSeq viral and 26,928 Millard Lab phage genomes |
2024-01 | 1,172,993 (0.037%) |
human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 |
human-t2t-hla masked with 150mers for 985 FDA-ARGOS bacterial, 18,719 RefSeq viral, and 26,928 Millard Lab phage genomes |
2024-01 | 1,473,260 (0.046%) |
human-t2t-hla-argos985-mycob140 |
human-t2t-hla masked with 150mers for 985 FDA-ARGOS bacterial & 140 mycobacterial genomes |
2023-07 | 319,752 (0.010%) |
Performance of human-t2t-hla
and human-t2t-hla-argos985-mycob140
was evaluated in the paper
Install
Installation with conda/mamba or Docker is recommended due to non-Python dependencies (Bowtie2, Minimap2, Samtools and Bedtools). Hostile is tested with Ubuntu Linux 22.04, MacOS 12, and under WSL for Windows.
Conda/mamba
conda create -y -n hostile -c conda-forge -c bioconda hostile
conda activate hostile
Docker
wget https://raw.githubusercontent.com/bede/hostile/main/Dockerfile
docker build . --platform linux/amd64
A Biocontainer image is also available, but beware that this often lags behind the latest released version
Index installation (optional)
Hostile automatically downloads and caches the default index human-t2t-hla
when run for the first time, meaning that there is no need to download an index in advance. Neverthless:
- To download and cache the default index (
human-t2t-hla
), runhostile fetch
- To list available indexes, run
hostile fetch --list
- To download and cache another standard index, run e.g.
hostile fetch --name human-t2t-hla-argos985
- To use a custom genome (made with e.g.
hostile mask
), runhostile clean
with--index path/to/genome.fa
(minimap2) or--index path/to/index
(without file extensions; Bowtie2) - To change where indexes are stored, set the environment variable
HOSTILE_CACHE_DIR
to a directory of your choice. Runhostile fetch --list
to verify.
Command line usage
$ hostile clean -h
usage: hostile clean [-h] --fastq1 FASTQ1 [--fastq2 FASTQ2] [--aligner {bowtie2,minimap2,auto}] [--index INDEX]
[--invert] [--rename] [--reorder] [--out-dir OUT_DIR] [--threads THREADS]
[--aligner-args ALIGNER_ARGS] [--force] [--offline] [--debug]
Remove reads aligning to an index from fastq[.gz] input files
options:
-h, --help show this help message and exit
--fastq1 FASTQ1 path to forward fastq[.gz] file
--fastq2 FASTQ2 optional path to reverse fastq[.gz] file
(default: None)
--aligner {bowtie2,minimap2,auto}
alignment algorithm. Default is Bowtie2 (paired reads) & Minimap2 (unpaired reads)
(default: auto)
--index INDEX name of standard index or path to custom genome/index
(default: human-t2t-hla)
--invert keep only reads aligning to the target genome (and their mates if applicable)
(default: False)
--rename replace read names with incrementing integers
(default: False)
--reorder ensure deterministic output order
(default: False)
--out-dir OUT_DIR path to output directory
(default: /Users/bede/Research/Git/hostile)
--threads THREADS number of alignment threads. A sensible default is chosen automatically
(default: 5)
--aligner-args ALIGNER_ARGS
additional arguments for alignment
(default: )
--force overwrite existing output files
(default: False)
--offline disable automatic index download
(default: False)
--debug show debug messages
(default: False)
Short reads, default index
$ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz
INFO: Hostile version 1.0.0. Mode: paired short read (Bowtie2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
[
{
"version": "1.0.0",
"aligner": "bowtie2",
"index": "human-t2t-hla",
"options": [],
"fastq1_in_name": "human_1_1.fastq.gz",
"fastq1_in_path": "/Users/bede/human_1_1.fastq.gz",
"fastq1_out_name": "human_1_1.clean_1.fastq.gz",
"fastq1_out_path": "/Users/bede/human_1_1.clean_1.fastq.gz",
"reads_in": 2,
"reads_out": 0,
"reads_removed": 2,
"reads_removed_proportion": 1.0,
"fastq2_in_name": "human_1_2.fastq.gz",
"fastq2_in_path": "/Users/bede/human_1_2.fastq.gz",
"fastq2_out_name": "human_1_2.clean_2.fastq.gz",
"fastq2_out_path": "/Users/bede/human_1_2.clean_2.fastq.gz"
}
]
Short reads, masked index, save log
$ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz --index human-t2t-hla-argos985 > log.json
INFO: Hostile version 1.0.0. Mode: paired short read (Bowtie2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
Short unpaired reads, save log
By default, single fastqs are assumed to be long reads. Override this by specifying --aligner bowtie2
when decontaminating unpaired short reads.
$ hostile clean --aligner bowtie2 --fastq1 tests/data/human_1_1.fastq.gz > log.json
INFO: Hostile version 1.0.0. Mode: short read (Bowtie2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
Long reads
$ hostile clean --fastq1 tests/data/tuberculosis_1_1.fastq.gz
INFO: Hostile version 1.0.0. Mode: long read (Minimap2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
[
{
"version": "1.0.0",
"aligner": "minimap2",
"index": "human-t2t-hla",
"options": [],
"fastq1_in_name": "tuberculosis_1_1.fastq.gz",
"fastq1_in_path": "/Users/bede/Research/Git/hostile/tests/data/tuberculosis_1_1.fastq.gz",
"fastq1_out_name": "tuberculosis_1_1.clean.fastq.gz",
"fastq1_out_path": "/Users/bede/Research/Git/hostile/tuberculosis_1_1.clean.fastq.gz",
"reads_in": 1,
"reads_out": 1,
"reads_removed": 0,
"reads_removed_proportion": 0.0
}
]
Python usage
from pathlib import Path
from hostile.lib import clean_fastqs, clean_paired_fastqs
# Long reads, defaults
clean_fastqs(
fastqs=[Path("reads.fastq.gz")],
)
# Paired short reads, various options, capture log
log = clean_paired_fastqs(
fastqs=[(Path("reads_1.fastq.gz"), Path("reads_2.fastq.gz"))],
index="human-t2t-hla-argos985",
out_dir=Path("decontaminated-reads"),
rename=True,
force=True,
threads=4
)
print(log)
Masking reference genomes
The mask
subcommand makes it easy to create custom-masked reference genomes and achieve maximum retention of specific target organisms:
hostile mask human.fasta lots-of-bacterial-genomes.fasta --threads 8
You may wish to use one of the existing reference genomes as a starting point. Masking uses Minimap2 to align 150mers of the supplied target genomes with the reference genome, and bedtools to mask all aligned regions with N. Both a masked genome (for Minimap2) and a masked Bowtie2 index is created.
Limitations
- Hostile prioritises retaining microbial sequences above discarding host sequences. If you strive to remove every last human sequence, other approaches may serve you better.
- Performance is not always improved by using all available CPU cores. A sensible default is therefore chosen automatically at runtime based on the number of available CPU cores.
- Minimap2 has an overhead of 30-90s for human genome indexing prior to starting decontamination. Surprisingly, loading a prebuilt index is not significantly faster. I hope to mitigate this in a future release.
Citation
Bede Constantinides, Martin Hunt, Derrick W Crook, Hostile: accurate decontamination of microbial host sequences, Bioinformatics, 2023; btad728, https://doi.org/10.1093/bioinformatics/btad728
@article{10.1093/bioinformatics/btad728,
author = {Constantinides, Bede and Hunt, Martin and Crook, Derrick W},
title = {Hostile: accurate decontamination of microbial host sequences},
journal = {Bioinformatics},
volume = {39},
number = {12},
pages = {btad728},
year = {2023},
month = {12},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btad728},
url = {https://doi.org/10.1093/bioinformatics/btad728},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/12/btad728/54850422/btad728.pdf},
}
Development install
git clone https://github.com/bede/hostile.git
cd hostile
conda env create -y -f environment.yml
conda activate hostile
pip install --editable '.[dev]'
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hostile-1.1.0.tar.gz
.
File metadata
- Download URL: hostile-1.1.0.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eee390f97ac9f669f10792a3fb487d92b9cec518c0b072338b3654a162965e2e |
|
MD5 | ca54d5f7a070a6023cd68f1ee665637f |
|
BLAKE2b-256 | a6d3c9ae7689fc9db16ec0a572a580f9c7908b954bdf7d85b59ddf42219a820a |
File details
Details for the file hostile-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: hostile-1.1.0-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a68d65387da1a1915e452541073b1c90d04468b8d737c5eff245222427a8c98 |
|
MD5 | 5f3672447d56358a8b887037cbf39ae6 |
|
BLAKE2b-256 | b977432f826c0aa0b129388424df041c4b0bfad4186dc4717e2aa9f6a4b2a246 |