hifi_trimmer is a tool for filtering and trimming extraneous adapter hits from a HiFi read set using a BLAST search.
Project description
hifi_trimmer
hifi_trimmer is a command-line tool for filtering and trimming extraneous adapter hits
from a HiFi read set using a BLAST search against a fasta file of adapter sequences. It is
designed to be highly configurable, with per-adapter settings to determine actions if
the adapter is found at the ends of a read or in the middle. To improve reproducibility,
the primary output of the tool is a BED file that describes the region of each read to be
excluded. The tool also includes a command to filter the reads to disk using the produced
BED file.
The polars backend for BLAST file processing should respect the number of cores set
by your scheduler; however, if this is not the case the number of threads used can be
adjusted by setting the environment variable POLARS_MAX_THREADS=int before running the
software. The number of threads used by the bgzip backend for writing compressed
BED files and the filtered FASTA files can be adjusted using the command-line option
--threads.
Installation
pip install hifi_trimmer
Usage
Usage: hifi_trimmer [OPTIONS] COMMAND [ARGS]...
Main entry point for the tool.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
filter_bam Filter the reads stored in a BAM file using the...
process_blast Processes the input blastout file according to the...
To process a blast output TSV to a BED file:
Usage: hifi_trimmer process_blast [OPTIONS] BLASTOUT ADAPTER_YAML
Processes the input blastout file according to the adapter yaml key.
BLASTOUT: tabular file resulting from a BLAST query of a readset against a
BLAST database of adapter sequences, run with -outfmt "6 std qlen". If the
qlen column is missing, lengths can be calculated by passing the --bam
option.
ADAPTER_YAML: yaml file containing a list with the following fields per
adapter:
- name: (name of adapter. can be a regular expression)
discard_middle: True/False (discard read if adapter found in middle)
discard_end: True/False (discard read if adapter found in end)
trim_end: True/False (trim read if adapter found in end)
middle_pident: int (minimum pident requred to identify adapter in middle of read)
middle_length: int (minimum match length required to identify adapter in middle of read)
end_pident: int (minimum pident requred to identify adapter in end window)
end_length: int (minimum match length requred to identify adapter in end window)
Output: By default, writes bgzipped BED to [prefix].bed.gz, and a JSON
summary file with raw counts of adapter hits detected, counts identified
after processing, and the total length of removed sequences per adapter to
[prefix].summary.json.
Options:
-p, --prefix TEXT Output prefix for results. Defaults to the
basename of the blastout if not provided.
-ml, --min_length_after_trimming INTEGER
Minumum length of a read after trimming the
ends in order not to be discarded [default:
300]
-el, --end_length INTEGER Window size at either end of the read to be
considered as 'ends' for searching
[default: 150]
-hf, --hits Write the hits identified using the given
adapter specifications to TSV. The format is
standard BLAST outfmt 6 with the following
extra columns: read_length (int), discard
(bool), trim_l (bool), trim_r (bool)
--no-summary Skip writing a summary TSV with the number
of hits for each adapter
-t, --threads INTEGER Number of threads to use for compression
[default: 1]
--help Show this message and exit.
To filter a bam file using the BED file:
Usage: hifi_trimmer filter_bam [OPTIONS] BAM BED OUTFILE
Filter the reads stored in a BAM file using the appropriate BED file
produced by blastout_to_bed and write to a bgzipped fasta file.
BAM: BAM file in which to filter reads
BED: BED file describing regions of the read set to exclude.
OUTFILE: File to write the filtered reads to (bgzipped).
Options:
-f, --fastq Write FASTQ instead of FASTA
-p, --preserve-sam-tags Preserve SAM tags in the output FASTX headers. Equivalent to -t when using samtools fastq.
-t, --threads INTEGER Number of threads to use for compression [default:
1]
--help Show this message and exit.
Example
First, BLAST your reads against an adapter database:
blastn -query <(samtools fasta /path/to/bam) \
-db /path/to/adapter/blast/db \
-reward 1 -penalty -5 -gapopen 3 -gapextend 3 \
-dust no -soft_masking true -evalue 700 \
-searchsp 1750000000000 \
-outfmt "6 std qlen" |\
bgzip > blastout.gz
To create a BED file, you then need to create a YAML file describing the actions to take for each adapter. The adapter name can be a regular expression, but note that each adapter name in the BLAST file must match only one entry in the YAML.
- adapter: "^NGB00972" // regular expression matching adapter names
discard_middle: True // discard read if adapter is found in the middle
discard_end: False // discard read if adapter found at end
trim_end: True // trim read if adapter is found at end (overridden by discard choice)
middle_pident: 95 // minimum percent identity for a match in the middle of the read
middle_length: 44 // minimum match length for a match in the middle of the read
end_pident: 90 // minimum percent identity for a match at the end of the read
end_length: 18 // minimum match length for a match at the end of the read
Then run blastout_to_bed to generate a BED file:
hifi_trimmer process_blast /path/to/blastout.gz /path/to/yaml
Then filter the bam file using the BED file:
hifi_trimmer filter_bam /path/to/bam /path/to/bed /path/to/final/fasta.gz
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hifi_trimmer-2.2.0.tar.gz.
File metadata
- Download URL: hifi_trimmer-2.2.0.tar.gz
- Upload date:
- Size: 232.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
783ca8ea5de0a14bc1fc740fb7dcdb5d14b4a6434adfbb67e63e6f719ac45f7a
|
|
| MD5 |
e0990fa73da4e6ad2a8770954fa7ac80
|
|
| BLAKE2b-256 |
f13522d2eb0643708362fbf1043ebe9e1121b8cc6661be19d6cdb2ff5109e31c
|
File details
Details for the file hifi_trimmer-2.2.0-py3-none-any.whl.
File metadata
- Download URL: hifi_trimmer-2.2.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6797d6cd46028badcdc88c3d39f70997597929acc607ddedf7b49bbbea3e0b59
|
|
| MD5 |
7fe1a0d69bbf7360292694e1a66efee6
|
|
| BLAKE2b-256 |
563e5c0acee271761b458e87df05d7a155550f4bdf57718a50c57aac5307a64f
|