Skip to main content

hifi_trimmer is a tool for filtering and trimming extraneous adapter hits from a HiFi read set using a BLAST search.

Project description

hifi_trimmer

hifi_trimmer is a command-line tool for filtering and trimming extraneous adapter hits from a HiFi read set using a BLAST search against a fasta file of adapter sequences. It is designed to be highly configurable, with per-adapter settings to determine actions if the adapter is found at the ends of a read or in the middle. To improve reproducibility, the primary output of the tool is a BED file that describes the region of each read to be excluded. The tool also includes a command to filter the reads to disk using the produced BED file.

The polars backend for BLAST file processing should respect the number of cores set by your scheduler; however, if this is not the case the number of threads used can be adjusted by setting the environment variable POLARS_MAX_THREADS=int before running the software. The number of threads used by the bgzip backend for writing compressed BED files and the filtered FASTA files can be adjusted using the command-line option --threads.

Installation

git clone git@github.com:sanger-tol/hifi-trimmer.git
cd hifi-trimmer
pip install .

Usage

Usage: hifi_trimmer [OPTIONS] COMMAND [ARGS]...

  Main entry point for the tool.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  filter_bam     Filter the reads stored in a BAM file using the...
  process_blast  Processes the input blastout file according to the...

To process a blast output TSV to a BED file:

Usage: hifi_trimmer process_blast [OPTIONS] BLASTOUT ADAPTER_YAML

  Processes the input blastout file according to the adapter yaml key.

  BLASTOUT: tabular file resulting from a BLAST query of a readset against a
  BLAST database of adapter sequences, run with -outfmt "6 std qlen". If the
  qlen column is missing, lengths can be calculated by passing the --bam
  option.

  ADAPTER_YAML: yaml file containing a list with the following fields per
  adapter:

  - name: (name of adapter. can be a regular expression)
    discard_middle: True/False (discard read if adapter found in middle)
    discard_end: True/False (discard read if adapter found in end)
    trim_end: True/False (trim read if adapter found in end)
    middle_pident: int (minimum pident requred to identify adapter in middle of read)
    middle_length: int (minimum match length required to identify adapter in middle of read)
    end_pident: int (minimum pident requred to identify adapter in end window)
    end_length: int (minimum match length requred to identify adapter in end window)

  Output: By default, writes bgzipped BED to [prefix].bed.gz, and a JSON
  summary file with raw counts of adapter hits detected, counts identified
  after processing, and the total length of removed sequences per adapter to
  [prefix].summary.json.

Options:
  -p, --prefix TEXT               Output prefix for results. Defaults to the
                                  basename of the blastout if not provided.
  -ml, --min_length_after_trimming INTEGER
                                  Minumum length of a read after trimming the
                                  ends in order not to be discarded  [default:
                                  300]
  -el, --end_length INTEGER       Window size at either end of the read to be
                                  considered as 'ends' for searching
                                  [default: 150]
  -hf, --hits                     Write the hits identified using the given
                                  adapter specifications to TSV. The format is
                                  standard BLAST outfmt 6 with the following
                                  extra columns: read_length (int), discard
                                  (bool), trim_l (bool), trim_r (bool)
  --no-summary                    Skip writing a summary TSV with the number
                                  of hits for each adapter
  -t, --threads INTEGER           Number of threads to use for compression
                                  [default: 1]
  --help                          Show this message and exit.

To filter a bam file using the BED file:

Usage: hifi_trimmer filter_bam [OPTIONS] BAM BED OUTFILE

  Filter the reads stored in a BAM file using the appropriate BED file
  produced by blastout_to_bed and write to a bgzipped fasta file.

  BAM: BAM file in which to filter reads
  BED: BED file describing regions of the read set to exclude.
  OUTFILE: File to write the filtered reads to (bgzipped).

Options:
  -f, --fastq            Write FASTQ instead of FASTA
  -t, --threads INTEGER  Number of threads to use for compression  [default:
                         1]
  --help                 Show this message and exit.

Example

First, BLAST your reads against an adapter database:

blastn -query <(samtools fasta /path/to/bam) \
  -db /path/to/adapter/blast/db \
  -reward 1 -penalty -5 -gapopen 3 -gapextend 3 \
  -dust no -soft_masking true -evalue 700 \
  -searchsp 1750000000000 \
  -outfmt "6 std qlen" |\
  bgzip > blastout.gz

To create a BED file, you then need to create a YAML file describing the actions to take for each adapter. The adapter name can be a regular expression, but note that each adapter name in the BLAST file must match only one entry in the YAML.

- adapter: "^NGB00972"  // regular expression matching adapter names
  discard_middle: True  // discard read if adapter is found in the middle
  discard_end: False    // discard read if adapter found at end
  trim_end: True        // trim read if adapter is found at end (overridden by discard choice)
  middle_pident: 95     // minimum percent identity for a match in the middle of the read
  middle_length: 44     // minimum match length for a match in the middle of the read
  end_pident: 90        // minimum percent identity for a match at the end of the read
  end_length: 18        // minimum match length for a match at the end of the read

Then run blastout_to_bed to generate a BED file:

hifi_trimmer process_blast /path/to/blastout.gz /path/to/yaml

Then filter the bam file using the BED file:

hifi_trimmer filter_bam /path/to/bam /path/to/bed /path/to/final/fasta.gz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hifi_trimmer-1.2.1.tar.gz (156.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hifi_trimmer-1.2.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file hifi_trimmer-1.2.1.tar.gz.

File metadata

  • Download URL: hifi_trimmer-1.2.1.tar.gz
  • Upload date:
  • Size: 156.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for hifi_trimmer-1.2.1.tar.gz
Algorithm Hash digest
SHA256 518a85322ce8fd20e7f4cf08f929af11d109d36fc4ecca59bb49a6cd0d9a6338
MD5 0b00d896a5057798f88a544fc66d9b57
BLAKE2b-256 2cdc4090894d29dca63565dc8a203f7ef70ea8b94404e94d4151496b047539a4

See more details on using hashes here.

File details

Details for the file hifi_trimmer-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: hifi_trimmer-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for hifi_trimmer-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 beac0aaeb1d15e5943452a42b68bd4a2b8e059a0b782e8f965c2e5d2d515ee3c
MD5 029e9dcd718e45d73928071c708eb489
BLAKE2b-256 0c367a1f17d4c9324fa6e86c09766ed1b855c8343c79fa99c009c09263e5140f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page