Skip to main content

hifi-trimmer is a tool for filtering and trimming extraneous adapter hits from a HiFi read set using a BLAST search.

Project description

hifi-trimmer

hifi-trimmer is a command-line tool for filtering and trimming extraneous adapter hits from a HiFi read set using a BLAST search against a fasta file of adapter sequences. It is designed to be highly configurable, with per-adapter settings to determine actions if the adapter is found at the ends of a read or in the middle. To improve reproducibility, the primary output of the tool is a BED file that describes the region of each read to be excluded. The tool also includes a command to filter the reads to disk using the produced BED file.

The polars backend for BLAST file processing should respect the number of cores set by your scheduler; however, if this is not the case the number of threads used can be adjusted by setting the environment variable POLARS_MAX_THREADS=int before running the software. The number of threads used by the bgzip backend for writing compressed BED files and the filtered FASTA files can be adjusted using the command-line option --threads.

Installation

pip install hifi_trimmer

Usage

Usage: hifi-trimmer [OPTIONS] COMMAND [ARGS]...

  Main entry point for the tool.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  process-blast  Processes the input blastout file according to the...
  trim           Filter the reads stored in a BAM file using the...

To process a blast output TSV to a BED file:

Usage: hifi-trimmer process-blast [OPTIONS] BLASTOUT ADAPTER_YAML

  Processes the input blastout file according to the adapter yaml key.

  BLASTOUT: tabular file resulting from a BLAST query of a readset against a
  BLAST database of adapter sequences, run with -outfmt "6 std qlen".

  ADAPTER_YAML: yaml file containing a list with the following fields per
  adapter:

  - name: (name of adapter. can be a regular expression)
    discard_middle: True/False (discard read if adapter found in middle)
    discard_end: True/False (discard read if adapter found in end)
    trim_end: True/False (trim read if adapter found in end)
    middle_pident: int (minimum pident requred to identify adapter in middle of read)
    middle_length: int (minimum match length required to identify adapter in middle of read)
    end_pident: int (minimum pident requred to identify adapter in end window)
    end_length: int (minimum match length requred to identify adapter in end window)

  Output: By default, writes bgzipped BED to [prefix].bed.gz, and a JSON
  summary file with raw counts of adapter hits detected, counts identified
  after processing, and the total length of removed sequences per adapter to
  [prefix].summary.json.

Options:
  -p, --prefix TEXT               Output prefix for results. Defaults to the
                                  basename of the blastout if not provided.
  -ml, --min-length-after-trimming INTEGER
                                  Minumum length of a read after trimming the
                                  ends in order not to be discarded  [default:
                                  300]
  -el, --end-length INTEGER       Window size at either end of the read to be
                                  considered as 'ends' for searching
                                  [default: 150]
  --hits / --no-hits              Write the hits identified using the given
                                  adapter specifications to TSV. The format is
                                  standard BLAST outfmt 6 with the following
                                  extra columns: read_length (int), discard
                                  (bool), trim_l (bool), trim_r (bool)
  --no-summary                    Skip writing a summary JSON with the number
                                  of hits for each adapter
  -t, --threads INTEGER           Number of threads to use for compression
                                  [default: 1]
  --help                          Show this message and exit.

To filter a BAM file using the BED file:

Usage: hifi-trimmer trim [OPTIONS] BAM BED

  Filter the reads stored in a BAM file using the appropriate BED file
  produced by blastout_to_bed and write to a bgzipped fasta file.

  BAM: BAM file in which to filter reads
  BED: BED file describing regions of the read set to exclude.

Options:
  -o, --outfile FILENAME       The output file to write to. Defaults to stdout
                               if not specified.
  -t, --threads INTEGER        Number of threads to use for compression
                               [default: 1]
  -f, --format [sam|bam|cram]  Output file format.  [default: bam]
  -h, --format-opts TEXT       Comma-separated list of format options to pass
                               to htslib for writing the output (see https://w
                               ww.htslib.org/doc/samtools.html#GLOBAL_COMMAND_
                               OPTIONS).
  --help                       Show this message and exit.

Example

First, BLAST your reads against an adapter database:

blastn -query <(samtools fasta /path/to/bam) \
  -db /path/to/adapter/blast/db \
  -reward 1 -penalty -5 -gapopen 3 -gapextend 3 \
  -dust no -soft_masking true -evalue 700 \
  -searchsp 1750000000000 \
  -outfmt "6 std qlen" |\
  bgzip > blastout.gz

To create a BED file, you then need to create a YAML file describing the actions to take for each adapter. The adapter name can be a regular expression, but note that each adapter name in the BLAST file must match only one entry in the YAML.

- adapter: "^NGB00972"  // regular expression matching adapter names
  discard_middle: True  // discard read if adapter is found in the middle
  discard_end: False    // discard read if adapter found at end
  trim_end: True        // trim read if adapter is found at end (overridden by discard choice)
  middle_pident: 95     // minimum percent identity for a match in the middle of the read
  middle_length: 44     // minimum match length for a match in the middle of the read
  end_pident: 90        // minimum percent identity for a match at the end of the read
  end_length: 18        // minimum match length for a match at the end of the read

Then run process-blast to generate a BED file:

hifi-trimmer process-blast /path/to/blastout.gz /path/to/yaml

Then filter the BAM file using the BED file:

hifi-trimmer trim /path/to/bam /path/to/bed /path/to/final/file.bam

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hifi_trimmer-3.1.0.tar.gz (259.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hifi_trimmer-3.1.0-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file hifi_trimmer-3.1.0.tar.gz.

File metadata

  • Download URL: hifi_trimmer-3.1.0.tar.gz
  • Upload date:
  • Size: 259.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hifi_trimmer-3.1.0.tar.gz
Algorithm Hash digest
SHA256 d1d05c95acc65cec296132d69f8747e731666cb1640105843ad14138a82dd7a0
MD5 f07a573856553541f4537589e32dade7
BLAKE2b-256 819b9454cdfc3a4f0921966a99ccf3f53508342e0db4e5c0c9848f0544b92082

See more details on using hashes here.

File details

Details for the file hifi_trimmer-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: hifi_trimmer-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hifi_trimmer-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 994fc19e2d255bd98ed648b909412be3c41de6ba7288b5029a6a21eb5c81c6cd
MD5 f9ce0378d975ee0543228b75d6bd4682
BLAKE2b-256 d26f77fd05d0b655aaa5faa70005592406c1c1151ceb011550067344fa2c8221

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page