Skip to main content

hifi_trimmer is a tool for filtering and trimming extraneous adapter hits from a HiFi read set using a BLAST search.

Project description

hifi_trimmer

hifi_trimmer is a command-line tool for filtering and trimming extraneous adapter hits from a HiFi read set using a BLAST search against a fasta file of adapter sequences. It is designed to be highly configurable, with per-adapter settings to determine actions if the adapter is found at the ends of a read or in the middle. To improve reproducibility, the primary output of the tool is a BED file that describes the region of each read to be excluded. The tool also includes a command to filter the reads to disk using the produced BED file.

The polars backend for BLAST file processing should respect the number of cores set by your scheduler; however, if this is not the case the number of threads used can be adjusted by setting the environment variable POLARS_MAX_THREADS=int before running the software. The number of threads used by the bgzip backend for writing compressed BED files and the filtered FASTA files can be adjusted using the command-line option --threads.

Installation

pip install hifi_trimmer

Usage

Usage: hifi_trimmer [OPTIONS] COMMAND [ARGS]...

  Main entry point for the tool.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  filter_bam     Filter the reads stored in a BAM file using the...
  process_blast  Processes the input blastout file according to the...

To process a blast output TSV to a BED file:

Usage: hifi_trimmer process_blast [OPTIONS] BLASTOUT ADAPTER_YAML

  Processes the input blastout file according to the adapter yaml key.

  BLASTOUT: tabular file resulting from a BLAST query of a readset against a
  BLAST database of adapter sequences, run with -outfmt "6 std qlen". If the
  qlen column is missing, lengths can be calculated by passing the --bam
  option.

  ADAPTER_YAML: yaml file containing a list with the following fields per
  adapter:

  - name: (name of adapter. can be a regular expression)
    discard_middle: True/False (discard read if adapter found in middle)
    discard_end: True/False (discard read if adapter found in end)
    trim_end: True/False (trim read if adapter found in end)
    middle_pident: int (minimum pident requred to identify adapter in middle of read)
    middle_length: int (minimum match length required to identify adapter in middle of read)
    end_pident: int (minimum pident requred to identify adapter in end window)
    end_length: int (minimum match length requred to identify adapter in end window)

  Output: By default, writes bgzipped BED to [prefix].bed.gz, and a JSON
  summary file with raw counts of adapter hits detected, counts identified
  after processing, and the total length of removed sequences per adapter to
  [prefix].summary.json.

Options:
  -p, --prefix TEXT               Output prefix for results. Defaults to the
                                  basename of the blastout if not provided.
  -ml, --min_length_after_trimming INTEGER
                                  Minumum length of a read after trimming the
                                  ends in order not to be discarded  [default:
                                  300]
  -el, --end_length INTEGER       Window size at either end of the read to be
                                  considered as 'ends' for searching
                                  [default: 150]
  -hf, --hits                     Write the hits identified using the given
                                  adapter specifications to TSV. The format is
                                  standard BLAST outfmt 6 with the following
                                  extra columns: read_length (int), discard
                                  (bool), trim_l (bool), trim_r (bool)
  --no-summary                    Skip writing a summary TSV with the number
                                  of hits for each adapter
  -t, --threads INTEGER           Number of threads to use for compression
                                  [default: 1]
  --help                          Show this message and exit.

To filter a bam file using the BED file:

Usage: hifi_trimmer filter_bam [OPTIONS] BAM BED OUTFILE

  Filter the reads stored in a BAM file using the appropriate BED file
  produced by blastout_to_bed and write to a bgzipped fasta file.

  BAM: BAM file in which to filter reads
  BED: BED file describing regions of the read set to exclude.
  OUTFILE: File to write the filtered reads to (bgzipped).

Options:
  -f, --fastq            Write FASTQ instead of FASTA
  -t, --threads INTEGER  Number of threads to use for compression  [default:
                         1]
  --help                 Show this message and exit.

Example

First, BLAST your reads against an adapter database:

blastn -query <(samtools fasta /path/to/bam) \
  -db /path/to/adapter/blast/db \
  -reward 1 -penalty -5 -gapopen 3 -gapextend 3 \
  -dust no -soft_masking true -evalue 700 \
  -searchsp 1750000000000 \
  -outfmt "6 std qlen" |\
  bgzip > blastout.gz

To create a BED file, you then need to create a YAML file describing the actions to take for each adapter. The adapter name can be a regular expression, but note that each adapter name in the BLAST file must match only one entry in the YAML.

- adapter: "^NGB00972"  // regular expression matching adapter names
  discard_middle: True  // discard read if adapter is found in the middle
  discard_end: False    // discard read if adapter found at end
  trim_end: True        // trim read if adapter is found at end (overridden by discard choice)
  middle_pident: 95     // minimum percent identity for a match in the middle of the read
  middle_length: 44     // minimum match length for a match in the middle of the read
  end_pident: 90        // minimum percent identity for a match at the end of the read
  end_length: 18        // minimum match length for a match at the end of the read

Then run blastout_to_bed to generate a BED file:

hifi_trimmer process_blast /path/to/blastout.gz /path/to/yaml

Then filter the bam file using the BED file:

hifi_trimmer filter_bam /path/to/bam /path/to/bed /path/to/final/fasta.gz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hifi_trimmer-2.0.0.tar.gz (156.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hifi_trimmer-2.0.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file hifi_trimmer-2.0.0.tar.gz.

File metadata

  • Download URL: hifi_trimmer-2.0.0.tar.gz
  • Upload date:
  • Size: 156.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hifi_trimmer-2.0.0.tar.gz
Algorithm Hash digest
SHA256 5ee3a4b6b39ac6c8569b38ef22148faa4e0debdf774915f5c59c9f50c78cd339
MD5 a26e98ace530b38ed9dfd281d4fcdf83
BLAKE2b-256 8b6a19128846ffad8164b25d1e4d77db7e5cda7af36ae5b3e666f0b286c9ec97

See more details on using hashes here.

File details

Details for the file hifi_trimmer-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: hifi_trimmer-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hifi_trimmer-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a5ba4a909f3b9e79de7c93420b2694442ab8b87bcc62dbd6f758d7b430f7687
MD5 1df234e69c0d2512d451bd2525a7b126
BLAKE2b-256 54e6112020fe618bec16cd48159d09f4e91f5ebc3bf8eb32f9ddac40821ac4dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page