Skip to main content

HIFI-SE

Project description

HIFI-barcode-SE400

The BGISEQ-500 platform has launched a new test sequencing kits capable of single-end 400 bp sequencing (SE400), which offers a simple and reliable way to achieve DNA barcodes efficiently. In this study, we explored the potential of the BGISEQ-500 SE400 sequencing in DNA barcode reference construction, meanwhile provided an updated HIFI-Barcode software package that can generate COI barcode assemblies using HTS reads of length >= 400 bp.

manual

manual book

Versions

  • v1.0.1 HIFI-SE v1.0.1 2018/12/2. Changers form previous version:

    • Add “polish” function
    • Fixed several small bugs.
  • v1.0.0
    HIFI-SE v1.0.0 2018/11/22. Changers form previous version:

    • Formatted python code writing style as PEP8.
    • Fixed several small bugs.
  • V0.0.3
    HIFI-SE v0.03 2018/11/15. Changes from previous version:

    • Modify the description of some arguments for better understanding.
  • V0.0.1
    HIFI-SE v0.0.1 2018/11/03 beat version, establish the framework and archive almost complete functions.

Installation

System requirement and dependencies

Operating system: HIFI-SE is designed to run on most platforms, including UNIX, Linux and MacOS/X. Microsoft Windows. We have tested on Linux and on MacOS/X, because these are the machines we develop on. HIFI-SE is written in python language, and a version 3.5 or higher is required.

Dependencies:

Install

Installation by pip is recommended because it will solve package dependencies automatically, including biopython and bold_identification packages.

pip install HIFI-SE

Usage (latest==1.0.0)

HIFI-SE
usage: HIFI-SE [-h] [-v]
               {all,filter,assign,assembly,polish,bold_identification} ...

Description

    An automatic pipeline for HIFI-SE400 project, including filtering
    raw reads, assigning reads to samples, assembly HIFI barcodes
    (COI sequences).

Version

    1.0.1 2018-12-2  Add "polish" function
    1.0.0 2018-11-22 formated as PEP8 style
    0.0.1 2018-11-3

Author
    yangchentao at genomics.cn, BGI.
    mengguanliang at genomics.cn, BGI.

positional arguments:
  {all,filter,assign,assembly,polish,bold_identification}
    all                 run filter, assign and assembly
    filter              filter raw reads
    assign              assign reads to samples
    assembly            do assembly from input fastq
                        reads, output HIFI barcodes.
    polish              polish COI barcode assemblies,
                        output confident barcodes.
    bold_identification
                        do taxa identification
                        on BOLD system,

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

run by steps [filter -> assign -> assembly]

  • python3 HIFI-SE.py filter
usage: HIFI-SE filter [-h] -outpre <STR> -raw <STR> [-e <INT>]
                      [-q <INT> <INT>] [-n <INT>]

optional arguments:
  -h, --help      show this help message and exit

common arguments:
  -outpre <STR>   prefix for output files

filter arguments:
  -raw <STR>      input raw Single-End fastq file, and
                  only adapters should be removed;
                  supposed on
                  Phred33 score system (BGISEQ-500)
  -e <INT>        expected error threshod, default=10
                  see more: http://drive5.com/usearch/manual/exp_errs.html
  -q <INT> <INT>  filter by base quality; for example: '20 5' means
                  dropping read which contains more than 5 percent of
                  quality score < 20 bases.
  -n <INT>        remove reads containing [INT] Ns, default=1
  • python3 HIFI-SE.py assign
usage: HIFI-SE assign [-h] -outpre <STR> -index INT -fq <STR> -primer <STR>
                      [-outdir <STR>]

optional arguments:
  -h, --help     show this help message and exit

common arguments:
  -outpre <STR>  prefix for output files

index arguments:
  -index INT     the length of tag sequence in the ends of primers

when only run assign arguments:
  -fq <STR>      cleaned fastq file

assign arguments:
  -primer <STR>  taged-primer list, on following format:
                 Rev001   AAGCTAAACTTCAGGGTGACCAAAAAATCA
                 For001   AAGCGGTCAACAAATCATAAAGATATTGG
                 ...
                 this format is necessary!
  -outdir <STR>  output directory for assignment,default="assigned"
  • python3 HIFI-SE.py assembly
usage: HIFI-SE assembly [-h] -outpre <STR> -index INT -list FILE
                        [-vsearch <STR>] [-threads <INT>] [-cid FLOAT]
                        [-min INT] [-max INT] [-oid FLOAT] [-tp INT] [-ab INT]
                        [-seqs_lim INT] [-len INT] [-mode INT] [-rc]
                        [-codon INT] [-frame INT]

optional arguments:
  -h, --help      show this help message and exit

common arguments:
  -outpre <STR>   prefix for output files

index arguments:
  -index INT      the length of tag sequence in the ends of primers

only run assembly arguments(not all):
  -list FILE      input file, fastq file list. [required]

software path:
  -vsearch <STR>  vsearch path(only needed if vsearch is not in $PATH)
  -threads <INT>  threads for vsearch, default=2
  -cid FLOAT      identity for clustering, default=0.98

assembly arguments:
  -min INT        minimun length of overlap, default=80
  -max INT        maximum length of overlap, default=90
  -oid FLOAT      minimun similarity of overlap region, default=0.95
  -tp INT         how many clusters will be used inassembly, recommendation=2
  -ab INT         keep clusters to assembly if its abundance >=INT
  -seqs_lim INT   reads number limitation. by default,
                  no limitation for input reads
  -len INT        standard read length, default=400
  -mode INT       1 or 2; modle 1 is to cluster and keep
                  most [-tp] abundance clusters, or clusters
                  abundance more than [-ab], and then make a
                  consensus sequence for each cluster.
                  modle 2 is directly to make only one consensus
                  sequence without clustering. default=1
  -rc             whether to check amino acid translation
                  for reads, default not

translation arguments(when set -rc or -cc):
  -codon INT      codon usage table used to checktranslation, default=5
  -frame INT      start codon shift for amino acidtranslation, default=1

-python3 HIFI-SE.py polish

usage: HIFI-SE polish [-h] -i STR [-cc] [-cov INT] [-l INT] -index INT
                      [-codon INT] [-frame INT]

optional arguments:
  -h, --help  show this help message and exit

polish arguments:
  -i STR      COI barcode assemblies
  -cc         whether to check final COI contig's
              amino acid translation, default yes
  -cov INT    minimun coverage of 5' or 3' end allowed, default=5
  -l INT      minimun length of COI barcode allowed, default=650

index arguments:
  -index INT  the length of tag sequence in the ends of primers

translation arguments(when set -rc or -cc):
  -codon INT  codon usage table used to checktranslation, default=5
  -frame INT  start codon shift for amino acidtranslation, default=1

Quickstart

Files used in tutorial

All related files could be found in github page. The important files for tutorial are:

  • raw.fastq.gz, raw output fastq file generated from BGISEQ-500 SE400 module.
  • indexed_primer.list, tagged primer list

Run in "all"

Example:

python3 HIFI-SE.py all -outpre hifi -raw test.raw.fastq -index 4 -primer index_primer.list -mode 1 -cid 0.98 -oid 0.95 -seqs_lim 50000 -threads 4 -tp 2

Citation

This work is not be published, but coming soon! I will update this part after publication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HIFI-SE-1.0.1.tar.gz (19.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page