Skip to main content

An HPV integration sites detection tool for targeted capture sequencing data

Project description

Documentation Status License PyPI version

Host Downloads
PyPI Downloads

SearcHPV

An HPV integration point detection tool for targeted capture sequencing data

Introdution

  • SearcHPV detects HPV fusion sites on both human genome and HPV genome
  • SearcHPV is able to provide locally assembled contigs for each integration events. It will report at least one and at most two contigs for each integration sites. The two contigs will provide information captured for left and right sides of the event.

Getting started

  1. Required resources
  • Unix like environment
  1. Download and install Firstly, download and install the required resources.

    1. Download Anaconda >=4.11.0: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html#install-linux-silent

    2. Download the "environment.yaml" file under this repository

    3. Creat conda environment for SearcHPV:

      conda env create -f [your_path]/environment.yaml
      
      

      This command will automatically set up all the third-party tools and packages required for SearcHPV and install latest version of SearcHPV. The name of the environment is "searcHPV".

      You can check the packages and tools in this environment by:

      conda list -n searcHPV
      
      

      You can update the environment by:

      conda env update -f [your_path]/environment.yaml
      
      
  2. Usage

SearcHPV have four main steps. You could either run it start-to-finish or run it step-by-step.

  • Before running SearcHPV, active the conda environment:
conda activate searcHPV

If you are running commands in a bash script, start with:

#!/bin/bash
source ~/anaconda3/etc/profile.d/conda.sh;
conda activate searcHPV; 
#[searcHPV commands...]

Note: Please check your path of "conda.sh" if you did not install Anaconda in the home directory.

  • Usage of searcHPV:
searcHPV <options> ...
  • Standard options:
 -fastq1 <str>  sequencing data: fastq/fq.gz file
 -fastq2 <str>  sequencing data: fastq/fq.gz file
 -humRef <str>  human reference genome: fasta file
 -virRef <str>  HPV reference genome: fasta file
  • Optional options:
-h, --help      show this help message and exit
-window <int>   the length of region searching for informative reads, default=300
-output <str>   output directory, default "./"
-alignment      run the alignment step, step1
-genomeFusion   call the genome fusion points, step2
-assemble local assemble for each integration event, step3
-hpvFusion call the HPV fusion points, step4
-clusterWindow <int> the length of window of clustering integration sites,default=100
-gz             if fastq files are in gz format
-poly(dn) N     poly(n), n*d(A/T/C/G), will report low confidence if contig contains poly(n), default=20
-index          index the original human and virus reference files, default=False

Note: If you've already indexed the virus and human reference files for BWA, Samtools, Picard, you do not need to add the "-index" option, especailly when you are running for a batch of samples that share the same virus and human reference files and you do not want to spend time on indexing references every time running a sample. The commands for indexing the virus and human reference files:

#activate SearcHPV conda environment first to make sure using the correct versions of tools
ref = '[path_of_your_reference_file]'
bwa index {ref}
samtools faidx {ref}
picard CreateSequenceDictionary R={ref} O={ref.replace('.fa','.dict')
  1. Examples:

    1. Run it start-to-finish and submit a SBATCH job:

      #!/bin/bash
      #SBATCH --job-name=searcHPV
      #SBATCH --mail-user=wenjingu@umich.edu
      #SBATCH --mail-type=BEGIN,END
      #SBATCH --cpus-per-task=1
      #SBATCH --nodes=1
      #SBATCH --ntasks-per-node=8
      #SBATCH --mem=40gb
      #SBATCH --time=100:00:00
      #SBATCH --account=XXXXX
      #SBATCH --partition=standard
      #SBATCH --output=searcHPV.log
      #SBATCH --error=searcHPV.err
      source ~/anaconda3/etc/profile.d/conda.sh;
      conda activate searcHPV;      
      searcHPV -fastq1 Sample_81279.R1.fastq.gz -fastq2 Sample_81279.R2.fastq.gz -humRef hs37d5.fa -virRef HPV.fa -output /home/scratch/HPV_fusion/Sample_81279 -gz -index;
      
    2. Run it step-by-step:

      searchHPV -alignment -fastq1 Sample_81279.R1.fastq.gz -fastq2 Sample_81279.R2.fastq.gz -humRef hs37d5.fa -virRef HPV.fa -output /home/scratch/HPV_fusion/Sample_81279 -gz -index
      searchHPV -genomeFusion -fastq1 Sample_81279.R1.fastq.gz -fastq2 Sample_81279.R2.fastq.gz -humRef hs37d5.fa -virRef HPV.fa -output /home/scratch/HPV_fusion/Sample_81279 -gz
      searchHPV -assemble -fastq1 Sample_81279.R1.fastq.gz -fastq2 Sample_81279.R2.fastq.gz -humRef hs37d5.fa -virRef HPV.fa -output /home/scratch/HPV_fusion/Sample_81279 -gz
      searchHPV -hpvFusion -fastq1 Sample_81279.R1.fastq.gz -fastq2 Sample_81279.R2.fastq.gz -humRef hs37d5.fa -virRef HPV.fa -output /home/scratch/HPV_fusion/Sample_81279 -gz
      
      

      Note: if run it step-by-step, please make sure the output directories for all steps are the same.

Output

  1. Alignment: the marked dupliaction alignment bam file and customized reference genome.\
  2. Genome Fusion Point Calling: orignal callset, filtered callset, filtered clustered callset.\
  3. Assemble: supportive reads, contigs for each integration events (unfiltered).\
  4. HPV fusion Point Calling: alignment bam file for contigs againt human and HPV genome.\ Final outputs are under the folder "call_fusion_virus": summary of all the integration events : "HPVfusionPointContig.txt" contig sequences for all the integration events: "ContigsSequence.fa"

Citation

SearcHPV: a novel approach to identify and assemble human papillomavirus-host genomic integration events in cancer --- Accepted by Cancer

Contact

wenjingu@umich.edu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

searcHPV-1.0.16.tar.gz (24.0 kB view details)

Uploaded Source

Built Distributions

searcHPV-1.0.16-py3.8.egg (43.8 kB view details)

Uploaded Source

searcHPV-1.0.16-py3.7.egg (44.5 kB view details)

Uploaded Source

searcHPV-1.0.16-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file searcHPV-1.0.16.tar.gz.

File metadata

  • Download URL: searcHPV-1.0.16.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.3 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.8

File hashes

Hashes for searcHPV-1.0.16.tar.gz
Algorithm Hash digest
SHA256 c55908d1625c33fa8b34213e4374b407dd7cf6bdd1ee9fe0ca1f1030bfacd2a8
MD5 d012d53e78cd045872a955d2e0a0dba8
BLAKE2b-256 6636872fc84a5798783676f80d98a2e3434acfba76941aacd8cb63e4bf2ce8de

See more details on using hashes here.

Provenance

File details

Details for the file searcHPV-1.0.16-py3.8.egg.

File metadata

  • Download URL: searcHPV-1.0.16-py3.8.egg
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.3 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.8

File hashes

Hashes for searcHPV-1.0.16-py3.8.egg
Algorithm Hash digest
SHA256 2be0954ecddfa9d84dab28bebac80bca8e51ce18765d8ce84a254d3ce8c7e748
MD5 d83f2e4dce96a1e1198518c1fcb52f40
BLAKE2b-256 dede01b2b598dcc541fa2c0dbe2cc0752e8efaac1104106a64e4fc4c21365431

See more details on using hashes here.

Provenance

File details

Details for the file searcHPV-1.0.16-py3.7.egg.

File metadata

  • Download URL: searcHPV-1.0.16-py3.7.egg
  • Upload date:
  • Size: 44.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.3 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.8

File hashes

Hashes for searcHPV-1.0.16-py3.7.egg
Algorithm Hash digest
SHA256 7dffc3c9bf7dafea4d3b0e2d2f2eac8d82ed11434240d85e4960f244e23f572d
MD5 79418d4439251add94dde16bca8636b1
BLAKE2b-256 5f867bef7a4285a5c49e8683602a6ca17c4dc38a050012040d129de01a057e02

See more details on using hashes here.

Provenance

File details

Details for the file searcHPV-1.0.16-py3-none-any.whl.

File metadata

  • Download URL: searcHPV-1.0.16-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.3 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.8

File hashes

Hashes for searcHPV-1.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 cc12f06f6256afaca37b6e16582ea4f0e51e667723d59f61f26cf2f8b4e26821
MD5 24a0b434418947d8674d3f82609b5135
BLAKE2b-256 588bd492377d37cbcab9793adb336caa559a1b9aa2db26149b6540d8d56e3fb4

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page