Skip to main content

A Structural Error-Aware Refinement Framework for Long-read Genome Assemblies

Project description

Genome Assembly Polishing & Scaffolding Pipeline

This repository provides an automated/step-by-step pipeline to detect assembly errors, scaffold contigs using a reference genome, and perform hybrid polishing utilizing both PacBio HiFi long reads and Illumina short reads.


Dependencies

Ensure the following tools are installed and available in your $PATH:


Step-by-Step Usage

Please export your environmental variables or modify the paths in the commands below before running.

Environment Setup (Example)

# Define your project directory and sample ID
export PROJECT_DIR="/path/to/your/workspace"
export SAMPLE_ID="YourSampleName"

# Define raw data inputs
export HIFI_READS="${PROJECT_DIR}/data/hifi.fastq.gz"
export NGS_R1="${PROJECT_DIR}/ngs/short_reads_1.fastq"
export NGS_R2="${PROJECT_DIR}/ngs/short_reads_2.fastq"
export REF_FASTA="/path/to/reference/T2T-CHM13v2.0.fasta"

# Define initial assembly input
export INIT_ASM="${PROJECT_DIR}/data/initial_assembly.fasta"

Step 1: Detect Assembly Errors (Parallel Inspector)

Run the parallelized inspector script to identify structural errors in the initial assembly.

Bash

nohup python parallel_inspector.py \
    -i ${HIFI_READS} \
    -c ${INIT_ASM} \
    -o sear \
    --num_subsets 4 \
    --proportion 0.3 \
    --max_concurrent 4 \
    > inspector.log 2>&1 &

Step 2: Filter and Cut Assembly Errors

Filter the detected error regions and slice the problematic contigs.

Bash

# Filter errors
./filter_errors.sh \
    -p inspector_out_sear_part \
    -m ${SAMPLE_ID} \
    -n 4 \
    -v 2

# Cut assembly at error positions
bash cut_error.sh \
    -m ${SAMPLE_ID} \
    -i final_best_list_${SAMPLE_ID}.bed \
    -r ${INIT_ASM}
    
# Expected Output: ${PROJECT_DIR}/data/${SAMPLE_ID}_FINAL_CORRECTED.fasta

Step 3: Reference-guided Scaffolding (RagTag)

Scaffold the corrected assembly using a reference genome.

Bash

export CORRECTED_ASM="${PROJECT_DIR}/data/${SAMPLE_ID}_FINAL_CORRECTED.fasta"

nohup ragtag.py scaffold \
    -o ragtag_output \
    ${REF_FASTA} \
    ${CORRECTED_ASM} \
    -t 32 > ragtag.log 2>&1 &

Step 4: Hybrid Polishing (NextPolish2)

This step performs the final hybrid accuracy polishing using both PacBio HiFi reads and Illumina short reads (NGS). It includes repetitive K-mer filtering (Meryl), long-read alignment (Winnowmap), short-read K-mer profiling (Yak), and the final consensus correction (NextPolish2).

export SCAFFOLD_ASM="ragtag_output/ragtag.scaffold.fasta"

# ---------------------------------------------------------------------
# Part 4.1: Long-Read Alignment with Repetitive K-mer Filtering (Winnowmap)
# ---------------------------------------------------------------------
# Count and filter high-frequency 15-mers to mask repetitive regions during alignment
meryl count k=15 output merylDB ${SCAFFOLD_ASM}
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

# Align HiFi reads using Winnowmap
nohup winnowmap -t 64 -W repetitive_k15.txt -ax map-pb \
    ${SCAFFOLD_ASM} \
    ${HIFI_READS} \
    > hifi.winnowmap.sam 2>winnowmap.log &

# Convert SAM to sorted BAM and index it
samtools view -@ 16 -b hifi.winnowmap.sam | samtools sort -@ 16 -o hifi.winnowmap.bam
samtools index hifi.winnowmap.bam


# ---------------------------------------------------------------------
# Part 4.2: Short-Read K-mer Counting (Yak)
# ---------------------------------------------------------------------
# Generate K-mer profiles (K=21 and K=31) from short reads for accuracy calibration
nohup yak count -o sr.k21.yak -k 21 -b 37 ${NGS_R1} ${NGS_R2} > yak_k21.log 2>&1 &
nohup yak count -o sr.k31.yak -k 31 -b 37 ${NGS_R1} ${NGS_R2} > yak_k31.log 2>&1 &


# ---------------------------------------------------------------------
# Part 4.3: Final Polish Execution
# ---------------------------------------------------------------------
# Ensure hifi.winnowmap.bam, sr.k21.yak, and sr.k31.yak are completely generated before running
nohup nextPolish2 -t 32 \
    -r hifi.winnowmap.bam \
    ${SCAFFOLD_ASM} \
    sr.k21.yak sr.k31.yak \
    -o ${SAMPLE_ID}_ragtag.np2.fa \
    > np2.log 2>&1 &

📂 Expected Output Directory Structure

.
├── inspector_out/              # Error detection outputs
├── ragtag_output/              # RagTag scaffolding results
│   └── ragtag.scaffold.fasta   # Scaffolded genome
├── merylDB/                    # Meryl K-mer database
├── repetitive_k15.txt          # Filtered repetitive K-mers
├── hifi.winnowmap.bam          # Sorted and indexed BAM alignment
├── sr.k21.yak / sr.k31.yak     # Yak short-read K-mer profiles
└── [SampleID]_ragtag.np2.fa    # Final Polished Assembly (Success 🏁)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sear_pipeline-0.1.0.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sear_pipeline-0.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file sear_pipeline-0.1.0.tar.gz.

File metadata

  • Download URL: sear_pipeline-0.1.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for sear_pipeline-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bdeadd18f4a1750e9e2f37f33ff11c681334963a8a028e703d17cfac6f8a38d1
MD5 29d86b40809705ba814f60bcbad5b359
BLAKE2b-256 bddc28f9ed909b5e54bceab2a6d79f8d31053e6500791e8e77cccda9f38727f1

See more details on using hashes here.

File details

Details for the file sear_pipeline-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sear_pipeline-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for sear_pipeline-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c4aaa7e2529fb3fa1e2a34bbaa3526e82911698a4315923bb7222140946283cf
MD5 66a034063cf971ad236a1d8d03f7e5c0
BLAKE2b-256 7fccaed61a7adb401e6fb305ff33abdc440db18a47a2a58ba74858bfef5ee514

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page