A Structural Error-Aware Refinement Framework for Long-read Genome Assemblies
Project description
Genome Assembly Polishing & Scaffolding Pipeline
This repository provides an automated/step-by-step pipeline to detect assembly errors, scaffold contigs using a reference genome, and perform hybrid polishing utilizing both PacBio HiFi long reads and Illumina short reads.
Dependencies
Ensure the following tools are installed and available in your $PATH:
Step-by-Step Usage
Please export your environmental variables or modify the paths in the commands below before running.
Environment Setup (Example)
# Define your project directory and sample ID
export PROJECT_DIR="/path/to/your/workspace"
export SAMPLE_ID="YourSampleName"
# Define raw data inputs
export HIFI_READS="${PROJECT_DIR}/data/hifi.fastq.gz"
export NGS_R1="${PROJECT_DIR}/ngs/short_reads_1.fastq"
export NGS_R2="${PROJECT_DIR}/ngs/short_reads_2.fastq"
export REF_FASTA="/path/to/reference/T2T-CHM13v2.0.fasta"
# Define initial assembly input
export INIT_ASM="${PROJECT_DIR}/data/initial_assembly.fasta"
Step 1: Detect Assembly Errors (Parallel Inspector)
Run the parallelized inspector script to identify structural errors in the initial assembly.
Bash
nohup python parallel_inspector.py \
-i ${HIFI_READS} \
-c ${INIT_ASM} \
-o sear \
--num_subsets 4 \
--proportion 0.3 \
--max_concurrent 4 \
> inspector.log 2>&1 &
Step 2: Filter and Cut Assembly Errors
Filter the detected error regions and slice the problematic contigs.
Bash
# Filter errors
./filter_errors.sh \
-p inspector_out_sear_part \
-m ${SAMPLE_ID} \
-n 4 \
-v 2
# Cut assembly at error positions
bash cut_error.sh \
-m ${SAMPLE_ID} \
-i final_best_list_${SAMPLE_ID}.bed \
-r ${INIT_ASM}
# Expected Output: ${PROJECT_DIR}/data/${SAMPLE_ID}_FINAL_CORRECTED.fasta
Step 3: Reference-guided Scaffolding (RagTag)
Scaffold the corrected assembly using a reference genome.
Bash
export CORRECTED_ASM="${PROJECT_DIR}/data/${SAMPLE_ID}_FINAL_CORRECTED.fasta"
nohup ragtag.py scaffold \
-o ragtag_output \
${REF_FASTA} \
${CORRECTED_ASM} \
-t 32 > ragtag.log 2>&1 &
Step 4: Hybrid Polishing (NextPolish2)
This step performs the final hybrid accuracy polishing using both PacBio HiFi reads and Illumina short reads (NGS). It includes repetitive K-mer filtering (Meryl), long-read alignment (Winnowmap), short-read K-mer profiling (Yak), and the final consensus correction (NextPolish2).
export SCAFFOLD_ASM="ragtag_output/ragtag.scaffold.fasta"
# ---------------------------------------------------------------------
# Part 4.1: Long-Read Alignment with Repetitive K-mer Filtering (Winnowmap)
# ---------------------------------------------------------------------
# Count and filter high-frequency 15-mers to mask repetitive regions during alignment
meryl count k=15 output merylDB ${SCAFFOLD_ASM}
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt
# Align HiFi reads using Winnowmap
nohup winnowmap -t 64 -W repetitive_k15.txt -ax map-pb \
${SCAFFOLD_ASM} \
${HIFI_READS} \
> hifi.winnowmap.sam 2>winnowmap.log &
# Convert SAM to sorted BAM and index it
samtools view -@ 16 -b hifi.winnowmap.sam | samtools sort -@ 16 -o hifi.winnowmap.bam
samtools index hifi.winnowmap.bam
# ---------------------------------------------------------------------
# Part 4.2: Short-Read K-mer Counting (Yak)
# ---------------------------------------------------------------------
# Generate K-mer profiles (K=21 and K=31) from short reads for accuracy calibration
nohup yak count -o sr.k21.yak -k 21 -b 37 ${NGS_R1} ${NGS_R2} > yak_k21.log 2>&1 &
nohup yak count -o sr.k31.yak -k 31 -b 37 ${NGS_R1} ${NGS_R2} > yak_k31.log 2>&1 &
# ---------------------------------------------------------------------
# Part 4.3: Final Polish Execution
# ---------------------------------------------------------------------
# Ensure hifi.winnowmap.bam, sr.k21.yak, and sr.k31.yak are completely generated before running
nohup nextPolish2 -t 32 \
-r hifi.winnowmap.bam \
${SCAFFOLD_ASM} \
sr.k21.yak sr.k31.yak \
-o ${SAMPLE_ID}_ragtag.np2.fa \
> np2.log 2>&1 &
📂 Expected Output Directory Structure
.
├── inspector_out/ # Error detection outputs
├── ragtag_output/ # RagTag scaffolding results
│ └── ragtag.scaffold.fasta # Scaffolded genome
├── merylDB/ # Meryl K-mer database
├── repetitive_k15.txt # Filtered repetitive K-mers
├── hifi.winnowmap.bam # Sorted and indexed BAM alignment
├── sr.k21.yak / sr.k31.yak # Yak short-read K-mer profiles
└── [SampleID]_ragtag.np2.fa # Final Polished Assembly (Success 🏁)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sear_pipeline-0.1.0.tar.gz.
File metadata
- Download URL: sear_pipeline-0.1.0.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdeadd18f4a1750e9e2f37f33ff11c681334963a8a028e703d17cfac6f8a38d1
|
|
| MD5 |
29d86b40809705ba814f60bcbad5b359
|
|
| BLAKE2b-256 |
bddc28f9ed909b5e54bceab2a6d79f8d31053e6500791e8e77cccda9f38727f1
|
File details
Details for the file sear_pipeline-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sear_pipeline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4aaa7e2529fb3fa1e2a34bbaa3526e82911698a4315923bb7222140946283cf
|
|
| MD5 |
66a034063cf971ad236a1d8d03f7e5c0
|
|
| BLAKE2b-256 |
7fccaed61a7adb401e6fb305ff33abdc440db18a47a2a58ba74858bfef5ee514
|