Skip to main content

Universal Sequencing-to-Genome-Browser-Tracks Pipeline

Project description

uSeq2Tracks: Universal Sequencing to Browser Tracks Pipeline

Version CI PyPI Snakemake Nextflow DSL2 Python License: MIT

A comprehensive pipeline for processing diverse sequencing datasets and generating standardized genomic tracks for UCSC Genome Browser visualization. uSeq2Tracks handles everything from raw sequencing data to publication-ready browser tracks with minimal user intervention.

Available in two implementations:

  • Snakemake โ€” stable, feature-complete
  • Nextflow DSL2 โ€” cloud/HPC-optimized with Docker & Singularity support

๐ŸŒŸ Overview

uSeq2Tracks is designed to standardize the processing of heterogeneous sequencing datasets, transforming raw sequencing data into organized, genome browser-ready tracks. Whether you're working with public datasets from ENCODE or your own experimental data, uSeq2Tracks provides a unified workflow that handles quality control, alignment, track generation, and browser hub creation.

Key Features

  • Universal Input Support: Handles both local FASTQ files and SRA accessions
  • Multiple Assay Types: ChIP-seq, ATAC-seq, CUT&RUN, RNA-seq, WGS, Long-reads, etc.
  • Two Processing Modes: Standard (full QC) and Rapid (streamlined for public data)
  • Genome-ID Organization: All outputs tagged with unique genome identifiers
  • UCSC Integration: Automatic track hub generation for browser visualization
  • Quality Control: Comprehensive QC reports with FastQC, MultiQC, and assay-specific metrics
  • Flexible Configuration: Extensive parameter customization for each assay type

๐Ÿงฌ Supported Assay Types

Assay Type Purpose Key Outputs Peak Calling
ChIP-seq Histone modifications, TF binding BigWig tracks, narrowPeak files MACS3
ATAC-seq Chromatin accessibility BigWig tracks, narrowPeak files MACS3
CUT&RUN Low-input chromatin profiling BigWig tracks, narrowPeak files MACS3
RNA-seq Gene expression BigWig tracks, count matrices N/A
WGS Genome-wide sequencing BigWig coverage, variant calls N/A
Ancient DNA Historical/archaeological samples BigWig tracks, damage analysis N/A
Nanopore Long-read sequencing BigWig tracks, structural variants N/A
PacBio Long-read sequencing BigWig tracks, high-accuracy variants N/A

๐Ÿš€ Quick Start

1. Installation

# Clone the repository
git clone https://github.com/pgrady1322/uSeq2Tracks.git
cd uSeq2Tracks

# Create the conda environment (bioinformatics tools + dependencies)
conda env create -f envs/useq2tracks.yml
conda activate useq2tracks

# Install the useq2tracks CLI
pip install -e .

# Or install the CLI from PyPI (without pipeline files)
# pip install useq2tracks

2. Configuration

Edit the main configuration file:

# config.yaml
samplesheet: "samples.csv"           # Your sample metadata
genome: "/path/to/genome.fa"         # Reference genome
genome_id: "galGal6"                 # REQUIRED: Unique genome identifier
gtf: "/path/to/annotations.gtf"      # Gene annotations (optional)
outdir: "./results"                  # Output directory
rapid_mode: false                    # Set to true for streamlined processing

3. Sample Sheet Setup

Create a sample sheet describing your data:

Standard Mode Example (comprehensive QC):

sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
sample1,chipseq,SRR123456,,,H3K27ac,replicate1,treatment
sample2,atacseq,,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,accessibility,replicate1,control
sample3,rnaseq,,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,expression,timepoint1,control

Rapid Mode Example (public datasets):

sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
ENCODE_H3K27ac_rep1,chipseq,SRR1536404,,,H3K27ac,ENCSR000EWQ,K562
ENCODE_H3K27ac_rep2,chipseq,SRR1536405,,,H3K27ac,ENCSR000EWQ,K562
ENCODE_Input,chipseq,SRR1536406,,,H3K27ac,input,input

4. Run the Pipeline

# โ”€โ”€ Using the useq2tracks CLI (recommended) โ”€โ”€

# Snakemake (default engine)
useq2tracks run                              # all cores, config.yaml
useq2tracks run --cores 16                   # limit to 16 cores
useq2tracks run --dryrun                     # preview what will run
useq2tracks run --rapid                      # skip QC, essential tracks only
useq2tracks run --profile slurm              # submit to SLURM
useq2tracks run --resume                     # resume interrupted run

# Nextflow
useq2tracks run --engine nextflow
useq2tracks run --engine nextflow --profile docker
useq2tracks run --engine nextflow --resume

# Forward extra engine-specific flags after '--'
useq2tracks run -- --printshellcmds          # Snakemake shell echo
useq2tracks run --engine nextflow -- -with-tower   # Nextflow Tower

# โ”€โ”€ Other CLI subcommands โ”€โ”€

# Validate inputs before running
useq2tracks validate samples.csv --configfile config.yaml

# Show pipeline info
useq2tracks info

# Generate a standalone UCSC track hub
useq2tracks hub --genome-id galGal6 --hub-name MyHub \
  --hub-short-label "My Hub" --hub-long-label "My Sequencing Hub" \
  --genome-name galGal6 --hub-email user@example.com \
  --bigwigs *.bw --output-dir ucsc_hub/

# โ”€โ”€ Or invoke engines directly (advanced) โ”€โ”€
snakemake --use-conda --cores all
nextflow run main.nf -profile docker --genome_id galGal6

๐Ÿ”„ Nextflow Implementation

uSeq2Tracks includes a Nextflow DSL2 pipeline for improved scalability, cloud integration, and HPC support. The Nextflow implementation provides all core functionality with enhanced portability and parallelization.

Why Use Nextflow?

Advantages:

  • โœ… Better Parallelization: Automatic task-level optimization
  • โœ… Improved Resume: More robust caching and resume capability
  • โœ… Cloud Native: Built-in support for AWS, Google Cloud, Azure
  • โœ… HPC Ready: First-class SLURM, SGE, PBS support
  • โœ… Container First: Excellent Docker/Singularity integration
  • โœ… Portable: Works identically across different systems

Nextflow Quick Start

1. Install Nextflow

curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/

2. Configure Pipeline

Create a params.yaml file:

# Input/output
samplesheet: "samples.csv"
genome: "/path/to/genome.fa"
genome_id: "galGal6"  # REQUIRED
outdir: "./results"

# Pipeline mode
rapid_mode: false

# ATAC-seq settings
atacseq:
  mapper: "bowtie2"
  markdup: false
  macs3_opts: "--qval 0.05"
  bw_norm: "CPM"

# ChIP-seq settings
chipseq:
  mapper: "bowtie2"
  control_tag: "input"
  markdup: false

# UCSC hub
ucsc:
  hub_name: "myHub"
  hub_short_label: "My Data"
  hub_long_label: "My Sequencing Data Hub"
  genome_name: "galGal6"
  hub_email: "pgrady1322@gmail.com"

3. Run Nextflow Pipeline

# Navigate to nextflow directory
cd nextflow/

# With Docker (local)
nextflow run main.nf -profile docker -params-file params.yaml

# With Singularity (HPC)
nextflow run main.nf -profile singularity -params-file params.yaml

# With SLURM + Singularity
nextflow run main.nf -profile slurm,singularity -params-file params.yaml

# Rapid mode
nextflow run main.nf -profile rapid,docker -params-file params.yaml

# Resume failed run
nextflow run main.nf -profile docker -resume

Nextflow Features

Implemented Workflows:

  • โœ… ATAC-seq workflow with Tn5 shift correction
  • โœ… ChIP-seq workflow with control matching
  • โœ… CUT&RUN workflow
  • โœ… Genome preparation and indexing
  • โœ… UCSC track hub generation
  • โœ… Samplesheet validation
  • โœ… Dynamic resource allocation
  • โœ… Multiple execution profiles

Snakemake-only (not yet ported to Nextflow):

  • RNA-seq workflow
  • WGS workflow
  • Long-read workflows (Nanopore, PacBio)
  • Ancient DNA workflow
  • SRA download integration
  • FastQC/MultiQC integration
  • Replicate merging

Nextflow Profiles

Container Profiles:

  • docker: Use Docker containers (recommended for local)
  • singularity: Use Singularity containers (recommended for HPC)
  • conda: Use Conda environments

Executor Profiles:

  • local: Run on local machine (default)
  • slurm: Submit to SLURM scheduler
  • sge: Submit to SGE scheduler
  • pbs: Submit to PBS scheduler

Special Profiles:

  • test: Run with minimal test dataset
  • rapid: Skip QC, generate essential tracks only

Combine profiles with commas:

nextflow run main.nf -profile slurm,singularity  # SLURM + Singularity
nextflow run main.nf -profile docker,rapid       # Docker + Rapid mode

Nextflow Configuration Example

# params.yaml for Nextflow
samplesheet: "samples.csv"
genome: "/path/to/genome.fa"
genome_id: "galGal6"

# Resource limits
max_cpus: 64
max_memory: "256.GB"
max_time: "240.h"

# ATAC-seq parameters
atacseq:
  mapper: "bowtie2"
  bowtie2_opts: "--very-sensitive"
  markdup: false
  shift: -75
  extsize: 150
  bw_norm: "CPM"

# ChIP-seq parameters  
chipseq:
  mapper: "bowtie2"
  control_tag: "input"
  macs3_opts: "--qval 0.05 --keep-dup all"
  bw_norm: "CPM"

Nextflow Output Structure

results/
โ””โ”€โ”€ galGal6/                    # Genome ID
    โ”œโ”€โ”€ atacseq/
    โ”‚   โ”œโ”€โ”€ bam/
    โ”‚   โ”œโ”€โ”€ peaks/
    โ”‚   โ””โ”€โ”€ bigwig/
    โ”œโ”€โ”€ chipseq/
    โ”‚   โ”œโ”€โ”€ bam/
    โ”‚   โ”œโ”€โ”€ peaks/
    โ”‚   โ””โ”€โ”€ bigwig/
    โ”œโ”€โ”€ ucsc/
    โ”‚   โ”œโ”€โ”€ hub.txt
    โ”‚   โ”œโ”€โ”€ genomes.txt
    โ”‚   โ””โ”€โ”€ trackDb.txt
    โ””โ”€โ”€ pipeline_info/         # Execution reports
        โ”œโ”€โ”€ execution_report.html
        โ”œโ”€โ”€ execution_timeline.html
        โ””โ”€โ”€ execution_trace.txt

Nextflow Quick Reference

Basic Commands:

# Run pipeline
nextflow run main.nf -profile docker -params-file params.yaml

# Resume failed run
nextflow run main.nf -profile docker -resume

# Override parameters
nextflow run main.nf --genome my_genome.fa --genome_id myGenome

# Generate reports
nextflow run main.nf -with-report -with-timeline -with-dag

# Limit concurrent jobs
nextflow run main.nf -profile slurm -qs 50

Troubleshooting:

# Check configuration
nextflow config main.nf

# Test run
nextflow run main.nf -profile test,docker

# Clean and restart
rm -rf work/ && nextflow run main.nf

Choosing Between Snakemake and Nextflow

Feature Snakemake Nextflow
Maturity Stable, feature-complete Stable (core assays)
Learning Curve Python-based (easier for most) Groovy-based
Parallelization Good Better (automatic)
Cloud Support Via plugins Native
HPC Integration Good Excellent
Resume Capability Good Excellent
Container Support Good Excellent
Best For General bioinformatics HPC/Cloud deployments

Recommendation:

  • Use Snakemake for: All assay types, Python familiarity, full feature set
  • Use Nextflow for: HPC/cloud environments, containerized execution, core epigenomic assays

Nextflow Documentation

For complete Nextflow documentation, see:

  • nextflow/QUICK_REFERENCE.md โ€” Quick command reference
  • nextflow/examples/ โ€” Example configurations

๐Ÿ“Š Pipeline Modes

Standard Mode (Default)

Purpose: Comprehensive analysis with full quality control

Features:

  • โœ… Complete FastQC reports for all samples
  • โœ… Adapter trimming with detailed summaries
  • โœ… MultiQC comprehensive aggregate report
  • โœ… Comprehensive track generation
  • โœ… Parameter sweep analysis (optional)
  • โœ… Replicate-merged composite tracks
  • โœ… Detailed QC metrics and visualizations
  • โœ… Genrich additional outputs (when enabled)

Pipeline Components:

  • FastQC quality reports for raw reads
  • Adapter trimming outputs and summaries
  • MultiQC aggregated QC dashboard
  • Primary pipeline outputs (BigWig, BAM, peaks)
  • Composite tracks for replicate groups
  • Parameter sweep comparisons
  • Complete UCSC hub files

Use Cases:

  • โœจ Novel experimental datasets requiring comprehensive QC
  • โœจ Unknown sample quality scenarios
  • โœจ Publication-ready analysis with full documentation
  • โœจ Parameter optimization studies
  • โœจ Research datasets needing complete audit trail

Output Structure:

results/{genome_id}/
โ”œโ”€โ”€ qc/
โ”‚   โ”œโ”€โ”€ fastqc/           # Individual FastQC reports
โ”‚   โ””โ”€โ”€ multiqc_report.html
โ”œโ”€โ”€ trimmed/              # Adapter trimming outputs
โ”œโ”€โ”€ {assay}/              # Assay-specific tracks
โ”œโ”€โ”€ ucsc/
โ”‚   โ”œโ”€โ”€ hub.txt
โ”‚   โ””โ”€โ”€ composite_trackDb.txt
โ””โ”€โ”€ genrich_sweep/        # Parameter sweeps (if enabled)

Rapid Mode

Purpose: Streamlined processing for high-confidence datasets

Features:

  • โšก Essential track generation only
  • โšก Skipped QC reports (FastQC, MultiQC)
  • โšก No adapter trimming summaries
  • โšก No composite tracks for replicate groups
  • โšก No parameter sweep outputs
  • โšก No Genrich additional outputs
  • โšก Core outputs: BigWig tracks, peak calls, basic UCSC hubs
  • โšก Faster processing time (30-50% faster)
  • โšก Reduced storage footprint
  • โšก Simplified output structure

Pipeline Components:

  • SRA downloads (if needed)
  • Genome indexing
  • Primary alignment and track generation
  • Essential peak calling
  • Basic UCSC hub files (no composites)
  • Rapid completion tracking

Use Cases:

  • ๐Ÿš€ Public datasets (ENCODE, TCGA, GEO) with known quality
  • ๐Ÿš€ Quick browser track generation for visualization
  • ๐Ÿš€ Streamlined processing when QC is unnecessary
  • ๐Ÿš€ Fast turnaround for track sharing
  • ๐Ÿš€ Time-sensitive analyses

Output Structure:

results/{genome_id}/
โ”œโ”€โ”€ {assay}/              # Essential tracks only
โ”œโ”€โ”€ ucsc/
โ”‚   โ””โ”€โ”€ hub.txt           # Basic hub (no composites)
โ””โ”€โ”€ rapid/
    โ””โ”€โ”€ rapid_tracks_complete.txt  # Completion summary

Activation:

rapid_mode: true

Configuration Example:

# For public ENCODE datasets
genome_id: "hg38_ENCODE"
samplesheet: "encode_samples.csv"
rapid_mode: true              # Enable rapid mode
genome: "/data/hg38.fa"

# Pipeline will skip:
# - FastQC reports
# - MultiQC aggregation
# - Adapter trimming summaries
# - Composite track generation

Mode Comparison Table

Feature Standard Mode Rapid Mode
FastQC Reports โœ… Yes โŒ Skipped
MultiQC Dashboard โœ… Yes โŒ Skipped
Adapter Trimming โœ… Full summaries โŒ No summaries
BigWig Tracks โœ… Yes โœ… Yes
Peak Calling โœ… Yes โœ… Yes
UCSC Hub โœ… With composites โœ… Basic only
Parameter Sweeps โœ… Optional โŒ Skipped
Composite Tracks โœ… Yes โŒ Skipped
Processing Time Baseline 30-50% faster
Storage Usage Full Reduced
Best For Research data Public datasets

Backward Compatibility

  • Default behavior unchanged: rapid_mode defaults to false
  • Existing configs work: No breaking changes to current setups
  • Progressive enhancement: Rapid mode is opt-in feature
  • Output tagging: All outputs tagged with genome_id regardless of mode

โœจ Benefits of Dual-Mode Architecture

Rapid Mode Benefits:

  1. โšก 30-50% faster processing for public datasets
  2. ๐Ÿ’พ Reduced storage footprint without QC intermediates
  3. ๐ŸŽฏ Cleaner output focused on essential browser tracks
  4. ๐Ÿš€ Quick turnaround for time-sensitive visualization needs
  5. ๐Ÿ“Š Streamlined workflows for known high-quality data

Standard Mode Benefits:

  1. ๐Ÿ“ˆ Complete audit trail for research datasets
  2. ๐Ÿ”ฌ Comprehensive quality assessment for novel samples
  3. ๐Ÿ“š Publication-ready with full documentation
  4. ๐Ÿ” Deep quality insights via MultiQC aggregation
  5. ๐ŸŽ›๏ธ Parameter optimization capabilities

Flexibility Advantages:

  • Switch between modes based on data source
  • Mix rapid and standard processing in same project
  • Maintain quality standards while optimizing efficiency
  • Preserve comprehensive analysis when needed

๐Ÿ“ Output Structure

The pipeline generates genome-tagged outputs with the following structure:

Standard Mode Output (rapid_mode: false)

results/
โ””โ”€โ”€ {genome_id}/                    # e.g., galGal6/
    โ”œโ”€โ”€ genome/                     # Genome indices
    โ”‚   โ”œโ”€โ”€ genome.fa              # Reference genome
    โ”‚   โ”œโ”€โ”€ star/                  # STAR index
    โ”‚   โ”œโ”€โ”€ bowtie2/               # Bowtie2 index
    โ”‚   โ””โ”€โ”€ bwa_mem2/              # BWA-MEM2 index
    โ”œโ”€โ”€ qc/                        # Quality control
    โ”‚   โ”œโ”€โ”€ fastqc/                # FastQC reports (HTML + zip)
    โ”‚   โ”‚   โ”œโ”€โ”€ sample1_fastqc.html
    โ”‚   โ”‚   โ””โ”€โ”€ sample1_fastqc.zip
    โ”‚   โ””โ”€โ”€ multiqc_report.html    # Aggregated QC report
    โ”œโ”€โ”€ trimmed/                   # Adapter trimming (if enabled)
    โ”‚   โ”œโ”€โ”€ sample1_R1_trimmed.fq.gz
    โ”‚   โ”œโ”€โ”€ sample1_R2_trimmed.fq.gz
    โ”‚   โ””โ”€โ”€ trimming_reports/
    โ”œโ”€โ”€ {assay_type}/              # Per-assay outputs
    โ”‚   โ”œโ”€โ”€ bam/                   # Aligned reads
    โ”‚   โ”‚   โ”œโ”€โ”€ sample1.sorted.bam
    โ”‚   โ”‚   โ””โ”€โ”€ sample1.sorted.bam.bai
    โ”‚   โ”œโ”€โ”€ bigwig/                # Coverage tracks
    โ”‚   โ”‚   โ”œโ”€โ”€ sample1.bw
    โ”‚   โ”‚   โ””โ”€โ”€ merged_replicates.bw
    โ”‚   โ””โ”€โ”€ peaks/                 # Peak calls (when applicable)
    โ”‚       โ”œโ”€โ”€ sample1_peaks.narrowPeak
    โ”‚       โ””โ”€โ”€ merged_peaks.bed
    โ”œโ”€โ”€ ucsc/                      # UCSC track hubs
    โ”‚   โ”œโ”€โ”€ hub.txt                # Main hub file
    โ”‚   โ”œโ”€โ”€ genomes.txt            # Genome specification
    โ”‚   โ”œโ”€โ”€ trackDb.txt            # Track definitions
    โ”‚   โ””โ”€โ”€ composite_trackDb.txt  # Composite track definitions
    โ”œโ”€โ”€ genrich_sweep/             # Parameter sweeps (if enabled)
    โ”‚   โ”œโ”€โ”€ qval_0.05/
    โ”‚   โ”œโ”€โ”€ qval_0.01/
    โ”‚   โ””โ”€โ”€ qval_0.001/
    โ””โ”€โ”€ logs/                      # Pipeline logs
        โ”œโ”€โ”€ alignment/
        โ”œโ”€โ”€ peak_calling/
        โ””โ”€โ”€ track_generation/

Rapid Mode Output (rapid_mode: true)

results/
โ””โ”€โ”€ {genome_id}/                    # e.g., galGal6/
    โ”œโ”€โ”€ genome/                     # Genome indices (same as standard)
    โ”‚   โ”œโ”€โ”€ genome.fa
    โ”‚   โ”œโ”€โ”€ star/
    โ”‚   โ”œโ”€โ”€ bowtie2/
    โ”‚   โ””โ”€โ”€ bwa_mem2/
    โ”œโ”€โ”€ {assay_type}/              # Essential tracks only
    โ”‚   โ”œโ”€โ”€ bam/                   # Aligned reads
    โ”‚   โ”‚   โ””โ”€โ”€ sample1.sorted.bam
    โ”‚   โ”œโ”€โ”€ bigwig/                # Coverage tracks
    โ”‚   โ”‚   โ””โ”€โ”€ sample1.bw
    โ”‚   โ””โ”€โ”€ peaks/                 # Peak calls (when applicable)
    โ”‚       โ””โ”€โ”€ sample1_peaks.narrowPeak
    โ”œโ”€โ”€ ucsc/                      # Basic UCSC hub
    โ”‚   โ”œโ”€โ”€ hub.txt                # Main hub file
    โ”‚   โ”œโ”€โ”€ genomes.txt            # Genome specification
    โ”‚   โ””โ”€โ”€ trackDb.txt            # Track definitions (no composites)
    โ”œโ”€โ”€ rapid/                     # Rapid mode tracking
    โ”‚   โ””โ”€โ”€ rapid_tracks_complete.txt  # Completion summary with stats
    โ””โ”€โ”€ logs/                      # Pipeline logs
        โ”œโ”€โ”€ alignment/
        โ”œโ”€โ”€ peak_calling/
        โ””โ”€โ”€ track_generation/

Output Differences Summary

Output Component Standard Mode Rapid Mode
QC Reports Full FastQC + MultiQC Skipped
Trimming Outputs Detailed summaries No summaries
Composite Tracks Generated Skipped
Parameter Sweeps Optional Skipped
Hub Complexity With composites Basic only
Storage Footprint Full ~30-40% smaller
Completion Marker Standard rapid_tracks_complete.txt

Key Output Files

BigWig Tracks (.bw):

  • Genome-wide coverage tracks for UCSC browser
  • Normalized by CPM, RPKM, or custom methods
  • Strand-specific for RNA-seq (when applicable)

Peak Files (.narrowPeak, .broadPeak):

  • BED-format files with enriched regions
  • MACS3 output with q-values and fold-enrichment
  • Optional Genrich peaks for ATAC-seq

BAM Files (.bam):

  • Sorted and indexed aligned reads
  • Optional duplicate marking
  • Quality filtered (MAPQ thresholds applied)

UCSC Hub Files:

  • hub.txt: Hub metadata and contact info
  • genomes.txt: Genome assembly specifications
  • trackDb.txt: Individual track configurations
  • composite_trackDb.txt: Grouped track configurations (standard mode only)

๐Ÿ”ง Configuration Details

Required Parameters

# Essential settings - must be configured
samplesheet: "samples.csv"          # Sample metadata file
genome: "/path/to/genome.fa"        # Reference genome FASTA
genome_id: "your_genome_id"         # REQUIRED: Unique identifier
outdir: "./results"                 # Output directory

Sample Sheet Format

Column Description Example Required
sample_id Unique sample identifier H3K27ac_rep1 Yes
type Assay type chipseq, atacseq, rnaseq Yes
sra_id SRA accession (if downloading) SRR123456 If no local files
read1 Path to R1 FASTQ data/sample_R1.fastq.gz If no SRA
read2 Path to R2 FASTQ data/sample_R2.fastq.gz For paired-end
experiment_group Experimental grouping H3K27ac, timepoint1 No
replicate_group Replicate grouping replicate1 No
condition Sample condition treatment, control No

Assay-Specific Parameters

ChIP-seq Configuration

chipseq:
  mapper: "bowtie2"                 # bowtie2 or bwa_mem2
  bowtie2_opts: "--very-sensitive"
  markdup: false                    # Mark duplicates
  macs3_opts: "--qval 0.05 --keep-dup all"
  control_tag: "input"              # Identify control samples
  bw_norm: "CPM"                    # BigWig normalization

ATAC-seq Configuration

atacseq:
  mapper: "bowtie2"
  markdup: false                    # Recommended: false for accessibility
  shift: -75                        # MACS3 shift for ATAC-seq
  extsize: 150                      # MACS3 extension size
  macs3_opts: "--qval 0.05"

RNA-seq Configuration

rnaseq:
  mapper: "star"                    # star or hisat2
  star_opts: "--outFilterMultimapNmax 20"
  strand_specific: false
  gene_bed: "genes.bed"             # For QC analysis

Advanced Features

Parameter Sweep

Test multiple peak-calling thresholds:

parameter_sweep:
  enabled: true
  qvalues: [0.05, 0.01, 0.005, 0.001]

Adapter Trimming

Enable quality-based trimming:

adapter_trimming:
  enabled: true
  min_length: 20
  quality_cutoff: 20

Alternative Peak Callers

Enable Genrich for ATAC-seq:

## ๐Ÿงฎ Computational Requirements

### Resource Recommendations

| Dataset Size | CPU Cores | Memory | Storage | Time Estimate |
|--------------|-----------|---------|---------|---------------|
| Small (< 10 samples) | 8-16 | 32 GB | 100 GB | 2-6 hours |
| Medium (10-50 samples) | 16-32 | 64 GB | 500 GB | 6-24 hours |
| Large (50+ samples) | 32-64 | 128 GB | 1 TB+ | 1-3 days |

### Cluster Configuration

The pipeline includes SLURM integration via `Executor.sh`. Customize for your cluster:

```bash
#SBATCH --job-name=uSeq2Tracks
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mem=10g
#SBATCH --cpus-per-task=4

๐Ÿ“– Detailed Workflows

ChIP-seq/ATAC-seq/CUT&RUN Workflow

  1. Quality Control: FastQC analysis of raw reads
  2. Adapter Trimming: Optional quality-based trimming with fastp
  3. Genome Indexing: Build mapper-specific indices (Bowtie2/BWA-MEM2)
  4. Read Mapping: Align reads to reference genome
  5. Post-processing: Sort, index, optional duplicate marking
  6. Coverage Tracks: Generate normalized BigWig files
  7. Peak Calling: Identify enriched regions with MACS3
  8. Quality Metrics: Generate assay-specific QC reports

RNA-seq Workflow

  1. Quality Control: FastQC analysis of raw reads
  2. Adapter Trimming: Optional preprocessing with fastp
  3. Genome Indexing: Build STAR or HISAT2 indices
  4. Read Mapping: Splice-aware alignment to reference
  5. Quantification: Generate gene-level count matrices
  6. Coverage Tracks: Create strand-specific BigWig files
  7. Quality Assessment: RNA-seq specific QC with RSeQC

Long-read Workflow (Nanopore/PacBio)

  1. Quality Assessment: Basic statistics and length distributions
  2. Genome Indexing: Build Minimap2 index
  3. Read Mapping: Long-read aware alignment
  4. Coverage Analysis: Generate coverage tracks
  5. Variant Calling: Optional structural variant detection

๐ŸŽจ UCSC Browser Integration

Automatic Track Hub Generation

uSeq2Tracks automatically creates UCSC-compatible track hubs:

Hub Structure:

ucsc/
โ”œโ”€โ”€ hub.txt                 # Hub metadata
โ”œโ”€โ”€ genomes.txt             # Genome definitions
โ””โ”€โ”€ trackDb.txt             # Track configurations

Track Organization:

  • Composite Tracks: Group related samples (e.g., same experiment)
  • Subgroups: Organize by condition, replicate, timepoint
  • Color Coding: Consistent color schemes per assay type
  • Metadata Integration: Sample information in track descriptions

Loading in UCSC Browser

  1. Upload track hub files to web-accessible location
  2. In UCSC Browser: My Data โ†’ Track Hubs โ†’ My Hubs
  3. Enter hub URL: https://your-server.com/path/to/ucsc/hub.txt
  4. Browse your data with full metadata integration

๐Ÿ” Quality Control Features

Standard QC Reports

  • FastQC: Per-sample quality metrics
  • MultiQC: Aggregated quality dashboard
  • Mapping Statistics: Alignment rates and quality scores
  • Library Complexity: Duplication rates and insert sizes

Assay-Specific QC

  • ChIP-seq: Fragment length distributions, enrichment metrics
  • ATAC-seq: TSS enrichment, fragment size profiles
  • RNA-seq: Gene body coverage, junction analysis
  • WGS: Coverage uniformity, variant quality metrics

Quality Thresholds

The pipeline includes built-in quality checks:

  • Minimum mapping rates
  • Fragment count requirements for peak calling
  • Insert size validation
  • Strand specificity assessment

๐Ÿ›  Troubleshooting

Common Issues

Genome ID Not Set

ERROR: genome_id must be set in config.yaml

Solution: Add genome_id: "your_genome" to config.yaml

Sample Sheet Formatting

ERROR: Missing required columns in sample sheet

Solution: Ensure sample sheet includes sample_id, type, and either sra_id or local file paths

Memory Issues

ERROR: Job exceeded memory limit

Solution: Increase memory allocation in config.yaml:

memory:
  large: 128000    # Increase for memory-intensive jobs

Disk Space

ERROR: No space left on device

Solution:

  • Clean up intermediate files: snakemake --delete-temp-output
  • Use scratch storage for temporary files
  • Monitor disk usage during execution

Performance Optimization

Speed Up Processing

  1. Enable Rapid Mode for public datasets
  2. Increase Parallelization: More --jobs in Snakemake
  3. Use SSDs for scratch space
  4. Optimize Resource Allocation: Match CPU/memory to job requirements

Reduce Storage

  1. Delete Intermediate Files: Use --delete-temp-output
  2. Compress Outputs: Enable compression for BAM files
  3. Archive Unused Data: Move completed analyses to long-term storage

๐Ÿ“š Examples

Example 1: ENCODE ChIP-seq Analysis (Rapid Mode)

Scenario: Processing public ENCODE data for quick visualization

# config.yaml
genome_id: "hg38_ENCODE_H3K27ac"
genome: "/data/genomes/hg38.fa"
samplesheet: "encode_chipseq.csv"
rapid_mode: true                     # Skip QC for public data

chipseq:
  mapper: "bowtie2"
  macs3_opts: "--qval 0.01"
# encode_chipseq.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
ENCSR000EWQ_rep1,chipseq,SRR1536404,,,H3K27ac,ENCSR000EWQ,K562
ENCSR000EWQ_rep2,chipseq,SRR1536405,,,H3K27ac,ENCSR000EWQ,K562
ENCSR000EWQ_input,chipseq,SRR1536406,,,H3K27ac,input,input

Expected Output:

results/hg38_ENCODE_H3K27ac/
โ”œโ”€โ”€ chipseq/
โ”‚   โ”œโ”€โ”€ bam/
โ”‚   โ”œโ”€โ”€ bigwig/
โ”‚   โ””โ”€โ”€ peaks/
โ”œโ”€โ”€ ucsc/
โ”‚   โ””โ”€โ”€ hub.txt
โ””โ”€โ”€ rapid/
    โ””โ”€โ”€ rapid_tracks_complete.txt

Processing Time: ~2-3 hours (vs 4-6 hours in standard mode)

Example 2: Multi-assay Developmental Study (Standard Mode)

Scenario: Novel experimental data requiring comprehensive QC

# config.yaml
genome_id: "mm10_development"
genome: "/data/genomes/mm10.fa"
gtf: "/data/annotations/mm10.gtf"
samplesheet: "development_study.csv"
rapid_mode: false                    # Full QC for novel data

parameter_sweep:
  enabled: true
  qvalues: [0.1, 0.05, 0.01, 0.001]
  
adapter_trimming:
  enabled: true
  min_length: 20
# development_study.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
E10_ATAC_rep1,atacseq,,data/E10_ATAC_1_R1.fq.gz,data/E10_ATAC_1_R2.fq.gz,E10,ATAC_rep1,E10
E10_ATAC_rep2,atacseq,,data/E10_ATAC_2_R1.fq.gz,data/E10_ATAC_2_R2.fq.gz,E10,ATAC_rep2,E10
E10_RNA_rep1,rnaseq,,data/E10_RNA_1_R1.fq.gz,data/E10_RNA_1_R2.fq.gz,E10,RNA_rep1,E10
E12_ATAC_rep1,atacseq,,data/E12_ATAC_1_R1.fq.gz,data/E12_ATAC_1_R2.fq.gz,E12,ATAC_rep1,E12
E12_RNA_rep1,rnaseq,,data/E12_RNA_1_R1.fq.gz,data/E12_RNA_1_R2.fq.gz,E12,RNA_rep1,E12

Expected Output:

results/mm10_development/
โ”œโ”€โ”€ qc/
โ”‚   โ”œโ”€โ”€ fastqc/
โ”‚   โ””โ”€โ”€ multiqc_report.html
โ”œโ”€โ”€ trimmed/
โ”œโ”€โ”€ atacseq/
โ”œโ”€โ”€ rnaseq/
โ”œโ”€โ”€ ucsc/
โ”‚   โ”œโ”€โ”€ hub.txt
โ”‚   โ””โ”€โ”€ composite_trackDb.txt
โ””โ”€โ”€ genrich_sweep/

Processing Time: ~8-12 hours with full QC

Example 3: TCGA Cancer Atlas (Rapid Mode)

Scenario: Rapid processing of TCGA RNA-seq data

# config.yaml
genome_id: "hg38_TCGA_BRCA"
genome: "/data/genomes/hg38.fa"
gtf: "/data/annotations/gencode.v38.gtf"
samplesheet: "tcga_rnaseq.csv"
rapid_mode: true                     # Fast track generation

rnaseq:
  mapper: "star"
  strand_specific: true
# tcga_rnaseq.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
TCGA_BRCA_01,rnaseq,SRR8494716,,,BRCA,tumor_01,tumor
TCGA_BRCA_02,rnaseq,SRR8494717,,,BRCA,tumor_02,tumor
TCGA_BRCA_normal,rnaseq,SRR8494718,,,BRCA,normal_01,normal

Benefits:

  • โšก 40% faster processing
  • ๐Ÿ’พ 50% less storage (no QC intermediates)
  • ๐ŸŽฏ Clean output for browser visualization

Example 4: Ancient DNA Analysis (Standard Mode)

Scenario: Archaeological samples requiring quality validation

# config.yaml
genome_id: "ancientDNA_sample"
genome: "/data/reference/ancient_ref.fa"
samplesheet: "ancient_samples.csv"
rapid_mode: false                    # Need full QC for damage assessment

ancientdna:
  mapper: "bwa_aln"
  markdup: true
  damage_analysis: true
  min_mapq: 30

Why Standard Mode:

  • Ancient DNA has unique quality issues (damage patterns)
  • Need comprehensive QC to assess sample preservation
  • Publication requires full quality documentation

Example 5: Mixed-Mode Project

Scenario: Combining public and experimental data

# config_public.yaml (rapid mode)
genome_id: "mm10_public_controls"
samplesheet: "public_controls.csv"
rapid_mode: true

# config_experimental.yaml (standard mode)
genome_id: "mm10_experimental"
samplesheet: "experimental_samples.csv"
rapid_mode: false

Workflow:

  1. Process public controls rapidly for quick validation
  2. Process experimental data with full QC
  3. Combine tracks in single UCSC hub
  4. Maintain appropriate quality standards for each dataset type

๐Ÿค Contributing

Contributions are welcome! To get started:

git clone https://github.com/pgrady1322/uSeq2Tracks.git
cd uSeq2Tracks
conda env create -f envs/useq2tracks.yml
conda activate useq2tracks
pip install -e ".[dev]"
make test        # run tests
make check       # lint + format check

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ“ž Support

๐Ÿ† Citation

If you use uSeq2Tracks in your research, please cite:

Patrick Grady (2026). uSeq2Tracks: A universal pipeline for sequencing
data to genome browser tracks. https://github.com/pgrady1322/uSeq2Tracks

๐Ÿ™ Acknowledgments

  • Snakemake Community: For the excellent workflow management system
  • Bioconda: For streamlined software distribution
  • UCSC Genome Browser: For track hub specifications
  • Tool Developers: FastQC, MultiQC, STAR, Bowtie2, BWA, MACS3, and all other integrated tools

uSeq2Tracks: From raw sequencing data to publication-ready browser tracks in one streamlined workflow.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

useq2tracks-1.0.0.tar.gz (104.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

useq2tracks-1.0.0-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file useq2tracks-1.0.0.tar.gz.

File metadata

  • Download URL: useq2tracks-1.0.0.tar.gz
  • Upload date:
  • Size: 104.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for useq2tracks-1.0.0.tar.gz
Algorithm Hash digest
SHA256 db94e685578b7c848b5e440e01e5dd43c98d817f6ce8cb83ce6b1718ee37ce05
MD5 1c3fa66485e37e3510efd83ea5c1bbad
BLAKE2b-256 2a4626d0e2420a2e0fd43fc9eb6ac62dd933123484b752d84281239ecdceb714

See more details on using hashes here.

Provenance

The following attestation bundles were made for useq2tracks-1.0.0.tar.gz:

Publisher: publish.yml on pgrady1322/uSeq2Tracks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file useq2tracks-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: useq2tracks-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for useq2tracks-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf807d69c4624c7e5586cb1b08f2396eab4f75ef89cfca2fa5a5cc2669448509
MD5 de4051a7afad63d41a09e37685abbc0c
BLAKE2b-256 b9aa124246ba828ce43b987c11a134b386b7a48b1f8069240f775171f4d5a9c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for useq2tracks-1.0.0-py3-none-any.whl:

Publisher: publish.yml on pgrady1322/uSeq2Tracks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page