Universal Sequencing-to-Genome-Browser-Tracks Pipeline
Project description
uSeq2Tracks: Universal Sequencing to Browser Tracks Pipeline
A comprehensive pipeline for processing diverse sequencing datasets and generating standardized genomic tracks for UCSC Genome Browser visualization. uSeq2Tracks handles everything from raw sequencing data to publication-ready browser tracks with minimal user intervention.
Available in two implementations:
- Snakemake โ stable, feature-complete
- Nextflow DSL2 โ cloud/HPC-optimized with Docker & Singularity support
๐ Overview
uSeq2Tracks is designed to standardize the processing of heterogeneous sequencing datasets, transforming raw sequencing data into organized, genome browser-ready tracks. Whether you're working with public datasets from ENCODE or your own experimental data, uSeq2Tracks provides a unified workflow that handles quality control, alignment, track generation, and browser hub creation.
Key Features
- Universal Input Support: Handles both local FASTQ files and SRA accessions
- Multiple Assay Types: ChIP-seq, ATAC-seq, CUT&RUN, RNA-seq, WGS, Long-reads, etc.
- Two Processing Modes: Standard (full QC) and Rapid (streamlined for public data)
- Genome-ID Organization: All outputs tagged with unique genome identifiers
- UCSC Integration: Automatic track hub generation for browser visualization
- Quality Control: Comprehensive QC reports with FastQC, MultiQC, and assay-specific metrics
- Flexible Configuration: Extensive parameter customization for each assay type
๐งฌ Supported Assay Types
| Assay Type | Purpose | Key Outputs | Peak Calling |
|---|---|---|---|
| ChIP-seq | Histone modifications, TF binding | BigWig tracks, narrowPeak files | MACS3 |
| ATAC-seq | Chromatin accessibility | BigWig tracks, narrowPeak files | MACS3 |
| CUT&RUN | Low-input chromatin profiling | BigWig tracks, narrowPeak files | MACS3 |
| RNA-seq | Gene expression | BigWig tracks, count matrices | N/A |
| WGS | Genome-wide sequencing | BigWig coverage, variant calls | N/A |
| Ancient DNA | Historical/archaeological samples | BigWig tracks, damage analysis | N/A |
| Nanopore | Long-read sequencing | BigWig tracks, structural variants | N/A |
| PacBio | Long-read sequencing | BigWig tracks, high-accuracy variants | N/A |
๐ Quick Start
1. Installation
# Clone the repository
git clone https://github.com/pgrady1322/uSeq2Tracks.git
cd uSeq2Tracks
# Create the conda environment (bioinformatics tools + dependencies)
conda env create -f envs/useq2tracks.yml
conda activate useq2tracks
# Install the useq2tracks CLI
pip install -e .
# Or install the CLI from PyPI (without pipeline files)
# pip install useq2tracks
2. Configuration
Edit the main configuration file:
# config.yaml
samplesheet: "samples.csv" # Your sample metadata
genome: "/path/to/genome.fa" # Reference genome
genome_id: "galGal6" # REQUIRED: Unique genome identifier
gtf: "/path/to/annotations.gtf" # Gene annotations (optional)
outdir: "./results" # Output directory
rapid_mode: false # Set to true for streamlined processing
3. Sample Sheet Setup
Create a sample sheet describing your data:
Standard Mode Example (comprehensive QC):
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
sample1,chipseq,SRR123456,,,H3K27ac,replicate1,treatment
sample2,atacseq,,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,accessibility,replicate1,control
sample3,rnaseq,,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,expression,timepoint1,control
Rapid Mode Example (public datasets):
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
ENCODE_H3K27ac_rep1,chipseq,SRR1536404,,,H3K27ac,ENCSR000EWQ,K562
ENCODE_H3K27ac_rep2,chipseq,SRR1536405,,,H3K27ac,ENCSR000EWQ,K562
ENCODE_Input,chipseq,SRR1536406,,,H3K27ac,input,input
4. Run the Pipeline
# โโ Using the useq2tracks CLI (recommended) โโ
# Snakemake (default engine)
useq2tracks run # all cores, config.yaml
useq2tracks run --cores 16 # limit to 16 cores
useq2tracks run --dryrun # preview what will run
useq2tracks run --rapid # skip QC, essential tracks only
useq2tracks run --profile slurm # submit to SLURM
useq2tracks run --resume # resume interrupted run
# Nextflow
useq2tracks run --engine nextflow
useq2tracks run --engine nextflow --profile docker
useq2tracks run --engine nextflow --resume
# Forward extra engine-specific flags after '--'
useq2tracks run -- --printshellcmds # Snakemake shell echo
useq2tracks run --engine nextflow -- -with-tower # Nextflow Tower
# โโ Other CLI subcommands โโ
# Validate inputs before running
useq2tracks validate samples.csv --configfile config.yaml
# Show pipeline info
useq2tracks info
# Generate a standalone UCSC track hub
useq2tracks hub --genome-id galGal6 --hub-name MyHub \
--hub-short-label "My Hub" --hub-long-label "My Sequencing Hub" \
--genome-name galGal6 --hub-email user@example.com \
--bigwigs *.bw --output-dir ucsc_hub/
# โโ Or invoke engines directly (advanced) โโ
snakemake --use-conda --cores all
nextflow run main.nf -profile docker --genome_id galGal6
๐ Nextflow Implementation
uSeq2Tracks includes a Nextflow DSL2 pipeline for improved scalability, cloud integration, and HPC support. The Nextflow implementation provides all core functionality with enhanced portability and parallelization.
Why Use Nextflow?
Advantages:
- โ Better Parallelization: Automatic task-level optimization
- โ Improved Resume: More robust caching and resume capability
- โ Cloud Native: Built-in support for AWS, Google Cloud, Azure
- โ HPC Ready: First-class SLURM, SGE, PBS support
- โ Container First: Excellent Docker/Singularity integration
- โ Portable: Works identically across different systems
Nextflow Quick Start
1. Install Nextflow
curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/
2. Configure Pipeline
Create a params.yaml file:
# Input/output
samplesheet: "samples.csv"
genome: "/path/to/genome.fa"
genome_id: "galGal6" # REQUIRED
outdir: "./results"
# Pipeline mode
rapid_mode: false
# ATAC-seq settings
atacseq:
mapper: "bowtie2"
markdup: false
macs3_opts: "--qval 0.05"
bw_norm: "CPM"
# ChIP-seq settings
chipseq:
mapper: "bowtie2"
control_tag: "input"
markdup: false
# UCSC hub
ucsc:
hub_name: "myHub"
hub_short_label: "My Data"
hub_long_label: "My Sequencing Data Hub"
genome_name: "galGal6"
hub_email: "pgrady1322@gmail.com"
3. Run Nextflow Pipeline
# Navigate to nextflow directory
cd nextflow/
# With Docker (local)
nextflow run main.nf -profile docker -params-file params.yaml
# With Singularity (HPC)
nextflow run main.nf -profile singularity -params-file params.yaml
# With SLURM + Singularity
nextflow run main.nf -profile slurm,singularity -params-file params.yaml
# Rapid mode
nextflow run main.nf -profile rapid,docker -params-file params.yaml
# Resume failed run
nextflow run main.nf -profile docker -resume
Nextflow Features
Implemented Workflows:
- โ ATAC-seq workflow with Tn5 shift correction
- โ ChIP-seq workflow with control matching
- โ CUT&RUN workflow
- โ Genome preparation and indexing
- โ UCSC track hub generation
- โ Samplesheet validation
- โ Dynamic resource allocation
- โ Multiple execution profiles
Snakemake-only (not yet ported to Nextflow):
- RNA-seq workflow
- WGS workflow
- Long-read workflows (Nanopore, PacBio)
- Ancient DNA workflow
- SRA download integration
- FastQC/MultiQC integration
- Replicate merging
Nextflow Profiles
Container Profiles:
docker: Use Docker containers (recommended for local)singularity: Use Singularity containers (recommended for HPC)conda: Use Conda environments
Executor Profiles:
local: Run on local machine (default)slurm: Submit to SLURM schedulersge: Submit to SGE schedulerpbs: Submit to PBS scheduler
Special Profiles:
test: Run with minimal test datasetrapid: Skip QC, generate essential tracks only
Combine profiles with commas:
nextflow run main.nf -profile slurm,singularity # SLURM + Singularity
nextflow run main.nf -profile docker,rapid # Docker + Rapid mode
Nextflow Configuration Example
# params.yaml for Nextflow
samplesheet: "samples.csv"
genome: "/path/to/genome.fa"
genome_id: "galGal6"
# Resource limits
max_cpus: 64
max_memory: "256.GB"
max_time: "240.h"
# ATAC-seq parameters
atacseq:
mapper: "bowtie2"
bowtie2_opts: "--very-sensitive"
markdup: false
shift: -75
extsize: 150
bw_norm: "CPM"
# ChIP-seq parameters
chipseq:
mapper: "bowtie2"
control_tag: "input"
macs3_opts: "--qval 0.05 --keep-dup all"
bw_norm: "CPM"
Nextflow Output Structure
results/
โโโ galGal6/ # Genome ID
โโโ atacseq/
โ โโโ bam/
โ โโโ peaks/
โ โโโ bigwig/
โโโ chipseq/
โ โโโ bam/
โ โโโ peaks/
โ โโโ bigwig/
โโโ ucsc/
โ โโโ hub.txt
โ โโโ genomes.txt
โ โโโ trackDb.txt
โโโ pipeline_info/ # Execution reports
โโโ execution_report.html
โโโ execution_timeline.html
โโโ execution_trace.txt
Nextflow Quick Reference
Basic Commands:
# Run pipeline
nextflow run main.nf -profile docker -params-file params.yaml
# Resume failed run
nextflow run main.nf -profile docker -resume
# Override parameters
nextflow run main.nf --genome my_genome.fa --genome_id myGenome
# Generate reports
nextflow run main.nf -with-report -with-timeline -with-dag
# Limit concurrent jobs
nextflow run main.nf -profile slurm -qs 50
Troubleshooting:
# Check configuration
nextflow config main.nf
# Test run
nextflow run main.nf -profile test,docker
# Clean and restart
rm -rf work/ && nextflow run main.nf
Choosing Between Snakemake and Nextflow
| Feature | Snakemake | Nextflow |
|---|---|---|
| Maturity | Stable, feature-complete | Stable (core assays) |
| Learning Curve | Python-based (easier for most) | Groovy-based |
| Parallelization | Good | Better (automatic) |
| Cloud Support | Via plugins | Native |
| HPC Integration | Good | Excellent |
| Resume Capability | Good | Excellent |
| Container Support | Good | Excellent |
| Best For | General bioinformatics | HPC/Cloud deployments |
Recommendation:
- Use Snakemake for: All assay types, Python familiarity, full feature set
- Use Nextflow for: HPC/cloud environments, containerized execution, core epigenomic assays
Nextflow Documentation
For complete Nextflow documentation, see:
nextflow/QUICK_REFERENCE.mdโ Quick command referencenextflow/examples/โ Example configurations
๐ Pipeline Modes
Standard Mode (Default)
Purpose: Comprehensive analysis with full quality control
Features:
- โ Complete FastQC reports for all samples
- โ Adapter trimming with detailed summaries
- โ MultiQC comprehensive aggregate report
- โ Comprehensive track generation
- โ Parameter sweep analysis (optional)
- โ Replicate-merged composite tracks
- โ Detailed QC metrics and visualizations
- โ Genrich additional outputs (when enabled)
Pipeline Components:
- FastQC quality reports for raw reads
- Adapter trimming outputs and summaries
- MultiQC aggregated QC dashboard
- Primary pipeline outputs (BigWig, BAM, peaks)
- Composite tracks for replicate groups
- Parameter sweep comparisons
- Complete UCSC hub files
Use Cases:
- โจ Novel experimental datasets requiring comprehensive QC
- โจ Unknown sample quality scenarios
- โจ Publication-ready analysis with full documentation
- โจ Parameter optimization studies
- โจ Research datasets needing complete audit trail
Output Structure:
results/{genome_id}/
โโโ qc/
โ โโโ fastqc/ # Individual FastQC reports
โ โโโ multiqc_report.html
โโโ trimmed/ # Adapter trimming outputs
โโโ {assay}/ # Assay-specific tracks
โโโ ucsc/
โ โโโ hub.txt
โ โโโ composite_trackDb.txt
โโโ genrich_sweep/ # Parameter sweeps (if enabled)
Rapid Mode
Purpose: Streamlined processing for high-confidence datasets
Features:
- โก Essential track generation only
- โก Skipped QC reports (FastQC, MultiQC)
- โก No adapter trimming summaries
- โก No composite tracks for replicate groups
- โก No parameter sweep outputs
- โก No Genrich additional outputs
- โก Core outputs: BigWig tracks, peak calls, basic UCSC hubs
- โก Faster processing time (30-50% faster)
- โก Reduced storage footprint
- โก Simplified output structure
Pipeline Components:
- SRA downloads (if needed)
- Genome indexing
- Primary alignment and track generation
- Essential peak calling
- Basic UCSC hub files (no composites)
- Rapid completion tracking
Use Cases:
- ๐ Public datasets (ENCODE, TCGA, GEO) with known quality
- ๐ Quick browser track generation for visualization
- ๐ Streamlined processing when QC is unnecessary
- ๐ Fast turnaround for track sharing
- ๐ Time-sensitive analyses
Output Structure:
results/{genome_id}/
โโโ {assay}/ # Essential tracks only
โโโ ucsc/
โ โโโ hub.txt # Basic hub (no composites)
โโโ rapid/
โโโ rapid_tracks_complete.txt # Completion summary
Activation:
rapid_mode: true
Configuration Example:
# For public ENCODE datasets
genome_id: "hg38_ENCODE"
samplesheet: "encode_samples.csv"
rapid_mode: true # Enable rapid mode
genome: "/data/hg38.fa"
# Pipeline will skip:
# - FastQC reports
# - MultiQC aggregation
# - Adapter trimming summaries
# - Composite track generation
Mode Comparison Table
| Feature | Standard Mode | Rapid Mode |
|---|---|---|
| FastQC Reports | โ Yes | โ Skipped |
| MultiQC Dashboard | โ Yes | โ Skipped |
| Adapter Trimming | โ Full summaries | โ No summaries |
| BigWig Tracks | โ Yes | โ Yes |
| Peak Calling | โ Yes | โ Yes |
| UCSC Hub | โ With composites | โ Basic only |
| Parameter Sweeps | โ Optional | โ Skipped |
| Composite Tracks | โ Yes | โ Skipped |
| Processing Time | Baseline | 30-50% faster |
| Storage Usage | Full | Reduced |
| Best For | Research data | Public datasets |
Backward Compatibility
- Default behavior unchanged:
rapid_modedefaults tofalse - Existing configs work: No breaking changes to current setups
- Progressive enhancement: Rapid mode is opt-in feature
- Output tagging: All outputs tagged with
genome_idregardless of mode
โจ Benefits of Dual-Mode Architecture
Rapid Mode Benefits:
- โก 30-50% faster processing for public datasets
- ๐พ Reduced storage footprint without QC intermediates
- ๐ฏ Cleaner output focused on essential browser tracks
- ๐ Quick turnaround for time-sensitive visualization needs
- ๐ Streamlined workflows for known high-quality data
Standard Mode Benefits:
- ๐ Complete audit trail for research datasets
- ๐ฌ Comprehensive quality assessment for novel samples
- ๐ Publication-ready with full documentation
- ๐ Deep quality insights via MultiQC aggregation
- ๐๏ธ Parameter optimization capabilities
Flexibility Advantages:
- Switch between modes based on data source
- Mix rapid and standard processing in same project
- Maintain quality standards while optimizing efficiency
- Preserve comprehensive analysis when needed
๐ Output Structure
The pipeline generates genome-tagged outputs with the following structure:
Standard Mode Output (rapid_mode: false)
results/
โโโ {genome_id}/ # e.g., galGal6/
โโโ genome/ # Genome indices
โ โโโ genome.fa # Reference genome
โ โโโ star/ # STAR index
โ โโโ bowtie2/ # Bowtie2 index
โ โโโ bwa_mem2/ # BWA-MEM2 index
โโโ qc/ # Quality control
โ โโโ fastqc/ # FastQC reports (HTML + zip)
โ โ โโโ sample1_fastqc.html
โ โ โโโ sample1_fastqc.zip
โ โโโ multiqc_report.html # Aggregated QC report
โโโ trimmed/ # Adapter trimming (if enabled)
โ โโโ sample1_R1_trimmed.fq.gz
โ โโโ sample1_R2_trimmed.fq.gz
โ โโโ trimming_reports/
โโโ {assay_type}/ # Per-assay outputs
โ โโโ bam/ # Aligned reads
โ โ โโโ sample1.sorted.bam
โ โ โโโ sample1.sorted.bam.bai
โ โโโ bigwig/ # Coverage tracks
โ โ โโโ sample1.bw
โ โ โโโ merged_replicates.bw
โ โโโ peaks/ # Peak calls (when applicable)
โ โโโ sample1_peaks.narrowPeak
โ โโโ merged_peaks.bed
โโโ ucsc/ # UCSC track hubs
โ โโโ hub.txt # Main hub file
โ โโโ genomes.txt # Genome specification
โ โโโ trackDb.txt # Track definitions
โ โโโ composite_trackDb.txt # Composite track definitions
โโโ genrich_sweep/ # Parameter sweeps (if enabled)
โ โโโ qval_0.05/
โ โโโ qval_0.01/
โ โโโ qval_0.001/
โโโ logs/ # Pipeline logs
โโโ alignment/
โโโ peak_calling/
โโโ track_generation/
Rapid Mode Output (rapid_mode: true)
results/
โโโ {genome_id}/ # e.g., galGal6/
โโโ genome/ # Genome indices (same as standard)
โ โโโ genome.fa
โ โโโ star/
โ โโโ bowtie2/
โ โโโ bwa_mem2/
โโโ {assay_type}/ # Essential tracks only
โ โโโ bam/ # Aligned reads
โ โ โโโ sample1.sorted.bam
โ โโโ bigwig/ # Coverage tracks
โ โ โโโ sample1.bw
โ โโโ peaks/ # Peak calls (when applicable)
โ โโโ sample1_peaks.narrowPeak
โโโ ucsc/ # Basic UCSC hub
โ โโโ hub.txt # Main hub file
โ โโโ genomes.txt # Genome specification
โ โโโ trackDb.txt # Track definitions (no composites)
โโโ rapid/ # Rapid mode tracking
โ โโโ rapid_tracks_complete.txt # Completion summary with stats
โโโ logs/ # Pipeline logs
โโโ alignment/
โโโ peak_calling/
โโโ track_generation/
Output Differences Summary
| Output Component | Standard Mode | Rapid Mode |
|---|---|---|
| QC Reports | Full FastQC + MultiQC | Skipped |
| Trimming Outputs | Detailed summaries | No summaries |
| Composite Tracks | Generated | Skipped |
| Parameter Sweeps | Optional | Skipped |
| Hub Complexity | With composites | Basic only |
| Storage Footprint | Full | ~30-40% smaller |
| Completion Marker | Standard | rapid_tracks_complete.txt |
Key Output Files
BigWig Tracks (.bw):
- Genome-wide coverage tracks for UCSC browser
- Normalized by CPM, RPKM, or custom methods
- Strand-specific for RNA-seq (when applicable)
Peak Files (.narrowPeak, .broadPeak):
- BED-format files with enriched regions
- MACS3 output with q-values and fold-enrichment
- Optional Genrich peaks for ATAC-seq
BAM Files (.bam):
- Sorted and indexed aligned reads
- Optional duplicate marking
- Quality filtered (MAPQ thresholds applied)
UCSC Hub Files:
hub.txt: Hub metadata and contact infogenomes.txt: Genome assembly specificationstrackDb.txt: Individual track configurationscomposite_trackDb.txt: Grouped track configurations (standard mode only)
๐ง Configuration Details
Required Parameters
# Essential settings - must be configured
samplesheet: "samples.csv" # Sample metadata file
genome: "/path/to/genome.fa" # Reference genome FASTA
genome_id: "your_genome_id" # REQUIRED: Unique identifier
outdir: "./results" # Output directory
Sample Sheet Format
| Column | Description | Example | Required |
|---|---|---|---|
sample_id |
Unique sample identifier | H3K27ac_rep1 |
Yes |
type |
Assay type | chipseq, atacseq, rnaseq |
Yes |
sra_id |
SRA accession (if downloading) | SRR123456 |
If no local files |
read1 |
Path to R1 FASTQ | data/sample_R1.fastq.gz |
If no SRA |
read2 |
Path to R2 FASTQ | data/sample_R2.fastq.gz |
For paired-end |
experiment_group |
Experimental grouping | H3K27ac, timepoint1 |
No |
replicate_group |
Replicate grouping | replicate1 |
No |
condition |
Sample condition | treatment, control |
No |
Assay-Specific Parameters
ChIP-seq Configuration
chipseq:
mapper: "bowtie2" # bowtie2 or bwa_mem2
bowtie2_opts: "--very-sensitive"
markdup: false # Mark duplicates
macs3_opts: "--qval 0.05 --keep-dup all"
control_tag: "input" # Identify control samples
bw_norm: "CPM" # BigWig normalization
ATAC-seq Configuration
atacseq:
mapper: "bowtie2"
markdup: false # Recommended: false for accessibility
shift: -75 # MACS3 shift for ATAC-seq
extsize: 150 # MACS3 extension size
macs3_opts: "--qval 0.05"
RNA-seq Configuration
rnaseq:
mapper: "star" # star or hisat2
star_opts: "--outFilterMultimapNmax 20"
strand_specific: false
gene_bed: "genes.bed" # For QC analysis
Advanced Features
Parameter Sweep
Test multiple peak-calling thresholds:
parameter_sweep:
enabled: true
qvalues: [0.05, 0.01, 0.005, 0.001]
Adapter Trimming
Enable quality-based trimming:
adapter_trimming:
enabled: true
min_length: 20
quality_cutoff: 20
Alternative Peak Callers
Enable Genrich for ATAC-seq:
## ๐งฎ Computational Requirements
### Resource Recommendations
| Dataset Size | CPU Cores | Memory | Storage | Time Estimate |
|--------------|-----------|---------|---------|---------------|
| Small (< 10 samples) | 8-16 | 32 GB | 100 GB | 2-6 hours |
| Medium (10-50 samples) | 16-32 | 64 GB | 500 GB | 6-24 hours |
| Large (50+ samples) | 32-64 | 128 GB | 1 TB+ | 1-3 days |
### Cluster Configuration
The pipeline includes SLURM integration via `Executor.sh`. Customize for your cluster:
```bash
#SBATCH --job-name=uSeq2Tracks
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mem=10g
#SBATCH --cpus-per-task=4
๐ Detailed Workflows
ChIP-seq/ATAC-seq/CUT&RUN Workflow
- Quality Control: FastQC analysis of raw reads
- Adapter Trimming: Optional quality-based trimming with fastp
- Genome Indexing: Build mapper-specific indices (Bowtie2/BWA-MEM2)
- Read Mapping: Align reads to reference genome
- Post-processing: Sort, index, optional duplicate marking
- Coverage Tracks: Generate normalized BigWig files
- Peak Calling: Identify enriched regions with MACS3
- Quality Metrics: Generate assay-specific QC reports
RNA-seq Workflow
- Quality Control: FastQC analysis of raw reads
- Adapter Trimming: Optional preprocessing with fastp
- Genome Indexing: Build STAR or HISAT2 indices
- Read Mapping: Splice-aware alignment to reference
- Quantification: Generate gene-level count matrices
- Coverage Tracks: Create strand-specific BigWig files
- Quality Assessment: RNA-seq specific QC with RSeQC
Long-read Workflow (Nanopore/PacBio)
- Quality Assessment: Basic statistics and length distributions
- Genome Indexing: Build Minimap2 index
- Read Mapping: Long-read aware alignment
- Coverage Analysis: Generate coverage tracks
- Variant Calling: Optional structural variant detection
๐จ UCSC Browser Integration
Automatic Track Hub Generation
uSeq2Tracks automatically creates UCSC-compatible track hubs:
Hub Structure:
ucsc/
โโโ hub.txt # Hub metadata
โโโ genomes.txt # Genome definitions
โโโ trackDb.txt # Track configurations
Track Organization:
- Composite Tracks: Group related samples (e.g., same experiment)
- Subgroups: Organize by condition, replicate, timepoint
- Color Coding: Consistent color schemes per assay type
- Metadata Integration: Sample information in track descriptions
Loading in UCSC Browser
- Upload track hub files to web-accessible location
- In UCSC Browser:
My DataโTrack HubsโMy Hubs - Enter hub URL:
https://your-server.com/path/to/ucsc/hub.txt - Browse your data with full metadata integration
๐ Quality Control Features
Standard QC Reports
- FastQC: Per-sample quality metrics
- MultiQC: Aggregated quality dashboard
- Mapping Statistics: Alignment rates and quality scores
- Library Complexity: Duplication rates and insert sizes
Assay-Specific QC
- ChIP-seq: Fragment length distributions, enrichment metrics
- ATAC-seq: TSS enrichment, fragment size profiles
- RNA-seq: Gene body coverage, junction analysis
- WGS: Coverage uniformity, variant quality metrics
Quality Thresholds
The pipeline includes built-in quality checks:
- Minimum mapping rates
- Fragment count requirements for peak calling
- Insert size validation
- Strand specificity assessment
๐ Troubleshooting
Common Issues
Genome ID Not Set
ERROR: genome_id must be set in config.yaml
Solution: Add genome_id: "your_genome" to config.yaml
Sample Sheet Formatting
ERROR: Missing required columns in sample sheet
Solution: Ensure sample sheet includes sample_id, type, and either sra_id or local file paths
Memory Issues
ERROR: Job exceeded memory limit
Solution: Increase memory allocation in config.yaml:
memory:
large: 128000 # Increase for memory-intensive jobs
Disk Space
ERROR: No space left on device
Solution:
- Clean up intermediate files:
snakemake --delete-temp-output - Use scratch storage for temporary files
- Monitor disk usage during execution
Performance Optimization
Speed Up Processing
- Enable Rapid Mode for public datasets
- Increase Parallelization: More
--jobsin Snakemake - Use SSDs for scratch space
- Optimize Resource Allocation: Match CPU/memory to job requirements
Reduce Storage
- Delete Intermediate Files: Use
--delete-temp-output - Compress Outputs: Enable compression for BAM files
- Archive Unused Data: Move completed analyses to long-term storage
๐ Examples
Example 1: ENCODE ChIP-seq Analysis (Rapid Mode)
Scenario: Processing public ENCODE data for quick visualization
# config.yaml
genome_id: "hg38_ENCODE_H3K27ac"
genome: "/data/genomes/hg38.fa"
samplesheet: "encode_chipseq.csv"
rapid_mode: true # Skip QC for public data
chipseq:
mapper: "bowtie2"
macs3_opts: "--qval 0.01"
# encode_chipseq.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
ENCSR000EWQ_rep1,chipseq,SRR1536404,,,H3K27ac,ENCSR000EWQ,K562
ENCSR000EWQ_rep2,chipseq,SRR1536405,,,H3K27ac,ENCSR000EWQ,K562
ENCSR000EWQ_input,chipseq,SRR1536406,,,H3K27ac,input,input
Expected Output:
results/hg38_ENCODE_H3K27ac/
โโโ chipseq/
โ โโโ bam/
โ โโโ bigwig/
โ โโโ peaks/
โโโ ucsc/
โ โโโ hub.txt
โโโ rapid/
โโโ rapid_tracks_complete.txt
Processing Time: ~2-3 hours (vs 4-6 hours in standard mode)
Example 2: Multi-assay Developmental Study (Standard Mode)
Scenario: Novel experimental data requiring comprehensive QC
# config.yaml
genome_id: "mm10_development"
genome: "/data/genomes/mm10.fa"
gtf: "/data/annotations/mm10.gtf"
samplesheet: "development_study.csv"
rapid_mode: false # Full QC for novel data
parameter_sweep:
enabled: true
qvalues: [0.1, 0.05, 0.01, 0.001]
adapter_trimming:
enabled: true
min_length: 20
# development_study.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
E10_ATAC_rep1,atacseq,,data/E10_ATAC_1_R1.fq.gz,data/E10_ATAC_1_R2.fq.gz,E10,ATAC_rep1,E10
E10_ATAC_rep2,atacseq,,data/E10_ATAC_2_R1.fq.gz,data/E10_ATAC_2_R2.fq.gz,E10,ATAC_rep2,E10
E10_RNA_rep1,rnaseq,,data/E10_RNA_1_R1.fq.gz,data/E10_RNA_1_R2.fq.gz,E10,RNA_rep1,E10
E12_ATAC_rep1,atacseq,,data/E12_ATAC_1_R1.fq.gz,data/E12_ATAC_1_R2.fq.gz,E12,ATAC_rep1,E12
E12_RNA_rep1,rnaseq,,data/E12_RNA_1_R1.fq.gz,data/E12_RNA_1_R2.fq.gz,E12,RNA_rep1,E12
Expected Output:
results/mm10_development/
โโโ qc/
โ โโโ fastqc/
โ โโโ multiqc_report.html
โโโ trimmed/
โโโ atacseq/
โโโ rnaseq/
โโโ ucsc/
โ โโโ hub.txt
โ โโโ composite_trackDb.txt
โโโ genrich_sweep/
Processing Time: ~8-12 hours with full QC
Example 3: TCGA Cancer Atlas (Rapid Mode)
Scenario: Rapid processing of TCGA RNA-seq data
# config.yaml
genome_id: "hg38_TCGA_BRCA"
genome: "/data/genomes/hg38.fa"
gtf: "/data/annotations/gencode.v38.gtf"
samplesheet: "tcga_rnaseq.csv"
rapid_mode: true # Fast track generation
rnaseq:
mapper: "star"
strand_specific: true
# tcga_rnaseq.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
TCGA_BRCA_01,rnaseq,SRR8494716,,,BRCA,tumor_01,tumor
TCGA_BRCA_02,rnaseq,SRR8494717,,,BRCA,tumor_02,tumor
TCGA_BRCA_normal,rnaseq,SRR8494718,,,BRCA,normal_01,normal
Benefits:
- โก 40% faster processing
- ๐พ 50% less storage (no QC intermediates)
- ๐ฏ Clean output for browser visualization
Example 4: Ancient DNA Analysis (Standard Mode)
Scenario: Archaeological samples requiring quality validation
# config.yaml
genome_id: "ancientDNA_sample"
genome: "/data/reference/ancient_ref.fa"
samplesheet: "ancient_samples.csv"
rapid_mode: false # Need full QC for damage assessment
ancientdna:
mapper: "bwa_aln"
markdup: true
damage_analysis: true
min_mapq: 30
Why Standard Mode:
- Ancient DNA has unique quality issues (damage patterns)
- Need comprehensive QC to assess sample preservation
- Publication requires full quality documentation
Example 5: Mixed-Mode Project
Scenario: Combining public and experimental data
# config_public.yaml (rapid mode)
genome_id: "mm10_public_controls"
samplesheet: "public_controls.csv"
rapid_mode: true
# config_experimental.yaml (standard mode)
genome_id: "mm10_experimental"
samplesheet: "experimental_samples.csv"
rapid_mode: false
Workflow:
- Process public controls rapidly for quick validation
- Process experimental data with full QC
- Combine tracks in single UCSC hub
- Maintain appropriate quality standards for each dataset type
๐ค Contributing
Contributions are welcome! To get started:
git clone https://github.com/pgrady1322/uSeq2Tracks.git
cd uSeq2Tracks
conda env create -f envs/useq2tracks.yml
conda activate useq2tracks
pip install -e ".[dev]"
make test # run tests
make check # lint + format check
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Support
- Documentation: GitHub Wiki
- Issues: GitHub Issues
- Discussions: GitHub Discussions
๐ Citation
If you use uSeq2Tracks in your research, please cite:
Patrick Grady (2026). uSeq2Tracks: A universal pipeline for sequencing
data to genome browser tracks. https://github.com/pgrady1322/uSeq2Tracks
๐ Acknowledgments
- Snakemake Community: For the excellent workflow management system
- Bioconda: For streamlined software distribution
- UCSC Genome Browser: For track hub specifications
- Tool Developers: FastQC, MultiQC, STAR, Bowtie2, BWA, MACS3, and all other integrated tools
uSeq2Tracks: From raw sequencing data to publication-ready browser tracks in one streamlined workflow.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file useq2tracks-1.0.0.tar.gz.
File metadata
- Download URL: useq2tracks-1.0.0.tar.gz
- Upload date:
- Size: 104.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db94e685578b7c848b5e440e01e5dd43c98d817f6ce8cb83ce6b1718ee37ce05
|
|
| MD5 |
1c3fa66485e37e3510efd83ea5c1bbad
|
|
| BLAKE2b-256 |
2a4626d0e2420a2e0fd43fc9eb6ac62dd933123484b752d84281239ecdceb714
|
Provenance
The following attestation bundles were made for useq2tracks-1.0.0.tar.gz:
Publisher:
publish.yml on pgrady1322/uSeq2Tracks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
useq2tracks-1.0.0.tar.gz -
Subject digest:
db94e685578b7c848b5e440e01e5dd43c98d817f6ce8cb83ce6b1718ee37ce05 - Sigstore transparency entry: 1023847065
- Sigstore integration time:
-
Permalink:
pgrady1322/uSeq2Tracks@377f319be02d08f12ab8fe761607b062ab9fc0cb -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/pgrady1322
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@377f319be02d08f12ab8fe761607b062ab9fc0cb -
Trigger Event:
release
-
Statement type:
File details
Details for the file useq2tracks-1.0.0-py3-none-any.whl.
File metadata
- Download URL: useq2tracks-1.0.0-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf807d69c4624c7e5586cb1b08f2396eab4f75ef89cfca2fa5a5cc2669448509
|
|
| MD5 |
de4051a7afad63d41a09e37685abbc0c
|
|
| BLAKE2b-256 |
b9aa124246ba828ce43b987c11a134b386b7a48b1f8069240f775171f4d5a9c0
|
Provenance
The following attestation bundles were made for useq2tracks-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on pgrady1322/uSeq2Tracks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
useq2tracks-1.0.0-py3-none-any.whl -
Subject digest:
bf807d69c4624c7e5586cb1b08f2396eab4f75ef89cfca2fa5a5cc2669448509 - Sigstore transparency entry: 1023847119
- Sigstore integration time:
-
Permalink:
pgrady1322/uSeq2Tracks@377f319be02d08f12ab8fe761607b062ab9fc0cb -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/pgrady1322
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@377f319be02d08f12ab8fe761607b062ab9fc0cb -
Trigger Event:
release
-
Statement type: