Universal Sequencing-to-Genome-Browser-Tracks Pipeline

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

uSeq2Tracks: Universal Sequencing to Browser Tracks Pipeline

A comprehensive pipeline for processing diverse sequencing datasets and generating standardized genomic tracks for UCSC Genome Browser visualization. uSeq2Tracks handles everything from raw sequencing data to publication-ready browser tracks with minimal user intervention.

Available in two implementations:

Snakemake — stable, feature-complete
Nextflow DSL2 — cloud/HPC-optimized with Docker & Singularity support

🌟 Overview

uSeq2Tracks is designed to standardize the processing of heterogeneous sequencing datasets, transforming raw sequencing data into organized, genome browser-ready tracks. Whether you're working with public datasets from ENCODE or your own experimental data, uSeq2Tracks provides a unified workflow that handles quality control, alignment, track generation, and browser hub creation.

Key Features

Universal Input Support: Handles both local FASTQ files and SRA accessions
Multiple Assay Types: ChIP-seq, ATAC-seq, CUT&RUN, RNA-seq, WGS, Long-reads, etc.
Two Processing Modes: Standard (full QC) and Rapid (streamlined for public data)
Genome-ID Organization: All outputs tagged with unique genome identifiers
UCSC Integration: Automatic track hub generation for browser visualization
Quality Control: Comprehensive QC reports with FastQC, MultiQC, and assay-specific metrics
Flexible Configuration: Extensive parameter customization for each assay type

🧬 Supported Assay Types

Assay Type	Purpose	Key Outputs	Peak Calling
ChIP-seq	Histone modifications, TF binding	BigWig tracks, narrowPeak files	MACS3
ATAC-seq	Chromatin accessibility	BigWig tracks, narrowPeak files	MACS3
CUT&RUN	Low-input chromatin profiling	BigWig tracks, narrowPeak files	MACS3
RNA-seq	Gene expression	BigWig tracks, count matrices	N/A
WGS	Genome-wide sequencing	BigWig coverage, variant calls	N/A
Ancient DNA	Historical/archaeological samples	BigWig tracks, damage analysis	N/A
Nanopore	Long-read sequencing	BigWig tracks, structural variants	N/A
PacBio	Long-read sequencing	BigWig tracks, high-accuracy variants	N/A

🚀 Quick Start

1. Installation

# Clone the repository
git clone https://github.com/pgrady1322/uSeq2Tracks.git
cd uSeq2Tracks

# Create the conda environment (bioinformatics tools + dependencies)
conda env create -f envs/useq2tracks.yml
conda activate useq2tracks

# Install the useq2tracks CLI
pip install -e .

# Or install the CLI from PyPI (without pipeline files)
# pip install useq2tracks

2. Configuration

Edit the main configuration file:

# config.yaml
samplesheet: "samples.csv"           # Your sample metadata
genome: "/path/to/genome.fa"         # Reference genome
genome_id: "galGal6"                 # REQUIRED: Unique genome identifier
gtf: "/path/to/annotations.gtf"      # Gene annotations (optional)
outdir: "./results"                  # Output directory
rapid_mode: false                    # Set to true for streamlined processing

3. Sample Sheet Setup

Create a sample sheet describing your data:

Standard Mode Example (comprehensive QC):

sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
sample1,chipseq,SRR123456,,,H3K27ac,replicate1,treatment
sample2,atacseq,,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,accessibility,replicate1,control
sample3,rnaseq,,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,expression,timepoint1,control

Rapid Mode Example (public datasets):

sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
ENCODE_H3K27ac_rep1,chipseq,SRR1536404,,,H3K27ac,ENCSR000EWQ,K562
ENCODE_H3K27ac_rep2,chipseq,SRR1536405,,,H3K27ac,ENCSR000EWQ,K562
ENCODE_Input,chipseq,SRR1536406,,,H3K27ac,input,input

4. Run the Pipeline

# ── Using the useq2tracks CLI (recommended) ──

# Snakemake (default engine)
useq2tracks run                              # all cores, config.yaml
useq2tracks run --cores 16                   # limit to 16 cores
useq2tracks run --dryrun                     # preview what will run
useq2tracks run --rapid                      # skip QC, essential tracks only
useq2tracks run --profile slurm              # submit to SLURM
useq2tracks run --resume                     # resume interrupted run

# Nextflow
useq2tracks run --engine nextflow
useq2tracks run --engine nextflow --profile docker
useq2tracks run --engine nextflow --resume

# Forward extra engine-specific flags after '--'
useq2tracks run -- --printshellcmds          # Snakemake shell echo
useq2tracks run --engine nextflow -- -with-tower   # Nextflow Tower

# ── Other CLI subcommands ──

# Validate inputs before running
useq2tracks validate samples.csv --configfile config.yaml

# Show pipeline info
useq2tracks info

# Generate a standalone UCSC track hub
useq2tracks hub --genome-id galGal6 --hub-name MyHub \
  --hub-short-label "My Hub" --hub-long-label "My Sequencing Hub" \
  --genome-name galGal6 --hub-email user@example.com \
  --bigwigs *.bw --output-dir ucsc_hub/

# ── Or invoke engines directly (advanced) ──
snakemake --use-conda --cores all
nextflow run main.nf -profile docker --genome_id galGal6

🔄 Nextflow Implementation

uSeq2Tracks includes a Nextflow DSL2 pipeline for improved scalability, cloud integration, and HPC support. The Nextflow implementation provides all core functionality with enhanced portability and parallelization.

Why Use Nextflow?

Advantages:

✅ Better Parallelization: Automatic task-level optimization
✅ Improved Resume: More robust caching and resume capability
✅ Cloud Native: Built-in support for AWS, Google Cloud, Azure
✅ HPC Ready: First-class SLURM, SGE, PBS support
✅ Container First: Excellent Docker/Singularity integration
✅ Portable: Works identically across different systems

Nextflow Quick Start

1. Install Nextflow

curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/

2. Configure Pipeline

Create a params.yaml file:

# Input/output
samplesheet: "samples.csv"
genome: "/path/to/genome.fa"
genome_id: "galGal6"  # REQUIRED
outdir: "./results"

# Pipeline mode
rapid_mode: false

# ATAC-seq settings
atacseq:
  mapper: "bowtie2"
  markdup: false
  macs3_opts: "--qval 0.05"
  bw_norm: "CPM"

# ChIP-seq settings
chipseq:
  mapper: "bowtie2"
  control_tag: "input"
  markdup: false

# UCSC hub
ucsc:
  hub_name: "myHub"
  hub_short_label: "My Data"
  hub_long_label: "My Sequencing Data Hub"
  genome_name: "galGal6"
  hub_email: "pgrady1322@gmail.com"

3. Run Nextflow Pipeline

# Navigate to nextflow directory
cd nextflow/

# With Docker (local)
nextflow run main.nf -profile docker -params-file params.yaml

# With Singularity (HPC)
nextflow run main.nf -profile singularity -params-file params.yaml

# With SLURM + Singularity
nextflow run main.nf -profile slurm,singularity -params-file params.yaml

# Rapid mode
nextflow run main.nf -profile rapid,docker -params-file params.yaml

# Resume failed run
nextflow run main.nf -profile docker -resume

Nextflow Features

Implemented Workflows:

✅ ATAC-seq workflow with Tn5 shift correction
✅ ChIP-seq workflow with control matching
✅ CUT&RUN workflow
✅ Genome preparation and indexing
✅ UCSC track hub generation
✅ Samplesheet validation
✅ Dynamic resource allocation
✅ Multiple execution profiles

Snakemake-only (not yet ported to Nextflow):

RNA-seq workflow
WGS workflow
Long-read workflows (Nanopore, PacBio)
Ancient DNA workflow
SRA download integration
FastQC/MultiQC integration
Replicate merging

Nextflow Profiles

Container Profiles:

docker: Use Docker containers (recommended for local)
singularity: Use Singularity containers (recommended for HPC)
conda: Use Conda environments

Executor Profiles:

local: Run on local machine (default)
slurm: Submit to SLURM scheduler
sge: Submit to SGE scheduler
pbs: Submit to PBS scheduler

Special Profiles:

test: Run with minimal test dataset
rapid: Skip QC, generate essential tracks only

Combine profiles with commas:

nextflow run main.nf -profile slurm,singularity  # SLURM + Singularity
nextflow run main.nf -profile docker,rapid       # Docker + Rapid mode

Nextflow Configuration Example

# params.yaml for Nextflow
samplesheet: "samples.csv"
genome: "/path/to/genome.fa"
genome_id: "galGal6"

# Resource limits
max_cpus: 64
max_memory: "256.GB"
max_time: "240.h"

# ATAC-seq parameters
atacseq:
  mapper: "bowtie2"
  bowtie2_opts: "--very-sensitive"
  markdup: false
  shift: -75
  extsize: 150
  bw_norm: "CPM"

# ChIP-seq parameters  
chipseq:
  mapper: "bowtie2"
  control_tag: "input"
  macs3_opts: "--qval 0.05 --keep-dup all"
  bw_norm: "CPM"

Nextflow Output Structure

results/
└── galGal6/                    # Genome ID
    ├── atacseq/
    │   ├── bam/
    │   ├── peaks/
    │   └── bigwig/
    ├── chipseq/
    │   ├── bam/
    │   ├── peaks/
    │   └── bigwig/
    ├── ucsc/
    │   ├── hub.txt
    │   ├── genomes.txt
    │   └── trackDb.txt
    └── pipeline_info/         # Execution reports
        ├── execution_report.html
        ├── execution_timeline.html
        └── execution_trace.txt

Nextflow Quick Reference

Basic Commands:

# Run pipeline
nextflow run main.nf -profile docker -params-file params.yaml

# Resume failed run
nextflow run main.nf -profile docker -resume

# Override parameters
nextflow run main.nf --genome my_genome.fa --genome_id myGenome

# Generate reports
nextflow run main.nf -with-report -with-timeline -with-dag

# Limit concurrent jobs
nextflow run main.nf -profile slurm -qs 50

Troubleshooting:

# Check configuration
nextflow config main.nf

# Test run
nextflow run main.nf -profile test,docker

# Clean and restart
rm -rf work/ && nextflow run main.nf

Choosing Between Snakemake and Nextflow

Feature	Snakemake	Nextflow
Maturity	Stable, feature-complete	Stable (core assays)
Learning Curve	Python-based (easier for most)	Groovy-based
Parallelization	Good	Better (automatic)
Cloud Support	Via plugins	Native
HPC Integration	Good	Excellent
Resume Capability	Good	Excellent
Container Support	Good	Excellent
Best For	General bioinformatics	HPC/Cloud deployments

Recommendation:

Use Snakemake for: All assay types, Python familiarity, full feature set
Use Nextflow for: HPC/cloud environments, containerized execution, core epigenomic assays

Nextflow Documentation

For complete Nextflow documentation, see:

nextflow/QUICK_REFERENCE.md — Quick command reference
nextflow/examples/ — Example configurations

📊 Pipeline Modes

Standard Mode (Default)

Purpose: Comprehensive analysis with full quality control

Features:

✅ Complete FastQC reports for all samples
✅ Adapter trimming with detailed summaries
✅ MultiQC comprehensive aggregate report
✅ Comprehensive track generation
✅ Parameter sweep analysis (optional)
✅ Replicate-merged composite tracks
✅ Detailed QC metrics and visualizations
✅ Genrich additional outputs (when enabled)

Pipeline Components:

FastQC quality reports for raw reads
Adapter trimming outputs and summaries
MultiQC aggregated QC dashboard
Primary pipeline outputs (BigWig, BAM, peaks)
Composite tracks for replicate groups
Parameter sweep comparisons
Complete UCSC hub files

Use Cases:

✨ Novel experimental datasets requiring comprehensive QC
✨ Unknown sample quality scenarios
✨ Publication-ready analysis with full documentation
✨ Parameter optimization studies
✨ Research datasets needing complete audit trail

Output Structure:

results/{genome_id}/
├── qc/
│   ├── fastqc/           # Individual FastQC reports
│   └── multiqc_report.html
├── trimmed/              # Adapter trimming outputs
├── {assay}/              # Assay-specific tracks
├── ucsc/
│   ├── hub.txt
│   └── composite_trackDb.txt
└── genrich_sweep/        # Parameter sweeps (if enabled)

Rapid Mode

Purpose: Streamlined processing for high-confidence datasets

Features:

⚡ Essential track generation only
⚡ Skipped QC reports (FastQC, MultiQC)
⚡ No adapter trimming summaries
⚡ No composite tracks for replicate groups
⚡ No parameter sweep outputs
⚡ No Genrich additional outputs
⚡ Core outputs: BigWig tracks, peak calls, basic UCSC hubs
⚡ Faster processing time (30-50% faster)
⚡ Reduced storage footprint
⚡ Simplified output structure

Pipeline Components:

SRA downloads (if needed)
Genome indexing
Primary alignment and track generation
Essential peak calling
Basic UCSC hub files (no composites)
Rapid completion tracking

Use Cases:

🚀 Public datasets (ENCODE, TCGA, GEO) with known quality
🚀 Quick browser track generation for visualization
🚀 Streamlined processing when QC is unnecessary
🚀 Fast turnaround for track sharing
🚀 Time-sensitive analyses

Output Structure:

results/{genome_id}/
├── {assay}/              # Essential tracks only
├── ucsc/
│   └── hub.txt           # Basic hub (no composites)
└── rapid/
    └── rapid_tracks_complete.txt  # Completion summary

Activation:

rapid_mode: true

Configuration Example:

# For public ENCODE datasets
genome_id: "hg38_ENCODE"
samplesheet: "encode_samples.csv"
rapid_mode: true              # Enable rapid mode
genome: "/data/hg38.fa"

# Pipeline will skip:
# - FastQC reports
# - MultiQC aggregation
# - Adapter trimming summaries
# - Composite track generation

Mode Comparison Table

Feature	Standard Mode	Rapid Mode
FastQC Reports	✅ Yes	❌ Skipped
MultiQC Dashboard	✅ Yes	❌ Skipped
Adapter Trimming	✅ Full summaries	❌ No summaries
BigWig Tracks	✅ Yes	✅ Yes
Peak Calling	✅ Yes	✅ Yes
UCSC Hub	✅ With composites	✅ Basic only
Parameter Sweeps	✅ Optional	❌ Skipped
Composite Tracks	✅ Yes	❌ Skipped
Processing Time	Baseline	30-50% faster
Storage Usage	Full	Reduced
Best For	Research data	Public datasets

Backward Compatibility

Default behavior unchanged: rapid_mode defaults to false
Existing configs work: No breaking changes to current setups
Progressive enhancement: Rapid mode is opt-in feature
Output tagging: All outputs tagged with genome_id regardless of mode

✨ Benefits of Dual-Mode Architecture

Rapid Mode Benefits:

⚡ 30-50% faster processing for public datasets
💾 Reduced storage footprint without QC intermediates
🎯 Cleaner output focused on essential browser tracks
🚀 Quick turnaround for time-sensitive visualization needs
📊 Streamlined workflows for known high-quality data

Standard Mode Benefits:

📈 Complete audit trail for research datasets
🔬 Comprehensive quality assessment for novel samples
📚 Publication-ready with full documentation
🔍 Deep quality insights via MultiQC aggregation
🎛️ Parameter optimization capabilities

Flexibility Advantages:

Switch between modes based on data source
Mix rapid and standard processing in same project
Maintain quality standards while optimizing efficiency
Preserve comprehensive analysis when needed

📁 Output Structure

The pipeline generates genome-tagged outputs with the following structure:

Standard Mode Output (rapid_mode: false)

results/
└── {genome_id}/                    # e.g., galGal6/
    ├── genome/                     # Genome indices
    │   ├── genome.fa              # Reference genome
    │   ├── star/                  # STAR index
    │   ├── bowtie2/               # Bowtie2 index
    │   └── bwa_mem2/              # BWA-MEM2 index
    ├── qc/                        # Quality control
    │   ├── fastqc/                # FastQC reports (HTML + zip)
    │   │   ├── sample1_fastqc.html
    │   │   └── sample1_fastqc.zip
    │   └── multiqc_report.html    # Aggregated QC report
    ├── trimmed/                   # Adapter trimming (if enabled)
    │   ├── sample1_R1_trimmed.fq.gz
    │   ├── sample1_R2_trimmed.fq.gz
    │   └── trimming_reports/
    ├── {assay_type}/              # Per-assay outputs
    │   ├── bam/                   # Aligned reads
    │   │   ├── sample1.sorted.bam
    │   │   └── sample1.sorted.bam.bai
    │   ├── bigwig/                # Coverage tracks
    │   │   ├── sample1.bw
    │   │   └── merged_replicates.bw
    │   └── peaks/                 # Peak calls (when applicable)
    │       ├── sample1_peaks.narrowPeak
    │       └── merged_peaks.bed
    ├── ucsc/                      # UCSC track hubs
    │   ├── hub.txt                # Main hub file
    │   ├── genomes.txt            # Genome specification
    │   ├── trackDb.txt            # Track definitions
    │   └── composite_trackDb.txt  # Composite track definitions
    ├── genrich_sweep/             # Parameter sweeps (if enabled)
    │   ├── qval_0.05/
    │   ├── qval_0.01/
    │   └── qval_0.001/
    └── logs/                      # Pipeline logs
        ├── alignment/
        ├── peak_calling/
        └── track_generation/

Rapid Mode Output (rapid_mode: true)

results/
└── {genome_id}/                    # e.g., galGal6/
    ├── genome/                     # Genome indices (same as standard)
    │   ├── genome.fa
    │   ├── star/
    │   ├── bowtie2/
    │   └── bwa_mem2/
    ├── {assay_type}/              # Essential tracks only
    │   ├── bam/                   # Aligned reads
    │   │   └── sample1.sorted.bam
    │   ├── bigwig/                # Coverage tracks
    │   │   └── sample1.bw
    │   └── peaks/                 # Peak calls (when applicable)
    │       └── sample1_peaks.narrowPeak
    ├── ucsc/                      # Basic UCSC hub
    │   ├── hub.txt                # Main hub file
    │   ├── genomes.txt            # Genome specification
    │   └── trackDb.txt            # Track definitions (no composites)
    ├── rapid/                     # Rapid mode tracking
    │   └── rapid_tracks_complete.txt  # Completion summary with stats
    └── logs/                      # Pipeline logs
        ├── alignment/
        ├── peak_calling/
        └── track_generation/

Output Differences Summary

Output Component	Standard Mode	Rapid Mode
QC Reports	Full FastQC + MultiQC	Skipped
Trimming Outputs	Detailed summaries	No summaries
Composite Tracks	Generated	Skipped
Parameter Sweeps	Optional	Skipped
Hub Complexity	With composites	Basic only
Storage Footprint	Full	~30-40% smaller
Completion Marker	Standard	`rapid_tracks_complete.txt`

Key Output Files

BigWig Tracks (.bw):

Genome-wide coverage tracks for UCSC browser
Normalized by CPM, RPKM, or custom methods
Strand-specific for RNA-seq (when applicable)

Peak Files (.narrowPeak, .broadPeak):

BED-format files with enriched regions
MACS3 output with q-values and fold-enrichment
Optional Genrich peaks for ATAC-seq

BAM Files (.bam):

Sorted and indexed aligned reads
Optional duplicate marking
Quality filtered (MAPQ thresholds applied)

UCSC Hub Files:

hub.txt: Hub metadata and contact info
genomes.txt: Genome assembly specifications
trackDb.txt: Individual track configurations
composite_trackDb.txt: Grouped track configurations (standard mode only)

🔧 Configuration Details

Required Parameters

# Essential settings - must be configured
samplesheet: "samples.csv"          # Sample metadata file
genome: "/path/to/genome.fa"        # Reference genome FASTA
genome_id: "your_genome_id"         # REQUIRED: Unique identifier
outdir: "./results"                 # Output directory

Sample Sheet Format

Column	Description	Example	Required
`sample_id`	Unique sample identifier	`H3K27ac_rep1`	Yes
`type`	Assay type	`chipseq`, `atacseq`, `rnaseq`	Yes
`sra_id`	SRA accession (if downloading)	`SRR123456`	If no local files
`read1`	Path to R1 FASTQ	`data/sample_R1.fastq.gz`	If no SRA
`read2`	Path to R2 FASTQ	`data/sample_R2.fastq.gz`	For paired-end
`experiment_group`	Experimental grouping	`H3K27ac`, `timepoint1`	No
`replicate_group`	Replicate grouping	`replicate1`	No
`condition`	Sample condition	`treatment`, `control`	No

Assay-Specific Parameters

ChIP-seq Configuration

chipseq:
  mapper: "bowtie2"                 # bowtie2 or bwa_mem2
  bowtie2_opts: "--very-sensitive"
  markdup: false                    # Mark duplicates
  macs3_opts: "--qval 0.05 --keep-dup all"
  control_tag: "input"              # Identify control samples
  bw_norm: "CPM"                    # BigWig normalization

ATAC-seq Configuration

atacseq:
  mapper: "bowtie2"
  markdup: false                    # Recommended: false for accessibility
  shift: -75                        # MACS3 shift for ATAC-seq
  extsize: 150                      # MACS3 extension size
  macs3_opts: "--qval 0.05"

RNA-seq Configuration

rnaseq:
  mapper: "star"                    # star or hisat2
  star_opts: "--outFilterMultimapNmax 20"
  strand_specific: false
  gene_bed: "genes.bed"             # For QC analysis

Advanced Features

Parameter Sweep

Test multiple peak-calling thresholds:

parameter_sweep:
  enabled: true
  qvalues: [0.05, 0.01, 0.005, 0.001]

Adapter Trimming

Enable quality-based trimming:

adapter_trimming:
  enabled: true
  min_length: 20
  quality_cutoff: 20

Alternative Peak Callers

Enable Genrich for ATAC-seq:

## 🧮 Computational Requirements

### Resource Recommendations

| Dataset Size | CPU Cores | Memory | Storage | Time Estimate |
|--------------|-----------|---------|---------|---------------|
| Small (< 10 samples) | 8-16 | 32 GB | 100 GB | 2-6 hours |
| Medium (10-50 samples) | 16-32 | 64 GB | 500 GB | 6-24 hours |
| Large (50+ samples) | 32-64 | 128 GB | 1 TB+ | 1-3 days |

### Cluster Configuration

The pipeline includes SLURM integration via `Executor.sh`. Customize for your cluster:

```bash
#SBATCH --job-name=uSeq2Tracks
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mem=10g
#SBATCH --cpus-per-task=4

📖 Detailed Workflows

ChIP-seq/ATAC-seq/CUT&RUN Workflow

Quality Control: FastQC analysis of raw reads
Adapter Trimming: Optional quality-based trimming with fastp
Genome Indexing: Build mapper-specific indices (Bowtie2/BWA-MEM2)
Read Mapping: Align reads to reference genome
Post-processing: Sort, index, optional duplicate marking
Coverage Tracks: Generate normalized BigWig files
Peak Calling: Identify enriched regions with MACS3
Quality Metrics: Generate assay-specific QC reports

RNA-seq Workflow

Quality Control: FastQC analysis of raw reads
Adapter Trimming: Optional preprocessing with fastp
Genome Indexing: Build STAR or HISAT2 indices
Read Mapping: Splice-aware alignment to reference
Quantification: Generate gene-level count matrices
Coverage Tracks: Create strand-specific BigWig files
Quality Assessment: RNA-seq specific QC with RSeQC

Long-read Workflow (Nanopore/PacBio)

Quality Assessment: Basic statistics and length distributions
Genome Indexing: Build Minimap2 index
Read Mapping: Long-read aware alignment
Coverage Analysis: Generate coverage tracks
Variant Calling: Optional structural variant detection

🎨 UCSC Browser Integration

Automatic Track Hub Generation

uSeq2Tracks automatically creates UCSC-compatible track hubs:

Hub Structure:

ucsc/
├── hub.txt                 # Hub metadata
├── genomes.txt             # Genome definitions
└── trackDb.txt             # Track configurations

Track Organization:

Composite Tracks: Group related samples (e.g., same experiment)
Subgroups: Organize by condition, replicate, timepoint
Color Coding: Consistent color schemes per assay type
Metadata Integration: Sample information in track descriptions

Loading in UCSC Browser

Upload track hub files to web-accessible location
In UCSC Browser: My Data → Track Hubs → My Hubs
Enter hub URL: https://your-server.com/path/to/ucsc/hub.txt
Browse your data with full metadata integration

🔍 Quality Control Features

Standard QC Reports

FastQC: Per-sample quality metrics
MultiQC: Aggregated quality dashboard
Mapping Statistics: Alignment rates and quality scores
Library Complexity: Duplication rates and insert sizes

Assay-Specific QC

ChIP-seq: Fragment length distributions, enrichment metrics
ATAC-seq: TSS enrichment, fragment size profiles
RNA-seq: Gene body coverage, junction analysis
WGS: Coverage uniformity, variant quality metrics

Quality Thresholds

The pipeline includes built-in quality checks:

Minimum mapping rates
Fragment count requirements for peak calling
Insert size validation
Strand specificity assessment

🛠 Troubleshooting

Common Issues

Genome ID Not Set

ERROR: genome_id must be set in config.yaml

Solution: Add genome_id: "your_genome" to config.yaml

Sample Sheet Formatting

ERROR: Missing required columns in sample sheet

Solution: Ensure sample sheet includes sample_id, type, and either sra_id or local file paths

Memory Issues

ERROR: Job exceeded memory limit

Solution: Increase memory allocation in config.yaml:

memory:
  large: 128000    # Increase for memory-intensive jobs

Disk Space

ERROR: No space left on device

Solution:

Clean up intermediate files: snakemake --delete-temp-output
Use scratch storage for temporary files
Monitor disk usage during execution

Performance Optimization

Speed Up Processing

Enable Rapid Mode for public datasets
Increase Parallelization: More --jobs in Snakemake
Use SSDs for scratch space
Optimize Resource Allocation: Match CPU/memory to job requirements

Reduce Storage

Delete Intermediate Files: Use --delete-temp-output
Compress Outputs: Enable compression for BAM files
Archive Unused Data: Move completed analyses to long-term storage

📚 Examples

Example 1: ENCODE ChIP-seq Analysis (Rapid Mode)

Scenario: Processing public ENCODE data for quick visualization

# config.yaml
genome_id: "hg38_ENCODE_H3K27ac"
genome: "/data/genomes/hg38.fa"
samplesheet: "encode_chipseq.csv"
rapid_mode: true                     # Skip QC for public data

chipseq:
  mapper: "bowtie2"
  macs3_opts: "--qval 0.01"

# encode_chipseq.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
ENCSR000EWQ_rep1,chipseq,SRR1536404,,,H3K27ac,ENCSR000EWQ,K562
ENCSR000EWQ_rep2,chipseq,SRR1536405,,,H3K27ac,ENCSR000EWQ,K562
ENCSR000EWQ_input,chipseq,SRR1536406,,,H3K27ac,input,input

Expected Output:

results/hg38_ENCODE_H3K27ac/
├── chipseq/
│   ├── bam/
│   ├── bigwig/
│   └── peaks/
├── ucsc/
│   └── hub.txt
└── rapid/
    └── rapid_tracks_complete.txt

Processing Time: ~2-3 hours (vs 4-6 hours in standard mode)

Example 2: Multi-assay Developmental Study (Standard Mode)

Scenario: Novel experimental data requiring comprehensive QC

# config.yaml
genome_id: "mm10_development"
genome: "/data/genomes/mm10.fa"
gtf: "/data/annotations/mm10.gtf"
samplesheet: "development_study.csv"
rapid_mode: false                    # Full QC for novel data

parameter_sweep:
  enabled: true
  qvalues: [0.1, 0.05, 0.01, 0.001]
  
adapter_trimming:
  enabled: true
  min_length: 20

# development_study.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
E10_ATAC_rep1,atacseq,,data/E10_ATAC_1_R1.fq.gz,data/E10_ATAC_1_R2.fq.gz,E10,ATAC_rep1,E10
E10_ATAC_rep2,atacseq,,data/E10_ATAC_2_R1.fq.gz,data/E10_ATAC_2_R2.fq.gz,E10,ATAC_rep2,E10
E10_RNA_rep1,rnaseq,,data/E10_RNA_1_R1.fq.gz,data/E10_RNA_1_R2.fq.gz,E10,RNA_rep1,E10
E12_ATAC_rep1,atacseq,,data/E12_ATAC_1_R1.fq.gz,data/E12_ATAC_1_R2.fq.gz,E12,ATAC_rep1,E12
E12_RNA_rep1,rnaseq,,data/E12_RNA_1_R1.fq.gz,data/E12_RNA_1_R2.fq.gz,E12,RNA_rep1,E12

Expected Output:

results/mm10_development/
├── qc/
│   ├── fastqc/
│   └── multiqc_report.html
├── trimmed/
├── atacseq/
├── rnaseq/
├── ucsc/
│   ├── hub.txt
│   └── composite_trackDb.txt
└── genrich_sweep/

Processing Time: ~8-12 hours with full QC

Example 3: TCGA Cancer Atlas (Rapid Mode)

Scenario: Rapid processing of TCGA RNA-seq data

# config.yaml
genome_id: "hg38_TCGA_BRCA"
genome: "/data/genomes/hg38.fa"
gtf: "/data/annotations/gencode.v38.gtf"
samplesheet: "tcga_rnaseq.csv"
rapid_mode: true                     # Fast track generation

rnaseq:
  mapper: "star"
  strand_specific: true

# tcga_rnaseq.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
TCGA_BRCA_01,rnaseq,SRR8494716,,,BRCA,tumor_01,tumor
TCGA_BRCA_02,rnaseq,SRR8494717,,,BRCA,tumor_02,tumor
TCGA_BRCA_normal,rnaseq,SRR8494718,,,BRCA,normal_01,normal

Benefits:

⚡ 40% faster processing
💾 50% less storage (no QC intermediates)
🎯 Clean output for browser visualization

Example 4: Ancient DNA Analysis (Standard Mode)

Scenario: Archaeological samples requiring quality validation

# config.yaml
genome_id: "ancientDNA_sample"
genome: "/data/reference/ancient_ref.fa"
samplesheet: "ancient_samples.csv"
rapid_mode: false                    # Need full QC for damage assessment

ancientdna:
  mapper: "bwa_aln"
  markdup: true
  damage_analysis: true
  min_mapq: 30

Why Standard Mode:

Ancient DNA has unique quality issues (damage patterns)
Need comprehensive QC to assess sample preservation
Publication requires full quality documentation

Example 5: Mixed-Mode Project

Scenario: Combining public and experimental data

# config_public.yaml (rapid mode)
genome_id: "mm10_public_controls"
samplesheet: "public_controls.csv"
rapid_mode: true

# config_experimental.yaml (standard mode)
genome_id: "mm10_experimental"
samplesheet: "experimental_samples.csv"
rapid_mode: false

Workflow:

Process public controls rapidly for quick validation
Process experimental data with full QC
Combine tracks in single UCSC hub
Maintain appropriate quality standards for each dataset type

🤝 Contributing

Contributions are welcome! To get started:

git clone https://github.com/pgrady1322/uSeq2Tracks.git
cd uSeq2Tracks
conda env create -f envs/useq2tracks.yml
conda activate useq2tracks
pip install -e ".[dev]"
make test        # run tests
make check       # lint + format check

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support

Documentation: GitHub Wiki
Issues: GitHub Issues
Discussions: GitHub Discussions

🏆 Citation

If you use uSeq2Tracks in your research, please cite:

Patrick Grady (2026). uSeq2Tracks: A universal pipeline for sequencing
data to genome browser tracks. https://github.com/pgrady1322/uSeq2Tracks

🙏 Acknowledgments

Snakemake Community: For the excellent workflow management system
Bioconda: For streamlined software distribution
UCSC Genome Browser: For track hub specifications
Tool Developers: FastQC, MultiQC, STAR, Bowtie2, BWA, MACS3, and all other integrated tools

uSeq2Tracks: From raw sequencing data to publication-ready browser tracks in one streamlined workflow.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pgrady1322

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

useq2tracks-1.0.0.tar.gz (104.2 kB view details)

Uploaded Mar 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

useq2tracks-1.0.0-py3-none-any.whl (22.7 kB view details)

Uploaded Mar 4, 2026 Python 3

File details

Details for the file useq2tracks-1.0.0.tar.gz.

File metadata

Download URL: useq2tracks-1.0.0.tar.gz
Upload date: Mar 4, 2026
Size: 104.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for useq2tracks-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`db94e685578b7c848b5e440e01e5dd43c98d817f6ce8cb83ce6b1718ee37ce05`
MD5	`1c3fa66485e37e3510efd83ea5c1bbad`
BLAKE2b-256	`2a4626d0e2420a2e0fd43fc9eb6ac62dd933123484b752d84281239ecdceb714`

See more details on using hashes here.

Provenance

The following attestation bundles were made for useq2tracks-1.0.0.tar.gz:

Publisher: publish.yml on pgrady1322/uSeq2Tracks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: useq2tracks-1.0.0.tar.gz
- Subject digest: db94e685578b7c848b5e440e01e5dd43c98d817f6ce8cb83ce6b1718ee37ce05
- Sigstore transparency entry: 1023847065
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: pgrady1322/uSeq2Tracks@377f319be02d08f12ab8fe761607b062ab9fc0cb
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/pgrady1322
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@377f319be02d08f12ab8fe761607b062ab9fc0cb
- Trigger Event: release

File details

Details for the file useq2tracks-1.0.0-py3-none-any.whl.

File metadata

Download URL: useq2tracks-1.0.0-py3-none-any.whl
Upload date: Mar 4, 2026
Size: 22.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for useq2tracks-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf807d69c4624c7e5586cb1b08f2396eab4f75ef89cfca2fa5a5cc2669448509`
MD5	`de4051a7afad63d41a09e37685abbc0c`
BLAKE2b-256	`b9aa124246ba828ce43b987c11a134b386b7a48b1f8069240f775171f4d5a9c0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for useq2tracks-1.0.0-py3-none-any.whl:

Publisher: publish.yml on pgrady1322/uSeq2Tracks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: useq2tracks-1.0.0-py3-none-any.whl
- Subject digest: bf807d69c4624c7e5586cb1b08f2396eab4f75ef89cfca2fa5a5cc2669448509
- Sigstore transparency entry: 1023847119
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: pgrady1322/uSeq2Tracks@377f319be02d08f12ab8fe761607b062ab9fc0cb
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/pgrady1322
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@377f319be02d08f12ab8fe761607b062ab9fc0cb
- Trigger Event: release

useq2tracks 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

uSeq2Tracks: Universal Sequencing to Browser Tracks Pipeline

🌟 Overview

Key Features

🧬 Supported Assay Types

🚀 Quick Start

1. Installation

2. Configuration

3. Sample Sheet Setup

4. Run the Pipeline

🔄 Nextflow Implementation

Why Use Nextflow?

Nextflow Quick Start

1. Install Nextflow

2. Configure Pipeline

3. Run Nextflow Pipeline

Nextflow Features

Nextflow Profiles

Nextflow Configuration Example

Nextflow Output Structure

Nextflow Quick Reference

Choosing Between Snakemake and Nextflow

Nextflow Documentation

📊 Pipeline Modes

Standard Mode (Default)

Rapid Mode

Mode Comparison Table

Backward Compatibility

✨ Benefits of Dual-Mode Architecture

📁 Output Structure

Standard Mode Output (rapid_mode: false)

Rapid Mode Output (rapid_mode: true)

Output Differences Summary

Key Output Files

🔧 Configuration Details

Required Parameters

Sample Sheet Format

Assay-Specific Parameters

ChIP-seq Configuration

ATAC-seq Configuration

RNA-seq Configuration

Advanced Features

Parameter Sweep

Adapter Trimming

Alternative Peak Callers

📖 Detailed Workflows

ChIP-seq/ATAC-seq/CUT&RUN Workflow

RNA-seq Workflow

Long-read Workflow (Nanopore/PacBio)

🎨 UCSC Browser Integration

Automatic Track Hub Generation

Loading in UCSC Browser

🔍 Quality Control Features

Standard QC Reports

Assay-Specific QC

Quality Thresholds

🛠 Troubleshooting

Common Issues

Genome ID Not Set

Sample Sheet Formatting

Memory Issues

Disk Space

Performance Optimization

Speed Up Processing

Reduce Storage

📚 Examples

Example 1: ENCODE ChIP-seq Analysis (Rapid Mode)

Example 2: Multi-assay Developmental Study (Standard Mode)

Example 3: TCGA Cancer Atlas (Rapid Mode)

Example 4: Ancient DNA Analysis (Standard Mode)

Example 5: Mixed-Mode Project