Skip to main content

Feature extraction tools for circulating tumor DNA from GRCh37 aligned BAM files

Project description

Krewlyzer: Comprehensive cfDNA Feature Extraction Toolkit

Krewlyzer logo

PyPI version GitHub Actions Docker

Krewlyzer is a robust, user-friendly command-line toolkit for extracting a wide range of biological features from cell-free DNA (cfDNA) sequencing data. It is designed for cancer genomics, liquid biopsy research, and clinical bioinformatics, providing high-performance, reproducible feature extraction from BAM files. Krewlyzer draws inspiration from cfDNAFE and implements state-of-the-art methods for fragmentation, motif, and methylation analysis, all in a modern Pythonic interface with rich parallelization and logging.

[!TIP] Full Documentation: For detailed usage, feature descriptions, and pipeline integration, visit our Documentation Site.


Table of Contents


System Requirements

  • Linux or macOS (tested on Ubuntu 20.04, macOS 12+)
  • Python 3.8+
  • ≥16GB RAM recommended for large BAM files
  • Docker (optional, for easiest setup)

Installation

With Docker (Recommended)

docker pull ghcr.io/msk-access/krewlyzer:latest
# Example usage:
docker run --rm -v $PWD:/data ghcr.io/msk-access/krewlyzer:latest motif /data/sample.bam -g /data/hg19.fa -o /data/motif_out

With uv Virtual Environment

uv venv .venv
source .venv/bin/activate
uv pip install .

Or install from PyPI:

uv pip install krewlyzer

Reference Data

  • Reference Genome (FASTA):
    • Download GRCh37/hg19 from UCSC
    • BAMs must be sorted, indexed, and aligned to the same build
  • Bin/Region/Marker Files:
    • Provided in krewlyzer/data/ (see options for each feature)

Command Summary

Command Description
motif Motif-based feature extraction
fsc Fragment size coverage
fsr Fragment size ratio
fsd Fragment size distribution
wps Windowed protection score
ocf Orientation-aware fragmentation
uxm Fragment-level methylation (SE/PE)
mfsd Mutant fragment size distribution
run-all Run all features for a BAM

Typical Workflow

The recommended way to run krewlyzer is using the Unified Pipeline via run-all, which processes the BAM file in a single pass for maximum efficiency.

# Optimized Unified Pipeline
krewlyzer run-all sample.bam --reference hg19.fa --output output_dir \
    --variants variants.maf --bin-input targets.bed --threads 4

Alternatively, you can run tools individually. Note that most tools require a fragment BED file (.bed.gz) produced by the extract command.

# 1. Extract fragments (BAM -> BED.gz)
krewlyzer extract sample.bam -g hg19.fa -o output_dir

# 2. Run feature tools using the BED file
krewlyzer fsc output_dir/sample.bed.gz --output fsc_out.txt
# ... (wps, fsd, ocf, etc.)

# 3. Motif analysis (Independent of BED, uses BAM directly)
krewlyzer motif sample.bam -g hg19.fa -o output_dir 

Feature Details & Usage

Motif-based Feature Extraction

Purpose: Extracts end motif, breakpoint motif, and Motif Diversity Score (MDS) from sequencing fragments.

Biological context: Motif analysis of cfDNA fragment ends can reveal tissue-of-origin, nucleosome positioning, and mutational processes. MDS quantifies motif diversity, which may be altered in cancer.

Usage:

krewlyzer motif path/to/input.bam -g path/to/reference.fa -o path/to/output_dir \
    --minlen 65 --maxlen 400 -k 3 --verbose
  • Output: EDM, BPM, and MDS subfolders in output directory.
  • Rich logging and progress bars for user-friendly feedback.

Fragment Size Coverage (FSC)

Purpose: Computes z-scored coverage of cfDNA fragments in different size ranges, per genomic bin (default: 100kb), with GC correction.

Biological context: cfDNA fragment size profiles are informative for cancer detection and tissue-of-origin. FSC quantifies the coverage of short (65-150bp), intermediate (151-260bp), long (261-400bp), and total (65-400bp) fragments, normalized to genome-wide means.

Usage:

krewlyzer fsc motif_out --output fsc_out [options]
  • Input: .bed.gz files from motif command
  • Output: One .FSC file per sample
  • Options:
    • --bin-input, -b: Bin file (default: data/ChormosomeBins/hg19_window_100kb.bed)
    • --windows, -w: Window size (default: 100000)
    • --continue-n, -c: Super-bin size (default: 50)
    • --threads, -t: Number of processes

Fragment Size Ratio (FSR)

Purpose: Calculates the ratio of ultra-short/short/intermediate/long fragments per bin.

Biological context: The DELFI method (Mouliere et al., 2018) showed that cfDNA fragment size ratios are highly informative for cancer detection. Krewlyzer uses ultra-short (65-100bp), short (65-150bp), intermediate (151-260bp), and long (261-400bp) bins. The ultra-short bin is a highly specific marker for ctDNA.

Usage:

krewlyzer fsr motif_out --output fsr_out [options]
  • Input: .bed.gz files from motif command
  • Output: One .FSR file per sample
  • Options: Same as FSC

Fragment Size Distribution (FSD)

Purpose: Computes high-resolution (5bp bins) fragment length distributions per chromosome arm.

Biological context: cfDNA fragmentation patterns at chromosome arms can reflect nucleosome positioning, chromatin accessibility, and cancer-specific fragmentation signatures.

Usage:

krewlyzer fsd motif_out --arms-file krewlyzer/data/ChormosomeArms/hg19_arms.bed --output fsd_out [options]
  • Input: .bed.gz files from motif command
  • Output: One .FSD file per sample
  • Options:
    • --arms-file, -a: Chromosome arms BED (required)
    • --threads, -t: Number of processes

Windowed Protection Score (WPS)

Purpose: Computes nucleosome protection scores (WPS) for each region in a transcript/region file.

Biological context: The WPS (Snyder et al., 2016) quantifies nucleosome occupancy and chromatin accessibility by comparing fragments spanning a window to those ending within it. High WPS indicates nucleosome protection; low WPS, open chromatin.

Usage:

krewlyzer wps motif_out --output wps_out [options]
  • Input: .bed.gz files from motif command
  • Output: .WPS.tsv.gz per region/sample
  • Options:
    • --tsv-input: Transcript region file (default: data/TranscriptAnno/transcriptAnno-hg19-1kb.tsv)
    • --wpstype: WPS type (L for long [default], S for short)
    • --threads, -t: Number of processes

Orientation-aware Fragmentation (OCF)

Purpose: Computes orientation-aware cfDNA fragmentation (OCF) values in tissue-specific open chromatin regions.

Biological context: OCF (Sun et al., Genome Res 2019) measures the phasing of upstream (U) and downstream (D) fragment ends in open chromatin, informing tissue-of-origin of cfDNA.

Usage:

krewlyzer ocf motif_out --output ocf_out [options]
  • Input: .bed.gz files from motif command
  • Output: .sync.end files per tissue and summary all.ocf.csv per sample
  • Options:
    • --ocr-input, -r: Open chromatin region BED (default: data/OpenChromatinRegion/7specificTissue.all.OC.bed)
    • --threads, -t: Number of processes

Fragment-level Methylation (UXM)

Purpose: Computes the proportions of Unmethylated (U), Mixed (X), and Methylated (M) fragments per region, supporting both single-end (SE) and paired-end (PE) BAMs.

Biological context: Fragment-level methylation (UXM, Sun et al., Nature 2023) reveals cell-of-origin and cancer-specific methylation patterns in cfDNA. Krewlyzer supports both SE and PE mode, pairing reads as in cfDNAFE.

Usage:

# Single-end (default)
krewlyzer uxm /path/to/bam_folder --output uxm_out [options]
# Paired-end mode
krewlyzer uxm /path/to/bam_folder --output uxm_out --type PE [options]
  • Input: Folder of sorted, indexed BAMs
  • Output: .UXM.tsv file per sample
  • Options:
    • --mark-input, -m: Marker BED file (default: data/MethMark/Atlas.U25.l4.hg19.bed)
    • --map-quality, -q: Minimum mapping quality (default: 30)
    • --min-cpg, -c: Minimum CpG per fragment (default: 4)
    • --methy-threshold, -tM: Methylation threshold (default: 0.75)
    • --unmethy-threshold, -tU: Unmethylation threshold (default: 0.25)
    • --type: Fragment type: SE or PE (default: SE)
    • --threads, -t: Number of processes

Mutant Fragment Size Distribution (mFSD)

Purpose: Compares the size distribution of mutant vs. wild-type reads at variant sites.

Biological context: Mutant ctDNA fragments are typically shorter than wild-type cfDNA. This module quantifies this difference using high-depth targeted sequencing data, providing a sensitive marker for ctDNA presence.

Usage:

krewlyzer mfsd sample.bam --input variants.vcf --output output.tsv [options]
  • Input: BAM file and VCF/MAF file containing variants.
  • Output: TSV file with mutant/WT counts, mean sizes, size difference, and KS test statistics.
  • Options:
    • --input, -i: VCF or MAF file (required)
    • --format, -f: Input format ('auto', 'vcf', 'maf')
    • --map-quality, -q: Minimum mapping quality (default: 20)

Run All Features

Runs all feature extraction commands (motif, fsc, fsr, fsd, wps, ocf, uxm, mfsd) for a single BAM file in one call.

Usage:

krewlyzer run-all sample.bam --reference hg19.fa --output all_features_out [--variant-input variants.vcf] [--threads N] [--type SE|PE]

Output Structure Examples

After krewlyzer run-all:

output_dir/
├── sample.bed.gz           # Fragment file (Tabix indexed)
├── sample.bed.gz.tbi
├── sample.EndMotif.txt     # End motif frequencies
├── sample.BreakPointMotif.txt
├── sample.MDS.txt          # Motif Diversity Score
├── sample.FSC.txt          # Fragment Size Coverage
├── sample.FSR.txt          # Fragment Size Ratio
├── sample.FSD.txt          # Fragment Size Distribution
├── sample.WPS.tsv.gz       # Windowed Protection Score
├── sample.OCF.csv          # Orientation-aware Fragmentation summary
├── sample.OCF.sync.tsv     # OCF details
└── sample.mfsd.tsv         # Mutant Fragment Size Distribution (if variants provided)

Troubleshooting

  • FileNotFoundError: Ensure all input files/paths exist and are readable. Use absolute paths if possible.
  • PermissionError: Check output directory permissions.
  • Missing dependencies: Use Docker or follow Installation for all requirements.
  • Reference mismatch: BAM and reference FASTA must be from the same genome build.
  • Memory errors: Use ≥16GB RAM for large BAMs or process in batches.

Citation & Acknowledgements

If you use Krewlyzer in your work, please cite this repository and cfDNAFE. Krewlyzer implements or adapts methods from the following primary literature:

  • DELFI (FSR): Mouliere F, Chandrananda D, Piskorz AM, et al. Enhanced detection of circulating tumor DNA by fragment size analysis. Sci Transl Med. 2018;10(466):eaat4921. https://doi.org/10.1126/scitranslmed.aat4921

  • WPS: Snyder MW, Kircher M, Hill AJ, Daza RM, Shendure J. Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell. 2016;164(1-2):57-68. https://doi.org/10.1016/j.cell.2015.11.050

  • OCF: Sun K, Jiang P, Chan KC, et al. Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Res. 2019;29(3):418-427. https://doi.org/10.1101/gr.242719.118

  • UXM: Sun K, et al. Fragment-level methylation measures cell-of-origin and cancer-specific signals in cell-free DNA. Nature. 2023;616(7956):563-571. https://doi.org/10.1038/s41586-022-05580-6

  • cfDNAFE:

@misc{cfDNAFE,
  author = {Wanxin Cui et al.},
  title = {cfDNAFE: A toolkit for comprehensive cell-free DNA fragmentation feature extraction},
  year = {2022},
  howpublished = {\url{https://github.com/Cuiwanxin1998/cfDNAFE}}
}
  • Developed by the MSK-ACCESS team at Memorial Sloan Kettering Cancer Center.

References

  1. Mouliere F, Chandrananda D, Piskorz AM, et al. Enhanced detection of circulating tumor DNA by fragment size analysis. Sci Transl Med. 2018;10(466):eaat4921. https://doi.org/10.1126/scitranslmed.aat4921
  2. Snyder MW, Kircher M, Hill AJ, Daza RM, Shendure J. Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell. 2016;164(1-2):57-68. https://doi.org/10.1016/j.cell.2015.11.050
  3. Sun K, Jiang P, Chan KC, et al. Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Res. 2019;29(3):418-427. https://doi.org/10.1101/gr.242719.118
  4. Sun K, et al. Fragment-level methylation measures cell-of-origin and cancer-specific signals in cell-free DNA. Nature. 2023;616(7956):563-571. https://doi.org/10.1038/s41586-022-05580-6

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for full terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krewlyzer-0.2.3.tar.gz (6.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

krewlyzer-0.2.3-cp312-cp312-manylinux_2_28_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

krewlyzer-0.2.3-cp311-cp311-manylinux_2_28_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

krewlyzer-0.2.3-cp310-cp310-manylinux_2_28_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

File details

Details for the file krewlyzer-0.2.3.tar.gz.

File metadata

  • Download URL: krewlyzer-0.2.3.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for krewlyzer-0.2.3.tar.gz
Algorithm Hash digest
SHA256 27af420ac09abd34957a9d69b5dda8a9fcd391fb39ce6ae3e47becad7078b3c2
MD5 fd7015a3e9833adc1cc1e0cd366d34e4
BLAKE2b-256 54653096def286a70018cfbd692009538c35ca8d3b55c6ef9b40e01e9af7731c

See more details on using hashes here.

Provenance

The following attestation bundles were made for krewlyzer-0.2.3.tar.gz:

Publisher: release.yml on msk-access/krewlyzer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file krewlyzer-0.2.3-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for krewlyzer-0.2.3-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a4e57c87ce2ec638b86345ac5b393da6b845ac25121eb500f02de1fc0d76765e
MD5 7753541438cbd16e71b4e34d0c82251e
BLAKE2b-256 8c50af77805739b58eb187b2c1ae801168260c68d5d1f5b7f9f261eb7605aa20

See more details on using hashes here.

Provenance

The following attestation bundles were made for krewlyzer-0.2.3-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: release.yml on msk-access/krewlyzer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file krewlyzer-0.2.3-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for krewlyzer-0.2.3-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bb48e03a800adc8e6d8a6a65e6b20ad8a1d4c68aebd48d6e7c397a9a2303bf46
MD5 0cd6dc50f60f5e32dc3f70e56edc6e36
BLAKE2b-256 fb5d862ff47747303b69904e47fa29c6fcbd47116e6ff48ceb6cb12bd2f8c59f

See more details on using hashes here.

Provenance

The following attestation bundles were made for krewlyzer-0.2.3-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: release.yml on msk-access/krewlyzer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file krewlyzer-0.2.3-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for krewlyzer-0.2.3-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 70f927e43ab749bd07a5f5d39a3dc3d28674572ef19eb23c0d7f34333b8e0eb9
MD5 763256a32d93e3e1182e487eac2129cf
BLAKE2b-256 9f94968c2d454a70b95b20b60206f29c37592c5ffa208f1607cf80389222a150

See more details on using hashes here.

Provenance

The following attestation bundles were made for krewlyzer-0.2.3-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: release.yml on msk-access/krewlyzer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page