Skip to main content

No project description provided

Project description

CNV From BAM

cnv_from_bam is a Rust library developed to efficiently calculate dynamic Copy Number Variation (CNV) profiles from sequence alignments contained in BAM files. It seamlessly integrates with Python using PyO3, making it an excellent choice for bioinformatics workflows involving genomic data analysis.

Features

  • Efficient Processing: Optimized for handling large genomic datasets in BAM format.
  • Python Integration: Built with PyO3 for easy integration into Python-based genomic analysis workflows.
  • Multithreading Support: Utilizes Rust's powerful concurrency model for improved performance.
  • Dynamic Binning: Bins the genome dynamically based on total read counts and genome length.
  • CNV Calculation: Accurately calculates CNV values for each bin across different contigs.
  • Directory Support: Supports processing of multiple BAM files in a directory. (Requires alignment to the same reference in all BAM files)

Installation

To use cnv_from_bam in your Rust project, add the following to your Cargo.toml file:

[dependencies]
cnv_from_bam = "0.1.0"  # Replace with the latest version

Usage

Here's a quick example of how to use the iterate_bam_file function:

use cnv_from_bam::iterate_bam_file;
use std::path::PathBuf;

let bam_path = PathBuf::from("path/to/bam/file.bam");
// Iterate over the BAM file and calculate CNV values for each bin. Number of threads is set to 4 and mapping quality filter is set to 60.
// If number of threads is not specified, it defaults to the number of logical cores on the machine.
let result = iterate_bam_file(bam_path, Some(4), Some(60), None, None);
// Process the result...

The results in this case are returned as a CnvResult, which has the following structure:

/// Results struct for python

#[pyclass]
#[derive(Debug)]
pub struct CnvResult {
    /// The CNV per contig
    #[pyo3(get)]
    pub cnv: PyObject,
    /// Bin width
    #[pyo3(get)]
    pub bin_width: usize,
    /// Genome length
    #[pyo3(get)]
    pub genome_length: usize,
    /// Variance of the whole genome
    #[pyo3(get)]
    pub variance: f64,
}

Where result.cnv is a Python dict PyObject containing the Copy Number for each bin of bin_width bases for each contig in the reference genome, result.bin_width is the width of the bins in bases, result.genome_length is the total length of the genome and result.variance is a measure of the variance across the whole genome.

Variance is calculated as the average of the squared differences from the Mean.

[!NOTE] Note: Only the main primary mapping alignment start is binned, Supplementary and Secondary alignments are ignored. Supplementary alignments can be included by setting exclude_supplementary

Directory analysis To analyse a directory of BAM files, use the iterate_bam_dir function:

use cnv_from_bam::iterate_bam_dir;
use std::path::PathBuf;
let bam_path = PathBuf::from("path/to/bam_directory/");
// Iterate over the BAM files in teh directory and calculate CNV values for the whole. Number of threads is set to 4 and mapping quality filter is set to 60.
// If number of threads is not specified, it defaults to the number of logical cores on the machine.
let result = iterate_bam_file(bam_path, Some(4), Some(60));

This again returns a CnvResult, but this time the CNV values are summed across all BAM files in the directory. The bin width and genome length are calculated based on the first BAM file in the directory.

[!NOTE] Note: All BAM files in the directory must be aligned to the same reference genome.

Python Integration

cnv_from_bam can be used in Python using the PyO3 bindings. To install the Python bindings, run:

pip install cnv_from_bam

The same iterate_bam_file is availabl e in python, accepting a path to a BAM file or a directory of BAM files, the number of threads (set to None to use the optimal number of threads for the machine), and the mapping quality filter.

Example simple plot in python, you will need matplotlib an numpy installed (pip install matplotlib numpy)

from matplotlib import pyplot as plt
import matplotlib as mpl
from pathlib import Path
from cnv_from_bam import iterate_bam_file
import numpy as np
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 3))
total = 0
bam_path = Path("path/to/bam/file.bam");
# Iterate over the BAM file and calculate CNV values for each bin. Number of threads is set to 4 and mapping quality filter is set to 60.
# If number of threads is not specified, it defaults to the optimal number of threads for the machine.
result = iterate_bam_file(bam_path, _threads=4, mapq_filter=60);
for contig, cnv in result.cnv.items():
    ax.scatter(x=np.arange(len(cnv)) + total, y=cnv, s =0.1)
    total += len(cnv)

ax.set_ylim((0,8))
ax.set_xlim((0, total))
fig.savefig("Example_cnv_plot.png")

Should look something like this. Obviously the cnv data is just a dictionary of lists, so you can do whatever you want with it vis a vis matplotlib, seaborn, etc. example cnv plot

Iterative use

It is possible to iteratively add bam files to a continuing count. By passing a dictionary to iterate_bam_file, the intermediate mapping start counts are kept in this dictionary.

This is limited to parsing files one at a time, rather than by directory.

from cnv_from_bam import iterate_bam_file
update = {}
bam_path = Path("path/to/bam/file.bam");
result = iterate_bam_file(bam_path, _threads=4, mapq_filter=60, copy_numbers=update);

bam_path_2 = Path("path/to/bam/file2.bam");
result = iterate_bam_file(bam_path_2, _threads=4, mapq_filter=60, copy_numbers=update);
# Result now contains the copy number as inferred by both BAMS

Output

This is new in version >= 0.3. If you just want raw stdout from rust and no faffing with loggers, use v0.2.

Progress Bar

By default, a progress bar is displayed, showing the progress of the iteration of each BAM file. To disable the progress bar, set the CI environment variable to 1 in your python script:

import os
os.environ["CI"] = "1"

Logging

We use the log crate for logging. By default, the log level is set to INFO, which means that the program will output the progress of the iteration of each BAM file. To disable all but warning and error logging, set the log level to WARN on the iterate_bam_file function:

import logging
from cnv_from_bam import iterate_bam_file
iterate_bam_file(bam_path, _threads=4, mapq_filter=60, log_level=int(logging.WARN))

getLevelName is a function from the logging module that converts the log level to the integer value of the level. These values are

Level Value
CRITICAL 50
ERROR 40
WARNING 30
INFO 20
DEBUG 10
NOTSET 0

[!NOTE] In v0.3 a regression was introduced, whereby keeping the GIL for logging meant that BAM reading was suddenly single threaded again. Whilst it was possible to fix this and keep PyO3-log, I decided to go for truly maximum speed instead. The only drawback to the removal of PyO3-log in (v0.4+) is that log messages will not be handled by python loggers, so they won't be written out by a file handler, for example.

Documentation

To generate the documentation, run:

cargo doc --open

Contributing

Contributions to cnv_from_bam are welcome!

We use pre-commit hooks (particularly cargo-fmt and ruff) to ensure that code is formatted correctly and passes all tests before being committed. To install the pre-commit hooks, run:

git clone https://github.com/Adoni5/cnv_from_bam.git
cd cnv_from_bam
pip install -e .[dev]
pre-commit install -t pre-commit -t post-checkout -t post-merge
pre-commit run --all-files

Changelog

v0.4.3

Iterative use

  • It is possible to iteratively add bam files to a continuing count. By passing a dictionary to iterate_bam_file, the intermediate mapping start counts are kept in this dictionary. This is limited to parsing files one at a time, rather than by directory. See example above under iterative use.

  • Catches bug where metadata (mapped and unmapped count) for a reference sequence in a BAI or CSI file would return None, and crash the calculation. As this was used to calculate the Progress bar total, just skips the offending reference sequence, returning - for both counts. May mean the progress bar can have a lower total than number of reads, but won't matter to final numbers.

  • Adds Cargo tests to CI

v0.4.2

  • Returns the contig names naturally sorted, rather than in random order!! For example chr1, chr2, chr3...chr22,chrM,chrX,chrY! Huge, will prevent some people getting repeatedly confused about expected CNV vs. Visualised and wasting an hour debugging a non existing issue.
  • Returns variance across the whole genome in the CNV result struct.

v0.4.1

  • Add exclude_supplementary parameter to iterate_bam_file, to exclude supplementary alignments (default True)

v0.4.0

  • Remove PyO3-log for maximum speed. This means that log messages will not be handled by python loggers. Can set log level on call to iterate_bam_file

v0.3.0

  • Introduce PyO3-log for logging. This means that log messages can be handled by python loggers, so they can be written out by a file handler, for example.
  • HAS A LARGE PERFORMANCE ISSUE
  • Can disable progress bar display by setting CI environment variable to 1 in python script.

v0.2.0

  • Purely rust based BAM parsing, using noodles.
  • Uses a much more sensible number for threading if not provided.
  • Allows iteration of BAMS in a directory

v0.1.0

  • Initial release
  • Uses rust-bio/rust-htslib for BAM parsing. Has to bind C code, is a faff.

License

This project is licensed under the Mozilla Public License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cnv_from_bam-0.4.3.tar.gz (38.1 MB view hashes)

Uploaded Source

Built Distributions

cnv_from_bam-0.4.3-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

cnv_from_bam-0.4.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

cnv_from_bam-0.4.3-pp310-pypy310_pp73-macosx_11_0_arm64.whl (607.0 kB view hashes)

Uploaded PyPy macOS 11.0+ ARM64

cnv_from_bam-0.4.3-pp310-pypy310_pp73-macosx_10_12_x86_64.whl (625.4 kB view hashes)

Uploaded PyPy macOS 10.12+ x86-64

cnv_from_bam-0.4.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

cnv_from_bam-0.4.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

cnv_from_bam-0.4.3-pp39-pypy39_pp73-macosx_11_0_arm64.whl (607.0 kB view hashes)

Uploaded PyPy macOS 11.0+ ARM64

cnv_from_bam-0.4.3-pp39-pypy39_pp73-macosx_10_12_x86_64.whl (625.4 kB view hashes)

Uploaded PyPy macOS 10.12+ x86-64

cnv_from_bam-0.4.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

cnv_from_bam-0.4.3-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

cnv_from_bam-0.4.3-pp38-pypy38_pp73-macosx_11_0_arm64.whl (607.3 kB view hashes)

Uploaded PyPy macOS 11.0+ ARM64

cnv_from_bam-0.4.3-pp38-pypy38_pp73-macosx_10_12_x86_64.whl (625.5 kB view hashes)

Uploaded PyPy macOS 10.12+ x86-64

cnv_from_bam-0.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

cnv_from_bam-0.4.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

cnv_from_bam-0.4.3-cp312-cp312-macosx_11_0_arm64.whl (606.6 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

cnv_from_bam-0.4.3-cp312-cp312-macosx_10_12_x86_64.whl (624.7 kB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

cnv_from_bam-0.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

cnv_from_bam-0.4.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

cnv_from_bam-0.4.3-cp311-cp311-macosx_11_0_arm64.whl (607.4 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

cnv_from_bam-0.4.3-cp311-cp311-macosx_10_12_x86_64.whl (625.8 kB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

cnv_from_bam-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

cnv_from_bam-0.4.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

cnv_from_bam-0.4.3-cp310-cp310-macosx_11_0_arm64.whl (607.3 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

cnv_from_bam-0.4.3-cp310-cp310-macosx_10_12_x86_64.whl (625.8 kB view hashes)

Uploaded CPython 3.10 macOS 10.12+ x86-64

cnv_from_bam-0.4.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

cnv_from_bam-0.4.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

cnv_from_bam-0.4.3-cp39-cp39-macosx_11_0_arm64.whl (607.7 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

cnv_from_bam-0.4.3-cp39-cp39-macosx_10_12_x86_64.whl (626.0 kB view hashes)

Uploaded CPython 3.9 macOS 10.12+ x86-64

cnv_from_bam-0.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

cnv_from_bam-0.4.3-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

cnv_from_bam-0.4.3-cp38-cp38-macosx_11_0_arm64.whl (607.7 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

cnv_from_bam-0.4.3-cp38-cp38-macosx_10_12_x86_64.whl (626.0 kB view hashes)

Uploaded CPython 3.8 macOS 10.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page