Skip to main content

MCCNado: Rust-based tools for use in processing Micro-Capure-C data using SeqNado

Project description

MCCNado

A high-performance Rust library with Python bindings for processing Micro-Capture-C (MCC) sequencing data.

Overview

MCCNado is a bioinformatics tool designed for analyzing chromatin conformation capture sequencing data. It provides efficient implementations for common preprocessing tasks including FASTQ deduplication, viewpoint read splitting, BAM annotation, and ligation junction analysis.

Features

  • FASTQ Deduplication: Remove duplicate reads from single-end and paired-end FASTQ files
  • Viewpoint Read Splitting: Split reads containing viewpoint sequences into constituent segments
  • BAM Annotation: Add metadata tags to BAM files for downstream analysis
  • Ligation Junction Identification: Extract and analyze chromatin interaction data
  • Ligation Statistics: Generate comprehensive statistics on cis/trans interactions
  • High Performance: Implemented in Rust with optional async processing for large datasets

Installation

From PyPI (recommended)

pip install mccnado

From Source

git clone https://github.com/yourusername/MCCNado.git
cd MCCNado
pip install .

Development Installation

git clone https://github.com/yourusername/MCCNado.git
cd MCCNado
pip install -e .

Requirements

  • Python 3.10+
  • Rust (for building from source)
  • samtools (for BAM file processing)

Quick Start

After installation, you can immediately use the mccnado command:

# Deduplicate a BAM file
mccnado deduplicate-bam input.bam output.bam

# View all available commands
mccnado --help

# Get help for a specific command
mccnado deduplicate-bam --help

Usage

Tool Overview

MCCNado provides several specialized tools for MCC data processing:

1. FASTQ Deduplication

Removes duplicate reads from FASTQ files by comparing sequence content and quality scores. Useful for removing PCR duplicates before alignment.

2. BAM Deduplication

Removes duplicate alignments from BAM files based on genomic coordinates and alignment information. Identifies and filters PCR duplicates that have the same mapping location.

3. Viewpoint Read Splitting

Splits composite reads containing viewpoint sequences into separate segments for independent analysis. Useful when reads contain both viewpoint and flanking sequence information.

4. BAM Annotation

Adds MCC-specific metadata tags to BAM files, including viewpoint information, oligo coordinates, and reporter tags for classification.

5. Ligation Statistics

Analyzes chromatin ligation events and generates statistics on cis/trans interactions, helping characterize the quality and type of chromatin interactions in your data.

6. Ligation Junction Identification

Identifies and extracts ligation junction sequences from BAM files, useful for validating chromatin interactions and analyzing junction characteristics.

Python API

import mccnado

# 1. Deduplicate FASTQ files
# Removes duplicate reads by comparing sequences
stats = mccnado.deduplicate_fastq(
    fastq1="input_R1.fastq.gz",
    output1="output_R1.fastq.gz",
    fastq2="input_R2.fastq.gz",      # Optional for paired-end
    output2="output_R2.fastq.gz"     # Optional for paired-end
)
print(f"Total reads: {stats.total_reads}")
print(f"Unique reads: {stats.unique_reads}")
print(f"Duplicate reads: {stats.duplicate_reads}")

# 2. Deduplicate BAM files
# Removes PCR duplicates based on genomic coordinates
bam_stats = mccnado.deduplicate_bam(
    bam="aligned_reads.bam",
    output="deduplicated.bam"
)
print(f"Unique molecules: {bam_stats.unique_molecules}")
print(f"Duplicate molecules: {bam_stats.duplicate_molecules}")

# 3. Split viewpoint reads
# Separates composite reads into individual segments
mccnado.split_viewpoint_reads(
    bam="aligned_reads.bam",
    output="split_reads.bam"
)

# 4. Annotate BAM file with MCC metadata
# Adds VP (viewpoint), OC (oligo coordinates), and RT (reporter tag) tags
mccnado.annotate_bam(
    bam="input.bam",
    output="annotated.bam"
)

# 5. Extract ligation statistics
# Generates JSON report of cis/trans interactions and other statistics
mccnado.extract_ligation_stats(
    bam="annotated.bam",
    stats="ligation_stats.json"
)

# 6. Identify ligation junctions
# Extracts junction sequences and writes to output directory
mccnado.identify_ligation_junctions(
    bam="annotated.bam",
    output_directory="junctions/"
)

Command Line Interface

MCCNado provides a clean, intuitive command-line interface accessible directly via the mccnado command after installation. The CLI uses command-line argument validation and provides helpful error messages.

Available Commands

# View all available commands and options
mccnado --help

# Deduplicate FASTQ files (single-end)
mccnado deduplicate-fastq input.fastq.gz output.fastq.gz

# Deduplicate FASTQ files (paired-end)
mccnado deduplicate-fastq input_R1.fastq.gz output_R1.fastq.gz \
  --fastq2 input_R2.fastq.gz --output2 output_R2.fastq.gz

# Remove PCR duplicates from BAM files
mccnado deduplicate-bam aligned_reads.bam deduplicated.bam

# Split reads containing viewpoint sequences
mccnado split-viewpoint-reads aligned_reads.bam split_reads.bam

# Annotate BAM files with MCC-specific metadata
mccnado annotate-bam input.bam annotated.bam

# Extract ligation statistics
mccnado extract-ligation-stats annotated.bam stats.json

# Identify ligation junctions
mccnado identify-ligation-junctions annotated.bam junctions/

# Get detailed help for any command
mccnado deduplicate-bam --help
mccnado deduplicate-fastq --help

CLI Features

  • Input Validation: Automatically checks for file existence and correct file formats
  • Clear Error Messages: Informative error reporting when issues are encountered
  • Summary Output: Commands that deduplicate data display summary statistics
  • Help System: Use --help with any command for detailed usage information

Command Name Aliases: Commands support both hyphenated and underscored formats (e.g., deduplicate-bam or deduplicate_bam)

File Formats

Input Files

  • FASTQ: Raw sequencing reads (single-end or paired-end, gzipped or uncompressed)
  • BAM: Aligned reads with proper headers and indexing

Output Files

  • FASTQ: Deduplicated reads
  • BAM: Annotated alignment files with MCC-specific tags
  • JSON: Ligation statistics and metadata

BAM Tags Added by MCCNado

  • VP: Viewpoint name
  • OC: Oligo coordinates
  • RT: Reporter tag (0 for capture reads, 1 for reporter reads)

Performance

MCCNado is optimized for large-scale data processing:

  • Memory Efficient: Streaming processing for large files
  • Parallel Processing: Multi-threaded operations where applicable
  • Fast Hashing: Uses xxHash for rapid duplicate detection
  • Batch Processing: Configurable batch sizes for optimal performance

Architecture

The package consists of several core modules:

Development

Building from Source

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/yourusername/MCCNado.git
cd MCCNado
cargo build --release

# Install Python package
pip install -e .

Running Tests

# Rust tests
cargo test

# Python tests
python -m pytest tests/

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use MCCNado in your research, please cite:

[Your Citation Here]

Support

For questions, issues, or feature requests, please:

  1. Check the documentation
  2. Search existing issues
  3. Open a new issue if needed

Acknowledgments

  • Built with PyO3 for Python-Rust interoperability
  • Uses noodles for bioinformatics file format handling
  • Powered by tokio for async operations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mccnado-0.1.4.tar.gz (69.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mccnado-0.1.4-cp312-cp312-macosx_11_0_arm64.whl (643.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

mccnado-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (780.3 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file mccnado-0.1.4.tar.gz.

File metadata

  • Download URL: mccnado-0.1.4.tar.gz
  • Upload date:
  • Size: 69.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mccnado-0.1.4.tar.gz
Algorithm Hash digest
SHA256 8ca49bbc57ebd1fc531730d1f86e083e7a362b74d9579d4525ff7c4b89f51f63
MD5 c58283ee771d40f474f361203652d7e4
BLAKE2b-256 32e6bf24c2041d30a519dc5ea4d68d71e8531374432003d2a3e39b6f9e4ea916

See more details on using hashes here.

Provenance

The following attestation bundles were made for mccnado-0.1.4.tar.gz:

Publisher: release.yml on alsmith151/MCCNado

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mccnado-0.1.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mccnado-0.1.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cd8fd704f3ad800010e34a5ee8ef4de81f564cc422a37c50f1ef2132465546c0
MD5 293f7652353e93eff84fb9053356c197
BLAKE2b-256 62699e083a9405230dd8f5d0378053e59cea271e77172680831cdb8cf5d1318c

See more details on using hashes here.

Provenance

The following attestation bundles were made for mccnado-0.1.4-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on alsmith151/MCCNado

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mccnado-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mccnado-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 33221cce61f4c4dee288cc0363d82bdcb0d5dd3ab698c937768117346100e133
MD5 9875ed7be4ab1a59833964462d956a47
BLAKE2b-256 ae491a2fe74956492a85c0569fea2960028e52f4ca71657d3735100bdf3453d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for mccnado-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on alsmith151/MCCNado

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page