MCCNado: Rust-based tools for use in processing Micro-Capure-C data using SeqNado
Project description
MCCNado
A high-performance Rust library with Python bindings for processing Micro-Capture-C (MCC) sequencing data.
Overview
MCCNado is a bioinformatics tool designed for analyzing chromatin conformation capture sequencing data. It provides efficient implementations for common preprocessing tasks including FASTQ deduplication, viewpoint read splitting, BAM annotation, and ligation junction analysis.
Features
- FASTQ Deduplication: Remove duplicate reads from single-end and paired-end FASTQ files
- Viewpoint Read Splitting: Split reads containing viewpoint sequences into constituent segments
- BAM Annotation: Add metadata tags to BAM files for downstream analysis
- Ligation Junction Identification: Extract and analyze chromatin interaction data
- Ligation Statistics: Generate comprehensive statistics on cis/trans interactions
- High Performance: Implemented in Rust with optional async processing for large datasets
Installation
From PyPI (recommended)
pip install mccnado
From Source
git clone https://github.com/yourusername/MCCNado.git
cd MCCNado
pip install .
Development Installation
git clone https://github.com/yourusername/MCCNado.git
cd MCCNado
pip install -e .
Requirements
- Python 3.10+
- Rust (for building from source)
- samtools (for BAM file processing)
Quick Start
After installation, you can immediately use the mccnado command:
# Deduplicate a BAM file
mccnado deduplicate-bam input.bam output.bam
# View all available commands
mccnado --help
# Get help for a specific command
mccnado deduplicate-bam --help
Usage
Tool Overview
MCCNado provides several specialized tools for MCC data processing:
1. FASTQ Deduplication
Removes duplicate reads from FASTQ files by comparing sequence content and quality scores. Useful for removing PCR duplicates before alignment.
2. BAM Deduplication
Removes duplicate alignments from BAM files based on genomic coordinates and alignment information. Identifies and filters PCR duplicates that have the same mapping location.
3. Viewpoint Read Splitting
Splits composite reads containing viewpoint sequences into separate segments for independent analysis. Useful when reads contain both viewpoint and flanking sequence information.
4. BAM Annotation
Adds MCC-specific metadata tags to BAM files, including viewpoint information, oligo coordinates, and reporter tags for classification.
5. Ligation Statistics
Analyzes chromatin ligation events and generates statistics on cis/trans interactions, helping characterize the quality and type of chromatin interactions in your data.
6. Ligation Junction Identification
Identifies and extracts ligation junction sequences from BAM files, useful for validating chromatin interactions and analyzing junction characteristics.
Python API
import mccnado
# 1. Deduplicate FASTQ files
# Removes duplicate reads by comparing sequences
stats = mccnado.deduplicate_fastq(
fastq1="input_R1.fastq.gz",
output1="output_R1.fastq.gz",
fastq2="input_R2.fastq.gz", # Optional for paired-end
output2="output_R2.fastq.gz" # Optional for paired-end
)
print(f"Total reads: {stats.total_reads}")
print(f"Unique reads: {stats.unique_reads}")
print(f"Duplicate reads: {stats.duplicate_reads}")
# 2. Deduplicate BAM files
# Removes PCR duplicates based on genomic coordinates
bam_stats = mccnado.deduplicate_bam(
bam="aligned_reads.bam",
output="deduplicated.bam"
)
print(f"Unique molecules: {bam_stats.unique_molecules}")
print(f"Duplicate molecules: {bam_stats.duplicate_molecules}")
# 3. Split viewpoint reads
# Separates composite reads into individual segments
mccnado.split_viewpoint_reads(
bam="aligned_reads.bam",
output="split_reads.bam"
)
# 4. Annotate BAM file with MCC metadata
# Adds VP (viewpoint), OC (oligo coordinates), and RT (reporter tag) tags
mccnado.annotate_bam(
bam="input.bam",
output="annotated.bam"
)
# 5. Extract ligation statistics
# Generates JSON report of cis/trans interactions and other statistics
mccnado.extract_ligation_stats(
bam="annotated.bam",
stats="ligation_stats.json"
)
# 6. Identify ligation junctions
# Extracts junction sequences and writes to output directory
mccnado.identify_ligation_junctions(
bam="annotated.bam",
output_directory="junctions/"
)
Command Line Interface
MCCNado provides a clean, intuitive command-line interface accessible directly via the mccnado command after installation. The CLI uses command-line argument validation and provides helpful error messages.
Available Commands
# View all available commands and options
mccnado --help
# Deduplicate FASTQ files (single-end)
mccnado deduplicate-fastq input.fastq.gz output.fastq.gz
# Deduplicate FASTQ files (paired-end)
mccnado deduplicate-fastq input_R1.fastq.gz output_R1.fastq.gz \
--fastq2 input_R2.fastq.gz --output2 output_R2.fastq.gz
# Remove PCR duplicates from BAM files
mccnado deduplicate-bam aligned_reads.bam deduplicated.bam
# Split reads containing viewpoint sequences
mccnado split-viewpoint-reads aligned_reads.bam split_reads.bam
# Annotate BAM files with MCC-specific metadata
mccnado annotate-bam input.bam annotated.bam
# Extract ligation statistics
mccnado extract-ligation-stats annotated.bam stats.json
# Identify ligation junctions
mccnado identify-ligation-junctions annotated.bam junctions/
# Get detailed help for any command
mccnado deduplicate-bam --help
mccnado deduplicate-fastq --help
CLI Features
- Input Validation: Automatically checks for file existence and correct file formats
- Clear Error Messages: Informative error reporting when issues are encountered
- Summary Output: Commands that deduplicate data display summary statistics
- Help System: Use
--helpwith any command for detailed usage information
Command Name Aliases: Commands support both hyphenated and underscored formats (e.g., deduplicate-bam or deduplicate_bam)
File Formats
Input Files
- FASTQ: Raw sequencing reads (single-end or paired-end, gzipped or uncompressed)
- BAM: Aligned reads with proper headers and indexing
Output Files
- FASTQ: Deduplicated reads
- BAM: Annotated alignment files with MCC-specific tags
- JSON: Ligation statistics and metadata
BAM Tags Added by MCCNado
VP: Viewpoint nameOC: Oligo coordinatesRT: Reporter tag (0 for capture reads, 1 for reporter reads)
Performance
MCCNado is optimized for large-scale data processing:
- Memory Efficient: Streaming processing for large files
- Parallel Processing: Multi-threaded operations where applicable
- Fast Hashing: Uses xxHash for rapid duplicate detection
- Batch Processing: Configurable batch sizes for optimal performance
Architecture
The package consists of several core modules:
fastq_deduplicate: FASTQ deduplication logicviewpoint_read_splitter: Read segmentation functionalitymcc_data_handler: BAM annotation and processingligation_stats: Statistical analysis of ligation eventsutils: Common utilities and data structures
Development
Building from Source
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone and build
git clone https://github.com/yourusername/MCCNado.git
cd MCCNado
cargo build --release
# Install Python package
pip install -e .
Running Tests
# Rust tests
cargo test
# Python tests
python -m pytest tests/
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use MCCNado in your research, please cite:
[Your Citation Here]
Support
For questions, issues, or feature requests, please:
- Check the documentation
- Search existing issues
- Open a new issue if needed
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mccnado-0.1.5.tar.gz.
File metadata
- Download URL: mccnado-0.1.5.tar.gz
- Upload date:
- Size: 70.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33206b4f6e32fecc5646e34089b780cab79ba7151caea24a4368303e74e5f5ac
|
|
| MD5 |
3577f38337b2fa4701e671df825f8668
|
|
| BLAKE2b-256 |
28c2cf950a9112629420ca357f1ae4cbee8aa235d8ef48bde70591bf024a654c
|
Provenance
The following attestation bundles were made for mccnado-0.1.5.tar.gz:
Publisher:
release.yml on alsmith151/MCCNado
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mccnado-0.1.5.tar.gz -
Subject digest:
33206b4f6e32fecc5646e34089b780cab79ba7151caea24a4368303e74e5f5ac - Sigstore transparency entry: 831809652
- Sigstore integration time:
-
Permalink:
alsmith151/MCCNado@66cbaf70274eaeb98eb1325b2bf691dd12e5d751 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/alsmith151
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@66cbaf70274eaeb98eb1325b2bf691dd12e5d751 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mccnado-0.1.5-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: mccnado-0.1.5-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 646.4 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ba4586d09c8f31d3d93fd6ce58511b67dac953b8aeddaf93ecdaca13ad1c6d4
|
|
| MD5 |
56ff9f94d3c9c047d1fafb320b4e4c3e
|
|
| BLAKE2b-256 |
cf3dad2caa16740991f91835c78c28d92a92df9c31e0900c4641c134ba06f367
|
Provenance
The following attestation bundles were made for mccnado-0.1.5-cp312-cp312-macosx_11_0_arm64.whl:
Publisher:
release.yml on alsmith151/MCCNado
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mccnado-0.1.5-cp312-cp312-macosx_11_0_arm64.whl -
Subject digest:
9ba4586d09c8f31d3d93fd6ce58511b67dac953b8aeddaf93ecdaca13ad1c6d4 - Sigstore transparency entry: 831809655
- Sigstore integration time:
-
Permalink:
alsmith151/MCCNado@66cbaf70274eaeb98eb1325b2bf691dd12e5d751 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/alsmith151
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@66cbaf70274eaeb98eb1325b2bf691dd12e5d751 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mccnado-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: mccnado-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 782.5 kB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
254ce793964403b9d4a0a87122922ee897a2fdf564205dec8ea3d1526329bdac
|
|
| MD5 |
76c9b166f7925e64c3ef7b97beb6319c
|
|
| BLAKE2b-256 |
adf5d88125646c57221eac6b95b701020481d12bfaa8530d8874291ff2692d45
|
Provenance
The following attestation bundles were made for mccnado-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on alsmith151/MCCNado
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mccnado-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
254ce793964403b9d4a0a87122922ee897a2fdf564205dec8ea3d1526329bdac - Sigstore transparency entry: 831809656
- Sigstore integration time:
-
Permalink:
alsmith151/MCCNado@66cbaf70274eaeb98eb1325b2bf691dd12e5d751 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/alsmith151
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@66cbaf70274eaeb98eb1325b2bf691dd12e5d751 -
Trigger Event:
push
-
Statement type: