A high-performance k-mer counting tool for MAF alignments.
Project description
MAF Counter
MAF Counter is a multithreaded tool designed to efficiently extract and count k-mers from multiple genome alignments in MAF format. It utilizes producer-consumer threading with concurrent queues to parallelize k-mer extraction and aggregation. The tool supports canonical/reverse-complement k-mer handling and provides flexible output options, including per-genome k-mer files and a consolidated single-file format.
Algorithm Overview
- Chunk Division: The input MAF file is divided into approximately equal-sized chunks, each containing whole alignment blocks, ensuring no block spans multiple chunks.
- Parallel Processing: Producer threads process each chunk independently, using a sliding window to extract k-mers and their counts.
- Consumer Merging: Consumer threads merge k-mer groups into a final aggregated data structure, avoiding conflicts by partitioning based on hashed k-mer keys.
- File Writing: Once processing is complete, output is written either as per-genome files in Jellyfish format or a single compressed file if the
-s
flag is used.
Compilation
To compile MAF Counter, use the following command:
g++ -std=c++11 -O3 -o maf_counter maf_counter.cpp -I /path/to/google_sparse_hash -I /path/to/concurrent_queue -pthread -lrt
Usage
./maf_counter [options] <k-mer length> <MAF file> <number of threads>
Options
-c, --complement: Aggregate k-mers with their reverse complements.
-s, --single_file_output: Write all k-mers to a single compressed file.
Examples
- Extract 15mers from input.maf using 8 producers and 8 consumers ( suitable ideally for 16 processor cores )
./maf_counter 15 input.maf 16
- The same options but aggregate each kmer with its reverse complement (writing the lexicographically first) and output the results in sinle file mode
./maf_counter -c -s 15 input.maf 16
Output Format
By default, the tool generates per-genome k-mer files in the results_counter directory. Each file is in Jellyfish format, containing k-mers and their counts for the corresponding genome ID.
Example file (genome1_kmer_counts.txt):
ATCGG 1401
TTGGC 1233
If the -s option is used, a single output file is created in the format:
ATCGG genomeid1:1401, genomeid2:1200
TTGGC genomeid1:1233, genomeid3:4123
License
This project is licensed under the GNU GPL v3.
Contact
For any questions or support, please contact
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file maf_counter-0.2.tar.gz
.
File metadata
- Download URL: maf_counter-0.2.tar.gz
- Upload date:
- Size: 80.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 36098889d70ade7d47db8275e77146335374e739c669f14dca0c624256f1471e |
|
MD5 | d0e15dc4a7288dac22a13b0034649ba6 |
|
BLAKE2b-256 | 6a4fb2d22eab45ef1f8c5aff02b63df23462df2f1b30b350fd77fae6894e45d4 |
File details
Details for the file maf_counter-0.2-py3-none-any.whl
.
File metadata
- Download URL: maf_counter-0.2-py3-none-any.whl
- Upload date:
- Size: 79.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2473c8f84a93a93644a4f15696d84fc03baea7ace7899cf42fd032e337891b14 |
|
MD5 | 18e2d91ec508408550d87305229c246d |
|
BLAKE2b-256 | 0baa1259ae3f9ec2346100ea431a964c79804f317fb46418467c1168e690a793 |