Skip to main content

A high-performance k-mer counting tool for MAF alignments.

Project description

MAF Counter

MAF Counter is a multithreaded tool designed to efficiently extract and count k-mers from multiple genome alignments in MAF format. It utilizes producer-consumer threading with concurrent queues to parallelize k-mer extraction and aggregation. The tool supports canonical/reverse-complement k-mer handling and provides flexible output options, including per-genome k-mer files and a consolidated single-file format.

Algorithm Overview

  1. Chunk Division: The input MAF file is divided into approximately equal-sized chunks, each containing whole alignment blocks, ensuring no block spans multiple chunks.
  2. Parallel Processing: Producer threads process each chunk independently, using a sliding window to extract k-mers and their counts.
  3. Consumer Merging: Consumer threads merge k-mer groups into a final aggregated data structure, avoiding conflicts by partitioning based on hashed k-mer keys.
  4. File Writing: Once processing is complete, output is written either as per-genome files in Jellyfish format or a single compressed file if the -s flag is used.

Compilation

To compile MAF Counter, use the following command:

g++ -std=c++11 -O3 -o maf_counter maf_counter.cpp -I /path/to/google_sparse_hash -I /path/to/concurrent_queue -pthread -lrt

Usage

./maf_counter [options] <k-mer length> <MAF file> <number of threads>

Options

-c, --complement: Aggregate k-mers with their reverse complements.
-s, --single_file_output: Write all k-mers to a single compressed file.

Examples

  • Extract 15mers from input.maf using 8 producers and 8 consumers ( suitable ideally for 16 processor cores )
./maf_counter 15 input.maf 16
  • The same options but aggregate each kmer with its reverse complement (writing the lexicographically first) and output the results in sinle file mode
./maf_counter -c -s 15 input.maf 16

Output Format

By default, the tool generates per-genome k-mer files in the results_counter directory. Each file is in Jellyfish format, containing k-mers and their counts for the corresponding genome ID.

Example file (genome1_kmer_counts.txt):

ATCGG  1401
TTGGC  1233

If the -s option is used, a single output file is created in the format:

ATCGG genomeid1:1401, genomeid2:1200
TTGGC genomeid1:1233, genomeid3:4123

License

This project is licensed under the GNU GPL v3.

Contact

For any questions or support, please contact

Project details


Release history Release notifications | RSS feed

This version

0.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maf_counter-0.2.tar.gz (80.6 kB view details)

Uploaded Source

Built Distribution

maf_counter-0.2-py3-none-any.whl (79.6 kB view details)

Uploaded Python 3

File details

Details for the file maf_counter-0.2.tar.gz.

File metadata

  • Download URL: maf_counter-0.2.tar.gz
  • Upload date:
  • Size: 80.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for maf_counter-0.2.tar.gz
Algorithm Hash digest
SHA256 36098889d70ade7d47db8275e77146335374e739c669f14dca0c624256f1471e
MD5 d0e15dc4a7288dac22a13b0034649ba6
BLAKE2b-256 6a4fb2d22eab45ef1f8c5aff02b63df23462df2f1b30b350fd77fae6894e45d4

See more details on using hashes here.

File details

Details for the file maf_counter-0.2-py3-none-any.whl.

File metadata

  • Download URL: maf_counter-0.2-py3-none-any.whl
  • Upload date:
  • Size: 79.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for maf_counter-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2473c8f84a93a93644a4f15696d84fc03baea7ace7899cf42fd032e337891b14
MD5 18e2d91ec508408550d87305229c246d
BLAKE2b-256 0baa1259ae3f9ec2346100ea431a964c79804f317fb46418467c1168e690a793

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page