Skip to main content

hash-based counting of k-mers with gaps using a fast jit-compiled k-mer counter

Project description

DESCRIPTION

hackgap (hash-based counting of k-mers with gaps) provides a fast jit-compiled k-kmer counter which supports gapped k-mers.

INSTALLATION

Installation via Conda or PyPi

The hackgap package is available via conda or PyPi. To install it run one of the following commands: PyPi pip install hackgap Conda conda install -c bioconda hackgap

Manual installation

Install the conda package manager (miniforge)

Go to https://conda-forge.org/miniforge/ and download the Miniforge installer: Follow the instructions of the installer and append the mamba executable to your PATH (even if the installer does not recommend it). You can let the installer do it, or do it manually by editing your .bashrc or similar file under Linux or MacOS, or by editing the environment variables under Windows. To verify that the installation works, open a new terminal and execute

mamba --version
python --version

Obtain or update hackgap

You can obtain hackgap by cloning this public git repository:

git clone https://gitlab.com/rahmannlab/hackgap.git

If you need to update hackgap later, you can do so by just executing

git pull

within the cloned directory tree.

Create and activate a conda environment

To run our software, a conda environment with the required libraries needs to be created. A list of needed libraries is provided in the environment.yml file in the cloned repository; it can be used to create a new environment:

cd hackgap  # the directory of the cloned repository
mamba env create

which will create an environment named hackgap with the required dependencies, using the provided environment.yml file in the same directory.

To activate the newly created environment run

mamba activate hackgap

Install hackgap

To install hackgap we use the package installer for Python pip.

Run the following command to install hackgap.

pip install -e .

To check if the installation was a success execute

hackgap -v # should be 1.0.0 or higher

Usage guide

hackgap is a command line tool which has multiple parameters you can adjust. You can get a list of all parameters by running hackgap count --help.

The required parameters are:

  • -o or --output: This parameter specifies the index name and the path at which it is stored.
  • -n or --nobjects: hackgap uses a in memory hash table to store the k-mers and the corresponding counts. For this you have to provide an estimated number of distinct k-mers. If the table is too small, you have to rerun the counting using a bigger table.
  • -k or --mask: hackgap is the only k-mer counter which supports gapped k-mers (or spaced seeds). The corresponding masks can be provided via the --mask parameter. A significant position is defined by a # and a insignificant position by an _. An example mask with k=25 (significant positions) and w=31 (window size) would be --mask "####_####_###_###_###_####_####". We only support masks with $k\leq 31$. If you want to count contiguous k-mers you can specify the k by using the -k parameter.
  • --files: This parameter specifies the input files in which the k-mers are counted. We support reading FASTA and FASTQ files uncompressed or compressed via gzip, xz or bzip2 (The required tools for decompressing the files are dependencies in the environment).

Parameters for parallelization:

To improve the speed of hackgap, we implemented a producer-consumer method in addition to our parallelized hash table. You can modify multiple parameter depending on your hardware and the number of files you want to count.

  • --subtables: Defines how many threads are used to insert k-mers into the table. If you have enough cores it scales well for up to 15 subtables.
  • --threads-split: The number of threads which translates the sequence data into 2 bit encoding and splits it into k-mers. 2-3 threads are recommended.
  • --threads-read: The number of threads used to read the input files. If you count more than one file, you can increase the number of reads. Normally at most 2-3 readers are enough to provide enough data to the threads splitting to sequence into k-mers and inserting the k-mers into the table.

Parameters for filtering:

With version 1.0 we introduced a hierarchical 3 level bloom-filter which can be used to exclude k-mers which only occurs less than 3 times. For this the input files are processed twice. First to create the filter and in the second run only k-mers are counted which passed through the filter. For this you have to provide two parameters:

  • --filtersize: This parameter takes up to 3 integer values. Each describes the size of one level of a filter in GB. The first one should be larger than the second one and the third one should be the smallest.
  • --filterfiles: These are sequence files which are used to fill the filter. These are usually the same files which are used to count the kk-mers, but can also be different data.

Additional parameters:

  • --maxcount: defines the maximal counter value.
  • --markweak: This marks all k-mers with a HAmming distance of one. This is done after counting the k-mers and needs additional time and memory.

Example

Here we will provide a small example how to run hackgap on the t2t reference.

Download reference genome

First we need to download the t2t reference (https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz)

mkdir data # create data folder
cd data
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz
cd ..

Run hackgap

To execute hackgap we need to provide:

  • -n: the expected number of distinct k-mers
  • -k: for contiguous or -mask for gapped: the k-mer shape
  • --fasta: the uncompressed input file using pigz or zcat
  • -o: the output file in zarr format
hackgap count -n 2391456540 -k 25 --files data/chm13v2.0.fa.gz -o t2t-k25
hackgap count -n 2416328905 --mask "####_####_###_###_###_####_####" --files data/chm13v2.0.fa.gz -o t2t-m2

Citation

If you use hackgap, please cite the article in the WABI 2022 proceedings:

@inproceedings{DBLP:conf/wabi/ZentgrafR22,
  author       = {Jens Zentgraf and
                  Sven Rahmann},
  editor       = {Christina Boucher and
                  Sven Rahmann},
  title        = {Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo
                  Hash Tables},
  booktitle    = {22nd International Workshop on Algorithms in Bioinformatics, {WABI}
                  2022, September 5-7, 2022, Potsdam, Germany},
  series       = {LIPIcs},
  volume       = {242},
  pages        = {12:1--12:20},
  publisher    = {Schloss Dagstuhl - Leibniz-Zentrum f{\"{u}}r Informatik},
  year         = {2022},
  url          = {https://doi.org/10.4230/LIPIcs.WABI.2022.12},
  doi          = {10.4230/LIPICS.WABI.2022.12},
  timestamp    = {Wed, 21 Aug 2024 22:46:00 +0200},
  biburl       = {https://dblp.org/rec/conf/wabi/ZentgrafR22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hackgap-1.0.0.tar.gz (94.9 kB view details)

Uploaded Source

Built Distribution

hackgap-1.0.0-py3-none-any.whl (107.6 kB view details)

Uploaded Python 3

File details

Details for the file hackgap-1.0.0.tar.gz.

File metadata

  • Download URL: hackgap-1.0.0.tar.gz
  • Upload date:
  • Size: 94.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for hackgap-1.0.0.tar.gz
Algorithm Hash digest
SHA256 35c5705d8ca1de0d8064c9869730b65f596a0a378a9d14f8a58368e4a71aab22
MD5 d4f76dda5654172b5b90f6c0df726715
BLAKE2b-256 86c3edb0823f9f923d107310f2a57ea12ac24f069d0f40a56e1c76b82eda59aa

See more details on using hashes here.

File details

Details for the file hackgap-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: hackgap-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 107.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for hackgap-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c223e17d214df60de93eda9bdbf9cefa93e434be14ee073e03f2b0cc4c5abdd
MD5 fe6ff9174b7df784d7aed734a65ebd6f
BLAKE2b-256 8601d0e1b64dc54298035eb744b2aae3b994d604f669a617fd26ca183ae45509

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page