Yet another corretion to the 'yet another correction to just a k-mer counter...'
Project description
README - kmerdb
A Python CLI and module for k-mer profiles, similarities, and graph databases
NOTE: This project is in beta stage. Development is ongoing. But feel free to clone the repository and play with the code for yourself.
Development Status
Summary
KDB is a Python library designed for bioinformatics applications. It addresses the 'k-mer' problem (substrings of length k) in a simple and performant manner. It stores the k-mer counts/abundances in a columnar format, with input file checksums, total counts, nullomers, and mononucleotide counts in a YAML formatted header in the first block of the bgzf
formatted .kdb
file. One restriction is that k-mers with unspecified sequence residues 'N' create gaps in the k-mer to sequence relationship space, and are excluded. That said, non-standard IUPAC residues are supported.
Please see the Quickstart guide for more information about the format, the library, and the project.
The k-mer spectrum of the fasta or fastq sequencing data is stored in the .kdb
format spec, a bgzf file similar to .bam
. For those familiar with .bam
, a view
and header
functions are provided to decompress a .kdb
file into a standard output stream. The output file is compatible with zlib
.
Installation
Dependencies
DESeq2 is required as a R dependency for rpy2-mediated normalization.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
All other dependencies are managed directly by pip.
OSX and Linux release:
pip install --python-version 3.7.4 --pre kmerdb
Development installation:
git clone https://github.com/MatthewRalston/kmerdb.git
pip install .
Usage Example
NOTE: Usage in detail can be found on the quickstart page
CLI Usage
kmerdb --help
# Build a [composite] profile to a new .kdb file
kmerdb profile -k 8 example1.fq.gz example2.fq.gz profile.8.kdb
# Note: zlib compatibility
zcat profile.8.kdb
# Build a weighted edge list
kmerdb graph -k 12 example1.fq.gz example2.fq.gz edges.kdbg
# View the raw data
kmerdb view profile.8.kdb # -H for full header
# View the header
kmerdb header profile.8.kdb
# Collate the files. See 'kmerdb matrix -h' for more information.
# Note: the 'pass' subcommand passes the int counts through collation, without normalization.
# In this case the shell interprets '*.8.kdb' as all 8-mer profiles in the current working directory.
# The k-mer profiles are read in parallel (-p $cores), and collated into one Pandas dataframe, which is printed to STDOUT.
# Other options include DESeq2 normalization, frequency matrix, or PCA|tSNE based dimensionality reduction techniques.
kmerdb matrix -p $cores pass *.8.kdb > kmer_count_dataframe.tsv
# Calculate similarity between two (or more) profiles
# The correlation distance from Numpy is used on one or more profiles, or piped output from 'kmerdb matrix'.
kmerdb distance correlation profile1.kdb profile2.kdb (...) > distance.tsv
# A condensed, one-line invocation of the matrix and distance command using the bash shell's pipe mechanism is as follows.
kmerdb matrix pass *.8.kdb | kmerdb distance correlation STDIN > distance.tsv
IUPAC support:
kmerdb profile -k $k input.fa output.kdb # This may discard non-IUPAC characters, this feature lacks documentation!
IUPAC residues (ATCG+RYSWKM+BDHV) are kept throughout the k-mer counting. But non-IUPAC residues (N) and characters are trimmed from the sequences prior to k-mer counting.
Documentation
Check out the main webpage and the Readthedocs documentation, with examples and descriptions of the module usage.
Important features to usage that may be important may not be fully documented.
For example, the IUPAC treatment is largely custom, and does the sensible thing when ambiguous bases are found in fasta files, but it could use some polishing.
In addition, the 'N
' residue rejection creates gaps in the k-mer profile from the real dataset by admittedly ommitting certain k-mer counts.
This is one method for counting k-mers and handling ambiguity. Fork it and play with it a bit.
Also, the parallel handling may not always be smooth, if you're trying to load dozens of 12+ mer profiles into memory. This would especially matter in the matrix command, before the matrix is generated. You can use single-core if your machine can't collate that much into main memory at once, depending on how deep the fastq dataset is, and the --block-size
parameter in kmerdb profile
is likely going to facilitate your memory overhead by reading chunks of --block-size
reads into memory at once while accumulating the k-mer counts in a uint64
array. Even when handling small-ish k-mer profiles, you may bump into memory overheads rather quickly.
Besides that, I'd suggest reading the source, the differente elements of the main page or the RTD documentation.
Development
https://matthewralston.github.io/kmerdb/developing
python setup.py test
License
Created by Matthew Ralston - Scientist, Programmer, Musician - Email
Distributed under the Apache license. See LICENSE.txt
for the copy distributed with this project. Open source software is not for everyone, and im the author and maintainer. cheers, on me. You may use and distribute this software, gratis, so long as the original LICENSE.txt is distributed along with the software. This software is distributed AS IS and provides no warranties of any kind.
Copyright 2020 Matthew Ralston
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Contributing
- Fork it (https://github.com/MatthewRalston/kmerdb/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request
Acknowledgements
Thank you to the authors of kPAL and Jellyfish for the early inspiration. And thank you to others for the encouragement along the way, who shall remain nameless. I wanted this library to be a good strategy for assessing these k-mer profiles, in a way that is both cost aware of the analytical tasks at play, capable of storing the exact profiles in sync with the current assemblies, and then updating the kmer databases only when needed to generate enough spectral signature information.
The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project began under GPL v3.0 and was relicensed with Apache v2. Hopefully this project could gain some interest. I have so much fun working on just this one project. There's more to it than meets the eye. I'm working on a preprint, and the draft is included in some of the latest versions of the codebase, specifically .Rmd files.
More on the flip-side of this file. Literally. And figuratively. It's so complex with technology these days.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file kmerdb-0.8.0.tar.gz
.
File metadata
- Download URL: kmerdb-0.8.0.tar.gz
- Upload date:
- Size: 250.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa1b0bd1bfc9b4f986910de38e05672d3e6d793eb4992686ad0436fd788c0574 |
|
MD5 | 5a8704e082fd1e835294c794c1e0feec |
|
BLAKE2b-256 | 95e5118c37ee0fe862b800e4f685de88ba7642f04fc6b9ca5cdb0202811cbba1 |