Skip to main content

Yet another kmer library for Python

Reason this release was yanked:

Switching to pyproject.toml crushes the bin script. See issue 97

Project description

README - kmerdb

A Python CLI and module for k-mer profiles, similarities, and graph databases

NOTE: This project is in alpha stage. Development is ongoing. But feel free to clone the repository and play with the code for yourself.

Development Status

Downloads PyPI - Downloads GitHub Downloads PyPI version Python versions Travis Build Status Coveralls code coverage ReadTheDocs status

Summary

KDB is a Python library designed for bioinformatics applications. It addresses the 'k-mer' problem (substrings of length k) in a simple and performant manner. It stores the k-mer counts/abundances and total counts. An experimental per-kmer metadata feature is included, which includes the coordinates of each k-mer w.r.t. their generating sequences. You can think of the current form as a "pre-index", as it includes all the essential information for indexing on any field in the landscape of k-mer to sequence relationships. One restriction is that k-mers with unspecified sequence residues 'N' create gaps in the k-mer to sequence relationship space, and are excluded. That said, non-standard IUPAC residues are supported.

Please see the Quickstart guide for more information about the format, the library, and the project.

The k-mer spectrum of the fasta or fastq sequencing data is stored in the .kdb format spec, a bgzf file similar to .bam. For those familiar with .bam, a view and header functions are provided to decompress a .kdb file into a standard output stream.

Installation

Dependencies

DESeq2 is required as a R dependency for rpy2-mediated normalization.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

All other dependencies are managed directly by pip.

OSX and Linux release:

pip install kmerdb

Development installation:

git clone https://github.com/MatthewRalston/kmerdb.git
pip install -e .

Usage Example

Usage in detail can be found on the quickstart page

CLI Usage

kmerdb --help
# Build a [composite] profile to a new or existing .kdb file
kmerdb profile -k 8 example1.fq.gz example2.fq.gz profile.8.kdb

# View the raw data
kmerdb view profile.8.kdb # -H for full header

# View the header
kmerdb header profile.8.kdb

# Collate the files
kmerdb matrix -p $cores pass *.8.kdb

# Calculate similarity between two (or more) profiles
kmerdb distance correlation profile1.kdb profile2.kdb (...)

Usage note:

kmerdb profile -k $k input.fa output.kdb # This may discard non-IUPAC characters, this feature lacks documentation!

IUPAC residues (ATCG+RYSWKM+BDHV) are kept throughout the k-mer counting. But non-IUPAC residues (N) and characters are trimmed from the sequences prior to k-mer counting.

Documentation

Check out the main webpage and the Readthedocs documentation, with examples and descriptions of the module usage.

Important features to usage that may be important may not be fully documented.

For example, the IUPAC treatment is largely custom, and does the sensible thing when ambiguous bases are found in fasta files, but it could use some polishing.

In addition, the 'N' residue rejection creates gaps in the k-mer profile from the real dataset by admittedly ommitting certain k-mer counts. This is one method for counting k-mers and handling ambiguity. Fork it and play with it a bit.

Also, the parallel handling may not always be smooth, if you're trying to load dozens of 12+ mer profiles into memory. This would especially matter in the matrix command, before the matrix is generated. You can use single-core if your machine can't collate that much into main memory at once, depending on how deep the fastq dataset is, and the --block-size parameter in kmerdb profile is likely going to facilitate your memory overhead by reading chunks of --block-size reads into memory at once while accumulating the k-mer counts in a uint64 array. Even when handling small-ish k-mer profiles, you may bump into memory overheads rather quickly.

Besides that, I'd suggest reading the source, the differente elements of the main page or the RTD documentation.

Development

https://matthewralston.github.io/kmerdb/developing

python setup.py test

License

Created by Matthew Ralston - Scientist, Programmer, Musician - Email

Distributed under the Apache license. See LICENSE.txt for the copy distributed with this project. Open source software is not for everyone, but for those of us starting out and trying to put the ecosystem ahead of ego, we march into the information age with this ethos.

   Copyright 2020 Matthew Ralston

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

Contributing

  1. Fork it (https://github.com/MatthewRalston/kdb/fork)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

Acknowledgements

Thank you to the authors of kPAL and Jellyfish for the early inspiration. And thank you to others for the encouragement along the way, who shall remain nameless. I wanted this library to be a good strategy for assessing these k-mer profiles, in a way that is both cost aware of the analytical tasks at play, capable of storing the exact profiles in sync with the current assemblies, and then updating the kmer databases only when needed to generate enough spectral signature information.

The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project has begun under GPL v3.0 and hopefully could gain some interest.

More on the flip-side of this file. Literally. And figuratively. It's so complex with technology these days.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmerdb-0.7.1.tar.gz (138.9 kB view details)

Uploaded Source

Built Distribution

kmerdb-0.7.1-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file kmerdb-0.7.1.tar.gz.

File metadata

  • Download URL: kmerdb-0.7.1.tar.gz
  • Upload date:
  • Size: 138.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.1

File hashes

Hashes for kmerdb-0.7.1.tar.gz
Algorithm Hash digest
SHA256 914a9f21121fb65b41baf28ab18166867cdfa901392c1f110eea422a057b6adc
MD5 22c6664f21ba26a19e1d56b85d3bdce2
BLAKE2b-256 fff3e54883179f00549da5d0859141ffd5604e6f656b28d4cd5879e5bec287eb

See more details on using hashes here.

File details

Details for the file kmerdb-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: kmerdb-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.1

File hashes

Hashes for kmerdb-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5d108b98b2f141993bcad742f756f2483c5e0b2867d8fef3852965233b4c36fc
MD5 f766281b763b567584e743182782355b
BLAKE2b-256 b7b02ae484f1468a60ee7ddb4d359b80a26733f13f9a2c7e583d1efd90e63b36

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page