FASTX parsing and k-mer methods
Project description
Needletail
Needletail is a MIT-licensed, minimal-copying FASTA/FASTQ parser and k-mer processing library for Rust.
The goal is to write a fast and well-tested set of functions that more specialized bioinformatics programs can use. Needletail's goal is to be as fast as the readfq C library at parsing FASTX files and much (i.e. 25 times) faster than equivalent Python implementations at k-mer counting.
Example
extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};
fn main() {
let filename = "tests/data/28S.fasta";
let mut n_bases = 0;
let mut n_valid_kmers = 0;
let mut reader = parse_fastx_file(&filename).expect("valid path/file");
while let Some(record) = reader.next() {
let seqrec = record.expect("invalid record");
// keep track of the total number of bases
n_bases += seqrec.num_bases();
// normalize to make sure all the bases are consistently capitalized and
// that we remove the newlines since this is FASTA
let norm_seq = seqrec.normalize(false);
// we make a reverse complemented copy of the sequence first for
// `canonical_kmers` to draw the complemented sequences from.
let rc = norm_seq.reverse_complement();
// now we keep track of the number of AAAAs (or TTTTs via
// canonicalization) in the file; note we also get the position (i.0;
// in the event there were `N`-containing kmers that were skipped)
// and whether the sequence was complemented (i.2) in addition to
// the canonical kmer (i.1)
for (_, kmer, _) in norm_seq.canonical_kmers(4, &rc) {
if kmer == b"AAAA" {
n_valid_kmers += 1;
}
}
}
println!("There are {} bases in your file.", n_bases);
println!("There are {} AAAAs in your file.", n_valid_kmers);
}
Installation
Needletail requires rust
and cargo
to be installed.
Please use either your local package manager (homebrew
, apt-get
, pacman
, etc) or install these via rustup.
Once you have Rust set up, you can include needletail in your Cargo.toml
file like:
[dependencies]
needletail = "0.4"
To install needletail itself for development:
git clone https://github.com/onecodex/needletail
cargo test # to run tests
Python
To work on the Python library on a Mac OS X/Unix system (requires Python 3):
pip install maturin
# finally, install the library in the local virtualenv
maturin develop --cargo-extra-args="--features=python"
Building binary wheels and pushing to PyPI
# The Mac build requires switching through a few different python versions
maturin build --cargo-extra-args="--features=python" --release --strip
# The linux build is automated through cross-compiling in a docker image
docker run --rm -v $(pwd):/io konstin2/maturin:master build --cargo-extra-args="--features=python" --release --strip
twine upload target/wheels/*
Getting Help
Questions are best directed as GitHub issues. We plan to add more documentation soon, but in the meantime "doc" comments are included in the source.
Contributing
Please do! We're happy to discuss possible additions and/or accept pull requests.
Acknowledgements
Starting from 0.4, the parsers algorithms is taken from seq_io. While it has been slightly modified, it is mainly
coming from that library. Links to the original files are available in src/parser/fast{a,q}.rs
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for needletail-0.4.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88424db584dec0bf80815b21840f7585c6fdceb45c0968e6d5e0250b64eb919d |
|
MD5 | c91e814d1934b5baf9bfe28b95635966 |
|
BLAKE2b-256 | 55971d82e6f749908be461ad15c49ec37ff95146c7776f09f4405f4c8927c795 |
Hashes for needletail-0.4.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d05ad827341abe883f49a196d23acd32789a9856c537e2227455e4434d5b28a |
|
MD5 | 225025cc250d6096133573a6e10b2893 |
|
BLAKE2b-256 | 56cc6a01d0f927802bb7e474fd6017b9aa52855a4d5a27676ba87038b41fab75 |
Hashes for needletail-0.4.1-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7537656caa75eaab58723b5683057d09295d7e2acf103ea787d263e01bc03d1e |
|
MD5 | ca05b4e0835f1b914ca2d0b83d7b90e5 |
|
BLAKE2b-256 | 48381deb82778ebfa1d7b9c90ed9fb3bda959190fdbe080a60729d582902a1c5 |
Hashes for needletail-0.4.1-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d2a70977fe98afef0ed6437909f65efa2885e87a2395aca41aafc3b19032598 |
|
MD5 | 15838f3021fa44323afdb9c588ef24c4 |
|
BLAKE2b-256 | 1714572dcdeb2692ab9c12f88111f91b71cdc3d93189abdc48960df9e98ee052 |
Hashes for needletail-0.4.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a333d5571a215f632e2b042763b8cd620aa08cdab54c864ecc05d3cfb11f42c0 |
|
MD5 | dff58245ce0527e00e62bb43bb353136 |
|
BLAKE2b-256 | 0615fe67af376c351adbc082864cbbdd714495b7e26e65884d39a14376b17697 |
Hashes for needletail-0.4.1-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25d3095007a002f8ca2cc321909fb6b9f5c7624c4b5a7f04e889b5dfef94aa82 |
|
MD5 | f96c0ac813c1f85f2d7fb8c99a0f9579 |
|
BLAKE2b-256 | c70d26b5f026e7df9cb92980f1e8bb3e7fa41679fd88e339a29cf61908a77bab |
Hashes for needletail-0.4.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a4f7e23f9a4130f34ff5459d79f9d5254336bafb4ab5dbbcc448a3ad6d23ef7 |
|
MD5 | c8c6388e9f3f39947d8775ebb8c84051 |
|
BLAKE2b-256 | b0cb670c75e3cdad06c73ccd2ccdd7f7fe7d390e39e9ff5c851fccc8e64502e1 |
Hashes for needletail-0.4.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06b83b4f7f1e58312cdb2d3aef29d00bd1178ddd502ac9cbce998a5fa9397756 |
|
MD5 | 8494a7368091ab81325c74dc39ab0a23 |
|
BLAKE2b-256 | a5d1b2c45b870cfbb0db2ea5ce9b85a6397db8561efd72cc150c258bba63ee31 |