Skip to main content

FASTX parsing and k-mer methods

Project description

CI crates.io

Needletail

Needletail is a MIT-licensed, minimal-copying FASTA/FASTQ parser and k-mer processing library for Rust.

The goal is to write a fast and well-tested set of functions that more specialized bioinformatics programs can use. Needletail's goal is to be as fast as the readfq C library at parsing FASTX files and much (i.e. 25 times) faster than equivalent Python implementations at k-mer counting.

Example

extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};

fn main() {
    let filename = "tests/data/28S.fasta";

    let mut n_bases = 0;
    let mut n_valid_kmers = 0;
    let mut reader = parse_fastx_file(&filename).expect("valid path/file");
    while let Some(record) = reader.next() {
        let seqrec = record.expect("invalid record");
        // keep track of the total number of bases
        n_bases += seqrec.num_bases();
        // normalize to make sure all the bases are consistently capitalized and
        // that we remove the newlines since this is FASTA
        let norm_seq = seqrec.normalize(false);
        // we make a reverse complemented copy of the sequence first for
        // `canonical_kmers` to draw the complemented sequences from.
        let rc = norm_seq.reverse_complement();
        // now we keep track of the number of AAAAs (or TTTTs via
        // canonicalization) in the file; note we also get the position (i.0;
        // in the event there were `N`-containing kmers that were skipped)
        // and whether the sequence was complemented (i.2) in addition to
        // the canonical kmer (i.1)
        for (_, kmer, _) in norm_seq.canonical_kmers(4, &rc) {
            if kmer == b"AAAA" {
                n_valid_kmers += 1;
            }
        }
    }
    println!("There are {} bases in your file.", n_bases);
    println!("There are {} AAAAs in your file.", n_valid_kmers);
}

Installation

Needletail requires rust and cargo to be installed. Please use either your local package manager (homebrew, apt-get, pacman, etc) or install these via rustup.

Once you have Rust set up, you can include needletail in your Cargo.toml file like:

[dependencies]
needletail = "0.6.0"

To install needletail itself for development:

git clone https://github.com/onecodex/needletail
cargo test  # to run tests

Python

Documentation

For a real example, you can refer to test_python.py.

The python library only raise one type of exception: NeedletailError.

There are 2 ways to parse a FASTA/FASTQ: one if you have a string (parse_fastx_string(content: str)) or a path to a file (parse_fastx_file(path: str)). Those functions will raise if the file is not found or if the content is invalid and will return an iterator.

from needletail import parse_fastx_file, NeedletailError, reverse_complement, normalize_seq

try:
    for record in parse_fastx_file("myfile.fastq"):
        print(record.id)
        print(record.seq)
        print(record.qual)
except NeedletailError:
    print("Invalid Fastq file")

A record has the following shape:

class Record:
    id: str
    seq: str
    qual: Optional[str]

    def is_fasta(self) -> bool
    def is_fastq(self) -> bool
    def normalize(self, iupac: bool)

Note that normalize (see https://docs.rs/needletail/0.4.1/needletail/sequence/fn.normalize.html for what it does) will mutate self.seq. It is also available as the normalize_seq(seq: str, iupac: bool) function which will return the normalized sequence in this case.

Lastly, there is also a reverse_complement(seq: str) that will do exactly what it says. This will not raise an error if you pass some invalid characters.

Building

To work on the Python library on a Mac OS X/Unix system (requires Python 3):

pip install maturin

# finally, install the library in the local virtualenv
maturin develop --cargo-extra-args="--features=python"

To build the binary wheels and push to PyPI

# The Mac build requires switching through a few different python versions
maturin build --features python --release --strip

# The linux build is automated through cross-compiling in a docker image
docker run --rm -v $(pwd):/io ghcr.io/pyo3/maturin:main build --features=python --release --strip -f
twine upload target/wheels/*

Releasing A New Version

There is a Github Workflow that will build Python wheels for macOS (x86 and ARM) and Ubuntu (x86). To run, create a new release.

Getting Help

Questions are best directed as GitHub issues. We plan to add more documentation soon, but in the meantime "doc" comments are included in the source.

Contributing

Please do! We're happy to discuss possible additions and/or accept pull requests.

Acknowledgements

Starting from 0.4, the parsers algorithms is taken from seq_io. While it has been slightly modified, it is mainly coming from that library. Links to the original files are available in src/parser/fast{a,q}.rs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

needletail-0.7.0.tar.gz (435.7 kB view details)

Uploaded Source

Built Distributions

needletail-0.7.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (594.4 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

needletail-0.7.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (521.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

needletail-0.7.0-cp313-cp313-macosx_11_0_arm64.whl (426.2 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

needletail-0.7.0-cp313-cp313-macosx_10_12_x86_64.whl (493.5 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

needletail-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (594.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

needletail-0.7.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (522.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

needletail-0.7.0-cp312-cp312-macosx_11_0_arm64.whl (426.5 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

needletail-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl (494.1 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

needletail-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (595.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

needletail-0.7.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (522.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

needletail-0.7.0-cp311-cp311-macosx_11_0_arm64.whl (428.7 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

needletail-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl (495.6 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

needletail-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (594.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

needletail-0.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (522.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

needletail-0.7.0-cp310-cp310-macosx_11_0_arm64.whl (428.7 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

needletail-0.7.0-cp310-cp310-macosx_10_12_x86_64.whl (495.7 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

needletail-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (595.3 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

needletail-0.7.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (522.4 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

needletail-0.7.0-cp39-cp39-macosx_11_0_arm64.whl (429.9 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

needletail-0.7.0-cp39-cp39-macosx_10_12_x86_64.whl (496.3 kB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

File details

Details for the file needletail-0.7.0.tar.gz.

File metadata

  • Download URL: needletail-0.7.0.tar.gz
  • Upload date:
  • Size: 435.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for needletail-0.7.0.tar.gz
Algorithm Hash digest
SHA256 3f5744b8984f2243ac6d0496968b91e6c72632311864695b53436958358898ef
MD5 46646e5f3e818f8d25981ea65d529470
BLAKE2b-256 7c6e015ed12cf5103ed4e7af6cb2f8d32442226961079e208d452678c3d3e152

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7d5967c2db5fe50eec2ace7aa4f371ebc1a5f598c0db7477b3bb5545b6f3bbe1
MD5 996539333cb45ffa954d5ffd569e4500
BLAKE2b-256 c1161fe0093a9e6d11cbc25929c50051139cb549e38512d13c2b8c3467f79f92

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1184a22ed56bec0c2b57aeecde41e458cacef4ca81bf2d776d67b66d2a833ed4
MD5 27bb460fe505e40caf087e83035f9f2c
BLAKE2b-256 c9b9c6eaf4841f82e75be7d57b176748ee1aa22bdf60c592d1ee142b2861d335

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2b971ef059b32f4516e4c45965603741a740ca1f04217fb12258685f8430d534
MD5 9e7582362cf7143260ceed2272e45b97
BLAKE2b-256 054e55ce12cf7e1b290b381eed64fe2a33daa8c68f1ca8bde79a58a7ac28ac66

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 47fb73c72468d4a06e43602f9fb88d0dc57c33b6a7e1b395d2424410e2cc70d1
MD5 eecf7ff10ae3c4c9833052642bf6495f
BLAKE2b-256 3fed96bdef2895dcca0b34c4fa056172c7e66ac49740041a03a2c7277bc5bdfc

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 421251f4d57af33da9e0316effceb02603cc92b5b6fd239e91e540167d64fc7a
MD5 cbd77d45e7689e81352d4fd2f5526952
BLAKE2b-256 4b04e3a3fcf6971ef272366dbcac3db46da6f33b1c9fb4ef4d8f7ef30ddf68c3

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 67066915f4439a883ef79886687fe6fc68dc266d35d8a9a8c502c16e9af1c1c4
MD5 82cbb92c94d95ba69ad141181126faf4
BLAKE2b-256 0484a349476889f4ea660d982638545a16e852e20a6dee2833632cae5a75f2cc

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8a2ba59bf00d54b5325f47fe23c5cdb8402c2fd6cc921a80f695841997c98434
MD5 da8e2acd7100500f98ff9eca564a1ded
BLAKE2b-256 16f54b898a4e104085072ec0e9aca72e2f1e686166b0381358128f78c475d144

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2fe2542d006995b3ba0a525a292f506b3fe938a41ff47be301fd008d5544ab3f
MD5 e6dd913d6f3031c15a4dabf4e355b772
BLAKE2b-256 25f66f4ad6a7f0dd737ff83afb1d96bf5815816fd2da818ee24a366ed7e115f9

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7d48ad113b091580da0f8337f990546a8fa81e82bb46eb3d6d37e74fdaca2470
MD5 a3cc69047ea0abdbe2aaecc813cfb36c
BLAKE2b-256 5870d0eb3459db7fa4d528fa0693f3cce4244be7b079fa54894c1a2e93bd0c46

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fff86bded466ab74c338d83a8a25be3731f106e73905c99a56bce84b8d4e9d27
MD5 33abcfa07028e2bddaa6c5e9e2c4aa17
BLAKE2b-256 07bff8b4b8f716d3c6ed5ca52e53a81569dd1761e6bbea0d6692646fcbe77732

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc05eb4617d586ae4b2eca2d7809bd4af606b4b139ceada00a9f4184bd25f57f
MD5 abbf14602b6443ec884e6fe3437460ef
BLAKE2b-256 dc8310580a20cef808b85e24c176372ec0bc1e83a97b1e60b97e9801ce5bcddd

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ed1e2246f9c92decb33aab977e3f50521c73ad0b503132b8345ee3099b31a286
MD5 15dd6b0597932c322630ce4945b6d67b
BLAKE2b-256 c4d8fc972abb0ae19e447bd73bc14c13636c1d36bfef42e647a6d470cd735624

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e9dc2b21f052695db936137bd2ecc5fdaaae4509ae498eb2aa32479b3d8a7319
MD5 91c76cf874951743e945e3b3638de5c8
BLAKE2b-256 c81fc76b47bf61458f4f467ec902e2a99eaaba96961d091590fad3e2ae3a09ae

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 323314e1f1993299a039155ca8b80aae25f22a75a562c10380156af2677e443a
MD5 79fb8e37ecc34d987a52a12c56574faf
BLAKE2b-256 3b77f2e59109188b43abb1d4bd981d2bc756ed8cf0616072c52cbd4c96786ed2

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bc4eba3fa52488a32e4b2c1a4cbe01dd0995403de5d55722feba3eb20239aede
MD5 bc2c18046f662b13d377d4c007f6b251
BLAKE2b-256 2ef932b3b08297e288d76be4d50bd00766b212c2f41f021a82e872b01555bd6d

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d33dc4c39cc37c182d05aa373096260ee1e7c11501ad59fe807cbcb33ae2d6d3
MD5 5e1ccb9f51506fda9751444f5df04d4c
BLAKE2b-256 a6356044d68dea90ed88ae8da367b716804bf44a2f4ad6cd04ee9174029af3e6

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d42b03543ef4f4ec921f40c95a42a5896b6914fbc2588fd85b329076eb228daf
MD5 d606af8c81173e49d37be494b9709923
BLAKE2b-256 8557c390fc7e7d695548ade114ffd8d4874b729f9f5b16713a2ed24a3e4af466

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 602d377b6191f01cc21a869a21f8bd1a38dc8b6d44338d282e83e7dac7a20683
MD5 a80b34d6f47585ba37aa00b04aff409e
BLAKE2b-256 5462828e97e16bff1644b770d415a1c5bd547278a688bcd6cfb04e1e2d2c7f58

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f24d6cf8e8a520fda579bd37fc95a622036c042fe24de42b8733fe1b125fa598
MD5 33ca7ba5154064e8b7d04be88697209f
BLAKE2b-256 5853f2d3e5b1045522d99ad6a136ae6ce4fb3ffd61644287b9ab0e5a476d0ffb

See more details on using hashes here.

File details

Details for the file needletail-0.7.0-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for needletail-0.7.0-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5b797b6a2bd1b3f446b7d0d5d923c28b8a4758fd2574a16239b2f7e8f939b97f
MD5 303627fe0ad5d96e031371e2e66f253d
BLAKE2b-256 ee2b959e41e08c4742cc0b499318e5b9d1fe2e784d64a03f5744372b33acd05d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page