Skip to main content

Naive Bayes Classifier with Rust-accelerated taxonomy functions

Project description

phylotypy

Naive Bayesian Classifier for 16S rRNA gene sequence data

Porting Riffomonas's CodeClub R package, phylotypr to python: https://github.com/riffomonas/phylotypr

It's been a great challenge learning how to interpret the R code into Python with minimal use of extra libraries.

It's best to clone the repository. Run vigentte.py to see if everything works.

Training the model with the full reference database from RDP takes about 30 seconds on my MacBook Pro.

You can modify the vignette at the end to classify your own sequences. I've done this using DADA2's output files.

There's also read_fasta.py that lets you take a fasta file of DNA seqences and process them into a dataframe for runing this classifier.

I made a separate vignette, vignette.py on how to do this and classify 16S sequence data from QIIME, DADA2, or text files.

Thanks Riffomonas for the inspiration. Check out the videos on his Youtube channel https://youtube.com/playlist?list=PLmNrK_nkqBpIZlWa3yGEc2-wX7An2kpCL&si=LmHDV02K5_wb6C0j

How to install

pip install git+https://github.com/csaltikov/phylotypy.git

or if using uv (recommended) (how to install uv)

uv pip install git+https://github.com/csaltikov/phylotypy.git

How to get started:

First download the training data, RDP's trainset19072023, either from https://mothur.org/wiki/rdp_reference_files/ You can use the one in the data directory called rdp_16S_v19.dada2.fasta

I processed the latest rdp reference data into a format that will work here and for DADA2.

The taxonomy string looks like this semicolon separated:

Bacteria;Phylum;Class;Order;Family;Genus

for example:
>Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Citrobacter
TAGAGTTTGATCCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACAC.....
  1. Load the training data and sequences to be classified
from pathlib import Path
from phylotypy import classifier, results, read_fasta

rdp = read_fasta.read_taxa_fasta("data/rdp_16S_v19.dada2.fasta")
moving_pics = read_fasta.read_taxa_fasta("data/dna_moving_pictures.fasta")
  1. Create the classifier. We'll call it database
database = classifier.make_classifier(rdp)
  1. Classify the sequences
classified = classifier.classify_sequences(moving_pics, database)
  1. Format the output
classified = results.summarize_predictions(classified)
print(classified.columns)

Output:

>>> Index(['id', 'sequence', 'classification', 'Kingdom', 'Phylum', 'Class',
       'Order', 'Family', 'Genus', 'observed', 'lineage'],
      dtype='object')
print(classified["classification"].head())

Output:

   0    Bacteria(100);Bacteroidota(100);Bacteroidia(10...
   1    Bacteria(100);Pseudomonadota(100);Betaproteoba...
   2    Bacteria(100);Bacillota(100);Bacilli(100);Lact...
   3    Bacteria(100);Bacteroidota(100);Bacteroidia(10...
   4    Bacteria(100);Bacteroidota(100);Bacteroidia(10...
   Name: classification, dtype: object

Format the results using results.summarize_predictions() function. The output is a pandas dataframe and can be saved to csv.

from phylotypy import results
classified = results.summarize_predictions(classified)
print(classified.head())

classified.to_csv("classified_results.csv")

Example classification output:

The taxonomic levels "Domain", "Phylum", "Class", "Order", "Family", "Genus" are separated by ";". The numbers in the () represent the confidence in the classificaiton. The default confidence is 80%.

>>> Bacteria(100);Pseudomonadota(99);Alphaproteobacteria(99);Rhodospirillales(99);Acetobacteraceae(99);Roseomonas(83)

>>> Bacteria(99);Bacteroidota(97);Bacteroidia(93);Bacteroidales(93);Bacteroidales_unclassified(93);Bacteroidales_unclassified(93)

Complete code block:

from pathlib import Path
from phylotypy import classifier, results, read_fasta

rdp = read_fasta.read_taxa_fasta("data/rdp_16S_v19.dada2.fasta")
moving_pics = read_fasta.read_taxa_fasta("data/dna_moving_pictures.fasta")

database = classifier.make_classifier(rdp)

classified = classifier.classify_sequences(moving_pics, database)
classified = results.summarize_predictions(classified)
print(classified.head())

Requirements

setuptools numpy numba pandas requests pandarallel jax cython

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phylotypy-0.2.0.tar.gz (269.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phylotypy-0.2.0-cp313-cp313-macosx_10_9_x86_64.whl (370.2 kB view details)

Uploaded CPython 3.13macOS 10.9+ x86-64

File details

Details for the file phylotypy-0.2.0.tar.gz.

File metadata

  • Download URL: phylotypy-0.2.0.tar.gz
  • Upload date:
  • Size: 269.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for phylotypy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a3ba58e0de94413a3da7b397fe83c9ae1ed391f94aecb6a971f0df6081260b8f
MD5 b04d642b170dc2e35a8636f4398aab59
BLAKE2b-256 548c22a804d7f6e72434601e1eb214f2335ca5c39d7a65cde300f18621a98b94

See more details on using hashes here.

File details

Details for the file phylotypy-0.2.0-cp313-cp313-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: phylotypy-0.2.0-cp313-cp313-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 370.2 kB
  • Tags: CPython 3.13, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for phylotypy-0.2.0-cp313-cp313-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c5fa2cfec6919f0b9e47b5c2ccbfaf0a7a320bf7c6ccf2e3d1f192ffe0777582
MD5 cfcdf546e5d29ba26c2bc141fddc5bd1
BLAKE2b-256 14fe49abb0d7dd4ddebe4f364fc3fd68307edaf620e1073ffce75f4ef3927858

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page