Naive Bayes Classifier with Rust-accelerated taxonomy functions
Project description
phylotypy
Naive Bayesian Classifier for 16S rRNA gene sequence data
Porting Riffomonas's CodeClub R package, phylotypr to python: https://github.com/riffomonas/phylotypr
It's been a great challenge learning how to interpret the R code into Python with minimal use of extra libraries.
It's best to clone the repository. Run vigentte.py to see if everything works.
Training the model with the full reference database from RDP takes about 30 seconds on my MacBook Pro.
You can modify the vignette at the end to classify your own sequences. I've done this using DADA2's output files.
There's also read_fasta.py that lets you take a fasta file of DNA seqences and process them into a dataframe for runing this classifier.
I made a separate vignette, vignette.py on how to do this and classify 16S sequence data from QIIME, DADA2, or text files.
Thanks Riffomonas for the inspiration. Check out the videos on his Youtube channel https://youtube.com/playlist?list=PLmNrK_nkqBpIZlWa3yGEc2-wX7An2kpCL&si=LmHDV02K5_wb6C0j
How to install
pip install git+https://github.com/csaltikov/phylotypy.git
or if using uv (recommended) (how to install uv)
uv pip install git+https://github.com/csaltikov/phylotypy.git
How to get started:
First download the training data, RDP's trainset19072023, either from https://mothur.org/wiki/rdp_reference_files/ You can use the one in the data directory called rdp_16S_v19.dada2.fasta
I processed the latest rdp reference data into a format that will work here and for DADA2.
The taxonomy string looks like this semicolon separated:
Bacteria;Phylum;Class;Order;Family;Genus
for example:
>Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Citrobacter
TAGAGTTTGATCCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACAC.....
- Load the training data and sequences to be classified
from pathlib import Path
from phylotypy import classifier, results, read_fasta
rdp = read_fasta.read_taxa_fasta("data/rdp_16S_v19.dada2.fasta")
moving_pics = read_fasta.read_taxa_fasta("data/dna_moving_pictures.fasta")
- Create the classifier. We'll call it database
database = classifier.make_classifier(rdp)
- Classify the sequences
classified = classifier.classify_sequences(moving_pics, database)
- Format the output
classified = results.summarize_predictions(classified)
print(classified.columns)
Output:
>>> Index(['id', 'sequence', 'classification', 'Kingdom', 'Phylum', 'Class',
'Order', 'Family', 'Genus', 'observed', 'lineage'],
dtype='object')
print(classified["classification"].head())
Output:
0 Bacteria(100);Bacteroidota(100);Bacteroidia(10...
1 Bacteria(100);Pseudomonadota(100);Betaproteoba...
2 Bacteria(100);Bacillota(100);Bacilli(100);Lact...
3 Bacteria(100);Bacteroidota(100);Bacteroidia(10...
4 Bacteria(100);Bacteroidota(100);Bacteroidia(10...
Name: classification, dtype: object
Format the results using results.summarize_predictions() function. The output is a pandas dataframe and can be saved to csv.
from phylotypy import results
classified = results.summarize_predictions(classified)
print(classified.head())
classified.to_csv("classified_results.csv")
Example classification output:
The taxonomic levels "Domain", "Phylum", "Class", "Order", "Family", "Genus" are separated by ";". The numbers in the () represent the confidence in the classificaiton. The default confidence is 80%.
>>> Bacteria(100);Pseudomonadota(99);Alphaproteobacteria(99);Rhodospirillales(99);Acetobacteraceae(99);Roseomonas(83)
>>> Bacteria(99);Bacteroidota(97);Bacteroidia(93);Bacteroidales(93);Bacteroidales_unclassified(93);Bacteroidales_unclassified(93)
Complete code block:
from pathlib import Path
from phylotypy import classifier, results, read_fasta
rdp = read_fasta.read_taxa_fasta("data/rdp_16S_v19.dada2.fasta")
moving_pics = read_fasta.read_taxa_fasta("data/dna_moving_pictures.fasta")
database = classifier.make_classifier(rdp)
classified = classifier.classify_sequences(moving_pics, database)
classified = results.summarize_predictions(classified)
print(classified.head())
Requirements
setuptools numpy numba pandas requests pandarallel jax cython
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phylotypy-0.2.0.tar.gz.
File metadata
- Download URL: phylotypy-0.2.0.tar.gz
- Upload date:
- Size: 269.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3ba58e0de94413a3da7b397fe83c9ae1ed391f94aecb6a971f0df6081260b8f
|
|
| MD5 |
b04d642b170dc2e35a8636f4398aab59
|
|
| BLAKE2b-256 |
548c22a804d7f6e72434601e1eb214f2335ca5c39d7a65cde300f18621a98b94
|
File details
Details for the file phylotypy-0.2.0-cp313-cp313-macosx_10_9_x86_64.whl.
File metadata
- Download URL: phylotypy-0.2.0-cp313-cp313-macosx_10_9_x86_64.whl
- Upload date:
- Size: 370.2 kB
- Tags: CPython 3.13, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5fa2cfec6919f0b9e47b5c2ccbfaf0a7a320bf7c6ccf2e3d1f192ffe0777582
|
|
| MD5 |
cfcdf546e5d29ba26c2bc141fddc5bd1
|
|
| BLAKE2b-256 |
14fe49abb0d7dd4ddebe4f364fc3fd68307edaf620e1073ffce75f4ef3927858
|