identifying plant infection with machine intelligence.
Project description
iimi: identifying infection with machine intelligence
iimi is a python package for plant virus diagnostics using high-throughput genome sequencing data. It provides tools for converting BAM files into coverage profiles, processing and visualizing genomic data with handling for unreliable regions, and training machine learning models to predict viral infections.
Installation
pip install iimi
Usage
import iimi
Data Processing and Coverage Profile Generation
To convert the indexed and sorted BAM file(s) into coverage profiles and feature-extracted data frame. The coverage profiles will be used to visualize the mapping information. The feature-extracted data frame will be used in the model training and predictions.
# convert BAM files to coverage profiles
bam_files = ["path/to/sample1.sorted.bam", "path/to/sample2.sorted.bam"]
iimi.convert_bam_to_rle(bam_files)
# convert coverage profiles to a feature-extracted DataFrame
rle_data = {
"sample1": {"seg1": [1, 2, 3, 0, 0, 4], "seg2": [0, 0, 0, 1, 1, 2]},
"sample2": {"seg3": [2, 3, 4, 5, 0, 1]},
}
additional_info = pd.DataFrame({
"virus_name": ["Virus4"],
"iso_id": ["Iso4"],
"seg_id": ["seg4"],
"A_percent": [40],
"C_percent": [20],
"T_percent": [20],
"GC_percent": [20],
"seg_len": [800],
})
iimi.convert_rle_to_df(rle_data, additional_nucleotide_info=additional_info)
Handling Unreliable Regions
Unreliable regions contain high nucleotide content regions and have a mappability profile. Identifying these regions helps eliminate false peaks.
High Nucleotide Content Regions
High nucleotide content regions is a profile of areas on a virus genome that has high GC content and/or high A nucleotide percentage.
virus_info = {
"seg1": "ATGCGATCGATCGATCGTACGATCGATCGATCGATCGTACGATCG",
"seg2": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
# identify regions with high GC content
create_high_nucleotide_content(
gc=0.4, a=0.0, window=10, virus_info=virus_info
)
# identify regions with high A content
create_high_nucleotide_content(
gc=0.0, a=0.8, window=10, virus_info=virus_info
)
Mappability Profile
Mappability profile is a profile of areas on a virus genome that can be mapped to other viruses or host genome. This tool uses Arabidopsis Thaliana as the host genome.
# generate mappability profile from host or virus BAM files
create_mappability_profile(
path_to_bam_files="path/to/bam/files",
virus_info=virus_info,
window=10
)
Machine Learning Models to Predict Viral Infections
Using Pre-trained Models
To use a provided model, input your data to newdata and choose a method: xgb, en, and rf, which stand for pre-trained XGBoost, elastic net, and random forest models. The prediction is TRUE if virus infected the sample, FALSE if virus did not infect the sample.
# predict using pre-trained random forest model
predict_iimi(newdata=df, method="rf")
Training a Custom Model
The train_iimi() function trains a machine learning model on the provided feature-extracted data frame of plant sample (train_x) and known target labels (train_y). It supports also three models: xgb, en, and rf.
# train random forest model
train_iimi(train_x, train_y, method="rf", ntree=100, mtry=2)
# train XGBoost model
train_iimi(train_x, train_y, method="xgb", nrounds=100)
# train elastic net model
en_model = train_iimi(train_x, train_y, method="en", k=5)
Visualizing Coverage Profiles
plot_cov() plots the coverage profile of the mapped plant sample and the percentage of A nucleotides and GC content for a sliding window of k-mer with the default step being 75 bases.
covs = {
"sample1": {
"seg1": [20, 30, 50, 60, 80],
"seg2": [15, 25, 45, 55, 75],
}
}
virus_info = {
"seg1": "ACGT" * 250,
"seg2": "TGCA" * 250,
}
# plot coverage of segments without unreliable regions
plot_cov(
covs,
legend_status=True,
nucleotide_status=True,
virus_info=virus_info,
unreliable_regions=None,
)
Sample Data and Models Provided
iimi/data/example_cov.pklCoverage profiles of three plant samples: A list of coverage profiles for three plant samplesiimi/data/example_diag.pdlKnown diagnostics result of virus segments: A matrix containing the known truth about the diagnostics result (using virus database version 1.4.0) for each plant sample for the example dataiimi/data/nucleotide_info.pklNucleotide information of virus segments: A data set containing the GC content and other information about the virus segments from the official Virtool virus data base (version 1.4.0)iimi/data/unreliable_regions.pklThe unreliable regions of the virus segments: A data frame of unmappable regions and regions of CG% and A% over 60% and 45% respectively for the virus segmentsiimi/data/trained_rf.pklA trained model using the default Random Forest settingsiimi/data/trained_xgb.modelA trained model using the default XGBoost settingsiimi/data/trained_en.pklA trained model using the default Elastic Net settings
References
- H. Ning, I. Boyes, Ibrahim Numanagić, M. Rott, L. Xing, and X. Zhang, “Diagnostics of viral infections using high-throughput genome sequencing data,” Briefings in Bioinformatics, vol. 25, no. 6, Sep. 2024, doi: https://doi.org/10.1093/bib/bbae501.
- Grigorii Sukhorukov, M. Khalili, Olivier Gascuel, Thierry Candresse, Armelle Marais-Colombel, and Macha Nikolski, “VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data,” Frontiers in bioinformatics, vol. 2, May 2022, doi: https://doi.org/10.3389/fbinf.2022.867111.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iimi-0.1.11.tar.gz.
File metadata
- Download URL: iimi-0.1.11.tar.gz
- Upload date:
- Size: 447.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87c3c882e73f82d98704f06b7a6b887d9588fa9ffdaabc3b71acf0164280b5bc
|
|
| MD5 |
f5abcd9259647af698c4bc23feadcb3f
|
|
| BLAKE2b-256 |
e306f29d13699e792f7fa2a1d4ae9214613273c2b2627bc1e1547629d93e9c0c
|
File details
Details for the file iimi-0.1.11-py3-none-any.whl.
File metadata
- Download URL: iimi-0.1.11-py3-none-any.whl
- Upload date:
- Size: 450.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b86d73e971cb951e6f5b539a529f75d15a8e4e58e20298d5be153fa38d9eb448
|
|
| MD5 |
d5181c64259fac2a6c1ce56abe7737f9
|
|
| BLAKE2b-256 |
d2adc2e122a5eb2320cb2f88455370b86d5b77be82e4a7e05a4f5a88b6a435d8
|