Skip to main content

identifying plant infection with machine intelligence.

Project description

iimi: identifying infection with machine intelligence

iimi is a python package for plant virus diagnostics using high-throughput genome sequencing data. It provides tools for converting BAM files into coverage profiles, processing and visualizing genomic data with handling for unreliable regions, and training machine learning models to predict viral infections.

Installation

pip install iimi

Usage

import iimi

Data Processing and Coverage Profile Generation

To convert the indexed and sorted BAM file(s) into coverage profiles and feature-extracted data frame. The coverage profiles will be used to visualize the mapping information. The feature-extracted data frame will be used in the model training and predictions.

# convert BAM files to coverage profiles
bam_files = ["path/to/sample1.sorted.bam", "path/to/sample2.sorted.bam"]
iimi.convert_bam_to_rle(bam_files)

# convert coverage profiles to a feature-extracted DataFrame
rle_data = {
    "sample1": {"seg1": [1, 2, 3, 0, 0, 4], "seg2": [0, 0, 0, 1, 1, 2]},
    "sample2": {"seg3": [2, 3, 4, 5, 0, 1]},
}

additional_info = pd.DataFrame({
    "virus_name": ["Virus4"],
    "iso_id": ["Iso4"],
    "seg_id": ["seg4"],
    "A_percent": [40],
    "C_percent": [20],
    "T_percent": [20],
    "GC_percent": [20],
    "seg_len": [800],
})

iimi.convert_rle_to_df(rle_data, additional_nucleotide_info=additional_info)

Handling Unreliable Regions

Unreliable regions contain high nucleotide content regions and have a mappability profile. Identifying these regions helps eliminate false peaks.

High Nucleotide Content Regions

High nucleotide content regions is a profile of areas on a virus genome that has high GC content and/or high A nucleotide percentage.

virus_info = {
    "seg1": "ATGCGATCGATCGATCGTACGATCGATCGATCGATCGTACGATCG",
    "seg2": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}

# identify regions with high GC content
create_high_nucleotide_content(
    gc=0.4, a=0.0, window=10, virus_info=virus_info
)
# identify regions with high A content
create_high_nucleotide_content(
    gc=0.0, a=0.8, window=10, virus_info=virus_info
)

Mappability Profile

Mappability profile is a profile of areas on a virus genome that can be mapped to other viruses or host genome. This tool uses Arabidopsis Thaliana as the host genome.

# generate mappability profile from host or virus BAM files
create_mappability_profile(
    path_to_bam_files="path/to/bam/files",
    virus_info=virus_info,
    window=10
)

Machine Learning Models to Predict Viral Infections

Using Pre-trained Models

To use a provided model, input your data to newdata and choose a method: xgb, en, and rf, which stand for pre-trained XGBoost, elastic net, and random forest models. The prediction is TRUE if virus infected the sample, FALSE if virus did not infect the sample.

# predict using pre-trained random forest model
predict_iimi(newdata=df, method="rf")

Training a Custom Model

The train_iimi() function trains a machine learning model on the provided feature-extracted data frame of plant sample (train_x) and known target labels (train_y). It supports also three models: xgb, en, and rf.

# train random forest model
train_iimi(train_x, train_y, method="rf", ntree=100, mtry=2)
# train XGBoost model
train_iimi(train_x, train_y, method="xgb", nrounds=100)
# train elastic net model
en_model = train_iimi(train_x, train_y, method="en", k=5)

Visualizing Coverage Profiles

plot_cov() plots the coverage profile of the mapped plant sample and the percentage of A nucleotides and GC content for a sliding window of k-mer with the default step being 75 bases.

covs = {
    "sample1": {
        "seg1": [20, 30, 50, 60, 80],
        "seg2": [15, 25, 45, 55, 75],
    }
}
virus_info = {
    "seg1": "ACGT" * 250,
    "seg2": "TGCA" * 250,
}

# plot coverage of segments without unreliable regions
plot_cov(
    covs,
    legend_status=True,
    nucleotide_status=True,
    virus_info=virus_info,
    unreliable_regions=None,
)

Sample Data and Models Provided

  • iimi/data/example_cov.pkl Coverage profiles of three plant samples: A list of coverage profiles for three plant samples
  • iimi/data/example_diag.pdl Known diagnostics result of virus segments: A matrix containing the known truth about the diagnostics result (using virus database version 1.4.0) for each plant sample for the example data
  • iimi/data/nucleotide_info.pkl Nucleotide information of virus segments: A data set containing the GC content and other information about the virus segments from the official Virtool virus data base (version 1.4.0)
  • iimi/data/unreliable_regions.pkl The unreliable regions of the virus segments: A data frame of unmappable regions and regions of CG% and A% over 60% and 45% respectively for the virus segments
  • iimi/data/trained_rf.pkl A trained model using the default Random Forest settings
  • iimi/data/trained_xgb.model A trained model using the default XGBoost settings
  • iimi/data/trained_en.pkl A trained model using the default Elastic Net settings

References

  • H. Ning, I. Boyes, Ibrahim Numanagić, M. Rott, L. Xing, and X. Zhang, “Diagnostics of viral infections using high-throughput genome sequencing data,” Briefings in Bioinformatics, vol. 25, no. 6, Sep. 2024, doi: https://doi.org/10.1093/bib/bbae501.
  • Grigorii Sukhorukov, M. Khalili, Olivier Gascuel, Thierry Candresse, Armelle Marais-Colombel, and Macha Nikolski, “VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data,” Frontiers in bioinformatics, vol. 2, May 2022, doi: https://doi.org/10.3389/fbinf.2022.867111.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iimi-0.1.11.tar.gz (447.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iimi-0.1.11-py3-none-any.whl (450.9 kB view details)

Uploaded Python 3

File details

Details for the file iimi-0.1.11.tar.gz.

File metadata

  • Download URL: iimi-0.1.11.tar.gz
  • Upload date:
  • Size: 447.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.17

File hashes

Hashes for iimi-0.1.11.tar.gz
Algorithm Hash digest
SHA256 87c3c882e73f82d98704f06b7a6b887d9588fa9ffdaabc3b71acf0164280b5bc
MD5 f5abcd9259647af698c4bc23feadcb3f
BLAKE2b-256 e306f29d13699e792f7fa2a1d4ae9214613273c2b2627bc1e1547629d93e9c0c

See more details on using hashes here.

File details

Details for the file iimi-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: iimi-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 450.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.17

File hashes

Hashes for iimi-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 b86d73e971cb951e6f5b539a529f75d15a8e4e58e20298d5be153fa38d9eb448
MD5 d5181c64259fac2a6c1ce56abe7737f9
BLAKE2b-256 d2adc2e122a5eb2320cb2f88455370b86d5b77be82e4a7e05a4f5a88b6a435d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page