Tool for predicting the origin of replication on circular bacterial chromosomes

These details have not been verified by PyPI

Project links

Homepage

Project description

ORCA

Python scripts that predict and plot the location of the origin of replication (oriC) of circular bacterial genomes based on Z-curve, GC-skew, dnaA-box, and gene location analyses. This README will not explain all of ORCA's methods. All functions, main or helper, are labelled with docs-strings and type-hinting for ease of use. Most functions in ORCA can be used separately as well. For example, calculate_disparity_curves can be used outside of ORCA, too. As well as all functions for fetching files from NCBI, or plotter functions, etc. Please, see their provided docs-strings for more info.

Installing ORCApy

To install ORCApy, simply use:

pip install orcapy

Or download it directly from the PyPi website here. The installation and download only include the files present in the src folder.

To download the provided Random Forest Classifier, simply use:

wget https://github.com/ZoyavanMeel/ORCA/blob/main/data/output/machine_learning/ORCA_RFC_model.pkl.gz

Or use a similar method, like curl. You can also download it manually by going here, clicking on the three horizontal dots in the top right and selecting 'Download'.

ORCA class

This script predicts the origin of replication for circular bacterial DNA. It makes use of a combination of Z-curve and GC-skew, dnaA-box, and gene location analyses. You can load the required FASTA files yourself, or simply provide an accession and NCBI-account email and ORCA will fetch them. The docs-string of the function shows more information on what is needed to use ORCA. See the Example Use section for more information. NOTE: the model provided in this repository has been compressed due to file size limitations. This example also uses the Joblib package for unpickling the model. This is not necessary, and Python's standard pickle and gzip package can be used as well.

Input

Please, make sure to load the proper file formats into the functions, otherwise ORCA will not work. A lot of invalid parameters will throw warnings or errors, but it is not improbable that a few were missed. More of ORCA's functionality and input methods are discussed in the paper.

There are four valid input formats for ORCA. Provided one has an internet connection, ORCA can function using only an accession number (from_accession). When using only an accession, ORCA will fetch the corresponding GenBank file from NCBI using Biopython's Entrez module. Please adhere to NCBI's rules on making use of their API. This module functions as a Python interface for NCBI's Entrez Direct. The version number of the accession can be specified. If this is omitted, the most recent version is automatically chosen. One can also provide the GenBank file themselves (from_gbk). This file will be processed the same way, but does not require an internet connection.

Another option is to provide Biopython SeqRecord objects (from_pkl). ORCA uses Biopython's SeqIO module for parsing the GenBank file. This means that input from GenBank will go through this module anyway and uses SeqRecords. Providing a SeqRecord that was made by parsing a Genbank file using this same module is therefore also possible. Pickled SeqRecord objects take up more memory, but they can be parsed quicker. This is a consideration that can be taken into account when processing large genomic datasets.

Lastly, one can also simply provide a DNA-sequence string (from_string). This same input method allows for indicating the location of the indicator genes. These gene locations are not inferred from the provided sequence and will have to be provided. This is the fastest input format, but takes the most work from the user. All input parameters are assumed to be in the correct specified format and will not be processed using Biopython. Further explanation on each of these formats can also be found in the documentation of the code.

Example use of ORCA

With the RFC, we provide a general use-case oriC prediction tool as outlined in the application note. However, as shown in the code documentation, there are many parameters that can be fine-tuned and possibly improved. One of the reasons of ORCA being open-source, is to provide not just a transparent oriC-prediction tool, but also a building block for further research.

It is possible to tune all parameters and train models for highly accurate prediction of the oriCs of bacterial species of interest. This could be done by incorporating more indicator genes for that species, or changing any number of other parameters. It could also be possible to adapt ORCA for the use in the prediction of oriCs in linear prokaryotic chromosomes, or archaea. it is our hope that ORCA can help research into any or all of these avenues with its adaptability or provide a lightweight easy-to-use tool with good out-of-the-box performance.

>>> import joblib
>>> from orcapy import ORCA
>>> email = "example@email.com"
>>> model = joblib.load("path/to/model.pkl.gz")
>>> orca = ORCA.from_accession("NC_000913.3", email=email, model=model)
>>> orca.find_oriCs(show_info=True, show_plot=False)
Accession    : NC_000913
predictions  :  0.99412
Z_scores     :      1.0
G_scores     :      0.5
D_scores     :      1.0
oriC_middles :  3927479
The best-scoring potential oriC was found at: 3927479 bp.

If you do not wish to use joblib, you can also open the model using:

import gzip, pickle
with gzip.open("path/to/model.pkl.gz", "rb") as fh:
    model = pickle.load(fh)

In the case of Escherichia coli (E. coli) K-12, only one potential origin was found and according to the model, this could also be classified as a true origin. Any prediction value >= 0.5, means that the model believes there is a 50 % or more that the corresponding origin is a true origin. It is possible that multiple candidate origins have a probability of being a true origin that is larger than 50 %. Then simply use the origin corresponding to the highest probability, as this was the origin that the model deemed most likely to be correct.

This repository also includes a pickled SeqRecord of the E. coli K-12 chromosome. If one wanted to use that, only a different constructor would have to be used.

>>> from orcapy import ORCA
>>> orca = ORCA.from_pkl("data/input/NC_000913_3.pkl", model=model)
>>> orca.find_oriCs(show_info=False, show_plot=False)

There are four constructors in total: from_accession, from_gbk, from_pkl, from_string. Each comes with extensive documentation of how to use them. Once the ORCA object has been instantiated, call find_oriCs to analyse the sequence.

ORCA's parameters

The parameters listed below are parameters that can be used as they are or can be fine-tuned for specific use cases. The standard parameters reflect ORCA's performance as shown in the application note. Retraining of the Random Forest Classifier is recommended if any parameters are changed. Otherwise, the same performance as is the paper can not be guaranteed.

dnaa_boxes: If None, will use the consensus DnaA-box: TTAT(A|C|G|T)CACA (see Section DnaA). Else, provide a list of 9 base strings. Example input: ['AAAAAAAAA', 'TTTTTTTTT'].
max_mismatches: Maximum allowed mismatches allowed in a dnaa_box for it still to be read as such. Recommended max: 2. ORCA uses 0 for use with the consensus DnaA-box.
genes_of_interest: List of gene names to consider as 'oriC-proximal' and use for helping estimate the location of the oriC. This parameter is case insensitive.
max_point_spread: Maximum distance between points in a group can have when looking for connected groups across the disparity curves. Default is 5 % of the total chromosome length.
windows: The windows around peaks of skew curves to consider. Defaults are 1, 3, and 5 % of the total chromosome length. ORCA checks each of the given windows.
model: A fitted scikit-learn classifier. Recommended to use the one provided on in this repository.

Libraries and Versions used

Peak class

The Peak class. This class is used in handling potential oriCs. The oriCs-attribute of an ORCA object consists of a list of Peak objects. Each Peak represents a potential oriC and has attributes for its Z-, G-, and D-score as well as the confidence a potential machine learning model has.

BioFile

Script with useful functions for I/O. These include functions for downloading, parsing, and saving files. To use these function call:

from orcapy import BioFile

Plotter

There are 3 general functions in this file which can be used to plot any generic 1D-np.array. To use these functions, make sure to have matplotlib installed.

plot_Z_curve_3D: Makes a 3D-plot of the Z-curve.
plot_curves: Can plot a maximum of four curves in a single 2D-plot. This one is useful for plotting single or multiple Z-curve or GC-skew components against each other.
plot_skew: Does the same as plot_curves, except only takes one array.

To use these functions call:

from orcapy import Plotter

Machine_learning

This folder contains a lot of data used in analysing, training, and testing ORCA and its Random Forest Classifier. The DoriC data can be downloaded from: http://tubic.tju.edu.cn/doric/public/index.php on May 16th, 2023. Each script comes with docs-string explaining their functionality.

data

input

This folder contains all the input data. This includes the data from DoriC 12.0. It also includes the experimental_dataset CSV. This is a collection of oriCs that have been experimentally verified. It also includes sources for each chromosome. This dataset was made in order to check the quality of DoriC.

This folder also includes a pickled SeqRecord object of the E. coli K-12 chromosome. This file was used for quick testing and demonstration.

output

This folder contains output from the various performance test we ran on ORCA. The machine_learning sub-folder contains the provided Random Forest Classifiers. Use the gzip and pickle from the standard Python library to load the models or, alternatively, use joblib.

ORCA_RFC_model.pkl.gz: This model has been trained on the full DoriC 12.0 dataset. This model is included with installation and can be called with the ORCA_RFC_model function.
ORCA_RFC_model_70_percent.pkl.gz: This model has been trained on roughly 70 % of the DoriC 12.0 dataset. First all the experimentally verified origins were subtracted and the remaining dataset was split 70:30 stratified. This model is has been included for posterity. Use the other model for any analyses.

DnaA

The table below shows a small overview of common dnaA-boxes and relevant papers associated with them. This table is by no means complete, but can serve as a starting point for further research. We also include some more papers that were useful in researching the effects of the dnaA protein and its binding sites: [1], [2], [3], [4].

We use the first entry in the table below and allow for 0 mismatches in this sequence.

Sequence	Paper	Year	Notes
TTAT(A\|C\|G\|T)CACA	[5], [6]	2007/8	In (roughly) all eubacteria, not just B. subtilis or E. coli
TTATCCACA, TTTTCCACA, TTATCAACA, TCATTCACA, TTATACACA, TTATCCAAA	[7]	1997	Affinity study
(T\|C)(T\|C)(A\|T\|C)T(A\|C)C(A\|G)(A\|C\|T)(A\|C)	[8]	1991	Only in E. coli K12. Do not use.
TGTG(G\|T)ATAAC	[9]	1985	Matsui-box
TTAT(A\|C)CA(A\|C)A	[10]	1984	The first consensus

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.1

Mar 5, 2024

This version

1.0.0

Feb 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orcapy-1.0.0.tar.gz (46.0 kB view details)

Uploaded Feb 27, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

orcapy-1.0.0-py3-none-any.whl (37.7 kB view details)

Uploaded Feb 27, 2024 Python 3

File details

Details for the file orcapy-1.0.0.tar.gz.

File metadata

Download URL: orcapy-1.0.0.tar.gz
Upload date: Feb 27, 2024
Size: 46.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.1

File hashes

Hashes for orcapy-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`bac418d56cc93ae1c775b1cd32e709e8258ef87a2e614cd1118c93da78a57c72`
MD5	`ba205c8716bfb2449c9734c3e8b73b1b`
BLAKE2b-256	`16887776c167df0d4d23084e64fd91d44829240168cc59653028f5e675fdc43d`

See more details on using hashes here.

File details

Details for the file orcapy-1.0.0-py3-none-any.whl.

File metadata

Download URL: orcapy-1.0.0-py3-none-any.whl
Upload date: Feb 27, 2024
Size: 37.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.1

File hashes

Hashes for orcapy-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de04298c16132ac90382290dc724041e39b9c9fdf8d11084cf729669855421d5`
MD5	`b6f83e6b027114823ee17a6babb53e4e`
BLAKE2b-256	`9c095b3f36cd251d3242e0445463799f1588cd79ec3c3a346554348ff35dbd31`

See more details on using hashes here.

orcapy 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ORCA

Installing ORCApy

ORCA class

Input

Example use of ORCA

ORCA's parameters

Libraries and Versions used

Peak class

BioFile

Plotter

Machine_learning

data

input

output

DnaA

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes