A collection of different evaluators for dna nucleodtide level labeling
Project description
DNA segmentation benchmark
This benchmark provides easy metrics for segmentation tasks beyond the common scores. It is highly flexible and easily adaptable to all kinds of annotations.
Insertion / Deletion / Excision / Incision metric
Looking at the kind of error models make when segmenting can reveal systematic biases and issues. Furthermore this package allows to also look at the lengths of the different errors.
Error counts
Error lengths
Precision / Recall across different levels
Similar to the tool gffcompare, this package offers precision / recall evaluation at different
levels.
As this benchmark is flexible and not limited to just evaluating exons any class can be chosen as the positive label, but for the coming examples
I will stick to referring to exons as positives.
Nucleotide level
- TP : A ground truth exon nucleotide being predicted as exon
- FP : A ground truth non exon nucleotide being predicted as exon
- TN : Any other label being not predicted as exon
- FN : Any other label being predicted as exon
Encompassing sections
- TP : A continuous sequence of ground truth exon nucleotides being contained in a continuous sequence of predicted exon nucleotides
- FP : A continuous sequence of ground truth exon nucleotides not being contained on both sides in a continuous sequence of predicted exon nucleotides
- TN :
- FN : A continuous sequence of predicted exon nucleotides not overlapping with any ground truth exon section
Strict sections
- TP : A continuous sequence of ground truth exon nucleotides exactly matching a continuous sequence of predicted exon nucleotides
- FP : A continuous sequence of ground truth exon nucleotides not exactly matching a continuous sequence of predicted exon nucleotides
- TN :
- FN : A continuous sequence of predicted exon nucleotides not overlapping with any ground truth exon section
All inner section boundaries are correct (only for multi exon transcript)
- TP : A set of predicted exon sections where all the inner boundaries are correct
- FP : A set of predicted exon sections where not all the inner boundaries are correct
- TN :
- FN : No prediction for exons being made despite ground truth exon annotations
Total section boundary correctness
- TP : A set of predicted exon sections where all the boundaries are correct
- FP : A set of predicted exon sections where not all the boundaries are correct
- TN :
- FN : No prediction for exons being made despite ground truth exon annotations
Frameshift metrics
When looking at segmented DNA we're often interested in how well the chained exon transcript (assuming no alternate splicing) fits to the protein sequence of a gene. However, to properly evaluate this each gene needs to be mapped to a protein sequence, which is not the case for arbitrary inputs. This metric offers to evaluate the frameshift that is introduced across a sequence segmentation.
Again, this metric can be incredibly insightful, but you have to be careful how you use it. Unless you
are sure that all exons are part of the final transcript for all the benchmarked sequences DON'T USE IT.
Your results will be skewed and hold no value.
Extra Visualizations
For debugging single sequences and analyzing the predictions in detail this package also contains a module to render interactive
webpages example_data/genome_annotation_comparison_enhanced.html
Usage
pip install dna-segmentation-benchmark
# load the module
from enum import Enum
# define the labels of the data
class CustomLabelDef(Enum):
NONCODING = 8
EXON = 0
INTRON = 2
As previously mentioned, one of the strengths of this package is its ability to run the evaluations on any specified label. So in the following example the evaluation will be run for introns and exons alike. (although most of the metrics are tailored to exons)
from dna_segmentation_benchmark import evaluate_predictors as ep
chosen_eval_metrics = [ep.EvalMetrics.INDEL, ep.EvalMetrics.FRAMESHIFT]
classes_to_eval = [CustomLabelDef.EXON, CustomLabelDef.INTRON]
example_gt_seq = [8, 8, 8, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 2, 0, 0, 8, 8, 8, 8]
example_pred_seq = [0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8]
evaluation = ep.benchmark_gt_vs_pred_single(gt_labels=example_gt_seq, pred_labels=example_pred_seq, labels=CustomLabelDef,
classes=classes_to_eval,
metrics=chosen_eval_metrics)
There are more extensive examples in the examples folder
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dna_segmentation_benchmark-0.0.4.tar.gz.
File metadata
- Download URL: dna_segmentation_benchmark-0.0.4.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4cb9f56bee00c10368135a2bec2ce097277173f24257b1476c5b1e137caec4dd
|
|
| MD5 |
8d19593941f8c967141495ede49599f1
|
|
| BLAKE2b-256 |
c3f1de5cb62c7864384c39d6c4ed9cb58b67a8ed4ff52fc6883c2228bf46e3f2
|
File details
Details for the file dna_segmentation_benchmark-0.0.4-py3-none-any.whl.
File metadata
- Download URL: dna_segmentation_benchmark-0.0.4-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb440a8c836597604b159f16a66d0572211977283bcbea1897308145ddc67e43
|
|
| MD5 |
d18b676169e28a055eb170082e337aab
|
|
| BLAKE2b-256 |
81c0d334749055db39b1dbe643fb4d2209e83179c5935444d4f7c144d2d0c524
|