Skip to main content

A collection of different evaluators for dna nucleodtide level labeling

Project description

DNA segmentation benchmark

This benchmark provides easy metrics for segmentation tasks beyond the common scores. It is highly flexible and easily adaptable to all kinds of annotations.

Insertion / Deletion / Excision / Incision metric

Looking at the kind of error models make when segmenting can reveal systematic biases and issues. Furthermore this package allows to also look at the lengths of the different errors.

Error counts

image

Error lengths

image

Precision / Recall across different levels

Similar to the tool gffcompare, this package offers precision / recall evaluation at different levels.
As this benchmark is flexible and not limited to just evaluating exons any class can be chosen as the positive label, but for the coming examples I will stick to referring to exons as positives.

Nucleotide level

  • TP : A ground truth exon nucleotide being predicted as exon
  • FP : A ground truth non exon nucleotide being predicted as exon
  • TN : Any other label being not predicted as exon
  • FN : Any other label being predicted as exon

image

Encompassing sections

  • TP : A continuous sequence of ground truth exon nucleotides being contained in a continuous sequence of predicted exon nucleotides
  • FP : A continuous sequence of ground truth exon nucleotides not being contained on both sides in a continuous sequence of predicted exon nucleotides
  • TN :
  • FN : A continuous sequence of predicted exon nucleotides not overlapping with any ground truth exon section image

Strict sections

  • TP : A continuous sequence of ground truth exon nucleotides exactly matching a continuous sequence of predicted exon nucleotides
  • FP : A continuous sequence of ground truth exon nucleotides not exactly matching a continuous sequence of predicted exon nucleotides
  • TN :
  • FN : A continuous sequence of predicted exon nucleotides not overlapping with any ground truth exon section image

All inner section boundaries are correct (only for multi exon transcript)

  • TP : A set of predicted exon sections where all the inner boundaries are correct
  • FP : A set of predicted exon sections where not all the inner boundaries are correct
  • TN :
  • FN : No prediction for exons being made despite ground truth exon annotations image

Total section boundary correctness

  • TP : A set of predicted exon sections where all the boundaries are correct
  • FP : A set of predicted exon sections where not all the boundaries are correct
  • TN :
  • FN : No prediction for exons being made despite ground truth exon annotations image

Frameshift metrics

When looking at segmented DNA we're often interested in how well the chained exon transcript (assuming no alternate splicing) fits to the protein sequence of a gene. However, to properly evaluate this each gene needs to be mapped to a protein sequence, which is not the case for arbitrary inputs. This metric offers to evaluate the frameshift that is introduced across a sequence segmentation.

Again, this metric can be incredibly insightful, but you have to be careful how you use it. Unless you are sure that all exons are part of the final transcript for all the benchmarked sequences DON'T USE IT. Your results will be skewed and hold no value. image

Extra Visualizations

For debugging single sequences and analyzing the predictions in detail this package also contains a module to render interactive webpages example_data/genome_annotation_comparison_enhanced.html image

Usage

pip install dna-segmentation-benchmark
# load the module
from enum import Enum
# define the labels of the data
class CustomLabelDef(Enum):
    NONCODING = 8
    EXON = 0
    INTRON = 2

As previously mentioned, one of the strengths of this package is its ability to run the evaluations on any specified label. So in the following example the evaluation will be run for introns and exons alike. (although most of the metrics are tailored to exons)

from dna_segmentation_benchmark import evaluate_predictors as ep
chosen_eval_metrics = [ep.EvalMetrics.INDEL, ep.EvalMetrics.FRAMESHIFT]
classes_to_eval = [CustomLabelDef.EXON, CustomLabelDef.INTRON]

example_gt_seq = [8, 8, 8, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 2, 0, 0, 8, 8, 8, 8]
example_pred_seq = [0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8]

evaluation = ep.benchmark_gt_vs_pred_single(gt_labels=example_gt_seq, pred_labels=example_pred_seq, labels=CustomLabelDef,
                                            classes=classes_to_eval,
                                            metrics=chosen_eval_metrics)

There are more extensive examples in the examples folder

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dna_segmentation_benchmark-0.0.4.tar.gz (35.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dna_segmentation_benchmark-0.0.4-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file dna_segmentation_benchmark-0.0.4.tar.gz.

File metadata

File hashes

Hashes for dna_segmentation_benchmark-0.0.4.tar.gz
Algorithm Hash digest
SHA256 4cb9f56bee00c10368135a2bec2ce097277173f24257b1476c5b1e137caec4dd
MD5 8d19593941f8c967141495ede49599f1
BLAKE2b-256 c3f1de5cb62c7864384c39d6c4ed9cb58b67a8ed4ff52fc6883c2228bf46e3f2

See more details on using hashes here.

File details

Details for the file dna_segmentation_benchmark-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for dna_segmentation_benchmark-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 cb440a8c836597604b159f16a66d0572211977283bcbea1897308145ddc67e43
MD5 d18b676169e28a055eb170082e337aab
BLAKE2b-256 81c0d334749055db39b1dbe643fb4d2209e83179c5935444d4f7c144d2d0c524

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page