Skip to main content

A collection of different evaluators for dna nucleodtide level labeling

Project description

DNA segmentation benchmark

This benchmark provides easy metrics for segmentation tasks beyond the common scores. It is highly flexible and easily adaptable to all kinds of annotations.

Insertion / Deletion / Excision / Incision metric

Looking at the kind of error models make when segmenting can reveal systematic biases and issues. Furthermore this package allows to also look at the lengths of the different errors.

Error counts

image

Error lengths

image

Precision / Recall across different levels

Similar to the tool gffcompare, this package offers precision / recall evaluation at different levels.
As this benchmark is flexible and not limited to just evaluating exons any class can be chosen as the positive label, but for the coming examples I will stick to referring to exons as positives.

Nucleotide level

  • TP : A ground truth exon nucleotide being predicted as exon
  • FP : A ground truth non exon nucleotide being predicted as exon
  • TN : Any other label being not predicted as exon
  • FN : Any other label being predicted as exon

image

Encompassing sections

  • TP : A continuous sequence of ground truth exon nucleotides being contained in a continuous sequence of predicted exon nucleotides
  • FP : A continuous sequence of ground truth exon nucleotides not being contained on both sides in a continuous sequence of predicted exon nucleotides
  • TN :
  • FN : A continuous sequence of predicted exon nucleotides not overlapping with any ground truth exon section image

Strict sections

  • TP : A continuous sequence of ground truth exon nucleotides exactly matching a continuous sequence of predicted exon nucleotides
  • FP : A continuous sequence of ground truth exon nucleotides not exactly matching a continuous sequence of predicted exon nucleotides
  • TN :
  • FN : A continuous sequence of predicted exon nucleotides not overlapping with any ground truth exon section image

All inner section boundaries are correct (only for multi exon transcript)

  • TP : A set of predicted exon sections where all the inner boundaries are correct
  • FP : A set of predicted exon sections where not all the inner boundaries are correct
  • TN :
  • FN : No prediction for exons being made despite ground truth exon annotations image

Total section boundary correctness

  • TP : A set of predicted exon sections where all the boundaries are correct
  • FP : A set of predicted exon sections where not all the boundaries are correct
  • TN :
  • FN : No prediction for exons being made despite ground truth exon annotations image

Frameshift metrics

When looking at segmented DNA we're often interested in how well the chained exon transcript (assuming no alternate splicing) fits to the protein sequence of a gene. However, to properly evaluate this each gene needs to be mapped to a protein sequence, which is not the case for arbitrary inputs. This metric offers to evaluate the frameshift that is introduced across a sequence segmentation.

Again, this metric can be incredibly insightful, but you have to be careful how you use it. Unless you are sure that all exons are part of the final transcript for all the benchmarked sequences DON'T USE IT. Your results will be skewed and hold no value. image

Extra Visualizations

For debugging single sequences and analyzing the predictions in detail this package also contains a module to render interactive webpages example_data/genome_annotation_comparison_enhanced.html image

Usage

pip install dna-segmentation-benchmark
# load the module
from enum import Enum
# define the labels of the data
class CustomLabelDef(Enum):
    NONCODING = 8
    EXON = 0
    INTRON = 2

As previously mentioned, one of the strengths of this package is its ability to run the evaluations on any specified label. So in the following example the evaluation will be run for introns and exons alike. (although most of the metrics are tailored to exons)

from dna_segmentation_benchmark import evaluate_predictors as ep
chosen_eval_metrics = [ep.EvalMetrics.INDEL, ep.EvalMetrics.FRAMESHIFT]
classes_to_eval = [CustomLabelDef.EXON, CustomLabelDef.INTRON]

example_gt_seq = [8, 8, 8, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 2, 0, 0, 8, 8, 8, 8]
example_pred_seq = [0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8]

evaluation = ep.benchmark_gt_vs_pred_single(gt_labels=example_gt_seq, pred_labels=example_pred_seq, labels=CustomLabelDef,
                                            classes=classes_to_eval,
                                            metrics=chosen_eval_metrics)

There are more extensive examples in the examples folder

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dna_segmentation_benchmark-0.0.3.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dna_segmentation_benchmark-0.0.3-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file dna_segmentation_benchmark-0.0.3.tar.gz.

File metadata

File hashes

Hashes for dna_segmentation_benchmark-0.0.3.tar.gz
Algorithm Hash digest
SHA256 c09f26b296abc5e4406a2932d0191813e5caa5149f415daa22a3d5c05afb9ad9
MD5 c7bab9c705cac213063d82774294c5cd
BLAKE2b-256 1cdc1707fc38b3591ff2fc6f7797847f1c2cb7913944fa01a400019f4ba77f70

See more details on using hashes here.

File details

Details for the file dna_segmentation_benchmark-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for dna_segmentation_benchmark-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1c194738a09ab63b9dfd52f1fee2ec9ca1d76eb10128fe6297e4e64702ab4ec3
MD5 865515bb4a0512a29a22b36968d302f7
BLAKE2b-256 3b30c7cb4671442ec60d9123e762f387c3f393f3a39d83c6fb407127316e68c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page