Skip to main content

This packages can efficiently measure the text structure recognition capabilities ofn pdftextsplitter

Project description

DeltaTextsplitter package

This package is meant to provide an objective evaluation of the performance of the pdftextsplitter package in terms of KPI's.

The package includes a set of test documents with references in the form of excel-files. These reference-files are human-produces excel-files containing the structure of the test documents that the pdftextsplitter package should have recognised. This can then be compared to the actual output of the pdftextsplitter package, so that the performance of the package can be evaluated.

The performance is evaluated in terms of the following two KPI's
structure KPI = 1 - (fp + tn)(fp + tt)
where tt = true total, the total number of structure elements in the reference-file, fp = false positive, the number of structure elements that are present in the output, but not in the reference-file, and tn = true negative, meaning tn = tt - tp, where tp = true positive, the number of matching structure elements between the reference-file and the actual outcome. Two structure elements are said to match if the fuzzy match ratio of their titles is >=80.0 (determined with the package thefuzz) and their main structure types are equal.

cascade kpi = 1 - uc/tp
where uc = unequal cascades, a subset of the above number tp where the cascade levels of the reference-file and the outcome of the package do not match.

With these two KPI's, it is possible to quantify improvements made to the pdftextsplitter package by calculating KPI's for each released version of pdftextsplitter.

Getting started

The KPI calculation can be performed efficiently by entering the following commands: from deltattextsplitter import documentclass
mydelta = deltattextsplitter() mydelta.FullRun()

The KPI's will then be printed, but can also be retrieved from:
mydelta.structure_kpi
mydelta.cascade_kpi
The KPI's per test document can also be retrieved from:
mydelta.documentarray[index].splitter.documentname
mydelta.documentarray[index].structure_kpi
mydelta.documentarray[index].cascade_kpi
There are 12 testdocuments in total.

The FullRun-command is very CPU-intensive, as it needs to process all the test documents with the pdftextsplitter-package (in dummy-mode). Once this has been done, one could speed up the process of subsequent calculations by entering
mydelta.FullRun(False,False)
to skip the pdftextsplitter-execution and only redo the KPI-calculation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltatextsplitter-1.1.0.tar.gz (28.9 MB view hashes)

Uploaded Source

Built Distribution

deltatextsplitter-1.1.0-py3-none-any.whl (28.9 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page