Skip to main content

This packages can efficiently measure the text structure recognition capabilities ofn pdftextsplitter

Project description

DeltaTextsplitter package

This package is meant to provide an objective evaluation of the performance of the pdftextsplitter package in terms of KPI's.

The package includes a set of test documents with references in the form of excel-files. These reference-files are human-produces excel-files containing the structure of the test documents that the pdftextsplitter package should have recognised. This can then be compared to the actual output of the pdftextsplitter package, so that the performance of the package can be evaluated.

The performance is evaluated in terms of the following two KPI's
structure KPI = 1 - (fp + tn)(fp + tt)
where tt = true total, the total number of structure elements in the reference-file, fp = false positive, the number of structure elements that are present in the output, but not in the reference-file, and tn = true negative, meaning tn = tt - tp, where tp = true positive, the number of matching structure elements between the reference-file and the actual outcome. Two structure elements are said to match if the fuzzy match ratio of their titles is >=80.0 (determined with the package thefuzz) and their main structure types are equal.

cascade kpi = 1 - uc/tp
where uc = unequal cascades, a subset of the above number tp where the cascade levels of the reference-file and the outcome of the package do not match.

With these two KPI's, it is possible to quantify improvements made to the pdftextsplitter package by calculating KPI's for each released version of pdftextsplitter.

Getting started

The KPI calculation can be performed efficiently by entering the following commands: from deltattextsplitter import documentclass
mydelta = deltattextsplitter() mydelta.FullRun()

The KPI's will then be printed, but can also be retrieved from:
mydelta.structure_kpi
mydelta.cascade_kpi
The KPI's per test document can also be retrieved from:
mydelta.documentarray[index].splitter.documentname
mydelta.documentarray[index].structure_kpi
mydelta.documentarray[index].cascade_kpi
There are 12 testdocuments in total.

The FullRun-command is very CPU-intensive, as it needs to process all the test documents with the pdftextsplitter-package (in dummy-mode). Once this has been done, one could speed up the process of subsequent calculations by entering
mydelta.FullRun(False,False)
to skip the pdftextsplitter-execution and only redo the KPI-calculation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltatextsplitter-1.1.0.tar.gz (28.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deltatextsplitter-1.1.0-py3-none-any.whl (28.9 MB view details)

Uploaded Python 3

File details

Details for the file deltatextsplitter-1.1.0.tar.gz.

File metadata

  • Download URL: deltatextsplitter-1.1.0.tar.gz
  • Upload date:
  • Size: 28.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for deltatextsplitter-1.1.0.tar.gz
Algorithm Hash digest
SHA256 472387d79205c5317075ed2b176300224486d2daf4dd82fb9e8b4d66aa6c3d3c
MD5 a28fda43ffca5ff41ceaa1bbe5857908
BLAKE2b-256 251285bb662a52095f8376e17e3bc40d7f24c723311f72ebfe39391130585694

See more details on using hashes here.

File details

Details for the file deltatextsplitter-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for deltatextsplitter-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 68bc441ebe22851984438b9b5a871c3f939a48aa9d1a3e0ef43c855d40154f59
MD5 984498abe14f8b7b5f4dd1e946145d9c
BLAKE2b-256 a6077d06616303a317d7d65624b61a6f1c8371b5ea020de1ad54f71bf48614a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page