This packages can efficiently measure the text structure recognition capabilities ofn pdftextsplitter

These details have not been verified by PyPI

Project links

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.10
Topic
- Software Development :: Build Tools

Project description

DeltaTextsplitter package

This package is meant to provide an objective evaluation of the performance of the pdftextsplitter package in terms of KPI's.

The package includes a set of test documents with references in the form of excel-files. These reference-files are human-produces excel-files containing the structure of the test documents that the pdftextsplitter package should have recognised. This can then be compared to the actual output of the pdftextsplitter package, so that the performance of the package can be evaluated.

The performance is evaluated in terms of the following two KPI's
structure KPI = 1 - (fp + tn)(fp + tt)
where tt = true total, the total number of structure elements in the reference-file, fp = false positive, the number of structure elements that are present in the output, but not in the reference-file, and tn = true negative, meaning tn = tt - tp, where tp = true positive, the number of matching structure elements between the reference-file and the actual outcome. Two structure elements are said to match if the fuzzy match ratio of their titles is >=80.0 (determined with the package thefuzz) and their main structure types are equal.

cascade kpi = 1 - uc/tp
where uc = unequal cascades, a subset of the above number tp where the cascade levels of the reference-file and the outcome of the package do not match.

With these two KPI's, it is possible to quantify improvements made to the pdftextsplitter package by calculating KPI's for each released version of pdftextsplitter.

Getting started

The KPI calculation can be performed efficiently by entering the following commands: from deltattextsplitter import documentclass
mydelta = deltattextsplitter() mydelta.FullRun()

The KPI's will then be printed, but can also be retrieved from:
mydelta.structure_kpi
mydelta.cascade_kpi
The KPI's per test document can also be retrieved from:
mydelta.documentarray[index].splitter.documentname
mydelta.documentarray[index].structure_kpi
mydelta.documentarray[index].cascade_kpi
There are 12 testdocuments in total.

The FullRun-command is very CPU-intensive, as it needs to process all the test documents with the pdftextsplitter-package (in dummy-mode). Once this has been done, one could speed up the process of subsequent calculations by entering
mydelta.FullRun(False,False)
to skip the pdftextsplitter-execution and only redo the KPI-calculation.

Project details

These details have not been verified by PyPI

Project links

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.10
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

This version

1.1.0

Dec 19, 2023

1.0.1

Dec 14, 2023

1.0.0

Dec 14, 2023

0.0.2

Nov 9, 2023

0.0.1

Oct 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltatextsplitter-1.1.0.tar.gz (28.9 MB view hashes)

Uploaded Dec 19, 2023 Source

Built Distribution

deltatextsplitter-1.1.0-py3-none-any.whl (28.9 MB view hashes)

Uploaded Dec 19, 2023 Python 3

Hashes for deltatextsplitter-1.1.0.tar.gz

Hashes for deltatextsplitter-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`472387d79205c5317075ed2b176300224486d2daf4dd82fb9e8b4d66aa6c3d3c`
MD5	`a28fda43ffca5ff41ceaa1bbe5857908`
BLAKE2b-256	`251285bb662a52095f8376e17e3bc40d7f24c723311f72ebfe39391130585694`

Hashes for deltatextsplitter-1.1.0-py3-none-any.whl

Hashes for deltatextsplitter-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68bc441ebe22851984438b9b5a871c3f939a48aa9d1a3e0ef43c855d40154f59`
MD5	`984498abe14f8b7b5f4dd1e946145d9c`
BLAKE2b-256	`a6077d06616303a317d7d65624b61a6f1c8371b5ea020de1ad54f71bf48614a0`