Skip to main content

Python library for post-extraction refinement of text that may be derived from PDF extraction.

Project description

Refinedoc

Python library for post-extraction refinement of text that may be derived from PDF extraction by the Learning Planet Institute.

PyPI version Code style: black

Why using Refinedoc ?

The idea behind this library is to enable post-extraction processing of unstructured text content, the best-known example being pdf files. The main idea is to robustly and securely separate the text body from its headers and footers.

What's more, the lib is written in pure Python and has no dependencies other than the standard lib.

Quickstart

Requirements

  • Python 3.10 <=

Installation

You can install with pip

pip install refinedoc

Example (vanilla)

from refinedoc.refined_document import RefinedDocument

document = [
            [
                "header 1",
                "subheader 1",
                "lorem ipsum dolor sit amet",
                "consectetur adipiscing elit",
                "footer 1",
            ],
            [
                "header 2",
                "subheader 2",
                "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua",
                "footer 2",
            ],
            [
                "header 3",
                "subheader 3",
                "ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat",
                "footer 3",
            ],
            [
                "header 4",
                "subheader 4",
                "duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur",
                "footer 4",
            ],
        ]

rd = RefinedDocument(content=document)
headers = rd.headers
# [["header 1", "subheader 1"], ["header 2", "subheader 2"], ["header 3", "subheader 3"], ["header 4", "subheader 4"]]

footers = rd.footers
# [["footer 1"], ["footer 2"], ["footer 3"], ["footer 4"]]

body = rd.body
# [["lorem ipsum dolor sit amet", "consectetur adipiscing elit"], ["sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"], ["ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat"], ["duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur"]]

Example (with pypdf)

from refinedoc.refined_document import RefinedDocument
from pypdf import PdfReader

# Build the document from a PDF file
reader = PdfReader("path/to/your/pdf/file.pdf")
document = []
for page in reader.pages:
    document.append(page.extract_text().split("\n"))
    
rd = RefinedDocument(content=document)
headers = rd.headers
# [["header 1", "subheader 1"], ["header 2", "subheader 2"], ["header 3", "subheader 3"], ["header 4", "subheader 4"]]
footers = rd.footers
# [["footer 1"], ["footer 2"], ["footer 3"], ["footer 4"]]
body = rd.body
# [["lorem ipsum dolor sit amet", "consectetur adipiscing elit"], ["sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"], ["ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat"], ["duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur"]]

How it's work

My work is based on this paper : Lin, Xiaofan. (2003). Header and Footer Extraction by Page-Association. 5010. 164-171. 10.1117/12.472833.

And an article medium by Hussain Shahbaz Khawaja.

License

This projects is licensed under Apache 2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refinedoc-1.0.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refinedoc-1.0.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file refinedoc-1.0.0.tar.gz.

File metadata

  • Download URL: refinedoc-1.0.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for refinedoc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 96f9a88db5fae4aa6b7a11dc7d64e9d9b4c7a933d3882f5ae28c85a8dcfeeac7
MD5 ebeba5580f05977d58f631dc2eb471e2
BLAKE2b-256 411110aee61456fc5e812fe2964de299b659143d2c35abb2474b3362ce6ee477

See more details on using hashes here.

Provenance

The following attestation bundles were made for refinedoc-1.0.0.tar.gz:

Publisher: publish-to-test-pypi.yml on CyberCRI/refinedoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file refinedoc-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: refinedoc-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for refinedoc-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4916579e6be146f1b4214addc31f1f35677c767ce6607a5fa36ad7395db63199
MD5 e400d321606d4bce383bc398eea2c7ef
BLAKE2b-256 1c03a2c1e81e65708081decbc46695bb33ff3f2d187cd7263f956f8c538c9c8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for refinedoc-1.0.0-py3-none-any.whl:

Publisher: publish-to-test-pypi.yml on CyberCRI/refinedoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page