Skip to main content

Python library for post-extraction refinement of text that may be derived from PDF extraction.

Project description

Refinedoc

Python library for post-extraction refinement of text that may be derived from PDF extraction by the Learning Planet Institute.

PyPI version Code style: black

Why using Refinedoc ?

The idea behind this library is to enable post-extraction processing of unstructured text content, the best-known example being pdf files. The main idea is to robustly and securely separate the text body from its headers and footers.

What's more, the lib is written in pure Python and has no dependencies other than the standard lib.

Quickstart

Requirements

  • Python 3.10 <=

Installation

You can install with pip

pip install refinedoc

Example

from refinedoc.refined_document import RefinedDocument

document = [
            [
                "header 1",
                "subheader 1",
                "lorem ipsum dolor sit amet",
                "consectetur adipiscing elit",
                "footer 1",
            ],
            [
                "header 2",
                "subheader 2",
                "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua",
                "footer 2",
            ],
            [
                "header 3",
                "subheader 3",
                "ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat",
                "footer 3",
            ],
            [
                "header 4",
                "subheader 4",
                "duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur",
                "footer 4",
            ],
        ]

rd = RefinedDocument(content=document)
headers = rd.headers
# [["header 1", "subheader 1"], ["header 2", "subheader 2"], ["header 3", "subheader 3"], ["header 4", "subheader 4"]]

footers = rd.footers
# [["footer 1"], ["footer 2"], ["footer 3"], ["footer 4"]]

body = rd.body
# [["lorem ipsum dolor sit amet", "consectetur adipiscing elit"], ["sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"], ["ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat"], ["duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur"]]

How it's work

My work is based on this paper : Lin, Xiaofan. (2003). Header and Footer Extraction by Page-Association. 5010. 164-171. 10.1117/12.472833.

And an article medium by Hussain Shahbaz Khawaja.

License

This projects is licensed under Apache 2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refinedoc-0.0.3.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refinedoc-0.0.3-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file refinedoc-0.0.3.tar.gz.

File metadata

  • Download URL: refinedoc-0.0.3.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for refinedoc-0.0.3.tar.gz
Algorithm Hash digest
SHA256 c975f73db347a7e8f376bc47a0064e13281b5e502e2fe11c655d8dde13931182
MD5 21904ef91dba83a0293d3a6eb4393835
BLAKE2b-256 fd167ae769ffe515d3e3d75e9e65b17a1dcd0e79ab9bc2023a7f88ce61070130

See more details on using hashes here.

File details

Details for the file refinedoc-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: refinedoc-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for refinedoc-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f3c5e835740ebfa1966b91b2d634615ecf877c2ff6db230c082ac54382607546
MD5 be285ebdcf6ce3d3ba4a188ca10c1cd1
BLAKE2b-256 85493f2e4f6e8524ade0b515c58d6101ddcc0c6c436065b02d114520a2d589b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page