Skip to main content

A light weight library to extract the table of contents and tag them to the pages containing the content.

Project description

pdf-page-annotator

A light weight library to extract the table of contents and tag them to the pages containing the content.

To understand the structure of a PDF and for effective retrieval, it is important to understand the contents and know exactly what page contains what.

When the need to extract a specific subsection of the pdf comes up, it can be found in either of the two places--

  1. In a section of a semi-structured (one with a structure and TOC) document.
  2. In an unknown section or in a fragmented form inside an unstructured document.

For the more extreme case of unstructured document, we have to perform an analysis on the whole document. Each time we want to find some informationin an exhaustive fashion (Because naive vector retrieval can't do that).

So, for the semi-structured documents, conventionally all important PDF documents worth indexing have a TOC, we can perform an initial TOC sweep, and extract relevant page numbers for each TOC item. In this manner, when we have to search for something exhaustively, instead of having to sesrch through the whole document, we can only search through the TOC to find the relevant pages, and then extract information from only those pages, saving time and tokens.

Installation

pip install pdf-page-annotator

Usage

  1. Import and initialize the PDFAnnotator class
from pdf_page_annotator import PDFAnnotator
annotator = PDFAnnotator(pdf_path="path_to_your_pdf_file", verbose=True) # `verbose=True` logs progress on the console, default is `False`
  1. Extract the contents
annotator.run_extraction_pipeline()
  1. Access the content list
print(annotator.content[0].unique_title, annotator.content[0].start_page, annotator.content[0].end_page)

Enjoy!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_page_annotator-0.1.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_page_annotator-0.1.0-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file pdf_page_annotator-0.1.0.tar.gz.

File metadata

  • Download URL: pdf_page_annotator-0.1.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for pdf_page_annotator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bbe66a5bfe8a95dbe9b63f8797e7c40f6c79f623f45331da390e723512895a94
MD5 2f8e5063efd0f405cca00c9c3684d94f
BLAKE2b-256 1920c07027d3d6b279343662f243ad7d2c4cb45e42fc7c6aee1d1396cc1762fa

See more details on using hashes here.

File details

Details for the file pdf_page_annotator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_page_annotator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0aaea39bed638cdd42fe195b7efe0bc85971bfc39f4b42e3d301cdbfa31f3f9a
MD5 5231bdcb5b3b02e522072cc758a3d7e9
BLAKE2b-256 b187f233e6d394fad772c595397f62bcb35fd499a5f1c18ca305eb35cf2e7571

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page