A light weight library to extract the table of contents and tag them to the pages containing the content.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Environment
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Programming Language
- Python :: 3

Project description

pdf-page-annotator

A light weight library to extract the table of contents and tag them to the pages containing the content.

To understand the structure of a PDF and for effective retrieval, it is important to understand the contents and know exactly what page contains what.

When the need to extract a specific subsection of the pdf comes up, it can be found in either of the two places--

In a section of a semi-structured (one with a structure and TOC) document.
In an unknown section or in a fragmented form inside an unstructured document.

For the more extreme case of unstructured document, we have to perform an analysis on the whole document. Each time we want to find some informationin an exhaustive fashion (Because naive vector retrieval can't do that).

So, for the semi-structured documents, conventionally all important PDF documents worth indexing have a TOC, we can perform an initial TOC sweep, and extract relevant page numbers for each TOC item. In this manner, when we have to search for something exhaustively, instead of having to sesrch through the whole document, we can only search through the TOC to find the relevant pages, and then extract information from only those pages, saving time and tokens.

Installation

pip install pdf-page-annotator

Usage

Import and initialize the PDFAnnotator class

from pdf_page_annotator import PDFAnnotator
annotator = PDFAnnotator(pdf_path="path_to_your_pdf_file", verbose=True) # `verbose=True` logs progress on the console, default is `False`

Extract the contents

annotator.run_extraction_pipeline()

Access the content list

print(annotator.content[0].unique_title, annotator.content[0].start_page, annotator.content[0].end_page)

Enjoy!

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Environment
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.0

Mar 23, 2024

0.2.0

Mar 23, 2024

This version

0.1.0

Mar 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_page_annotator-0.1.0.tar.gz (16.3 kB view details)

Uploaded Mar 21, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_page_annotator-0.1.0-py3-none-any.whl (17.1 kB view details)

Uploaded Mar 21, 2024 Python 3

File details

Details for the file pdf_page_annotator-0.1.0.tar.gz.

File metadata

Download URL: pdf_page_annotator-0.1.0.tar.gz
Upload date: Mar 21, 2024
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for pdf_page_annotator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bbe66a5bfe8a95dbe9b63f8797e7c40f6c79f623f45331da390e723512895a94`
MD5	`2f8e5063efd0f405cca00c9c3684d94f`
BLAKE2b-256	`1920c07027d3d6b279343662f243ad7d2c4cb45e42fc7c6aee1d1396cc1762fa`

See more details on using hashes here.

File details

Details for the file pdf_page_annotator-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf_page_annotator-0.1.0-py3-none-any.whl
Upload date: Mar 21, 2024
Size: 17.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for pdf_page_annotator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0aaea39bed638cdd42fe195b7efe0bc85971bfc39f4b42e3d301cdbfa31f3f9a`
MD5	`5231bdcb5b3b02e522072cc758a3d7e9`
BLAKE2b-256	`b187f233e6d394fad772c595397f62bcb35fd499a5f1c18ca305eb35cf2e7571`

See more details on using hashes here.

pdf-page-annotator 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdf-page-annotator

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes