A light weight library to extract the table of contents and tag them to the pages containing the content.
Project description
pdf-page-annotator
A light weight library to extract the table of contents and tag them to the pages containing the content.
To understand the structure of a PDF and for effective retrieval, it is important to understand the contents and know exactly what page contains what.
When the need to extract a specific subsection of the pdf comes up, it can be found in either of the two places--
- In a section of a semi-structured (one with a structure and TOC) document.
- In an unknown section or in a fragmented form inside an unstructured document.
For the more extreme case of unstructured document, we have to perform an analysis on the whole document. Each time we want to find some informationin an exhaustive fashion (Because naive vector retrieval can't do that).
So, for the semi-structured documents, conventionally all important PDF documents worth indexing have a TOC, we can perform an initial TOC sweep, and extract relevant page numbers for each TOC item. In this manner, when we have to search for something exhaustively, instead of having to sesrch through the whole document, we can only search through the TOC to find the relevant pages, and then extract information from only those pages, saving time and tokens.
Installation
pip install pdf-page-annotator
Usage
- Import and initialize the PDFAnnotator class
from pdf_page_annotator import PDFAnnotator
annotator = PDFAnnotator(pdf_path="path_to_your_pdf_file", verbose=True) # `verbose=True` logs progress on the console, default is `False`
- Extract the contents
annotator.run_extraction_pipeline()
- Access the content list
print(annotator.content[0].unique_title, annotator.content[0].start_page, annotator.content[0].end_page)
Enjoy!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_page_annotator-0.1.0.tar.gz.
File metadata
- Download URL: pdf_page_annotator-0.1.0.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbe66a5bfe8a95dbe9b63f8797e7c40f6c79f623f45331da390e723512895a94
|
|
| MD5 |
2f8e5063efd0f405cca00c9c3684d94f
|
|
| BLAKE2b-256 |
1920c07027d3d6b279343662f243ad7d2c4cb45e42fc7c6aee1d1396cc1762fa
|
File details
Details for the file pdf_page_annotator-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdf_page_annotator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0aaea39bed638cdd42fe195b7efe0bc85971bfc39f4b42e3d301cdbfa31f3f9a
|
|
| MD5 |
5231bdcb5b3b02e522072cc758a3d7e9
|
|
| BLAKE2b-256 |
b187f233e6d394fad772c595397f62bcb35fd499a5f1c18ca305eb35cf2e7571
|