PDF Annotation Utils

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
License
- OSI Approved
Natural Language
- French
Operating System
- OS Independent
Programming Language
- Python
- Python :: 3.6
Topic
- Communications

Project description

pdfannot

This package aims to create a two-way link between annotated pdf and excel data frame.

It allows you to :

create an DataFrame containing each string annotated of the pdf in a column 'annot_text', along with its annotation in a column 'label' and information such as coordinates, page etc.
annotate a pdf given an DataFrame of the form described above.

It can be really useful for generating automatically annotated pdf documents with NLP models capable to infer annotations from raw texts in a data frame.

Prerequisites

pandas
fitz

(pip install pymupdf)

Installing

pip install pdfannot

Examples

Your DataFrame must contains info on what to annotate on the pdf :

import pdfannot
import pandas as pd

# adf stands for annotation dataframe
adf = pd.DataFrame([
{'x': 40, 'y': 60, 'w': 300, 'h': 50}, 
{'text': 'APPEAL relating to Cancellation Proceedings No 399', 'type': 'Highlight'},
{'text': 'ication for a declaration of i', 'type': 'Highlight', 'label': 'label 1'},
{'x': 100, 'y': 600, 'w': 300, 'h': 50, 'page': 1, 'label': 'label 2'}, 
 ])

# pdfannot.exple_pdf is a test pdf shipped with pdf annot package / debug is set to True for some verbose
pdfannot.annotate_pdf(adf, pdfannot.exple_pdf, '/tmp/test.pdf', debug=True)

Your annotation dataframe should have already columns 'x','y','h','w' (coordinate to make a square) or 'text' (texts to annotate).

annotate_pdf(DataFrame, orig_pdfpath, dest_pdfpath)

will use your DataFrame and the directory of your pdf passed in argument to annotate it and store it at dest_pdfpath.

The function also considers optional columns 'label' to label your annotations and 'type' to specify whether you want a 'Square' or a 'Highlight'.

Defaults are label : '' and type : 'Square'.

Finally, specifying the annotation's page with a column 'page' speeds up the algorithm. "page" is optional for 1 page pdfs.

Internals

However if your DataFrame has one text column per label of annotation (WARNING : each of them must be named annot_{label_name}) then you can transform it with :

annot_utils.dlf2adf(DataFrame)

to make it acceptable by annotate_pdf. After this execute :

annotate_pdf(DataFrame, orig_pdfpath, dest_pdfpath)

to annotate your pdf (this method allows only highlights).

Authors

Arthur Renaud, Antoine Marullaz --> Stackadoc

Any recommendation/question ? --> contact@stackadoc.com

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
License
- OSI Approved
Natural Language
- French
Operating System
- OS Independent
Programming Language
- Python
- Python :: 3.6
Topic
- Communications

Release history Release notifications | RSS feed

This version

2019.6.5.1

Jun 5, 2019

2019.5.27.1

May 27, 2019

0.0.13

May 27, 2019

0.0.12

May 27, 2019

0.0.11

May 27, 2019

0.0.10

May 14, 2019

0.0.9

May 10, 2019

0.0.8

May 9, 2019

0.0.7

May 9, 2019

0.0.6

Apr 24, 2019

0.0.5

Apr 23, 2019

0.0.4

Apr 23, 2019

0.0.3

Apr 23, 2019

0.0.2

Apr 23, 2019

0.0.1

Apr 23, 2019

0.0.0

Apr 19, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfannot-2019.6.5.1.tar.gz (737.4 kB view hashes)

Uploaded Jun 5, 2019 Source

Hashes for pdfannot-2019.6.5.1.tar.gz

Hashes for pdfannot-2019.6.5.1.tar.gz
Algorithm	Hash digest
SHA256	`46e92e35fabd52c82cedb043d361845955daf6eac75c00759ad6ac3e90689fb2`
MD5	`cbf44923bf44024dd99f4168b8acac32`
BLAKE2b-256	`1ab20721f54aeae00a10526a9847cb7693a0a02dda2d55f0da5cc79925616422`