Skip to main content

Tool to extract and pretty-print PDF annotations for reviewing

Project description

pdfannots

Build status PyPI version

This program extracts annotations (highlights, comments, etc.) from a PDF file, and formats them in a variety of ways. It is primarily intended for use in reviewing submissions to scientific conferences/journals.

For the default markdown format, the output is as follows:

  • Highlights without an attached comment are output first, as "highlights" with just the highlighted text included. Note that these are not typically suitable for use in a review, since they're unlikely to have any meaning to the recipient; they are just meant to serve as a reminder to the reviewer.

  • Highlights with an attached comment, and text annotations (not attached to any particular text/highlight) are output next, as "detailed comments". Typically most comments on a reviewed paper are of this form.

  • Underline, strikeout, and squiggly underline annotations are output last, as "Nits", with or without an attached comment. The intention of this is to easily separate formatting or grammatical corrections from more substantial comments about the content of the document.

For each annotation, the page number is given, along with the associated (highlighted/underlined) text, if any. Additionally, if the document embeds outlines (aka bookmarks), such as those generated by the LaTeX hyperref package, they are printed to help identify to which section in the document the annotation refers.

Usage

See pdfannots --help (in a source tree: pdfannots.py --help) for options and invocation.

Dependencies

Known issues and limitations

  • While it is generally reliable, pdfminer (the underlying PDF parser) is less accurate than other tools (Poppler's pdftotext) at extracting text from a PDF. It has been known to fail in several different ways:

    • Sometimes it misses or misplaces individual characters, resulting in annotations with some or all of the text missing (in the latter case, you'll see a warning).

    • Sometimes the characters are captured, but not spaces between the words. Tweaking the advanced layout analysis parameters (e.g., --word-margin) may help with this.

    • Sometimes it extracts all the text but renders it out of order, for example, reporting that text at the top of a second column comes before text at the end of the first column. This causes pdfannots to return the annotations out of order, or to report the wrong outlines (section headings) for annotations. You can mostly work around this issue by using the --cols parameter to force a fixed page layout for the document (e.g. --cols=2 for a typical 2-column document).

  • If an annotation (such as a StrikeOut) covers solely whitespace, no text is extracted for the annotation, and it will be skipped (with a warning). This is an artifact of the way pdfminer reports whitespace with only an implicit position defined by surrounding characters.

  • When extracting text, we remove all hyphens that immediately precede a line break and join the adjacent words. This usually produces the best results with LaTeX multi-column documents (e.g. "soft-\nware" becomes "software"), but sometimes the hyphen needs to stay (e.g. "memory-\nmapped", which will be extracted as "memorymapped"), and we can't tell the difference. To disable this behaviour, pass --keep-hyphens.

FAQ

  1. I'd like to change how the output is formatted.

    Some minor tweaks (e.g.: word wrap, skipping sections) can be accomplished via command-line arguments.

    All of the output comes from the relevant Printer subclass; more elaborate changes can be accomplished there. Pull requests to introduce new output formats or variants as printers are welcomed.

  2. I think I got a review generated by this tool...

    I hope that it was a constructive review, and that the annotations helped the reviewer give you more detailed feedback so you can improve your paper. This is, after all, just a tool, and it should not be an excuse for reviewer sloppiness. Note that I am not the only user of this script.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfannots-0.2.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

pdfannots-0.2-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file pdfannots-0.2.tar.gz.

File metadata

  • Download URL: pdfannots-0.2.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for pdfannots-0.2.tar.gz
Algorithm Hash digest
SHA256 c8fed7e38350ea83fd3c652b49ef389e51d827a227de467e9082500f7ba5686a
MD5 0ac6a2524bafe6de25e89598df24af54
BLAKE2b-256 bd0e6254b218a20b7f1f74234691f2cd2dca94358b3e2a12e0ef18477bca59ca

See more details on using hashes here.

File details

Details for the file pdfannots-0.2-py3-none-any.whl.

File metadata

  • Download URL: pdfannots-0.2-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for pdfannots-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fe7639eb9f4a889c5ae0c20d6263771cc2e2d2d773dc6149c97a26894d3bc028
MD5 3e1cd3efeba8fe0d8f78852e777e6531
BLAKE2b-256 541d82f40b14b849837f7b30eabb9abbe9a1807cc3be28bbbf793cc21090e349

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page