Skip to main content

tools for reading and processing pdf content

Project description

## Tools for processing pdf files

This is a light-weighted library for processing pdf files in python. One of the use-cases might be the extraction of pdf-annotations for ML / NLP.

Support for

  • obtaining textual and vizual content of pdf files

  • locating positions of words

  • fetching pdf annotations

  • adding a digital layer to image-pdfs

  • re-creating a clean pdf file with annotations removed

## Dependencies

Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are

To install Poppler, see the guide in the [pdf2image readme](https://pypi.org/project/pdf2image/).

## How to

Some examples of usage are shown in the [notebook](./notebook/Demo.ipynb).

## Todo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf-utils-0.1.1.tar.gz (16.7 kB view details)

Uploaded Source

File details

Details for the file pdf-utils-0.1.1.tar.gz.

File metadata

  • Download URL: pdf-utils-0.1.1.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for pdf-utils-0.1.1.tar.gz
Algorithm Hash digest
SHA256 208bf612970ae01ab81df0637539c0d52a1f9a9f15759ee6deef3402e3924eb5
MD5 311374da825eaf4fe7867cf7b84605b8
BLAKE2b-256 f7e618173ef2985b5ae6d707ad050876d35436a7e1340077fe8e6bfd680c96e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page