tools for reading and processing pdf content
Project description
## Tools for processing pdf files
This is a light-weighted library for processing pdf files in python. One of the use-cases might be the extraction of pdf-annotations for ML / NLP.
Support for
obtaining textual and vizual content of pdf files
locating positions of words
fetching pdf annotations
adding a digital layer to image-pdfs
re-creating a clean pdf file with annotations removed
## Dependencies
Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are
[Poppler](https://poppler.freedesktop.org/),
[Tesseract](https://tesseract-ocr.github.io/tessdoc/Home.html), and
[OpenCV](https://opencv.org/).
To install Poppler, see the guide in the [pdf2image readme](https://pypi.org/project/pdf2image/).
## How to
Some examples of usage are shown in the [notebook](./notebook/Demo.ipynb).
## Todo
Add detection of page-orientation (upside-down, rotated,…) based on images.
Add some of our experiments with “naive” table detection
Get rid of PyPDF2 as [it is not maintained](https://stackoverflow.com/questions/63199763/maintained-alternatives-to-pypdf2); replace by PyMUPdf or pdfMiner.six.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pdf-utils-0.1.1.tar.gz.
File metadata
- Download URL: pdf-utils-0.1.1.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
208bf612970ae01ab81df0637539c0d52a1f9a9f15759ee6deef3402e3924eb5
|
|
| MD5 |
311374da825eaf4fe7867cf7b84605b8
|
|
| BLAKE2b-256 |
f7e618173ef2985b5ae6d707ad050876d35436a7e1340077fe8e6bfd680c96e3
|