pdf-utils

tools for reading and processing pdf content

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
Topic
- Software Development :: Build Tools

Project description

## Tools for processing pdf files

This is a light-weighted library for processing pdf files in python. One of the use-cases might be the extraction of pdf-annotations for ML / NLP.

Support for

obtaining textual and vizual content of pdf files
locating positions of words
fetching pdf annotations
adding a digital layer to image-pdfs
re-creating a clean pdf file with annotations removed

## Dependencies

Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are

[Poppler](https://poppler.freedesktop.org/),
[Tesseract](https://tesseract-ocr.github.io/tessdoc/Home.html), and
[OpenCV](https://opencv.org/).

To install Poppler, see the guide in the [pdf2image readme](https://pypi.org/project/pdf2image/).

## How to

Some examples of usage are shown in the [notebook](./notebook/Demo.ipynb).

## Todo

Add detection of page-orientation (upside-down, rotated,…) based on images.
Add some of our experiments with “naive” table detection
Get rid of PyPDF2 as [it is not maintained](https://stackoverflow.com/questions/63199763/maintained-alternatives-to-pypdf2); replace by PyMUPdf or pdfMiner.six.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

This version

0.1.1

Aug 19, 2020

0.0.0

Aug 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf-utils-0.1.1.tar.gz (16.7 kB view details)

Uploaded Aug 19, 2020 Source

File details

Details for the file pdf-utils-0.1.1.tar.gz.

File metadata

Download URL: pdf-utils-0.1.1.tar.gz
Upload date: Aug 19, 2020
Size: 16.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for pdf-utils-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`208bf612970ae01ab81df0637539c0d52a1f9a9f15759ee6deef3402e3924eb5`
MD5	`311374da825eaf4fe7867cf7b84605b8`
BLAKE2b-256	`f7e618173ef2985b5ae6d707ad050876d35436a7e1340077fe8e6bfd680c96e3`

See more details on using hashes here.

pdf-utils 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes