PDF text and table search
Project description
PDFScraper
CLI program and library for extraction of PDF elements, which implements a search functionality that outputs summary in an HTML format. It combines Pdfminer.six, Camelot and Tesseract OCR in a single program, which is simple to use.
How to use
Install using pip
Use pip to install PDFScraper:
$ pip install PDFScraper
Arguments
optional arguments: -h, --help show this help message and exit --path PATH path to pdf folder or file --out OUT path to output file location --log_level {critical,error,warning,info,debug} logger level to use (default: info) --search SEARCH word to search for --tessdata TESSDATA location of tesseract data files --tables TABLES should tables be extracted and searched --search_mode SEARCH_MODE And or Or search, when multiple search words are provided --multiprocessing MULTIPROCESSING should multiprocessing be enabled
path
, by default ".", specifies the location of the PDF folder or directory.
out
, by default ".", specifies output directory in which summary.html
file is created.
search
argument is used for specifying the word or sentence that will be searched for in the PDF documents.
tessdata
argument can be used to specify custom tessdata location for OCR analysis.
tables
, by default True, specifies whether to search for search word in tables. Disabling tables search improves speed significantly.
search_mode
, by default in 'and' mode, specifies whether all the search terms need to be contained inside paragraph. In 'or' mode, the paragraph is returned if any of the terms are contained. In 'and' mode, the paragraph is returned if all the terms are contained.
multiprocessing
, by default True, runs process in multiple threads to speed up processing. Should not be used with OCR as it significantly decreases performance
OCR
tessdata pretrained language files need to be manually added to the tessdata directory.
OCR analysis of PDF documents currently supports English and Slovenian language. Language of the document is automatically detected using langdetect library.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file PDFScraper-1.1.9.tar.gz
.
File metadata
- Download URL: PDFScraper-1.1.9.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1c7fa36bca1a035be5dad017ccd0dd751d43f7e21924da9dc04a1b3d4a42e0a |
|
MD5 | 883677c8b8f85adf4ac63296f734d5e4 |
|
BLAKE2b-256 | 0b5342498fa0a4807bd00bfb5a806d7ea4d3cfd54b6c82957565a583cfe50dab |
File details
Details for the file PDFScraper-1.1.9-py3-none-any.whl
.
File metadata
- Download URL: PDFScraper-1.1.9-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc2df703fcb13654b70979e111e1435d5f0eb00dd6a9804c34b6f13bdba8005d |
|
MD5 | 5dd2056facae0c91dd7d1b0e0c3452c9 |
|
BLAKE2b-256 | 1369670dfc4453d60412a752e99d03a15d85bd05117cbc7e7cfcb40f9c50b234 |