Skip to main content

PDF text and table search

Project description

PDFScraper

PyPI version

CLI program and library for extraction of PDF elements, which implements a search functionality that outputs summary in an HTML format. It combines Pdfminer.six, Camelot and Tesseract OCR in a single program, which is simple to use.

How to use

Install using pip

Use pip to install PDFScraper:

$ pip install PDFScraper

Arguments

optional arguments:
  -h, --help            show this help message and exit
  --path PATH           path to pdf folder or file
  --out OUT             path to output file location
  --log_level {critical,error,warning,info,debug}
                        logger level to use (default: info)
  --search SEARCH       word to search for
  --tessdata TESSDATA   location of tesseract data files
  --tables TABLES       should tables be extracted and searched
  --search_mode SEARCH_MODE
                        And or Or search, when multiple search words are
                        provided
  --multiprocessing MULTIPROCESSING
                        should multiprocessing be enabled

path, by default ".", specifies the location of the PDF folder or directory.

out, by default ".", specifies output directory in which summary.html file is created.

search argument is used for specifying the word or sentence that will be searched for in the PDF documents.

tessdata argument can be used to specify custom tessdata location for OCR analysis.

tables, by default True, specifies whether to search for search word in tables. Disabling tables search improves speed significantly.

search_mode, by default in 'and' mode, specifies whether all the search terms need to be contained inside paragraph. In 'or' mode, the paragraph is returned if any of the terms are contained. In 'and' mode, the paragraph is returned if all the terms are contained.

multiprocessing, by default True, runs process in multiple threads to speed up processing. Should not be used with OCR as it significantly decreases performance

OCR

tessdata pretrained language files need to be manually added to the tessdata directory.

OCR analysis of PDF documents currently supports English and Slovenian language. Language of the document is automatically detected using langdetect library.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PDFScraper-1.1.8.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

PDFScraper-1.1.8-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file PDFScraper-1.1.8.tar.gz.

File metadata

  • Download URL: PDFScraper-1.1.8.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.4.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for PDFScraper-1.1.8.tar.gz
Algorithm Hash digest
SHA256 3501ddb901e0ce2388be8f62f03ac57b79c1f748051a0f0978d2fc727981b28d
MD5 bf7d48a7f87a06361ca6af0aa06b1272
BLAKE2b-256 4fc763589ac05d04c23b7638b5f25d5e55cc79a2b1655c7b553651e88953cd29

See more details on using hashes here.

File details

Details for the file PDFScraper-1.1.8-py3-none-any.whl.

File metadata

  • Download URL: PDFScraper-1.1.8-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.4.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for PDFScraper-1.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 00ec0a17b85a45ebd3b2eded8a0369e70d0d26485690a6633df5bf8cd7465722
MD5 f546293f032514c5baa576bf7772f6da
BLAKE2b-256 f67db2658a079d531bab3d8ecfaf4fb6b32313dd41bee2f70cdeff1b9394dbd3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page