Extract structured text from pdf files.
Project description
leaf-focus
Extract structured text from pdf files.
Install
Install from PyPI using pip:
pip install leaf-focus
Download the Xpdf command line tools and extract the executable files.
Provide the directory containing the executable files as --exe-dir
.
Usage
usage: leaf-focus [-h] [--version] --exe-dir EXE_DIR [--page-images] [--ocr]
[--first FIRST] [--last LAST]
[--log-level {debug,info,warning,error,critical}]
input_pdf output_dir
Extract structured text from a pdf file.
positional arguments:
input_pdf path to the pdf file to read
output_dir path to the directory to save the extracted text files
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--exe-dir EXE_DIR path to the directory containing xpdf executable files
--page-images save each page of the pdf as a separate image
--ocr run optical character recognition on each page of the
pdf
--first FIRST the first pdf page to process
--last LAST the last pdf page to process
--log-level {debug,info,warning,error,critical}
the log level: debug, info, warning, error, critical
Examples
# Extract the pdf information and embedded text.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages
# Extract the pdf information, embedded text, an image of each page, and Optical Character Recognition results of each page.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages --ocr
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
leaf-focus-0.4.0.tar.gz
(23.5 kB
view hashes)
Built Distribution
leaf_focus-0.4.0-py3-none-any.whl
(24.4 kB
view hashes)
Close
Hashes for leaf_focus-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f865bf0c74f55ec8e8b5e7ad316d09422c1f714356528ba3f66a3e65d4f3b3fa |
|
MD5 | 35975e83bfb747e9aafc477b07ba48c1 |
|
BLAKE2b-256 | 39b8a7bcf8e980f5672042fb95ea1d1d792d1c530b2d4a901faf3beb2ad33abd |