Extract structured text from pdf files.
Project description
leaf-focus
Extract structured text from pdf files.
Install
Install from PyPI using pip:
pip install leaf-focus
Download the Xpdf command line tools and extract the executable files.
Provide the directory containing the executable files as --exe-dir
.
Usage
usage: leaf-focus [-h] [--version] --exe-dir EXE_DIR [--page-images] [--ocr]
[--first FIRST] [--last LAST]
[--log-level {debug,info,warning,error,critical}]
input_pdf output_dir
Extract structured text from a pdf file.
positional arguments:
input_pdf path to the pdf file to read
output_dir path to the directory to save the extracted text files
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--exe-dir EXE_DIR path to the directory containing xpdf executable files
--page-images save each page of the pdf as a separate image
--ocr run optical character recognition on each page of the
pdf
--first FIRST the first pdf page to process
--last LAST the last pdf page to process
--log-level {debug,info,warning,error,critical}
the log level: debug, info, warning, error, critical
Examples
# Extract the pdf information and embedded text.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages
# Extract the pdf information, embedded text, an image of each page, and Optical Character Recognition results of each page.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages --ocr
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
leaf-focus-0.5.0.tar.gz
(24.2 kB
view hashes)
Built Distribution
leaf_focus-0.5.0-py3-none-any.whl
(25.0 kB
view hashes)
Close
Hashes for leaf_focus-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1521a8a14c5b29f30beccf1a36e2110c3adb63a71f4296d4da7b64117c5e85a |
|
MD5 | a5e761c10fe8220344594e2f6fd42407 |
|
BLAKE2b-256 | 75693ca305fe3a914325cc65956a703490544cd99df4cf8dfc80cca37c1d431b |