Extract structured text from pdf files.
Project description
leaf-focus
Extract structured text from pdf files.
Install
Install from PyPI using pip:
pip install leaf-focus
Download the Xpdf command line tools and extract the executable files.
Provide the directory containing the executable files as --exe-dir
.
Usage
usage: leaf-focus [-h] [--version] --exe-dir EXE_DIR [--page-images] [--ocr]
[--first FIRST] [--last LAST]
[--log-level {debug,info,warning,error,critical}]
input_pdf output_dir
Extract structured text from a pdf file.
positional arguments:
input_pdf path to the pdf file to read
output_dir path to the directory to save the extracted text files
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--exe-dir EXE_DIR path to the directory containing xpdf executable files
--page-images save each page of the pdf as a separate image
--ocr run optical character recognition on each page of the
pdf
--first FIRST the first pdf page to process
--last LAST the last pdf page to process
--log-level {debug,info,warning,error,critical}
the log level: debug, info, warning, error, critical
Examples
# Extract the pdf information and embedded text.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages
# Extract the pdf information, embedded text, an image of each page, and Optical Character Recognition results of each page.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages --ocr
Dependencies
- xpdf
- keras-ocr
- Tensorflow (can optionally be run more efficiently using one or more GPUs)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
leaf-focus-0.6.2.tar.gz
(31.5 kB
view details)
Built Distribution
File details
Details for the file leaf-focus-0.6.2.tar.gz
.
File metadata
- Download URL: leaf-focus-0.6.2.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0b0be650e761626836cdd74ccc8c32c8c96a4e380d618bb506d3e71e719079c |
|
MD5 | c0e255bf756b7d3d8255a9134466a24e |
|
BLAKE2b-256 | eeddace4b960dd401e6109bfed1c4ca4a981c48c5f5fc4b61af7c9fbfd36c32c |
File details
Details for the file leaf_focus-0.6.2-py3-none-any.whl
.
File metadata
- Download URL: leaf_focus-0.6.2-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a4e37ffdbecdc6ea3992901ea7ac4194b8413a0e14d98f242d97a5a7af6eedd |
|
MD5 | 58082d7d2e89cd65297798dba08d30f5 |
|
BLAKE2b-256 | b35862ec79fcdaf85093217f6cd07dd3e4602cd955e2f6cc950b60b2b0298fc2 |