Easily convert a big folder with PDFs into a dataset, with extracted text using OCR
Project description
pdf2dataset
For extracting text from PDFs and save to a dataset
Install
Install Dependencies
Ubuntu (or debians)
$ sudo apt update
$ sudo apt install -y poppler-utils tesseract-ocr-por
Install pdf2dataset
For usage
# first, clone repository
$ pip3 install pdf2dataset --user # Please, isolate the environment
For development
# first, clone repository
$ poetry install
Usage examples
Simple
# Reads all PDFs from my_pdfs_folder and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip
Keeping progress
# Keep progress in tmp folder, so can resume processing in case of any error or interruption
# To resume, just use the same --tmp-dir folder
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --tmp-dir tmp
Help
$ pdf2dataset -h
usage: pdf2dataset [-h] [--tmp-dir TMP_DIR] [--lang LANG]
[--num-cpus NUM_CPUS] [--address ADDRESS]
[--webui-host WEBUI_HOST] [--redis-password REDIS_PASSWORD]
input_dir results_file
Extract text from all PDF files in a directory
positional arguments:
input_dir The folder to lookup for PDF files recursively
results_file File to save the resultant dataframe
optional arguments:
-h, --help show this help message and exit
--tmp-dir TMP_DIR The folder to keep all the results, including log
files and intermediate files
--lang LANG Tesseract language
--num-cpus NUM_CPUS Number of cpus to use
--address ADDRESS Ray address to connect
--webui-host WEBUI_HOST
Which port ray webui to listen
--redis-password REDIS_PASSWORD
Redis password to use to connect with redis
Sample output
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf2dataset-0.1.0.tar.gz
(6.2 kB
view hashes)
Built Distribution
Close
Hashes for pdf2dataset-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3618ddc522785baf91ba955b8710e7a6fa5025dbc7f14d43723aa334db0ab68 |
|
MD5 | aff09e6b1b590b9fae3ffba2d56154a6 |
|
BLAKE2b-256 | af34b03e07c37ebf140c2d98412ae01ee08553ecd9e44ad933f1bc7167c72d6c |