Skip to main content

Easily convert a big folder with PDFs into a dataset, with extracted text using OCR

Project description

pdf2dataset

pdf2dataset

For extracting text from PDFs and save to a dataset

Install

Install Dependencies

Ubuntu (or debians)

$ sudo apt update
$ sudo apt install -y poppler-utils tesseract-ocr-por

Install pdf2dataset

For usage

# first, clone repository

$ pip3 install pdf2dataset --user # Please, isolate the environment

For development

# first, clone repository

$ poetry install

Usage examples

Simple

# Reads all PDFs from my_pdfs_folder and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip

Keeping progress

# Keep progress in tmp folder, so can resume processing in case of any error or interruption
# To resume, just use the same --tmp-dir folder
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --tmp-dir tmp

Help

$ pdf2dataset -h
usage: pdf2dataset [-h] [--tmp-dir TMP_DIR] [--lang LANG]
                   [--num-cpus NUM_CPUS] [--address ADDRESS]
                   [--webui-host WEBUI_HOST] [--redis-password REDIS_PASSWORD]
                   input_dir results_file

Extract text from all PDF files in a directory

positional arguments:
  input_dir             The folder to lookup for PDF files recursively
  results_file          File to save the resultant dataframe

optional arguments:
  -h, --help            show this help message and exit
  --tmp-dir TMP_DIR     The folder to keep all the results, including log
                        files and intermediate files
  --lang LANG           Tesseract language
  --num-cpus NUM_CPUS   Number of cpus to use
  --address ADDRESS     Ray address to connect
  --webui-host WEBUI_HOST
                        Which port ray webui to listen
  --redis-password REDIS_PASSWORD
                        Redis password to use to connect with redis

Sample output

output_sample

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2dataset-0.1.0.tar.gz (6.2 kB view hashes)

Uploaded Source

Built Distribution

pdf2dataset-0.1.0-py3-none-any.whl (6.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page