Skip to main content

Easily convert a big folder with PDFs into a dataset, with extracted text using OCR

Project description

pdf2dataset

pdf2dataset

Converts a whole subdirectory with big volume of PDF documents to a dataset (pandas DataFrame) with the columns: path x page x text x error

Highlights

  • Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
  • Support for parallel and distributed computing through ray
  • Incremental writing of resulting DataFrame, to save memory
  • Ability to save processing progress and resume from it
  • Error tracking of faulty documents
  • Use OCR for extracting text through pytesseract and pdf2image
  • Custom behaviour through parameters (number of CPUs, text language, etc)

Install

Install Dependencies

Ubuntu (or debians)

$ sudo apt update
$ sudo apt install -y poppler-utils tesseract-ocr-por  # "-por" for portuguese, use your language

Install pdf2dataset

For usage

$ pip3 install pdf2dataset --user # Please, isolate the environment

For development

# First, clone repository and cd into it
$ poetry install

Usage

Simple

# Reads all PDFs from my_pdfs_folder and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip

Save Processing Progress

It's possible to save the progress to a temporary folder and resume from the saved state in case of any error or interruption. To resume the processing, just use the --tmp-dir [directory] flag:

$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --tmp-dir my_progress

The indicated temporary directory can also be used for debugging pourposes and is not deleted automatically, so delete it when desired.

Results File

The resulting "file" is a parquet hive written with fastparquet, it can be easily read with pandas or dask:

>>> import pandas as pd
>>> df = pd.read_parquet('my_df.parquet.gzip')
>>> df
                             path  page                  text                                              error
index                                                                                                           
0                single_page1.pdf     1  My beautiful sample!                                                   
1       sub1/copy_multi_page1.pdf     2           Second page                                                   
2      sub2/copy_single_page1.pdf     1  My beautiful sample!                                                   
3       sub1/copy_multi_page1.pdf     3            Third page                                                   
4                 multi_page1.pdf     1            First page                                                   
5                 multi_page1.pdf     3            Third page                                                   
6       sub1/copy_multi_page1.pdf     1            First page                                                   
7                 multi_page1.pdf     2           Second page                                                   
0                    invalid1.pdf    -1                        Traceback (most recent call last):\n  File "/h...

There is no guarantee about the uniqueness or sequence of the index, you might need to create a new index with the whole data in memory.

The -1 page number means that was not possible of even openning the document.

Help

$ pdf2dataset -h
usage: pdf2dataset [-h] [--tmp-dir TMP_DIR] [--lang LANG]
                   [--num-cpus NUM_CPUS] [--address ADDRESS]
                   [--webui-host WEBUI_HOST] [--redis-password REDIS_PASSWORD]
                   input_dir results_file

Extract text from all PDF files in a directory

positional arguments:
  input_dir             The folder to lookup for PDF files recursively
  results_file          File to save the resultant dataframe

optional arguments:
  -h, --help            show this help message and exit
  --tmp-dir TMP_DIR     The folder to keep all the results, including log
                        files and intermediate files
  --lang LANG           Tesseract language
  --num-cpus NUM_CPUS   Number of cpus to use
  --address ADDRESS     Ray address to connect
  --webui-host WEBUI_HOST
                        Which port ray webui to listen
  --redis-password REDIS_PASSWORD
                        Redis password to use to connect with redis

Troubleshooting

  1. Troubles with high memory usage

You can try to decrease the number of CPUs in use, reducing the level of parallelism, test with --num-cpus 1 flag and then increasy according to your hardware.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2dataset-0.1.1.tar.gz (7.7 kB view hashes)

Uploaded Source

Built Distribution

pdf2dataset-0.1.1-py3-none-any.whl (7.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page