Easily convert a big folder with PDFs into a dataset, with extracted text using OCR

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

pdf2dataset

Converts a whole subdirectory with any volume (small or huge) of PDF documents to a dataset (pandas DataFrame) with the columns: path x page x text x error. No need to setup any external service (no database, brokers, etc). Just install and run!

Highlights

Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
Support for parallel and distributed computing through ray
Incremental writing of resulting DataFrame, to save memory
Ability to save processing progress and resume from it
Error tracking of faulty documents
Ability to extract text through pdftotext
Ability to use OCR for extracting text through pytesseract and pdf2image
Custom behavior through parameters (number of CPUs, text language, etc)

Installation

Install Dependencies

Fedora

# "-por" for portuguese, use the documents language
$ sudo dnf install -y poppler-utils pkgconfig poppler-cpp-devel python3-devel tesseract-langpack-por

Ubuntu (or debians)

$ sudo apt update

# "-por" for portuguese, use the documents language
$ sudo apt install -y poppler-utils build-essential libpoppler-cpp-dev pkg-config python3-dev tesseract-ocr-por

Install pdf2dataset

For usage

$ pip3 install pdf2dataset --user  # Please, isolate the environment

For development

# First, clone repository and cd into it
$ poetry install

Usage

Simple - CLI

# Reads all PDFs from my_pdfs_dir and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip  # Most basic
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --num-cpus 1  # Reduce parallelism to the maximum
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true  # For scanned PDFs
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true --lang eng  # For scanned documents with english text

Save Processing Progress - CLI

It's possible to save the progress to a temporary folder and resume from the saved state in case of any error or interruption. To resume the processing, just use the --tmp-dir [directory] flag:

$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --tmp-dir my_progress

The indicated temporary directory can also be used for debugging purposes and is not deleted automatically, so delete it when desired.

Using as a library

The extract_text function can be used analogously to the CLI:

from pdf2dataset import extract_text

extract_text('my_pdfs_dir', 'my_df.parquet.gzip', tmp_dir='my_progress')

Small

One feature not available to the CLI is the custom behavior for handling small volumes of data (small can be understood as that the extraction won't run for hours or days and locally).

The complete list of differences are:

Faster initialization (use multiprocessing instead of ray)
Don't save processing progress
Distributed processing not supported
Don't write dataframe to disk
Returns the dataframe

Example:

from pdf2dataset import extract_text

df = extract_text('my_pdfs_dir', small=True)
# ...

Passing specific tasks

If you don't want to specify a directory for the documents, you can specify the tasks that will be processed.

The tasks can be of the form (document_name, document_bytes, page_number) or just (document_name, document_bytes), document_name must ends with .pdf but don't need to be a real file, document_bytes are the bytes of the pdf document and page_number is the number of the page to process (all pages if not specified).

Example:

from pdf2dataset import extract_text

tasks = [
    ('a.pdf', a_bytes),  # Processing all pages of this document
    ('b.pdf', b_bytes, 1),
    ('b.pdf', b_bytes, 2),
]

# 'df' will contain all pages from 'a.pdf' and page 1 and 2 from 'b.pdf'
df = extract_text(tasks, 'my_df.parquet.gzip', small=True)

# ...

Returning a list with the contents, instead of DataFrame

If you are just interested on the texts, it's possible to return a list that contains only the pages content. Each document will be a list which each element is a page.

Example:

>>> from pdf2dataset import extract_text
>>> extract_text('tests/samples', return_list=True)
[[''],
 ['First page', 'Second page', 'Third page'],
 ['My beautiful sample!'],
 ['First page', 'Second page', 'Third page'],
 ['My beautiful sample!']]

Note: Pages/Documents with parsing error will have an empty string as text result

Results File

The resulting "file" is a parquet hive written with fastparquet, it can be easily read with pandas or dask:

>>> import pandas as pd
>>> df = pd.read_parquet('my_df.parquet.gzip')
>>> df
                             path  page                  text                                              error
index                                                                                                           
0                single_page1.pdf     1  My beautiful sample!                                                   
1       sub1/copy_multi_page1.pdf     2           Second page                                                   
2      sub2/copy_single_page1.pdf     1  My beautiful sample!                                                   
3       sub1/copy_multi_page1.pdf     3            Third page                                                   
4                 multi_page1.pdf     1            First page                                                   
5                 multi_page1.pdf     3            Third page                                                   
6       sub1/copy_multi_page1.pdf     1            First page                                                   
7                 multi_page1.pdf     2           Second page                                                   
0                    invalid1.pdf    -1                        Traceback (most recent call last):\n  File "/h...

There is no guarantee about the uniqueness or sequence of the index, you might need to create a new index with the whole data in memory.

The -1 page number means that was not possible of even opening the document.

Run on a Cluster

Setup the Cluster

Follow ray documentation for manual or automatic setup.

Run it

To go distributed you can run it just like local, but using the --address and --redis-password flags to point to your cluster (More information).

With version >= 0.2.0, only the head node needs to have access to the documents in disk.

Help

$ pdf2dataset -h
usage: pdf2dataset [-h] [--tmp-dir TMP_DIR] [--lang LANG] [--ocr OCR] [--chunksize CHUNKSIZE] [--num-cpus NUM_CPUS] [--address ADDRESS] [--webui-host WEBUI_HOST]
                   [--redis-password REDIS_PASSWORD]
                   input_dir results_file

Extract text from all PDF files in a directory

positional arguments:
  input_dir             The folder to lookup for PDF files recursively
  results_file          File to save the resultant dataframe

optional arguments:
  -h, --help            show this help message and exit
  --tmp-dir TMP_DIR     The folder to keep all the results, including log files and intermediate files
  --lang LANG           Tesseract language
  --ocr OCR             'pytesseract' if true, else 'pdftotext'. default: false
  --chunksize CHUNKSIZE
                        Chunksize to use while processing pages, otherwise is calculated
  --num-cpus NUM_CPUS   Number of cpus to use
  --address ADDRESS     Ray address to connect
  --webui-host WEBUI_HOST
                        Which port ray webui to listen
  --redis-password REDIS_PASSWORD
                        Redis password to use to connect with redis

Troubleshooting

Troubles with high memory usage

Decrease the number of CPUs in use, reducing the level of parallelism, test it with --num-cpus 1 flag and then increase according to your hardware.
Use smaller chunksize, so less documents will be put in memory at once. Use --chunksize 1 for having 1 * num_cpus documents in memory at once.

How to Contribute

Just open your issues and/or pull requests, all are welcome :smiley:!

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.3

Sep 13, 2020

0.5.2

Sep 8, 2020

0.5.1

Sep 1, 2020

0.5.1a0 pre-release

Sep 1, 2020

0.5.0

Aug 29, 2020

0.4.1

Aug 24, 2020

0.4.0

Aug 23, 2020

This version

0.4.0a1 pre-release

Aug 20, 2020

0.4.0a0 pre-release

Aug 20, 2020

0.3.2

Aug 2, 2020

0.3.1

Jul 29, 2020

0.3.0

Jul 29, 2020

0.2.0

Jul 21, 2020

0.1.1

Jul 16, 2020

0.1.0

Jul 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2dataset-0.4.0a1.tar.gz (19.8 kB view hashes)

Uploaded Aug 20, 2020 Source

Built Distribution

pdf2dataset-0.4.0a1-py3-none-any.whl (18.3 kB view hashes)

Uploaded Aug 20, 2020 Python 3

Hashes for pdf2dataset-0.4.0a1.tar.gz

Hashes for pdf2dataset-0.4.0a1.tar.gz
Algorithm	Hash digest
SHA256	`f0fa5c6e1ff60239da030c9dbf889b09bd886e96938c05c6702a260ab76c1e9a`
MD5	`6cbf7910de802cbc0f7844b6dd2af85d`
BLAKE2b-256	`419288fa00e1915603ddf366e51fb720fee5bd8f69f8a4d1700c05e1604fcec5`

Hashes for pdf2dataset-0.4.0a1-py3-none-any.whl

Hashes for pdf2dataset-0.4.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ad54a9b2544438e146157a65dc4be81fd17f3f07f49a75a76b2cfb1346c1df7`
MD5	`78b396f252002c92095ee5bf54ab60db`
BLAKE2b-256	`41f6e686a11d88a4bb8ff5daebe2d421b482b2f84375b556620a0967947722d1`

pdf2dataset 0.4.0a1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

pdf2dataset

Highlights

Installation

Install Dependencies

Fedora

Ubuntu (or debians)

Install pdf2dataset

For usage

For development

Usage

Simple - CLI

Save Processing Progress - CLI

Using as a library

Small

Example:

Passing specific tasks

Example:

Returning a list with the contents, instead of DataFrame

Example:

Results File

Run on a Cluster

Setup the Cluster

Run it

Help

Troubleshooting

How to Contribute

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution