Easily convert a subdirectory with big volume of PDF documents into a dataset, supports extracting text and images

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

pdf2dataset

Converts a whole subdirectory with any volume (small or huge) of PDF documents to a dataset (pandas DataFrame). No need to setup any external service (no database, brokers, etc). Just install and run it!

Main features

Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
Support for parallel and distributed processing through ray
Extractions are performed by page, making tasks distribution more uniform, for handling documents with a big difference in number of pages
Incremental writing of resulting DataFrame, making possible to process data bigger than memory
Error tracking of faulty documents
Ability to save processing progress and resume from it
Ability to extract text through pdftotext
Ability to use OCR for extracting text through pytesseract
Ability to extract images through pdf2image
Support for implementing custom features extraction
Highly customizable behavior through params

Installation

Install Dependencies

Fedora

# "-por" for portuguese, use the documents language
$ sudo dnf install -y gcc-c++ poppler-utils pkgconfig poppler-cpp-devel python3-devel tesseract-langpack-por

Ubuntu (or debians)

$ sudo apt update

# "-por" for portuguese, use the documents language
$ sudo apt install -y build-essential poppler-utils libpoppler-cpp-dev pkg-config python3-dev tesseract-ocr-por

Install pdf2dataset

For usage

$ pip3 install pdf2dataset --user  # Please, isolate the environment

For development

# First, install poetry, clone repository and cd into it
$ poetry install

Usage

Simple - CLI

# Note: path, page and error will always be present in resulting DataFrame

# Reads all PDFs from my_pdfs_dir and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip  # Most basic, extract all possible features
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip  --features=text  # Extract just text
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip  --features=image  # Extract just image
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --num-cpus 1  # Maximum reducing of parallelism
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true  # For scanned PDFs
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true --lang eng  # For scanned documents with english text

Save Processing Progress - CLI

It's possible to save the progress to a temporary folder and resume from the saved state in case of any error or interruption. To resume the processing, just use the --tmp-dir [directory] flag:

$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --tmp-dir my_progress

The indicated temporary directory can also be used for debugging purposes and is not deleted automatically, so delete it when desired.

Using as a library

Main functions

There're some helper functions to facilitate pdf2dataset usage:

extract: function can be used analogously to the CLI
extract_text: extract wrapper with features=text
extract_image: extract wrapper with features=image
image_from_bytes: (pdf2image.utils) get a Pillow Image object given the image bytes
image_to_bytes: (pdf2image.utils) get the image bytes given the a Pillow Image object

Basic example

from pdf2dataset import extract

extract('my_pdfs_dir', 'all_features.parquet.gzip', tmp_dir='my_progress')

Small data

One feature, not available to the CLI, is the custom behavior for handling small volumes of data (small can be understood as that: the extraction won't run for hours or days and won't be distributed).

The complete list of differences are:

Faster initialization (use multiprocessing instead of ray)
Don't save processing progress
Distributed processing not supported
Don't write dataframe to disk
Returns the dataframe

Example:

from pdf2dataset import extract_text

df = extract_text('my_pdfs_dir', small=True)
# ...

Pass files from memory

If you don't want to specify a directory for the documents, you can specify the tasks that will be processed.

The tasks can be of the form (document_name, document_bytes, page_number) or just (document_name, document_bytes), document_name must ends with .pdf but don't need to be a real file, document_bytes are the bytes of the pdf document and page_number is the number of the page to process (all pages, if not specified).

Example:

from pdf2dataset import extract_text

tasks = [
    ('a.pdf', a_bytes),  # Processing all pages of this document
    ('b.pdf', b_bytes, 1),
    ('b.pdf', b_bytes, 2),
]

# 'df' will contain results from all pages from 'a.pdf' and page 1 and 2 from 'b.pdf'
df = extract_text(tasks, 'my_df.parquet.gzip', small=True)

# ...

Returning a list

If you don't want to handle the DataFrame, is possible to return a nested list with the features values. The structure for the resulting list is:

result = List[documents]
documents = List[pages]
pages = List[features]
features = List[feature]
feature = any

any is any type supported by pyarrow.
features are ordered by the feature name (text, image, etc)

Example:

>>> from pdf2dataset import extract_text
>>> extract_text('tests/samples', return_list=True)
[[[None]],
 [['First page'], ['Second page'], ['Third page']],
 [['My beautiful sample!']],
 [['First page'], ['Second page'], ['Third page']],
 [['My beautiful sample!']]]

Features with error will have None value as result
Here, extract_text was used, so the only feature is text

Custom Features

With version >= 0.4.0, is also possible to easily implement extraction of custom features:

Example:

This is the strucuture:

from pdf2dataset import extract, feature, PdfExtractTask


class MyCustomTask(PdfExtractTask):

    @feature('bool_')
    def get_is_page_even(self):
        return self.page % 2 == 0

    @feature('binary')
    def get_doc_first_bytes(self):
        return self.file_bin[:10]

    @feature('string', exceptions=[ValueError])
    def get_wrong(self):
        raise ValueError("There was a problem!")


if __name__ == '__main__':
    df = extract('tests/samples', small=True, task_class=MyCustomTask)
    print(df)

    df.dropna(subset=['text'], inplace=True)  # Discard invalid documents
    print(df.iloc[0].error)

First print:

                         path  page doc_first_bytes  ...                  text  wrong                                              error
0                invalid1.pdf    -1   b"I'm invali"  ...                  None   None  image_original:\nTraceback (most recent call l...
1             multi_page1.pdf     2  b'%PDF-1.5\n%'  ...           Second page   None  wrong:\nTraceback (most recent call last):\n  ...
2             multi_page1.pdf     3  b'%PDF-1.5\n%'  ...            Third page   None  wrong:\nTraceback (most recent call last):\n  ...
3   sub1/copy_multi_page1.pdf     1  b'%PDF-1.5\n%'  ...            First page   None  wrong:\nTraceback (most recent call last):\n  ...
4   sub1/copy_multi_page1.pdf     3  b'%PDF-1.5\n%'  ...            Third page   None  wrong:\nTraceback (most recent call last):\n  ...
5             multi_page1.pdf     1  b'%PDF-1.5\n%'  ...            First page   None  wrong:\nTraceback (most recent call last):\n  ...
6  sub2/copy_single_page1.pdf     1  b'%PDF-1.5\n%'  ...  My beautiful sample!   None  wrong:\nTraceback (most recent call last):\n  ...
7   sub1/copy_multi_page1.pdf     2  b'%PDF-1.5\n%'  ...           Second page   None  wrong:\nTraceback (most recent call last):\n  ...
8            single_page1.pdf     1  b'%PDF-1.5\n%'  ...  My beautiful sample!   None  wrong:\nTraceback (most recent call last):\n  ...

[9 rows x 8 columns]

Second print:

wrong:
Traceback (most recent call last):
  File "/home/icaro/Desktop/pdf2dataset/pdf2dataset/extract_task.py", line 32, in inner
    result = feature_method(*args, **kwargs)
  File "example.py", line 16, in get_wrong
    raise ValueError("There was a problem!")
ValueError: There was a problem!

Notes:

@feature is the decorator used to define new features
First argument to @feature must be a valid PyArrow type, complete list here.
exceptions param specify a list of exceptions to be recorded on DataFrame, otherwise they are raised
For this example, all available features plus the custom ones are extracted.

Results File

The resulting "file" is a directory with structure specified by dask with pyarrow engine, it can be easily read with pandas or dask:

Example with pandas

>>> import pandas as pd
>>> df = pd.read_parquet('my_df.parquet.gzip', engine='pyarrow')
>>> df
                             path  page                  text                                              error
index                                                                                                           
0                single_page1.pdf     1  My beautiful sample!                                                   
1       sub1/copy_multi_page1.pdf     2           Second page                                                   
2      sub2/copy_single_page1.pdf     1  My beautiful sample!                                                   
3       sub1/copy_multi_page1.pdf     3            Third page                                                   
4                 multi_page1.pdf     1            First page                                                   
5                 multi_page1.pdf     3            Third page                                                   
6       sub1/copy_multi_page1.pdf     1            First page                                                   
7                 multi_page1.pdf     2           Second page                                                   
0                    invalid1.pdf    -1                        Traceback (most recent call last):\n  File "/h...

There is no guarantee about the uniqueness or order of index, you might need to create a new index with the whole data in memory.

The -1 page number means that was not possible of even parsing the document.

Run on a Cluster

Setup the Cluster

Follow ray documentation for manual or automatic setup.

Run it

To go distributed you can run it just like local, but using the --address and --redis-password flags to point to your cluster (More information).

With version >= 0.2.0, only the head node needs to have access to the documents in disk.

CLI Help

usage: pdf2dataset [-h] [--features FEATURES] [--tmp-dir TMP_DIR] [--ocr-lang OCR_LANG] [--ocr OCR] [--chunksize CHUNKSIZE] [--image-size IMAGE_SIZE] [--ocr-image-size OCR_IMAGE_SIZE]
                   [--image-format IMAGE_FORMAT] [--num-cpus NUM_CPUS] [--address ADDRESS] [--webui-host WEBUI_HOST] [--redis-password REDIS_PASSWORD]
                   input_dir results_file

Extract text from all PDF files in a directory

positional arguments:
  input_dir             The folder to lookup for PDF files recursively
  results_file          File to save the resultant dataframe

optional arguments:
  -h, --help            show this help message and exit
  --features FEATURES   Specify a comma separated list with the features you want to extract. 'path' and 'page' will always be added. Available features to add: image, text Examples: '--
                        features=text,image' or '--features=all'
  --tmp-dir TMP_DIR     The folder to keep all the results, including log files and intermediate files
  --ocr-lang OCR_LANG   Tesseract language
  --ocr OCR             'pytesseract' if true, else 'pdftotext'. default: false
  --chunksize CHUNKSIZE
                        Chunksize to use while processing pages, otherwise is calculated
  --image-size IMAGE_SIZE
                        If adding image feature, image will be resized to this size. Provide two integers separated by 'x'. Example: --image-size 1000x1414
  --ocr-image-size OCR_IMAGE_SIZE
                        The height of the image OCR will be applied. Width will be adjusted to keep the ratio.
  --image-format IMAGE_FORMAT
                        Format of the image generated from the PDF pages
  --num-cpus NUM_CPUS   Number of cpus to use
  --address ADDRESS     Ray address to connect
  --webui-host WEBUI_HOST
                        Which IP ray webui will try to listen on
  --redis-password REDIS_PASSWORD
                        Redis password to use to connect with ray

Troubleshooting

Troubles with high memory usage

Decrease the number of CPUs in use, reducing the level of parallelism, test it with --num-cpus 1 flag and then increase according to your hardware.
Use smaller chunksize, so less documents will be put in memory at once. Use --chunksize 1 for having 1 * num_cpus documents in memory at once.

How to Contribute

Just open your issues and/or pull requests, all are welcome :smiley:!

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.3

Sep 13, 2020

0.5.2

Sep 8, 2020

0.5.1

Sep 1, 2020

0.5.1a0 pre-release

Sep 1, 2020

0.5.0

Aug 29, 2020

0.4.1

Aug 24, 2020

This version

0.4.0

Aug 23, 2020

0.4.0a1 pre-release

Aug 20, 2020

0.4.0a0 pre-release

Aug 20, 2020

0.3.2

Aug 2, 2020

0.3.1

Jul 29, 2020

0.3.0

Jul 29, 2020

0.2.0

Jul 21, 2020

0.1.1

Jul 16, 2020

0.1.0

Jul 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2dataset-0.4.0.tar.gz (24.0 kB view hashes)

Uploaded Aug 23, 2020 Source

Built Distribution

pdf2dataset-0.4.0-py3-none-any.whl (21.1 kB view hashes)

Uploaded Aug 23, 2020 Python 3

Hashes for pdf2dataset-0.4.0.tar.gz

Hashes for pdf2dataset-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`95d550d883703e0cd90bdee1c4b904dd522d515fb2a61a8d74c9b6aa76661208`
MD5	`6f4ea3f97804d704fdd3f8f2a5cac828`
BLAKE2b-256	`5904edac5e524fd2a0ee63481c31140b1222a27a081263a9bd3fe721dc2a306c`

Hashes for pdf2dataset-0.4.0-py3-none-any.whl

Hashes for pdf2dataset-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebbc7f0fdb38c30662bf85dd2495d72ab7ab8f54164ac94baedfff480bc1599b`
MD5	`6c9359b2c38cf4463907f717b7b0f8f4`
BLAKE2b-256	`39bcdae12c9644ff8fd16a9b07c862dca4da0114046aad73f3d40e0d3e11771e`

pdf2dataset 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

pdf2dataset

Main features

Installation

Install Dependencies

Fedora

Ubuntu (or debians)

Install pdf2dataset

For usage

For development

Usage

Simple - CLI

Save Processing Progress - CLI

Using as a library

Main functions

Basic example

Small data

Example:

Pass files from memory

Example:

Returning a list

Example:

Custom Features

Example:

Results File

Example with pandas

Run on a Cluster

Setup the Cluster

Run it

CLI Help

Troubleshooting

How to Contribute

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution