Easily convert a subdirectory with big volume of PDF documents into a dataset, supports extracting text and images
Project description
pdf2dataset
Converts a whole subdirectory with any volume (small or huge) of PDF documents to a dataset (pandas DataFrame). No need to setup any external service (no database, brokers, etc). Just install and run it!
Main features
- Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
- Support for parallel and distributed processing through ray
- Extractions are performed by page, making tasks distribution more uniform, for handling documents with a big difference in number of pages
- Incremental writing of resulting DataFrame, making possible to process data bigger than memory
- Error tracking of faulty documents
- Ability to save processing progress and resume from it
- Ability to extract text through pdftotext
- Ability to use OCR for extracting text through pytesseract
- Ability to extract images through pdf2image
- Support for implementing custom features extraction
- Highly customizable behavior through params
Installation
Install Dependencies
Fedora
# "-por" for portuguese, use the documents language
$ sudo dnf install -y gcc-c++ poppler-utils pkgconfig poppler-cpp-devel python3-devel tesseract-langpack-por
Ubuntu (or debians)
$ sudo apt update
# "-por" for portuguese, use the documents language
$ sudo apt install -y build-essential poppler-utils libpoppler-cpp-dev pkg-config python3-dev tesseract-ocr-por
Install pdf2dataset
For usage
$ pip3 install pdf2dataset --user # Please, isolate the environment
For development
# First, install poetry, clone repository and cd into it
$ poetry install
Usage
Simple - CLI
# Note: path, page and error will always be present in resulting DataFrame
# Reads all PDFs from my_pdfs_dir and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip # Most basic, extract all possible features
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --features=text # Extract just text
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --features=image # Extract just image
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --num-cpus 1 # Maximum reducing of parallelism
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true # For scanned PDFs
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true --lang eng # For scanned documents with english text
Save Processing Progress - CLI
It's possible to save the progress to a temporary folder and resume from the saved state in case of
any error or interruption. To resume the processing, just use the --tmp-dir [directory]
flag:
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --tmp-dir my_progress
The indicated temporary directory can also be used for debugging purposes and is not deleted automatically, so delete it when desired.
Using as a library
Main functions
There're some helper functions to facilitate pdf2dataset usage:
- extract: function can be used analogously to the CLI
- extract_text:
extract
wrapper withfeatures=text
- extract_image:
extract
wrapper withfeatures=image
- image_from_bytes: (pdf2image.utils) get a Pillow
Image
object given the image bytes - image_to_bytes: (pdf2image.utils) get the image bytes given the a Pillow
Image
object
Basic example
from pdf2dataset import extract
extract('my_pdfs_dir', 'all_features.parquet.gzip', tmp_dir='my_progress')
Small data
One feature, not available to the CLI, is the custom behavior for handling small volumes of data (small can be understood as that: the extraction won't run for hours or days and won't be distributed).
The complete list of differences are:
- Faster initialization (use multiprocessing instead of ray)
- Don't save processing progress
- Distributed processing not supported
- Don't write dataframe to disk
- Returns the dataframe
Example:
from pdf2dataset import extract_text
df = extract_text('my_pdfs_dir', small=True)
# ...
Pass files from memory
If you don't want to specify a directory for the documents, you can specify the tasks that will be processed.
The tasks can be of the form (document_name, document_bytes, page_number)
or just (document_name, document_bytes)
, document_name must ends with .pdf
but
don't need to be a real file, document_bytes are the bytes of the pdf document and
page_number is the number of the page to process (all pages, if not specified).
Example:
from pdf2dataset import extract_text
tasks = [
('a.pdf', a_bytes), # Processing all pages of this document
('b.pdf', b_bytes, 1),
('b.pdf', b_bytes, 2),
]
# 'df' will contain results from all pages from 'a.pdf' and page 1 and 2 from 'b.pdf'
df = extract_text(tasks, 'my_df.parquet.gzip', small=True)
# ...
Returning a list
If you don't want to handle the DataFrame, is possible to return a nested list with the features values. The structure for the resulting list is:
result = List[documents]
documents = List[pages]
pages = List[features]
features = List[feature]
feature = any
any
is any type supported by pyarrow.- features are ordered by the feature name (
text
,image
, etc)
Example:
>>> from pdf2dataset import extract_text
>>> extract_text('tests/samples', return_list=True)
[[[None]],
[['First page'], ['Second page'], ['Third page']],
[['My beautiful sample!']],
[['First page'], ['Second page'], ['Third page']],
[['My beautiful sample!']]]
- Features with error will have
None
value as result - Here,
extract_text
was used, so the only feature istext
Custom Features
With version >= 0.4.0, is also possible to easily implement extraction of custom features:
Example:
This is the strucuture:
from pdf2dataset import extract, feature, PdfExtractTask
class MyCustomTask(PdfExtractTask):
@feature('bool_')
def get_is_page_even(self):
return self.page % 2 == 0
@feature('binary')
def get_doc_first_bytes(self):
return self.file_bin[:10]
@feature('string', exceptions=[ValueError])
def get_wrong(self):
raise ValueError("There was a problem!")
if __name__ == '__main__':
df = extract('tests/samples', small=True, task_class=MyCustomTask)
print(df)
df.dropna(subset=['text'], inplace=True) # Discard invalid documents
print(df.iloc[0].error)
- First print:
path page doc_first_bytes ... text wrong error
0 invalid1.pdf -1 b"I'm invali" ... None None image_original:\nTraceback (most recent call l...
1 multi_page1.pdf 2 b'%PDF-1.5\n%' ... Second page None wrong:\nTraceback (most recent call last):\n ...
2 multi_page1.pdf 3 b'%PDF-1.5\n%' ... Third page None wrong:\nTraceback (most recent call last):\n ...
3 sub1/copy_multi_page1.pdf 1 b'%PDF-1.5\n%' ... First page None wrong:\nTraceback (most recent call last):\n ...
4 sub1/copy_multi_page1.pdf 3 b'%PDF-1.5\n%' ... Third page None wrong:\nTraceback (most recent call last):\n ...
5 multi_page1.pdf 1 b'%PDF-1.5\n%' ... First page None wrong:\nTraceback (most recent call last):\n ...
6 sub2/copy_single_page1.pdf 1 b'%PDF-1.5\n%' ... My beautiful sample! None wrong:\nTraceback (most recent call last):\n ...
7 sub1/copy_multi_page1.pdf 2 b'%PDF-1.5\n%' ... Second page None wrong:\nTraceback (most recent call last):\n ...
8 single_page1.pdf 1 b'%PDF-1.5\n%' ... My beautiful sample! None wrong:\nTraceback (most recent call last):\n ...
[9 rows x 8 columns]
- Second print:
wrong:
Traceback (most recent call last):
File "/home/icaro/Desktop/pdf2dataset/pdf2dataset/extract_task.py", line 32, in inner
result = feature_method(*args, **kwargs)
File "example.py", line 16, in get_wrong
raise ValueError("There was a problem!")
ValueError: There was a problem!
Notes:
@feature
is the decorator used to define new features- First argument to
@feature
must be a valid PyArrow type, complete list here. exceptions
param specify a list of exceptions to be recorded on DataFrame, otherwise they are raised- For this example, all available features plus the custom ones are extracted.
Results File
The resulting "file" is a directory with structure specified by dask with pyarrow engine, it can be easily read with pandas or dask:
Example with pandas
>>> import pandas as pd
>>> df = pd.read_parquet('my_df.parquet.gzip', engine='pyarrow')
>>> df
path page text error
index
0 single_page1.pdf 1 My beautiful sample!
1 sub1/copy_multi_page1.pdf 2 Second page
2 sub2/copy_single_page1.pdf 1 My beautiful sample!
3 sub1/copy_multi_page1.pdf 3 Third page
4 multi_page1.pdf 1 First page
5 multi_page1.pdf 3 Third page
6 sub1/copy_multi_page1.pdf 1 First page
7 multi_page1.pdf 2 Second page
0 invalid1.pdf -1 Traceback (most recent call last):\n File "/h...
There is no guarantee about the uniqueness or order of index
, you might need to create a new index with
the whole data in memory.
The -1
page number means that was not possible of even parsing the document.
Run on a Cluster
Setup the Cluster
Follow ray documentation for manual or automatic setup.
Run it
To go distributed you can run it just like local, but using the --address
and --redis-password
flags to point to your cluster (More information).
With version >= 0.2.0, only the head node needs to have access to the documents in disk.
CLI Help
usage: pdf2dataset [-h] [--features FEATURES] [--tmp-dir TMP_DIR] [--ocr-lang OCR_LANG] [--ocr OCR] [--chunksize CHUNKSIZE] [--image-size IMAGE_SIZE] [--ocr-image-size OCR_IMAGE_SIZE]
[--image-format IMAGE_FORMAT] [--num-cpus NUM_CPUS] [--address ADDRESS] [--webui-host WEBUI_HOST] [--redis-password REDIS_PASSWORD]
input_dir results_file
Extract text from all PDF files in a directory
positional arguments:
input_dir The folder to lookup for PDF files recursively
results_file File to save the resultant dataframe
optional arguments:
-h, --help show this help message and exit
--features FEATURES Specify a comma separated list with the features you want to extract. 'path' and 'page' will always be added. Available features to add: image, text Examples: '--
features=text,image' or '--features=all'
--tmp-dir TMP_DIR The folder to keep all the results, including log files and intermediate files
--ocr-lang OCR_LANG Tesseract language
--ocr OCR 'pytesseract' if true, else 'pdftotext'. default: false
--chunksize CHUNKSIZE
Chunksize to use while processing pages, otherwise is calculated
--image-size IMAGE_SIZE
If adding image feature, image will be resized to this size. Provide two integers separated by 'x'. Example: --image-size 1000x1414
--ocr-image-size OCR_IMAGE_SIZE
The height of the image OCR will be applied. Width will be adjusted to keep the ratio.
--image-format IMAGE_FORMAT
Format of the image generated from the PDF pages
--num-cpus NUM_CPUS Number of cpus to use
--address ADDRESS Ray address to connect
--webui-host WEBUI_HOST
Which IP ray webui will try to listen on
--redis-password REDIS_PASSWORD
Redis password to use to connect with ray
Troubleshooting
- Troubles with high memory usage
-
Decrease the number of CPUs in use, reducing the level of parallelism, test it with
--num-cpus 1
flag and then increase according to your hardware. -
Use smaller chunksize, so less documents will be put in memory at once. Use
--chunksize 1
for having1 * num_cpus
documents in memory at once.
How to Contribute
Just open your issues and/or pull requests, all are welcome :smiley:!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pdf2dataset-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebbc7f0fdb38c30662bf85dd2495d72ab7ab8f54164ac94baedfff480bc1599b |
|
MD5 | 6c9359b2c38cf4463907f717b7b0f8f4 |
|
BLAKE2b-256 | 39bcdae12c9644ff8fd16a9b07c862dca4da0114046aad73f3d40e0d3e11771e |