Advanced PDF parsing for python
Project description
Burdoc: Advanced PDF Parsing for Python
A python library for extracting structured text, images, and tables from PDFs with context and reading order.
Table Of Contents
- Table Of Contents
- About the Project
- Quickstart
- Usage
- Roadmap
- Built With
- Contributing
- License
- Authors
- Acknowledgements
About the Project
Why Another PDF Parsing Library?
Excellent question! Between pdfminer, PyMuPDF, Tika, and many others there are a plethora of tools for parsing PDFs, but nearly all are focused on the initial step of pulling out raw content, not on representing the documents actual meaning. Burdoc's goal is to generate a rich semantic representation of a PDF, including headings, reading order, tables, and images that can be used for downstream processing.
Key Features
-
Rich Document Representation: Burdoc is able to identify most common types of text, including:
- Paragraphs
- Headings
- Lists (ordered and unordered)
- Headers, footers and sidebars,
- Visual Asides such as read-out boxes
-
Structured Output: Burdoc generates a comprehensive JSON representation of the text. Unlike many other tools it preserves information such metadata, fonts, and original bounding boxes to give downstream users as much information as is needed.
-
Complex Reading Order Inference: Burdoc uses a multi-stage algorithm to infer reading order even in complex pages with changing numbers of columns, split sections, and asides.
-
ML-Powered Table Extraction: Burdoc makes use of the latest machine learning models for identifying tables, alongside a rules-based approach to identify inline tables.
-
Large Documents: By relying on PyMuPDF rather than pdfminer, the core PDF reading task is substantially faster than other libraries, and can handle large files (~1000s of pages or 100s of Mbs in size) with ease. Running a single page through Burdoc can be quite slow due to expensive initialisation requirements and takes O(2s) but with GPU acceleration and multithreading support we can process documents at 0.2-0.5s/page
Limitations
- OCR: As Burdoc relies on high-precision font and location information for it's processing it is likely to perform badly when parsing OCR'd files.
- Right-to-Left Text: All parsing is for left-to-right languages only.
- Complex Figures: Areas with large amounts of text arranged around figures in a arbitrary fashion will not be extracted correctly.
- Forms: Currently Burdoc has no way to recognise complex forms.
Quickstart
More detailed information on running Burdoc can be found here - Docs
Prerequisites
ML Prerequisites
The transformer-based table detection use by Burdoc by default can be quite slow on CPU, often taking several seconds per page, you'll see a large performance increase by running it on a GPU. To avoid messing around with package versions after the fact, it's generally better to install GPU drivers and GPU accelerated versions of PyTorch first if available.
Installation
To install burdoc from pip
pip install burdoc
To build it directly from source
git clone https://github.com/jennis0/burdoc
cd burdoc
pip install .
Developer Install
To reproduce the development environment for running builds, tests, etc. use
pip install burdoc[dev]
or
git clone https://github.com/jennis0/burdoc
cd burdoc
pip install -e ".[dev]"
Usage
Burdoc can be used as a library or directly from the command line depending on your usecase.
Command Line
usage: burdoc [-h] [--pages PAGES] [--html] [--detailed] [--no-ml-tables] [--images] [--single-threaded] [--profile] [--debug] in_file [out_file]
positional arguments:
in_file Path to the PDF file you want to parse
out_file Path to file to write output to. Defaults to [in-file-stem].json/[in-file-stem].html
optional arguments:
-h, --help show this help message and exit
--pages PAGES List of pages to process. Accepts comma separated list and ranges specified with '-'
--html Output a simple HTML representation of the document, rather than the JSON content.
--detailed Include BoundingBoxes and font statistics in the output to aid onward processing
--no-ml-tables Turn off ML table finding. Defaults to False.
--images Extract images from PDF and store in output. This can lead to very large output JSON files.Default is False
--single-threaded Force Burdoc to run in single-threaded mode. Default to off
--profile Dump timing information at end of processing
--debug Dump debug messages to log
Library
from burdoc import BurdocParser
parser = BurdocParser(
detailed: bool = False, # Include detailed information such as font statistics and bounding boxes in the output
skip_ml_table_finding: bool = False, # Whether to use ML table finding algorithms
ignore_images: bool = False, # Don’t extract any images from the document. Much faster but prone to errors if images used as layout elements
max_threads: int | None = None, # Maximum number of threads to run. Set to None to use default system limits or 1 to force single-threaded mode. Defaults to None
log_level: int = 20, # Defaults to logging.INFO
show_pages: bool = False # Draw each page as it’s extracted with extraction information laid on top. Primarily for debugging. Defaults to False.
)
content = parser.read('file.pdf')
Roadmap
Current issues I'd like to address are:
- Improved Table Extraction - tables extraction is currently quite poor, I'd like to adopt some of the line-based methods used by Camelot and similar tools.
- Improved Headers/Footers/Sidebars - The current approach is quite conservative and can will often miss obvious headers/footers. We also currently don't include this information in the final content
- ToC Alignment - The extracted Page Hierarchy is functionally a table of content. Need to do work to align this with ToCs that exist within the document
- Image Usage Classifiction - The current image classifier is quite poor and doesn't distinguish between 'content' and
- Out-of-line Ordering - Ordering of out-of-line elements, such as page-width tables and images is somewhat random.
- Captions - We should be able to identify when a piece of text is tied to an image or figure.
Built With
Contributing
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- If you have suggestions for adding or removing projects, feel free to open an issue to discuss it, or directly create a pull request after you edit the README.md file with necessary changes.
- Please make sure you check your spelling and grammar.
- Create individual PR for each suggestion.
Creating A Pull Request
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
Distributed under the MIT License. See LICENSE for more information.
Authors
- jennis0 - Github Profile
Acknowledgements
- ShaanCoding - ReadME-Generator
- ImgShields
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file burdoc-0.2.3.tar.gz
.
File metadata
- Download URL: burdoc-0.2.3.tar.gz
- Upload date:
- Size: 2.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09f2ddb7600f627bba0dada9b0ce9241b41cef17fae44d8b3d3b626274a42459 |
|
MD5 | 0753b7b243b4741fb03b9af857ed5e92 |
|
BLAKE2b-256 | 9cbbb7680335010c94a93a55befea4719d8c2818348c579606e293d4150c5673 |
File details
Details for the file burdoc-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: burdoc-0.2.3-py3-none-any.whl
- Upload date:
- Size: 82.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4955a5c32a2dd9799db85dff0e2df5e0f4f3b9252e265038c2bb753e91c65fa4 |
|
MD5 | 245845cb65cce74fd4b00800c458ff1b |
|
BLAKE2b-256 | 1a60528396328c885424578044e64004d3af2225d332085bc19f67721dfb6098 |