Skip to main content

PDF table & paragraph extractor

Project description

DePDF

An ultimate pdf file disintegration tool. DePDF is designed to extract tables and paragraphs into structured markup language [eg. html] from embedding pdf pages. You can also use it to convert page/pdf to html.

Built on top of pdfplumber

Table of Contents

[toc]

Installation

pip install depdf

Example

from depdf import DePDF
from depdf import DePage

# general
with DePDF.load('test/test_general.pdf') as pdf
    pdf_html = pdf.to_html
    print(pdf_html)

# with dedicated configurations
c = Config(
    debug_flag=True,
    verbose_flag=True,
    add_line_flag=True
)
pdf = DePDF.load('test/test_general.pdf', config=c)
page_index = 23  # start from zero
page = pdf_file.pages[page_index]
page_soup = page.soup
print(page_soup.text)

APIs

functions usage
extract_page_paragraphs extract paragraphs from specific page
extract_page_tables extract tables from specific page
convert_pdf_to_html convert the entire pdf to html
convert_page_to_html convert specific page to html

In-Depth

In-page elements

  • Paragraph
    • Text
    • Span
  • Table
    • Cell
  • Image

Common properties

property & method explanation
html converted html string
soup converted beautiful soup
bbox bounding box region
save_html write html tag to local file

DePDf HTML structure

<div class="{pdf_class}">
    %for <!--page-{pid}-->
        <div id="page-{}" class="{}">
            %for {html_elements} endfor%
        </div>
    endfor%
</div>

DePage HTML element structure

Paragraph

<p>
    {paragraph-content}
    <span> {span-content} </span>
    ... 
</p>

Table

<table>
    <tr>
        <td> {cell_0_0} </td>
        <td> {cell_0_1} </td>
        ...
    </tr>
    <tr colspan=2>
        <td> {cell_1_0} </td>
        ...
    </tr>
    ...
</table>

Image

<img src="temp_depdf/$prefix.png"></img>

Appendix

DePage element denotations

Useful element properties within page

page element

todo

  • add support for multiple-column pdf page
  • better table structure recognition
  • recognize embedded objects inside page elements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

depdf-0.1.1.tar.gz (34.0 kB view hashes)

Uploaded Source

Built Distribution

depdf-0.1.1-py3-none-any.whl (38.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page