Skip to main content

PDF table & paragraph extractor

Project description

DePDF

An ultimate pdf file disintegration tool. DePDF is designed to extract tables and paragraphs into structured markup language [eg. html] from embedding pdf pages. You can also use it to convert page/pdf to html.

Built on top of pdfplumber

Table of Contents

[toc]

Installation

pip install depdf

Example

from depdf import DePDF
from depdf import DePage

# general
with DePDF.load('test/test_general.pdf') as pdf
    pdf_html = pdf.to_html
    print(pdf_html)

# with dedicated configurations
c = Config(
    debug_flag=True,
    verbose_flag=True,
    add_line_flag=True
)
pdf = DePDF.load('test/test_general.pdf', config=c)
page_index = 23  # start from zero
page = pdf_file.pages[page_index]
page_soup = page.soup
print(page_soup.text)

APIs

functions usage
extract_page_paragraphs extract paragraphs from specific page
extract_page_tables extract tables from specific page
convert_pdf_to_html convert the entire pdf to html
convert_page_to_html convert specific page to html

In-Depth

In-page elements

  • Paragraph
    • Text
    • Span
  • Table
    • Cell
  • Image

Common properties

property & method explanation
html converted html string
soup converted beautiful soup
bbox bounding box region
save_html write html tag to local file

DePDf HTML structure

<div class="{pdf_class}">
    %for <!--page-{pid}-->
        <div id="page-{}" class="{}">
            %for {html_elements} endfor%
        </div>
    endfor%
</div>

DePage HTML element structure

Paragraph

<p>
    {paragraph-content}
    <span> {span-content} </span>
    ... 
</p>

Table

<table>
    <tr>
        <td> {cell_0_0} </td>
        <td> {cell_0_1} </td>
        ...
    </tr>
    <tr colspan=2>
        <td> {cell_1_0} </td>
        ...
    </tr>
    ...
</table>

Image

<img src="temp_depdf/$prefix.png"></img>

Appendix

DePage element denotations

Useful element properties within page

page element

todo

  • add support for multiple-column pdf page
  • better table structure recognition
  • recognize embedded objects inside page elements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

depdf-0.1.1.tar.gz (34.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

depdf-0.1.1-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file depdf-0.1.1.tar.gz.

File metadata

  • Download URL: depdf-0.1.1.tar.gz
  • Upload date:
  • Size: 34.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.6.5 Darwin/19.3.0

File hashes

Hashes for depdf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9cf8adf6b784738080caa67ff1fc9cfb69f61ecafec64c628e3f962b470bc516
MD5 3e7bb5f29ffbed5cc126ba4ea7059744
BLAKE2b-256 4a642d30a7e97c786a0ae7a1607b2ddfec54fee97c0e1c59287825bceddcb7b4

See more details on using hashes here.

File details

Details for the file depdf-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: depdf-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 38.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.6.5 Darwin/19.3.0

File hashes

Hashes for depdf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 623550c3963a57a2417ff6e02375f4f267018f970759bccd1d2ea833ce0c4fe8
MD5 6247cd11322a5fa113515ff8bff96dc8
BLAKE2b-256 1aafc827010aea6d2be326bda8f5ba54f95de75d9780bc516571e403f9f177db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page