Skip to main content

PDF Table to JSON Converter

Project description

pdf-table-extract

Extract tables data from pdf files To JSON

  • Locate the table with oepncv and read the contents with a text reader (Your table should be blocked by a border)

  • (If you don't have a border, add a border through adjustment)

  • Currently, only the basic table is supported. (Supports only tables with horizontal headers).

  • The number of headers and the number of cells must be the same

    Header 1 Header 2 Header 3
    cel1 cel2 cel3
    cel1 cel2 cel3
    cel1 cel2 cel3
  • The pdf must be readable by a text reader. Drag on pdf to see if the text is captured.

Current Status(Change the values ​​if adjustments are needed)

  • Finds a table with a horizontal length greater than 1000 and a height greater than 100.
  • Cells are excluded if their width or height is equal to the width or height of the table, or if the width or height of the cell is less than 10.
  • Adding border lines to areas with a color of (230, 230, 230) and a width of 1000 or more to recognize them as table regions.
  • Removing watermark images with a color of (213, 213, 213 == #D5D5D5).
  • The specific string list is removed from the PDF text for the purpose of removing text watermarks (Currently empty).

Installation

  • Rquired Python >= 3.8
  • install with pip
pip install pdf-table2json

Example

import

import pdf_table2json.converter as converter

path = "PATH/PDF_NAME.pdf"
result = converter.main(path, json_file_out=True, image_file_out=True)
print(result)

CLI

python converter.py -i "pdf_path/pdf_name.pdf" [-j] [-o]
  • "-i", "--input", required=True, help="[Required] Input PDF file path"
  • "-j", "--json_file", action="store_true", help="[Optional] Create JSON Data file"
  • "-o", "--image_file", action="store_true", help="[Optional] Save Image Data file"

Colab

[Open In Colab]

License

  • GPL-3.0 license

Contact

Read Text From PDF library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_table2json-1.0.1.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

pdf_table2json-1.0.1-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf_table2json-1.0.1.tar.gz.

File metadata

  • Download URL: pdf_table2json-1.0.1.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pdf_table2json-1.0.1.tar.gz
Algorithm Hash digest
SHA256 53b523020d47e7cd971c7287a4e8df7d2f66bc6ec7f8325ccca2d1aa99353e9e
MD5 a33aaf9247f8f4de973c7319b368f576
BLAKE2b-256 8fc6287a06e78a28267f2bb26a324c62b9558c9064cacc324d6069cf4e37259d

See more details on using hashes here.

File details

Details for the file pdf_table2json-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdf_table2json-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pdf_table2json-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b3bff86c9977f5cd4bc96ad529f4034df08cdba51367fb278798e53510df5bd7
MD5 ca52e42807581d00fd33b13b2cb53e73
BLAKE2b-256 4002373257279f27c116ba49295caef4d00365bf33d44e7a034ecb235d207893

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page