Skip to main content

PDF Table to JSON Converter

Project description

pdf-table2json

Extract tables data from pdf files To JSON

  • Locate the table with oepncv and read the contents with a text reader (Your table should be blocked by a border)
  • (If you don't have a border, add a border through adjustment)
  • The pdf must be readable by a text reader. Drag on pdf to see if the text is captured.
  • Check version before install
    • 1.0.1 : Only the basic table is supported
    • 2.0.1 : Handling tables with separate headers or cells (example)

Current Status(Change the values ​​if adjustments are needed)

  • Finds a table with a horizontal length greater than 1000 and a height greater than 100.
  • Cells are excluded if their width or height is equal to the width or height of the table, or if the width or height of the cell is less than 10.
  • Adding border lines to areas with a color of (230, 230, 230) and a width of 1000 or more to recognize them as table regions.
  • Removing watermark images with a color of (213, 213, 213 == #D5D5D5).
  • The specific string list is removed from the PDF text for the purpose of removing text watermarks (Currently empty).

Installation

  • Rquired Python >= 3.8
  • install with pip
pip install pdf-table2json 

Example

import

import pdf_table2json.converter as converter

path = "PATH/PDF_NAME.pdf"
result = converter.main(path, json_file_out=True, image_file_out=True)
print(result)

CLI

python converter.py -i "pdf_path/pdf_name.pdf" [-j] [-o]
  • "-i", "--input", required=True, help="[Required] Input PDF file path"
  • "-j", "--json_file", action="store_true", help="[Optional] Create JSON Data file"
  • "-o", "--image_file", action="store_true", help="[Optional] Save Image Data file"

version-1.0.1

  • Only the basic table is supported. (Supports only tables with horizontal headers).

  • The number of headers and the number of cells must be the same

  • Example Table

    Header 1 Header 2 Header 3
    cel1 cel2 cel3
    cel1 cel2 cel3
    cel1 cel2 cel3
  • Example

    • converter.py
    import pdf_table2json.converter as converter
    
    path = "PATH/PDF_NAME.pdf"
    result = converter.main(path, json_file_out=True, image_file_out=True)
    print(result)
    

version-2.0.1 or Higher

  1. Tables in general format that can be processed in version 1.0.1 can be processed.

    • Example Table
      Header 1 Header 2 Header 3
      cel1 cel2 cel3
      cel1 cel2 cel3
      cel1 cel2 cel3
  2. Table with separated header and subheader

    • Example Table

      Header 1 Header 2
      Sub Header 1 Sub Header 2
      cel1 cel2 cel3
      cel1 cel2 cel3
      cel1 cel2 cel3
    • Output

      • Delete separated parent header, use child header

        Header 1 : cel1
        Sub Header 1 : cel2
        Sub Header 2 : cel3
        
  3. Tables with columns separated, except for the first cell

    • Example Table

      Header 1 Header 2 Header 3
      cel1 cel2 cel3
      cel1 cel2-1 cel3-1
      cel2-2 cel3-2
    • Output

      • Add to data in the top row (with "@")

        Header 1 : cel1
        Header 2 : cel2
        Header 3 : cel3
        Header 1 : cel1
        Header 2 : cel2-1@cel2-2
        Header 3 : cel3-1@cel3-2
        
  • Use Example

    • converter_2.py
    import pdf_table2json.converter_2 as converter_2
    
    path = "PATH/PDF_NAME.pdf"
    result = converter_2.main(path, json_file_out=True, image_file_out=True)
    print(result)
    

License

  • GPL-3.0 license

Contact

Read Text From PDF library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_table2json-2.0.1.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

pdf_table2json-2.0.1-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_table2json-2.0.1.tar.gz.

File metadata

  • Download URL: pdf_table2json-2.0.1.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pdf_table2json-2.0.1.tar.gz
Algorithm Hash digest
SHA256 8de58c3087db10f15f5652b8c5b3b5d190ddd1df2079b232b327904746edefa2
MD5 8a08c3ef47266341802bcec1c4e7753f
BLAKE2b-256 a863ccba587fad8b719f6d4ccd1d2218638df071beca9d8157a08d9298122d4d

See more details on using hashes here.

File details

Details for the file pdf_table2json-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdf_table2json-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 26.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pdf_table2json-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d9e037dc7955fd4e9e068c90034b56590889b589135e52ddcaf9fe2adf86fbd7
MD5 4c6ed507be458b3ad35a604984561daf
BLAKE2b-256 192be046c541d689691adb24d87160b07268e6c91fcfe6aa4c10798503fea067

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page