PDF Table to JSON Converter
Project description
pdf-table-extract
Extract tables data from pdf files To JSON
-
Locate the table with oepncv and read the contents with a text reader (Your table should be blocked by a border)
-
(If you don't have a border, add a border through adjustment)
-
Currently, only the basic table is supported. (Supports only tables with horizontal headers).
-
The number of headers and the number of cells must be the same
Header 1 Header 2 Header 3 cel1 cel2 cel3 cel1 cel2 cel3 cel1 cel2 cel3 -
The pdf must be readable by a text reader. Drag on pdf to see if the text is captured.
Current Status(Change the values if adjustments are needed)
- Finds a table with a horizontal length greater than 1000 and a height greater than 100.
- Cells are excluded if their width or height is equal to the width or height of the table, or if the width or height of the cell is less than 10.
- Adding border lines to areas with a color of (230, 230, 230) and a width of 1000 or more to recognize them as table regions.
- Removing watermark images with a color of (213, 213, 213 == #D5D5D5).
- The specific string list is removed from the PDF text for the purpose of removing text watermarks (Currently empty).
Installation
- Rquired Python >= 3.8
- install with pip
pip install pdf-table2json
Example
import
import pdf_table2json.converter as converter
path = "PATH/PDF_NAME.pdf"
result = converter.main(path, json_file_out=True, image_file_out=True)
print(result)
CLI
python converter.py -i "pdf_path/pdf_name.pdf" [-j] [-o]
- "-i", "--input", required=True, help="[Required] Input PDF file path"
- "-j", "--json_file", action="store_true", help="[Optional] Create JSON Data file"
- "-o", "--image_file", action="store_true", help="[Optional] Save Image Data file"
Colab
[]
License
- GPL-3.0 license
Contact
Read Text From PDF library
- PyMuPDF GitHub
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf_table2json-1.0.1.tar.gz
(18.9 kB
view hashes)
Built Distribution
Close
Hashes for pdf_table2json-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3bff86c9977f5cd4bc96ad529f4034df08cdba51367fb278798e53510df5bd7 |
|
MD5 | ca52e42807581d00fd33b13b2cb53e73 |
|
BLAKE2b-256 | 4002373257279f27c116ba49295caef4d00365bf33d44e7a034ecb235d207893 |