Use GTP4-Vision as a better than OCR data extractor
Project description
PDF-GPT4-JSON
This project is designed to convert PDF files into JSON format using GPT-4. For each page in the PDF, a JSON file will be generated. The hierarchy of the JSON structure will be inferred from the layout of the data in the PDF. Can be used as python module or in the cli.
Theory of Generating Structured JSON using GPT-4 Vision
GPT-4 Vision is a state-of-the-art language model that has been fine-tuned for image understanding and analysis. It leverages the power of deep learning to extract meaningful information from PDF files and convert them into structured JSON format.
The process of generating structured JSON using GPT-4 Vision involves the following steps:
-
PDF Parsing: The PDF file is parsed to extract the textual content and layout information of each page.
-
Text Extraction: The extracted text is processed to remove any noise or irrelevant information, such as headers, footers, and page numbers.
-
Layout Analysis: GPT-4 Vision analyzes the layout of the text on each page to identify the hierarchical structure of the data. It looks for patterns, indentation, and formatting cues to infer the relationships between different elements.
-
JSON Generation: Based on the layout analysis, GPT-4 Vision generates a structured JSON representation of the PDF content. Each page is represented as a separate JSON file, with nested objects and arrays to capture the hierarchical relationships.
By leveraging the power of GPT-4 Vision, the PDF-GPT4-JSON project simplifies the process of converting PDF files into structured JSON format. This enables developers to easily extract and analyze data from PDFs, opening up a wide range of possibilities for data processing and automation.
Installation
-
Install via pip:
pip install pdf_gpt4_json
-
Set your OpenAI api key:
export OPENAI_API_KEY=sk-xxxxxxxxxxx
You can also pass in as a command line arugment to the tool
--openai-key
Usage
-
Run the conversion script:
pdf-gpt4-json ./sample.pdf
Will generate tmp working folder and an output folder with json for each page. If you already have a folder 'output' it will get renamed, if the working folders exist they will be deleted and recreated to insure that they are empyt
-
Final output folder will be
samplepdf_final_folders
in this case. It will use the pdfs filename as the prefix to the output folder. If there are errors it will be in{filename}_errors
folder. In both cases the cli will return a message.
Parameter
--prompt-file (str, optional): Path to a file containing a prompt for the model.
--openai-key (str, optional): OpenAI API key. If not provided, it will be read from the environment.
--model (str, optional): Model to use. Default is "gpt-4-vision-preview".
--verbose (bool, optional): If True, print additional debug information. Default is False.
--cleanup (bool, optional): If True, cleanup temporary files after processing. Default is False.
By adjusting these parameters, users can tailor the PDF-to-JSON conversion to their specific needs and preferences.
Contributing
Contributions are welcome! Please follow the guidelines in CONTRIBUTING.md.
License
This project is licensed under the GLP-3 LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf_gpt4_json-0.1.2.tar.gz
.
File metadata
- Download URL: pdf_gpt4_json-0.1.2.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.2 Linux/5.15.146.1-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df46066bd18550dabef3f13ae6aee3391ccefb98f6a86383e562a160b830953d |
|
MD5 | 52362791484aac6140e7ee95fa6dba28 |
|
BLAKE2b-256 | 86212cff9f31acfa5999be1405094da127ba2b72b28998a7e20140670aa080d4 |
File details
Details for the file pdf_gpt4_json-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: pdf_gpt4_json-0.1.2-py3-none-any.whl
- Upload date:
- Size: 22.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.2 Linux/5.15.146.1-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92039da8e9bc94a133384bb435266916a99d7c9167271c26c2b74767bed1abc7 |
|
MD5 | 3f25d79d28e5215104bd230237d3ea71 |
|
BLAKE2b-256 | e1d92cff9f0b4ac90a78ede4305207490aa9a8916dec82a9ae9f3c9b669015aa |