Skip to main content

Use GTP4-Vision as a better than OCR data extractor

Project description

PDF-GPT4-JSON

This project is designed to convert PDF files into JSON format using GPT-4. For each page in the PDF, a JSON file will be generated. The hierarchy of the JSON structure will be inferred from the layout of the data in the PDF. Can be used as python module or in the cli.

Theory of Generating Structured JSON using GPT-4 Vision

GPT-4 Vision is a state-of-the-art language model that has been fine-tuned for image understanding and analysis. It leverages the power of deep learning to extract meaningful information from PDF files and convert them into structured JSON format.

The process of generating structured JSON using GPT-4 Vision involves the following steps:

  1. PDF Parsing: The PDF file is parsed to extract the textual content and layout information of each page.

  2. Text Extraction: The extracted text is processed to remove any noise or irrelevant information, such as headers, footers, and page numbers.

  3. Layout Analysis: GPT-4 Vision analyzes the layout of the text on each page to identify the hierarchical structure of the data. It looks for patterns, indentation, and formatting cues to infer the relationships between different elements.

  4. JSON Generation: Based on the layout analysis, GPT-4 Vision generates a structured JSON representation of the PDF content. Each page is represented as a separate JSON file, with nested objects and arrays to capture the hierarchical relationships.

By leveraging the power of GPT-4 Vision, the PDF-GPT4-JSON project simplifies the process of converting PDF files into structured JSON format. This enables developers to easily extract and analyze data from PDFs, opening up a wide range of possibilities for data processing and automation.

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/PDF-GPT4-JSON.git
    
  2. Navigate to the project directory:

    cd PDF-GPT4-JSON
    
  3. Install the required dependencies:

    pip install -r requirements.txt
    

Usage

  1. OPENAI Key

    Supply your Open AI key as an Enviroment variable OPENAI_API_KEY or as a command line argument --openai-key

  2. Run the conversion script:

    python main.py ../samples/sample.pdf 
    

    Will generate tmp working folder and an output folder with json for each page.

Parameter

--prompt-file (str, optional): Path to a file containing a prompt for the model.
--openai-key (str, optional): OpenAI API key. If not provided, it will be read from the environment.
--model (str, optional): Model to use. Default is "gpt-4-vision-preview".
--verbose (bool, optional): If True, print additional debug information. Default is False.
--cleanup (bool, optional): If True, cleanup temporary files after processing. Default is False.

By adjusting these parameters, users can tailor the PDF-to-JSON conversion to their specific needs and preferences.

Contributing

Contributions are welcome! Please follow the guidelines in CONTRIBUTING.md.

License

This project is licensed under the GLP-3 LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_gpt4_json-0.1.0.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

pdf_gpt4_json-0.1.0-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf_gpt4_json-0.1.0.tar.gz.

File metadata

  • Download URL: pdf_gpt4_json-0.1.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Linux/5.15.146.1-microsoft-standard-WSL2

File hashes

Hashes for pdf_gpt4_json-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4241a83a9cf97b3c85f67966256e3ec3d071858e659e9de57d184cb093055d24
MD5 0b6d1dfee75bdcdcd0f6cf81a5e64f57
BLAKE2b-256 6b6f79bb089330a9c4f701563d785ae2593aa8f56e2b1df2cd1e84ff2e10f0a7

See more details on using hashes here.

File details

Details for the file pdf_gpt4_json-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf_gpt4_json-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Linux/5.15.146.1-microsoft-standard-WSL2

File hashes

Hashes for pdf_gpt4_json-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e42680ccf455887611dc8cb968187e61747ec440503534bf90ef3dc63b2c1d4
MD5 2d2dc40de32bfb3a4818deb8e84b9dfe
BLAKE2b-256 b864483792c409b7389f563ec0c06dc6919395a099e077bd8df5f5f70c2d29c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page