Skip to main content

This package can help user parse PDF files into text file and JSON file. Additionally, it can help user parse question-answer pairs into a JSONL document in prompt-completion format, that is supported by OpenAI

Project description

Multi-purpose PDF parser

This parser was designed keeping in mind requirement for parsing PDFs file to streamline fine tuning process of Large Language Models such as Open AI's GPT models.

Functionalities:

  • PDF file to Text File conversion
  • PDF file to JSON file conversion
  • PDF file to JSONL file conversion

How to use?

  1. PDF file to Text File conversion.

    You can use command pdfparser pdftotext INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH to make a copy of contents in your PDF file in a TEXT file of your own choosing.

  2. PDF file to JSON File conversion.

    You can use command pdfparser pdftojson INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH to make a copy of contents in your PDF file in a JSON file of your own choosing. The JSON file will be in format {'text':PDF_CONTENTS}.

  3. PDF file to JSONL File conversion.

    This utility will prove quite helpful if you want to process a question answer data file into JsONL file to process it as source data for various Large Language Model's operations. You can use command pdfparser pdftojsonl INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH to extract question-answer pairs from your PDF file and save it in a separate JSONL file.

    Input file format: Your question answer pairs in PDF should be in Question: What is a cat? Answer: Cat is an animal. New line separator will not affect the parser at all. Output format: A jsonl file in structure similar to [{'prompt':Question, 'completion':Answer}] format.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AGIpdf2json-1.0.1.tar.gz (3.2 kB view hashes)

Uploaded Source

Built Distribution

AGIpdf2json-1.0.1-py3-none-any.whl (3.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page