Skip to main content

This package can help user parse PDF files into text file and JSON file. Additionally, it can help user parse question-answer pairs into a JSONL document in prompt-completion format, that is supported by OpenAI

Project description

Multi-purpose PDF parser

This parser was designed keeping in mind requirement for parsing PDFs file to streamline fine tuning process of Large Language Models such as Open AI's GPT models.

Functionalities:

  • PDF file to Text File conversion
  • PDF file to JSON file conversion
  • PDF file to JSONL file conversion

How to use?

  1. PDF file to Text File conversion.

    You can use command pdfparser pdftotext INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH to make a copy of contents in your PDF file in a TEXT file of your own choosing.

  2. PDF file to JSON File conversion.

    You can use command pdfparser pdftojson INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH to make a copy of contents in your PDF file in a JSON file of your own choosing. The JSON file will be in format {'text':PDF_CONTENTS}.

  3. PDF file to JSONL File conversion.

    This utility will prove quite helpful if you want to process a question answer data file into JsONL file to process it as source data for various Large Language Model's operations. You can use command pdfparser pdftojsonl INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH to extract question-answer pairs from your PDF file and save it in a separate JSONL file.

    Input file format: Your question answer pairs in PDF should be in Question: What is a cat? Answer: Cat is an animal. New line separator will not affect the parser at all. Output format: A jsonl file in structure similar to [{'prompt':Question, 'completion':Answer}] format.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AGIpdf2json-1.0.1.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

AGIpdf2json-1.0.1-py3-none-any.whl (3.9 kB view details)

Uploaded Python 3

File details

Details for the file AGIpdf2json-1.0.1.tar.gz.

File metadata

  • Download URL: AGIpdf2json-1.0.1.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for AGIpdf2json-1.0.1.tar.gz
Algorithm Hash digest
SHA256 96753329383a180b37548e6b212b0dadf99fa85a33e7a99ac9785b112b6252a1
MD5 37e1c81d998d3890ac3a865b0ffed40b
BLAKE2b-256 dd8c656a7ae837848a543b9f7e66f0d3207f8437d25520b8411242acaca301ff

See more details on using hashes here.

File details

Details for the file AGIpdf2json-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: AGIpdf2json-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for AGIpdf2json-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e2b334d5ef7bc44e14c6647a5c501281c894294c2b79e3b501311afb6550b264
MD5 5c8011f20fbdbb5f1dd186ab0cf0cbc1
BLAKE2b-256 7b77584b3b151d686564c00806d8225be2a990d19b4458218dac175f768b9522

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page