This package can help user parse PDF files into text file and JSON file. Additionally, it can help user parse question-answer pairs into a JSONL document in prompt-completion format, that is supported by OpenAI
Project description
Multi-purpose PDF parser
This parser was designed keeping in mind requirement for parsing PDFs file to streamline fine tuning process of Large Language Models such as Open AI's GPT models.
Functionalities:
- PDF file to Text File conversion
- PDF file to JSON file conversion
- PDF file to JSONL file conversion
How to use?
-
PDF file to Text File conversion.
You can use command
pdfparser pdftotext INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH
to make a copy of contents in your PDF file in a TEXT file of your own choosing. -
PDF file to JSON File conversion.
You can use command
pdfparser pdftojson INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH
to make a copy of contents in your PDF file in a JSON file of your own choosing. The JSON file will be in format{'text':PDF_CONTENTS}
. -
PDF file to JSONL File conversion.
This utility will prove quite helpful if you want to process a question answer data file into JsONL file to process it as source data for various Large Language Model's operations. You can use command
pdfparser pdftojsonl INPUT_PDF_FILE_PATH -o OUTPUT_TEXT_FILE_PATH
to extract question-answer pairs from your PDF file and save it in a separate JSONL file.Input file format: Your question answer pairs in PDF should be in
Question: What is a cat? Answer: Cat is an animal
. New line separator will not affect the parser at all. Output format: A jsonl file in structure similar to[{'prompt':Question, 'completion':Answer}]
format.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file AGIpdf2json-1.0.1.tar.gz
.
File metadata
- Download URL: AGIpdf2json-1.0.1.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96753329383a180b37548e6b212b0dadf99fa85a33e7a99ac9785b112b6252a1 |
|
MD5 | 37e1c81d998d3890ac3a865b0ffed40b |
|
BLAKE2b-256 | dd8c656a7ae837848a543b9f7e66f0d3207f8437d25520b8411242acaca301ff |
File details
Details for the file AGIpdf2json-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: AGIpdf2json-1.0.1-py3-none-any.whl
- Upload date:
- Size: 3.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2b334d5ef7bc44e14c6647a5c501281c894294c2b79e3b501311afb6550b264 |
|
MD5 | 5c8011f20fbdbb5f1dd186ab0cf0cbc1 |
|
BLAKE2b-256 | 7b77584b3b151d686564c00806d8225be2a990d19b4458218dac175f768b9522 |