PDF to TXT
Project description
EXPDF2TXT
🤔 Overview
The expdf2txt is a versatile and efficient package designed for seamless conversion of PDF (Portable Document Format) files into TXT (Plain Text) format. This package empowers users with a straightforward and reliable method to extract textual content from PDF documents, making it easily accessible and editable."
❓ Features
1. InvoiceExtractor (Convert Invoice Pdf to Text)
Code:
from expdf2txt.invoice2text import InvoiceExtractor
FILEPATH = "invoice_2001321.pdf"
invocie_extractor = InvoiceExtractor(FILEPATH)
data = invocie_extractor.openai_extract_data(format_data=True)
print(data)
2. ImageExtractor (extract image from pdf)
Code:
from expdf2txt.pdf2data import PDFExtractor
FILEPATH = "invoice_2001321.pdf"
pdf_obj = PDFExtractor(FILEPATH)
pdf_obj.extract_image()
🚀 Methods
1. InvoiceExtractor methods :
openai_extract_data() 'This method extracts text from a Invoice Pdf'
- Parameters:
- temperature (float, optional): The temperature parameter for the OpenAI LLM.
- api_key (str, optional): The API key for accessing OpenAI services. Note: "If the default API key is not functioning, please provide an alternative API key for use."
- template (str, optional): Custom template string for document extraction.
- format_data (bool, optional): If True, format the extracted data into a dictionary or list; if False, return raw output. Note: "If the data is successfully converted into a dictionary, it will be returned as a dictionary. Otherwise, it will be returned as a list."
2. PDFExtractor methods:
- countpages() 'Count the number of pages in the document.'
- extract_string() 'Extract the data from the source.'
- extract_image() 'Extract images from the PDF document and saves them as separate files.'
Dependencies
-
openai
-
pytesseract
-
PyPDF2
-
PyMuPDF
License
This project is licensed under the MIT License - see the LICENSE file for details.
Issues
If you encounter any issues or have suggestions, please create an issue on the GitHub repository.
Acknowledgments
Mention any libraries or tools you used and give credit to their respective authors.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for expdf2txt-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6d532789b892b22019175921c6c6c8210b9a75b2e5c8ed3cf1aa914a626634a |
|
MD5 | 6bc0c77196cd3e0d1bc0975d1cd23857 |
|
BLAKE2b-256 | f39873a9bb2add629fb0f463783e1077ac105fec0cbfec7613f90e6ce4d6aff9 |