Skip to main content

Easier to deal with PDF, extract readable text and OCR to recognise image text and clean the format. Make it more suitable for knowledge base construction. Best performance with Doc2X.

Project description

pdfdeal

Python package test

For better RAG!

🗺️ ENGLISH | 简体中文

What's new

V0.1.1

✨ New Features

  • All functions now support the new return format, through the optional parameter version to choose, when it is v2, it will return: list: successful processing files list: processing failed files bool, and the default v1 return parameter will only return list: successful processing files.
  • pdf2file and file2pdf now support the optional parameter output_names to specify the output file name.
  • Added request retry mechanism, now it will automatically retry when the network request fails.
  • Added error handling mechanism, now it will automatically handle errors when processing files, and will not cause the entire program to interrupt due to a file error.

🐛 Bug Fixes

  • Fixed the font exception problem in the pdfdeal function.
  • Fixed some abnormal use of keys.
  • Fixed the problem that the rpm limit may not take effect.

V0.1.0

[!IMPORTANT] The Doc2x methods in version 0.0.X have been deprecated and will be removed in the future, please migrate to the new implementation as soon as possible. You will receive a warning when you still use the old methods.

Most of its interface has not changed, you can try to change from pdfdeal.doc2x import Doc2x to from pdfdeal.doc2x import Doc2X directly.

Refactored Doc2X support using concurrency to speed up processing. Quick start:

from pdfdeal.doc2x import Doc2X

Client = Doc2X()
filelist = gen_folder_list("./test","pdf")
# This is a built-in function for generating the folder under the path of all the pdf, you can give any list of the form of the path of the pdf
Client.pdfdeal(filelist)

See Doc2x Support.

Summary

Easier to deal with PDF, extract readable text and OCR to recognise image text and clean the format. Make it more suitable for knowledge base construction.

Its going to use easyocr to recognise the image and add it to the original text. If the output format uses pdf format, this ensures that the text is on the same number of pages in the new PDF as the original. You can use knowledge base applications (such as Dify,FastGPT) after the PDF processing, so that theoretically can reach a better recognition rate.

111

222

Support for Doc2x

Added support for Doc2x, which currently has a daily 500-page free usage quota, and its recognition of tables/formulas is excellent.

You can also use Doc2x support module alone to convert pdf to markdown/latex/docx directly like below. See Doc2x Support for more.

from pdfdeal.doc2x import Doc2X

Client = Doc2X()
filelist = gen_folder_list("./test","pdf")
# This is a built-in function for generating the folder under the path of all the pdf, you can give any list of the form of the path of the pdf
Client.pdfdeal(filelist)

Usage

See the example codes.

Install

Install from PyPI:

pip install 'pdfdeal[easyocr]'

Using pytesseract, make sure you have install tesseract first:

pip install 'pdfdeal[pytesseract]'

Using own custom OCR function or Doc2x or skip OCR:

pip install pdfdeal

Install from source:

pip install 'pdfdeal[all] @ git+https://github.com/Menghuan1918/pdfdeal.git'

Parameters

Import the function byfrom pdfdeal import deal_pdf. Explanation of the parameters accepted by the function:

  • input: str

    • Description: The URL or local path to the PDF file that you want to process.
    • Example: "https://example.com/sample.pdf" or "/path/to/local/sample.pdf"
  • output: str, optional, default: "text"

    • Description: Specifies the type of output you want. The options are:
      • "text": Extracted text from the PDF as a single string.
      • "texts": Extracted text from the PDF as a list of strings, one per page.
      • "md": Markdown formatted text.
      • "pdf": A new PDF file with the extracted text.
    • Example: "md"
  • ocr: function, optional, default: None

    • Description: A custom OCR (Optical Character Recognition) function. If not provided, the default OCR function will be used. Use string "pytesseract" to use pytesseract, string "pass" to skip OCR
    • Example custom OCR function: custom_ocr_function, input is :(path, language=["ch_sim", "en"], GPU=False), return a string
  • language: list, optional, default: ["ch_sim", "en"]

    • Description: A list of languages to be used in OCR. The default languages are Simplified Chinese ("ch_sim") and English ("en"). ["eng"] for pytesseract.
    • Example: ["en", "fr"]
  • GPU: bool, optional, default: False

    • Description: A boolean flag indicating whether to use GPU for OCR processing. If set to True, GPU will be used.
    • Example: True
  • path: str, optional, default: None

    • Description: The directory path where the output file will be saved. This parameter is only used when the output type is "md" or "pdf".
    • Example: "/path/to/save/output"

Processes all the files in a file and saves them in the Output folder

import os
from pdfdeal import deal_pdf
for root, dirs, files in os.walk("./PPT"):
    for file in files:
        file_path = os.path.join(root, file)
        deal_pdf(
            input=file_path, output="pdf", language=["en"], path="./Output", GPU=True
        )
        print(f"Deal with {file_path} successfully!")

Get the the list of text in the pdf

from pdfdeal import deal_pdf
Text = deal_pdf(input="test.pdf", output="texts", language=["en"], GPU=True)
for text in Text:
  print(text)

Using pytesseract to do OCR

output_path = deal_pdf(
    input="test.pdf",
    output="md",
    ocr="pytesseract",
    language=["eng"],
    path="markdown"
)
print(f"Save processed file to {output_path}")

Skip OCR

print(deal_pdf(input="test.pdf",ocr="pass"))

Doc2x support

from pdfdeal.doc2x import Doc2X

Client = Doc2X()
filelist = gen_folder_list("./test","pdf")
# This is a built-in function for generating the folder under the path of all the pdf, you can give any list of the form of the path of the pdf
Client.pdfdeal(filelist)

See Doc2x Support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdeal-0.1.1.tar.gz (65.3 kB view hashes)

Uploaded Source

Built Distribution

pdfdeal-0.1.1-py3-none-any.whl (19.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page