Skip to main content

A python wrapper for the Doc2X API and comes with native texts processing (to improve texts recall in RAG).

Project description

pdfdeal

Package Testing on Python 3.8-3.13 on Win/Linux/macOS

Downloads GitHub License PyPI - Version GitHub Repo stars


📄Documentation


🗺️ ENGLISH | 简体中文

Handle PDF more easily and simply, utilizing Doc2X's powerful document conversion capabilities for retained format file conversion/RAG enhancement.

Introduction

Doc2X Support

Doc2X is a new universal document OCR tool that can convert images or PDF files into Markdown/LaTeX text with formulas and text formatting. It performs better than similar tools in most scenarios. pdfdeal provides abstract packaged classes to use Doc2X for requests.

Processing PDFs

Use various OCR or PDF recognition tools to identify images and add them to the original text. You can set the output format to use PDF, which will ensure that the recognized text retains the same page numbers as the original in the new PDF. It also offers various practical file processing tools.

After conversion and pre-processing of PDF using Doc2X, you can achieve better recognition rates when used with knowledge base applications such as graphrag, Dify, and FastGPT.

Cases

graphrag

See how to use it with graphrag, its not supported to recognize pdf, but you can use the CLI tool doc2x to convert it to a txt document for use.

Fastgpt/Dify or other RAG system

Or for knowledge base applications, you can use pdfdeal's built-in variety of enhancements to documents, such as uploading images to remote storage services, adding breaks by paragraph, etc. See Integration with RAG applications.

Documentation

For details, please refer to the documentation

Or check out the documentation repository pdfdeal-docs.

Quick Start

For details, please refer to the documentation

Installation

Install using pip:

pip install --upgrade pdfdeal

If you need document processing tools:

pip install --upgrade "pdfdeal[rag]"

Use the Doc2X PDF API to process all PDF files in a specified folder

from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key",debug=True)
success, failed, flag = client.pdf2file(
    pdf_file="tests/pdf",
    output_path="./Output",
    output_format="docx",
)
print(success)
print(failed)
print(flag)

Use the Doc2X PDF API to process the specified PDF file and specify the name of the exported file

from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key",debug=True)
success, failed, flag = client.pdf2file(
    pdf_file="tests/pdf/sample.pdf",
    output_path="./Output/test/single/pdf2file",
    output_names=["sample1.zip"],
    output_format="md_dollar",
)
print(success)
print(failed)
print(flag)

See the online documentation for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdeal-0.4.2.tar.gz (110.1 kB view details)

Uploaded Source

Built Distribution

pdfdeal-0.4.2-py3-none-any.whl (41.0 kB view details)

Uploaded Python 3

File details

Details for the file pdfdeal-0.4.2.tar.gz.

File metadata

  • Download URL: pdfdeal-0.4.2.tar.gz
  • Upload date:
  • Size: 110.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for pdfdeal-0.4.2.tar.gz
Algorithm Hash digest
SHA256 af12e7942a061079597c1f91aacdbe8a6f7a4e78f02f88fa3da23ba209744681
MD5 fa4f5da5d5793234ab5823184abddd62
BLAKE2b-256 54ea87e1e4267abcf1e50e2c5dc71d43469c6094146bf60a7469041c2bf3e7fb

See more details on using hashes here.

File details

Details for the file pdfdeal-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: pdfdeal-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 41.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for pdfdeal-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ed8d8d69842eb0b4f61ac52d51e7d9e19a17753d8d0f40e4f5174e9cfb0a7e0c
MD5 548556b49bfd81be04271fe9b61158a6
BLAKE2b-256 36a1da2b8f6c9626bf14fb7af5b76c70b16223c3098302ef636e3a62e52044f2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page