Easier to deal with PDF, extract readable text and OCR to recognise image text and clean the format. Make it more suitable for knowledge base construction. Best performance with Doc2X.

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

pdfdeal

Better RAG Effect!

🗺️ ENGLISH | 简体中文

Easily handle PDFs, extract readable text, recognize image text with OCR and clean up formatting to make it more suitable for building knowledge bases.

Introduction

What's NEW

Added documentation tutorial on how to integrate with graphrag

Doc2X Support

doc2x

Doc2X is a new universal document OCR tool that can convert images or PDF files into Markdown/LaTeX text with formulas and text formatting. It performs better than similar tools in most scenarios. pdfdeal provides abstract packaged classes to use Doc2X for requests.

Processing PDFs

Use various OCR or PDF recognition tools to identify images and add them to the original text. You can set the output format to use PDF, which will ensure that the recognized text retains the same page numbers as the original in the new PDF. It also offers various practical file processing tools.

After processing PDFs, you can achieve better recognition rates when used with knowledge base applications such as graphrag, Dify, and FastGPT.

It is recommended to use Doc2X for the best results.

main

Cases

For example, if graphrag does not support recognizing PDFs, you can use doc2x to convert it into txt documents for use.

rag

Or for knowledge base applications, you can also use pdfdeal to enhance documents. Below are the effects of original PDF/OCR enhancement/Doc2X processing in Dify:

222

Documentation

You can view new features under development here!

For details, please refer to the documentation

Or check out the documentation repository pdfdeal-docs.

Quick Start

For details, please refer to the documentation

Installation

Install from PyPI:

pip install --upgrade pdfdeal

Using pytesseract as an OCR engine

When using "pytesseract", make sure that tesseract is installed first:

pip install 'pdfdeal[pytesseract]'

from pdfdeal import deal_pdf, get_files

files, rename = get_files("tests/pdf", "pdf", "md")
output_path, failed, flag = deal_pdf(
    pdf_file=files,
    output_format="md",
    ocr="pytesseract",
    language=["eng"],
    output_path="Output",
    output_names=rename,
)
for f in output_path:
    print(f"Save processed file to {f}")

Using Doc2X as PDF deal tool

from pdfdeal import Doc2X
from pdfdeal import get_files

client = Doc2X()
file_list, rename = get_files(path="tests/pdf", mode="pdf", out="pdf")
success, failed, flag = client.pdfdeal(
    pdf_file=file_list,
    output_path="./Output/test/multiple/pdfdeal",
    output_names=rename,
)
print(success)
print(failed)
print(flag)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.2.2

Jul 16, 2024

This version

0.2.1

Jul 15, 2024

0.2.0

Jul 14, 2024

0.1.6

Jul 7, 2024

0.1.5

Jul 5, 2024

0.1.4

Jul 4, 2024

0.1.3

Jul 3, 2024

0.1.2

Jun 27, 2024

0.1.1

Jun 22, 2024

0.1.0

Jun 20, 2024

0.0.7

Jun 13, 2024

0.0.6

Jun 5, 2024

0.0.5

Jun 1, 2024

0.0.4

May 31, 2024

0.0.3

May 29, 2024

0.0.2

May 29, 2024

0.0.1

May 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdeal-0.2.1.tar.gz (109.4 kB view hashes)

Uploaded Jul 15, 2024 Source

Built Distribution

pdfdeal-0.2.1-py3-none-any.whl (36.6 kB view hashes)

Uploaded Jul 15, 2024 Python 3

Hashes for pdfdeal-0.2.1.tar.gz

Hashes for pdfdeal-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`cb74d5effe0d5509c7b344ae234abf2605ad8a7f9cb9d30f7f893a6ddb5101e5`
MD5	`db79d4c343895c65bc1cd79beea3d1e3`
BLAKE2b-256	`3150a56ea4e42561286becabf25d11c143efdb8060024709355c6263c750d049`

Hashes for pdfdeal-0.2.1-py3-none-any.whl

Hashes for pdfdeal-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`52b258fffa53c8694e8ebc8bbf9d437ec0c673f104a6d242978a50587acfb9b7`
MD5	`e70720423d18451945e0e9b8fa5b063c`
BLAKE2b-256	`f583583ac7af39662e521690935d661a408b9aefd58141bc061ac79436cec066`