PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.
Project description
Introduction
pdf-craft converts PDF files into various other formats, with a focus on handling scanned book PDFs.
This project is based on DeepSeek OCR for document recognition. It supports the recognition of complex content such as tables and formulas. With GPU acceleration, pdf-craft can complete the entire conversion process from PDF to Markdown or EPUB locally. During the conversion, pdf-craft automatically identifies document structure, accurately extracts body text, and filters out interfering elements like headers and footers. For academic or technical documents containing footnotes, formulas, and tables, pdf-craft handles them properly, preserving these important elements. The final Markdown or EPUB files maintain the content integrity and readability of the original book.
Lightweight and Fast
Starting from the official v1.0.0 release, pdf-craft fully embraces DeepSeek OCR and no longer relies on LLM for text correction. This change brings significant performance improvements: the entire conversion process is completed locally without network requests, eliminating the long waits and occasional network failures of the old version.
However, the new version has also removed the LLM text correction feature. If your use case still requires this functionality, you can continue using the old version v0.2.8.
Quick Start
Installation
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pdf-craft
This project uses DeepSeek OCR, which depends on a CUDA environment. The above command only ensures Python can read types without errors, but cannot actually run OCR recognition. For specific CUDA environment installation instructions, please refer to the Installation Guide.
Quick Start
Convert to Markdown
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
markdown_assets_path="images",
)
Convert to EPUB
from pdf_craft import transform_epub, BookMeta
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
book_meta=BookMeta(
title="Book Title",
authors=["Author"],
),
)
Detailed Usage
Convert to Markdown
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
markdown_assets_path="images",
analysing_path="temp", # Optional: specify temporary folder
model="gundam", # Optional: tiny, small, base, large, gundam
models_cache_path="models", # Optional: model cache path
includes_footnotes=True, # Optional: include footnotes
generate_plot=False, # Optional: generate visualization charts
)
Convert to EPUB
from pdf_craft import transform_epub, BookMeta, TableRender, LaTeXRender
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
analysing_path="temp", # Optional: specify temporary folder
model="gundam", # Optional: OCR model size
models_cache_path="models", # Optional: model cache path
includes_cover=True, # Optional: include cover
includes_footnotes=True, # Optional: include footnotes
generate_plot=False, # Optional: generate visualization charts
book_meta=BookMeta(
title="Book Title",
authors=["Author 1", "Author 2"],
publisher="Publisher",
language="en",
),
lan="en", # Optional: language (zh/en)
table_render=TableRender.HTML, # Optional: table rendering method
latex_render=LaTeXRender.MATHML, # Optional: formula rendering method
)
Model Management
pdf-craft depends on DeepSeek OCR models, which are automatically downloaded from Hugging Face on first run. You can control model storage and loading behavior through the models_cache_path and local_only parameters.
Pre-download Models
In production environments, it is recommended to download models in advance to avoid downloading on first run:
from pdf_craft import predownload_models
predownload_models(
models_cache_path="models", # Specify model cache directory
revision=None, # Optional: specify model version
)
Specify Model Cache Path
By default, models are downloaded to the system's Hugging Face cache directory. You can customize the cache location through the models_cache_path parameter:
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
models_cache_path="./my_models", # Custom model cache directory
)
Offline Mode
If you have pre-downloaded the models, you can use local_only=True to disable network downloads and ensure only local models are used:
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
models_cache_path="./my_models",
local_only=True, # Use local models only, do not download from network
)
API Reference
OCR Models
Supports the following DeepSeek OCR models:
tiny- Smallest model, fastest speedsmall- Small modelbase- Base modellarge- Large modelgundam- Largest model, highest quality (default)
Table Rendering Methods
TableRender.HTML- HTML format (default)TableRender.MARKDOWN- Markdown formatTableRender.TEXT- Plain text format
Formula Rendering Methods
LaTeXRender.MATHML- MathML format (default)LaTeXRender.IMAGE- Image formatLaTeXRender.TEXT- Plain text format
Related Open Source Libraries
epub-translator uses AI large language models to automatically translate EPUB e-books while 100% preserving the original book's format, illustrations, table of contents, and layout. It also generates bilingual versions for convenient language learning or international sharing. When combined with this library, you can convert and translate scanned PDF books. For a demonstration, see this video: Convert PDF scanned books to EPUB format and translate to bilingual books.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Starting from v1.0.0, pdf-craft has fully migrated to DeepSeek OCR (MIT license), removing the previous AGPL-3.0 dependency, allowing the entire project to be released under the more permissive MIT license. Thanks to the community for their support and contributions!
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_craft-1.0.0.tar.gz.
File metadata
- Download URL: pdf_craft-1.0.0.tar.gz
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.10.15 Darwin/25.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0de9d9bc0702bbac00e172cd482c758b43c0d0feb5d3eb0f86e2d0adf25daa2d
|
|
| MD5 |
3aeb14acd33533a81ca286d492a45f15
|
|
| BLAKE2b-256 |
89dd3bdf17576d72e814aba2f6878b983379a2cfd55075f31a40edab12767366
|
File details
Details for the file pdf_craft-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pdf_craft-1.0.0-py3-none-any.whl
- Upload date:
- Size: 33.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.10.15 Darwin/25.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
656fd3c032f21cfac47fb0ded42710af3a2fe67e9b234f79ca43cd714e50f52e
|
|
| MD5 |
1fa5c5cebe8fa58f53e2a116719601ea
|
|
| BLAKE2b-256 |
03fff82dd6622aba922c005acf50443176fad7e64df719aac34f70d5ac7fdabb
|