A practical tool for converting PDF to Markdown
Project description
Magic-PDF
Introduction
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
Key features include:
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
Getting Started
Requirements
- Python 3.9 or newer
Usage Instructions
1. Install Magic-PDF
pip install magic-pdf
2. Usage via Command Line
simple
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
more
magic-pdf --help
3. Usage via Api
Local
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Object Storage
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Demo can be referred to demo.py
All Thanks To Our Contributors
License Information
See LICENSE.md for details.
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
magic_pdf-0.5.10-py3-none-any.whl
(256.9 kB
view hashes)
Close
Hashes for magic_pdf-0.5.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce59dff4fc9a922d89c945e6f9987c59272d32e01b0c9cd22487497a7544f810 |
|
MD5 | ce22bc4524ee65ddee6a439fb3d780de |
|
BLAKE2b-256 | f152b4db640546b86a3e242a0d394fbc7ac66bae85a99e2842668d7c4f3398d9 |