Skip to main content

A lightweight toolbox to manipulate documents

Project description

license issue resolution open issues

👋 join us on Discord and WeChat

English | 简体中文

Install

Prerequisites: python3.10

Install Dependencies

linux/osx

apt-get/yum/brew install libreoffice

windows

install libreoffice 
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH

Install Magic-Doc

pip install qctc-doc[cpu] --extra-index-url https://wheels.myhloli.com # cpu version
or
pip install qctc-doc[gpu] --extra-index-url https://wheels.myhloli.com # gpu version

Introduction

Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.

Example

# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config

s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)

Performance

ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7

File Type Speed
PDF (digital) 347 (page/s)
PDF (ocr) 2.7 (page/s)
PPT 20 (page/s)
PPTX 149 (page/s)
DOC 600 (page/s)
DOCX 1482 (page/s)

All Thanks To Our Contributors:

image

Acknowledgments

🖊️ Citation

@misc{2024magic-doc,
    title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
    author={Magic-Doc Contributors},
    howpublished = {\url{https://github.com/InternLM/magic-doc}},
    year={2024}
}

License

This project is released under the Apache 2.0 license.

🔼 Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qctc_doc-0.1.4.tar.gz (689.6 kB view details)

Uploaded Source

Built Distribution

qctc_doc-0.1.4-py3-none-any.whl (797.9 kB view details)

Uploaded Python 3

File details

Details for the file qctc_doc-0.1.4.tar.gz.

File metadata

  • Download URL: qctc_doc-0.1.4.tar.gz
  • Upload date:
  • Size: 689.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for qctc_doc-0.1.4.tar.gz
Algorithm Hash digest
SHA256 7990b6efaa225cdee7d526413f543d16a545a907adbe1f4dc08dc6f800a9f27b
MD5 ef0297fe78faf6d2e3921861cbd31bcb
BLAKE2b-256 1017797e124f6dc297c2d6b22ff3a9abb08cd2bba13cafb476e7ac61b11cea48

See more details on using hashes here.

File details

Details for the file qctc_doc-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: qctc_doc-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 797.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for qctc_doc-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b301d70809523ece52fed5e3775b42898680bc0760d1b76a352b4217ce9c7caa
MD5 71dc9c9e21af8fc6186f49ab1726f0d3
BLAKE2b-256 213196293b26c197ebf8ac040e090e7c01e5bf873783ce6ab788d11efb03378b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page