Skip to main content

A lightweight toolbox to manipulate documents

Project description

license issue resolution open issues

👋 join us on Discord and WeChat

English | 简体中文

Install

Prerequisites: python3.10

Install Dependencies

linux/osx

apt-get/yum/brew install libreoffice

windows

install libreoffice 
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH

Install Magic-Doc

pip install qctc-doc[cpu] --extra-index-url https://wheels.myhloli.com # cpu version
or
pip install qctc-doc[gpu] --extra-index-url https://wheels.myhloli.com # gpu version

Introduction

Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.

Example

# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config

s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)

Performance

ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7

File Type Speed
PDF (digital) 347 (page/s)
PDF (ocr) 2.7 (page/s)
PPT 20 (page/s)
PPTX 149 (page/s)
DOC 600 (page/s)
DOCX 1482 (page/s)

All Thanks To Our Contributors:

image

Acknowledgments

🖊️ Citation

@misc{2024magic-doc,
    title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
    author={Magic-Doc Contributors},
    howpublished = {\url{https://github.com/InternLM/magic-doc}},
    year={2024}
}

License

This project is released under the Apache 2.0 license.

🔼 Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qctc_doc-0.1.3.tar.gz (689.6 kB view details)

Uploaded Source

Built Distribution

qctc_doc-0.1.3-py3-none-any.whl (797.9 kB view details)

Uploaded Python 3

File details

Details for the file qctc_doc-0.1.3.tar.gz.

File metadata

  • Download URL: qctc_doc-0.1.3.tar.gz
  • Upload date:
  • Size: 689.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for qctc_doc-0.1.3.tar.gz
Algorithm Hash digest
SHA256 905870b0dd73848938a74803f5d75766d2707a776fc789b1dda854cdb9272480
MD5 bc4bc5a197de1d6a967577e3346aca1e
BLAKE2b-256 db3a380b6a54b60025149bee58e26976972da90d024455db4554f7e62315a412

See more details on using hashes here.

File details

Details for the file qctc_doc-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: qctc_doc-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 797.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for qctc_doc-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1f58558e17e0db9cbd070b8046f953abfce0ac5415e238e927ead30a17617a2c
MD5 4d776100fc82494d28ef41d2ba7abf19
BLAKE2b-256 be5d2890e07da3168c1e2760c4a5f6fdadaf90a4aaf1ef893271a1d69d6db2fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page