Skip to main content

A lightweight toolbox to manipulate documents

Project description

license issue resolution open issues

👋 join us on Discord and WeChat

English | 简体中文

Install

Prerequisites: python3.10

Install Dependencies

linux/osx

apt-get/yum/brew install libreoffice

windows

install libreoffice 
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH

Install Magic-Doc

pip install qctc-doc[cpu] --extra-index-url https://wheels.myhloli.com # cpu version
or
pip install qctc-doc[gpu] --extra-index-url https://wheels.myhloli.com # gpu version

Introduction

Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.

Example

# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config

s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)

Performance

ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7

File Type Speed
PDF (digital) 347 (page/s)
PDF (ocr) 2.7 (page/s)
PPT 20 (page/s)
PPTX 149 (page/s)
DOC 600 (page/s)
DOCX 1482 (page/s)

All Thanks To Our Contributors:

image

Acknowledgments

🖊️ Citation

@misc{2024magic-doc,
    title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
    author={Magic-Doc Contributors},
    howpublished = {\url{https://github.com/InternLM/magic-doc}},
    year={2024}
}

License

This project is released under the Apache 2.0 license.

🔼 Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qctc_doc-0.1.5.tar.gz (685.7 kB view details)

Uploaded Source

Built Distribution

qctc_doc-0.1.5-py3-none-any.whl (800.3 kB view details)

Uploaded Python 3

File details

Details for the file qctc_doc-0.1.5.tar.gz.

File metadata

  • Download URL: qctc_doc-0.1.5.tar.gz
  • Upload date:
  • Size: 685.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for qctc_doc-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e203983dbe57c4646e402d8f18de401525a3e0741195fcc7cb7de19e3eb29822
MD5 a670c1175fb1c0ff1b70ad5879d83ae4
BLAKE2b-256 9b0674ef5811b9e1e1d82aecb6cc9d44f2cc9bdf33f8e49b4edac3005195d180

See more details on using hashes here.

File details

Details for the file qctc_doc-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: qctc_doc-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 800.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for qctc_doc-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 7d9f06a6acd72aec2d96977fd65de8b419055504ae48c34aae30023afb4735c1
MD5 27009f7f58fd9c88d246b5d3c58bc179
BLAKE2b-256 a89dd3f717ef1a3b244aeba434844ebc114a703ed67a09c8a00dc77e52039d6d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page