MagicConvert is a Python library that converts various document formats (PDF, DOCX, XLSX, PPTX, HTML, Images) to markdown text. Features include OCR support, automatic format detection, and URL/file stream handling.
Project description
MagicConvert: The Ultimate File-to-Markdown Conversion Library
MagicConvert is a powerful and user-friendly Python library designed to convert various file formats into Markdown. Whether you're dealing with documents, images, web content, or spreadsheets, MagicConvert makes the process effortless. Equipped with built-in OCR (Optical Character Recognition), it can even extract text from images, making it an essential tool for developers, researchers, and anyone working with Markdown workflows. It’s especially helpful for LLM (Large Language Model) integrations!
✨ Why Choose MagicConvert?
MagicConvert is your go-to tool for file-to-Markdown conversion. Here’s what makes it special:
- Supports Multiple File Formats: Convert documents, images, spreadsheets, web pages, and more into Markdown.
- OCR Integration: Extract text from scanned images and documents using Tesseract OCR.
- Convert Web Content: Quickly transform URLs or HTML files into clean, readable Markdown.
- Markdown for AI & LLMs: Simplify content preparation for AI models using structured Markdown.
- Simple & Efficient: An intuitive API that makes file conversion a breeze.
🚀 Installation
Getting started is easy! Install MagicConvert using pip:
pip install MagicConvert
Note: For OCR functionality, make sure you have Tesseract OCR installed on your system.
Pypi Link: MagicConvert on Pypi
📚 Getting Started
1. Import and Initialize
Begin by importing MagicConvert and initializing the converter:
from MagicConvert import MagicConvert
converter = MagicConvert()
2. Convert Files to Markdown
MagicConvert supports various file types. Here are some examples:
Convert Word Documents
result = converter.magic("document.docx")
print(result.get_text)
Convert PowerPoint Presentations
result = converter.magic("presentation.pptx")
print(result.get_text)
Convert PDFs
result = converter.magic("document.pdf")
print(result.get_text)
Convert Images (OCR)
result = converter.magic("image.png")
print(result.get_text)
Convert Web Content (URLs)
result = converter.magic("https://example.com")
print(result.get_text)
Convert Plain Text Files
result = converter.magic("example.txt")
print(result.get_text)
Convert HTML Files
result = converter.convert_local("webpage.html")
print(result.get_text)
Convert Excel Files
result = converter.convert_local("spreadsheet.xlsx")
print(result.get_text)
Convert CSV Files
result = converter.convert_local("data.csv")
print(result.get_text)
📂 Supported File Formats
MagicConvert supports a wide range of file formats, making it a versatile tool for various needs:
Document Formats
- Word Documents:
.docx
- PDF Files:
.pdf
- PowerPoint Presentations:
.pptx
- Excel Spreadsheets:
.xlsx
- CSV Files:
.csv
Web Formats
- HTML Files:
.html
,.htm
- URLs:
http://
,https://
Image Formats
- JPEG:
.jpg
,.jpeg
- PNG:
.png
- TIFF:
.tiff
- BMP:
.bmp
Text Formats
- Plain Text:
.txt
📅 Future Work
MagicConvert is constantly evolving. Here are some features planned for the future:
- Audio-to-Text Markdown: Convert audio files (e.g.,
.mp3
,.wav
) into Markdown by transcribing them with speech recognition. - Video Subtitles to Markdown: Extract captions or subtitles from video files and convert them into Markdown.
- Advanced Formatting Options: Customizable Markdown output with styles like tables, headers, and inline code.
- Multi-language OCR Support: Enhanced text recognition for multiple languages.
- Cloud Integration: Save converted Markdown directly to cloud platforms like Google Drive, Dropbox, etc.
- Batch Conversion: Process multiple files simultaneously for large-scale projects.
Want to contribute ideas? Let us know!
👨💻 Contributing
MagicConvert is developed by Muhammad Noman, a student at Iqra University. Contributions, feedback, and bug reports are always welcome!
Here’s how you can get in touch or contribute:
- Email: muhammadnomanshafiq76@gmail.com
- LinkedIn: Muhammad Noman
- GitHub Repository: MagicConvert on GitHub
If you enjoy using MagicConvert, feel free to ⭐️ the repository on GitHub and share it with others!
📃 License
MagicConvert is open-source and licensed under the MIT License. You are free to use, modify, and distribute the library as per the license terms.
💡 Summary
MagicConvert is the ultimate tool for converting files into Markdown, whether you’re preparing content for AI models, creating documentation, or simply working with Markdown-based workflows. Its ease of use, wide format support, and robust features make it an indispensable tool for developers, researchers, and content creators.
Try MagicConvert today and unlock the power of seamless file-to-Markdown conversion! 🚀