Skip to main content

Parse complex files (PDF,Docx,PPTX) for LLM consumption

Project description

MegaParse - Your Mega Parser for every type of documents

Quivr-logo

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

  • Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
  • No Information Loss: Focus on having no information loss during parsing.
  • Fast and Efficient: Designed with speed and efficiency at its core.
  • Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
  • Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

  • Files: ✅ PDF ✅ Powerpoint ✅ Word
  • Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

Example

Quivr-logo

Installation

pip install megaparse

Usage

  1. Add your OpenAI API key to the .env file

  2. Install poppler on your computer (images and PDFs)

  3. Install tesseract on your computer (images and PDFs)

from megaparse import MegaParse

megaparse = MegaParse(file_path="./test.pdf")
content = megaparse.convert()
print(content)
megaparse.save_md(content, "./test.md")

Use LlamaParse

  1. Create an account on Llama Cloud and get your API key.

  2. Call Megaparse with the llama_parse_api_key parameter

from megaparse import MegaParse

megaparse = MegaParse(file_path="./test.pdf", llama_parse_api_key="llx-your_api_key")
content = megaparse.convert()
print(content)

BenchMark

Diff megaparse unstructured: 120 Diff llama parse: 31 Diff megaparse llama: 26

Lower is better

Next Steps

  • Improve Table Parsing
  • Improve Image Parsing and description
  • Add TOC for Docx
  • Add Hyperlinks for Docx
  • Order Headers for Docx to Markdown

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megaparse-0.0.5.tar.gz (12.3 kB view hashes)

Uploaded Source

Built Distribution

megaparse-0.0.5-py3-none-any.whl (12.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page