Skip to main content

Parse complex files (PDF,Docx,PPTX) for LLM consumption

Project description

MegaParse - Your Mega Parser for every type of documents

Quivr-logo

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

  • Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
  • No Information Loss: Focus on having no information loss during parsing.
  • Fast and Efficient: Designed with speed and efficiency at its core.
  • Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
  • Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

  • Files: ✅ PDF ✅ Powerpoint ✅ Word
  • Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

Installation

pip install megaparse 

Usage

  1. Add your OpenAI or Anthropic API key to the .env file

  2. Install poppler on your computer (images and PDFs)

  3. Install tesseract on your computer (images and PDFs)

  4. If you have a mac, you also need to install libmagic brew install libmagic

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # or any langchain compatible Chat Models
parser = UnstructuredParser(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use MegaParse Vision

  • Change the parser to MegaParseVision
model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md")

Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

(Optional) Use LlamaParse for Improved Results

  1. Create an account on Llama Cloud and get your API key.

  2. Change the parser to LlamaParser

from parser.llama import LlamaParser
parser = LlamaParser(api_key = os.getenv("LLAMA_CLOUD_API_KEY"))
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use as an API

There is a MakeFile for you, simply use : make dev at the root of the project and you are good to go.

See localhost:8000/docs for more info on the different endpoints !

BenchMark

Parser similarity_ratio
megaparse_vision 0.87
unstructured_with_check_table 0.77
unstructured 0.59
llama_parser 0.33

Higher the better

Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py and then run python evaluations/script.py. If it is better, do a PR, I mean, let's go higher together 🚀.

In Construction 🚧

  • Improve table checker
  • Create Checkers to add modular postprocessing ⚙️
  • Add Structured output, let's get computer talking 🤖

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megaparse-0.0.32.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

megaparse-0.0.32-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file megaparse-0.0.32.tar.gz.

File metadata

  • Download URL: megaparse-0.0.32.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for megaparse-0.0.32.tar.gz
Algorithm Hash digest
SHA256 4ff41e6d629c1a683b1242103a59c1129f1dd91d38c93acd400d623f60a1f127
MD5 b37d6d315c6c503fdfd482a816d6911d
BLAKE2b-256 eb72652044340da9259ac8aea09f48821e9dd70721e4997efa8b5f27d9667618

See more details on using hashes here.

File details

Details for the file megaparse-0.0.32-py3-none-any.whl.

File metadata

  • Download URL: megaparse-0.0.32-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for megaparse-0.0.32-py3-none-any.whl
Algorithm Hash digest
SHA256 de012089a84ea4a8982a9a150dcefdcb343c09cf6692c4c58914386bb167a16f
MD5 c0c67d301feabd6abd84ad48074058e7
BLAKE2b-256 ed7ef77012585b6661d607dff1bf86c3dd69bbd964a6e5f7ce4e03874a90e355

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page