Skip to main content

Parse complex files (PDF,Docx,PPTX) for LLM consumption

Project description

MegaParse - Your Parser for every type of documents

Quivr-logo

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

  • Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
  • No Information Loss: Focus on having no information loss during parsing.
  • Fast and Efficient: Designed with speed and efficiency at its core.
  • Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
  • Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

  • Files: ✅ PDF ✅ Powerpoint ✅ Word
  • Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

Installation

pip install megaparse 

Usage

  1. Add your OpenAI or Anthropic API key to the .env file

  2. Install poppler on your computer (images and PDFs)

  3. Install tesseract on your computer (images and PDFs)

  4. If you have a mac, you also need to install libmagic brew install libmagic

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.unstructured_parser import UnstructuredParser

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # or any langchain compatible Chat Models
parser = UnstructuredParser(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use MegaParse Vision

  • Change the parser to MegaParseVision
from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.megaparse_vision import MegaParseVision

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md")

Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

(Optional) Use LlamaParse for Improved Results

  1. Create an account on Llama Cloud and get your API key.

  2. Change the parser to LlamaParser

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.llama import LlamaParser

parser = LlamaParser(api_key = os.getenv("LLAMA_CLOUD_API_KEY"))
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use as an API

There is a MakeFile for you, simply use : make dev at the root of the project and you are good to go.

See localhost:8000/docs for more info on the different endpoints !

BenchMark

Parser similarity_ratio
megaparse_vision 0.87
unstructured_with_check_table 0.77
unstructured 0.59
llama_parser 0.33

Higher the better

Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py and then run python evaluations/script.py. If it is better, do a PR, I mean, let's go higher together 🚀.

In Construction 🚧

  • Improve table checker
  • Create Checkers to add modular postprocessing ⚙️
  • Add Structured output, let's get computer talking 🤖

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megaparse-0.0.42.tar.gz (3.3 MB view details)

Uploaded Source

Built Distribution

megaparse-0.0.42-py3-none-any.whl (366.9 kB view details)

Uploaded Python 3

File details

Details for the file megaparse-0.0.42.tar.gz.

File metadata

  • Download URL: megaparse-0.0.42.tar.gz
  • Upload date:
  • Size: 3.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for megaparse-0.0.42.tar.gz
Algorithm Hash digest
SHA256 6e23067b632c19cb72a563c527b247e106371eca113a1ac63d42be61c0785064
MD5 029cf6d64d020ec3a70ed9805dba9a7a
BLAKE2b-256 5bd6f1377f70d7c0fb58a8d26ae824367e8c4a174f928adf97ef6a8d02f3824f

See more details on using hashes here.

File details

Details for the file megaparse-0.0.42-py3-none-any.whl.

File metadata

  • Download URL: megaparse-0.0.42-py3-none-any.whl
  • Upload date:
  • Size: 366.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for megaparse-0.0.42-py3-none-any.whl
Algorithm Hash digest
SHA256 3808f172d6ad58e2a13792d4cee863201479f88af560d4de25511244f8ae6f1d
MD5 fbab3dbcc42599256052782084b97ca4
BLAKE2b-256 2c0a0863a4d7c5d9b4dffc59ad3e2c3b2a0343c8f6af13c58fbc49f24e47443e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page