Skip to main content

Parse complex files (PDF,Docx,PPTX) for LLM consumption

Project description

MegaParse - Your Parser for every type of documents

Quivr-logo

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

  • Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
  • No Information Loss: Focus on having no information loss during parsing.
  • Fast and Efficient: Designed with speed and efficiency at its core.
  • Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
  • Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

  • Files: ✅ PDF ✅ Powerpoint ✅ Word
  • Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

Installation

pip install megaparse 

Usage

  1. Add your OpenAI or Anthropic API key to the .env file

  2. Install poppler on your computer (images and PDFs)

  3. Install tesseract on your computer (images and PDFs)

  4. If you have a mac, you also need to install libmagic brew install libmagic

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.unstructured_parser import UnstructuredParser

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # or any langchain compatible Chat Models
parser = UnstructuredParser(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use MegaParse Vision

  • Change the parser to MegaParseVision
from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.megaparse_vision import MegaParseVision

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md")

Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

(Optional) Use LlamaParse for Improved Results

  1. Create an account on Llama Cloud and get your API key.

  2. Change the parser to LlamaParser

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.llama import LlamaParser

parser = LlamaParser(api_key = os.getenv("LLAMA_CLOUD_API_KEY"))
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use as an API

There is a MakeFile for you, simply use : make dev at the root of the project and you are good to go.

See localhost:8000/docs for more info on the different endpoints !

BenchMark

Parser similarity_ratio
megaparse_vision 0.87
unstructured_with_check_table 0.77
unstructured 0.59
llama_parser 0.33

Higher the better

Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py and then run python evaluations/script.py. If it is better, do a PR, I mean, let's go higher together 🚀.

In Construction 🚧

  • Improve table checker
  • Create Checkers to add modular postprocessing ⚙️
  • Add Structured output, let's get computer talking 🤖

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megaparse-0.0.36.tar.gz (3.3 MB view details)

Uploaded Source

Built Distribution

megaparse-0.0.36-py3-none-any.whl (366.8 kB view details)

Uploaded Python 3

File details

Details for the file megaparse-0.0.36.tar.gz.

File metadata

  • Download URL: megaparse-0.0.36.tar.gz
  • Upload date:
  • Size: 3.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for megaparse-0.0.36.tar.gz
Algorithm Hash digest
SHA256 ccad5dd542f32fa9749f66cc6341254c865c8179d83644740e749c2482370b0d
MD5 54c692889d61fe885dd0ac06f53e093a
BLAKE2b-256 36d2371c6b03c47343ea163ea6b2cb620e5e222b7f9e9d8cb452073e06562e21

See more details on using hashes here.

File details

Details for the file megaparse-0.0.36-py3-none-any.whl.

File metadata

  • Download URL: megaparse-0.0.36-py3-none-any.whl
  • Upload date:
  • Size: 366.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for megaparse-0.0.36-py3-none-any.whl
Algorithm Hash digest
SHA256 7e320d585bb5634dccebad99f829a19e24a072b6ba63292537c640592edb7914
MD5 96309eb21cbde37f68631b7438310673
BLAKE2b-256 0d1d373aa251d6e867036f0556a2e0f8c62c17a3b70bc8d34a35a4196957d18a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page