Skip to main content

Parse complex files (PDF,Docx,PPTX) for LLM consumption

Project description

MegaParse - Your Parser for every type of documents

Quivr-logo

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

  • Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
  • No Information Loss: Focus on having no information loss during parsing.
  • Fast and Efficient: Designed with speed and efficiency at its core.
  • Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
  • Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

  • Files: ✅ PDF ✅ Powerpoint ✅ Word
  • Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

Installation

pip install megaparse 

Usage

  1. Add your OpenAI or Anthropic API key to the .env file

  2. Install poppler on your computer (images and PDFs)

  3. Install tesseract on your computer (images and PDFs)

  4. If you have a mac, you also need to install libmagic brew install libmagic

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.unstructured_parser import UnstructuredParser

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # or any langchain compatible Chat Models
parser = UnstructuredParser(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use MegaParse Vision

  • Change the parser to MegaParseVision
from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.megaparse_vision import MegaParseVision

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md")

Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

(Optional) Use LlamaParse for Improved Results

  1. Create an account on Llama Cloud and get your API key.

  2. Change the parser to LlamaParser

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.llama import LlamaParser

parser = LlamaParser(api_key = os.getenv("LLAMA_CLOUD_API_KEY"))
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use as an API

There is a MakeFile for you, simply use : make dev at the root of the project and you are good to go.

See localhost:8000/docs for more info on the different endpoints !

BenchMark

Parser similarity_ratio
megaparse_vision 0.87
unstructured_with_check_table 0.77
unstructured 0.59
llama_parser 0.33

Higher the better

Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py and then run python evaluations/script.py. If it is better, do a PR, I mean, let's go higher together 🚀.

In Construction 🚧

  • Improve table checker
  • Create Checkers to add modular postprocessing ⚙️
  • Add Structured output, let's get computer talking 🤖

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megaparse-0.0.45.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

megaparse-0.0.45-py3-none-any.whl (370.4 kB view details)

Uploaded Python 3

File details

Details for the file megaparse-0.0.45.tar.gz.

File metadata

  • Download URL: megaparse-0.0.45.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for megaparse-0.0.45.tar.gz
Algorithm Hash digest
SHA256 7d1e67ef02799f80ebd29441bbed5699c27d05461144e35f60d8ff04b8407c57
MD5 cc15d9198fb1a24937d76707974e55ad
BLAKE2b-256 58247fd99595288fdaff309fae0a4c5a6090fd6c2fc72f3149e06468f74edfc7

See more details on using hashes here.

File details

Details for the file megaparse-0.0.45-py3-none-any.whl.

File metadata

  • Download URL: megaparse-0.0.45-py3-none-any.whl
  • Upload date:
  • Size: 370.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for megaparse-0.0.45-py3-none-any.whl
Algorithm Hash digest
SHA256 88da448abcd21101f1d79d855ec913c49fcccb156353414ddf2fea26cff195dc
MD5 cbc34bba04f1ebfd8612752b4a8427ee
BLAKE2b-256 905d572a1c64348ba6795c2a5f803b4fb3501f295ca0a73df646332373afd54c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page