Parse complex files (PDF,Docx,PPTX) for LLM consumption
Project description
MegaParse - Your Mega Parser for every type of documents
MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.
Key Features 🎯
- Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
- No Information Loss: Focus on having no information loss during parsing.
- Fast and Efficient: Designed with speed and efficiency at its core.
- Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
- Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.
Support
- Files: ✅ PDF ✅ Powerpoint ✅ Word
- Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images
Example
https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3
Installation
pip install megaparse
Usage
-
Add your OpenAI or Anthropic API key to the .env file
-
Install poppler on your computer (images and PDFs)
-
Install tesseract on your computer (images and PDFs)
-
If you have a mac, you also need to install libmagic
brew install libmagic
model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY")) # or any langchain compatible Chat Models
parser = UnstructuredParser(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format
Use MegaParse Vision
- Change the parser to MegaParseVision
model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY")) # type: ignore
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md")
Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.
(Optional) Use LlamaParse for Improved Results
-
Create an account on Llama Cloud and get your API key.
-
Change the parser to LlamaParser
from parser.llama import LlamaParser
parser = LlamaParser(api_key = os.getenv("LLAMA_CLOUD_API_KEY"))
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format
Use as an API
There is a MakeFile for you, simply use :
make dev
at the root of the project and you are good to go.
See localhost:8000/docs for more info on the different endpoints !
BenchMark
Parser | similarity_ratio |
---|---|
megaparse_vision | 0.87 |
unstructured_with_check_table | 0.77 |
unstructured | 0.59 |
llama_parser | 0.33 |
Higher the better
Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py
and then run python evaluations/script.py
. If it is better, do a PR, I mean, let's go higher together 🚀.
In Construction 🚧
- Improve table checker
- Create Checkers to add modular postprocessing ⚙️
- Add Structured output, let's get computer talking 🤖
Star History
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file megaparse-0.0.32.tar.gz
.
File metadata
- Download URL: megaparse-0.0.32.tar.gz
- Upload date:
- Size: 3.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ff41e6d629c1a683b1242103a59c1129f1dd91d38c93acd400d623f60a1f127 |
|
MD5 | b37d6d315c6c503fdfd482a816d6911d |
|
BLAKE2b-256 | eb72652044340da9259ac8aea09f48821e9dd70721e4997efa8b5f27d9667618 |
File details
Details for the file megaparse-0.0.32-py3-none-any.whl
.
File metadata
- Download URL: megaparse-0.0.32-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de012089a84ea4a8982a9a150dcefdcb343c09cf6692c4c58914386bb167a16f |
|
MD5 | c0c67d301feabd6abd84ad48074058e7 |
|
BLAKE2b-256 | ed7ef77012585b6661d607dff1bf86c3dd69bbd964a6e5f7ce4e03874a90e355 |