A universal GenAI-based local parser for complex documents of all types.
Project description
🖍️ LlaMarker
Your go-to tool for converting and parsing documents into clean, well-structured Markdown!
Fast, intuitive, and entirely local 💻🚀.
✨ Key Features
- ✨ All-in-One Parsing: Supports TXT, DOCX, PDF, PPT, XLSX, and more—even processes images inside documents.
- 🖼️ Visual Content Extraction: Utilizes Llama 3.2 Vision to detect images, tables, charts, and diagrams, converting them into rich Markdown.
- 🏗️ Built with Marker: Extends the open-source Marker parser to handle complex content types locally.
- 🛡️ Local-First Privacy: No cloud, no external servers—all processing happens on your machine.
🚀 How It Works
-
Parsing & Conversion
- Parses and converts multiple file types (
.txt,.docx,.pdf,.ppt,.xlsx, etc.) into Markdown. - Leverages Marker for accurate and efficient parsing of both text and visual elements.
- Extracts images, charts, and tables, embedding them in Markdown.
- (Optional) Converts documents into PDFs using LibreOffice for easy viewing.
- Parses and converts multiple file types (
-
Visual Analysis
- Distinguishes logos from content-rich images.
- Extracts and preserves the original language from images.
- Uses multiple agents to extract useful information from the images.
-
Fast & Efficient
- Supports parallel processing for faster handling of large folders.
-
Streamlit GUI
- A user-friendly interface to upload and parse files (or multiple files at once!) or entire directories.
- Download results directly from the GUI.
📑 Table of Contents
✨ Features
-
📄 Document Conversion
Converts.txt,.docx, and other supported file types into.pdfusing LibreOffice. -
📊 Page Counting
Automatically counts pages in PDFs using PyPDF2. -
🖼️ Image Processing
Analyzes images to differentiate logos from content-rich images. Extracts relevant data and updates the corresponding Markdown file. -
✍️ Markdown Parsing
Uses Marker to generate clean, structured Markdown files from parsed PDFs. -
🌐 Multilingual Support
Maintains the original language of the content during extraction. -
📈 Data Visualization
Generates analysis plots based on the page counts of processed documents.
🛠️ Installation
🔧 Requirements
- Python 3.10+ – Core language for running LlaMarker.
- Marker – Open-source parser tool. Ensure it's installed locally or available in your
PATH. - LibreOffice – Required for document conversion (Optional if you only need to parse PDFs).
- (Recommended) Poetry – Dependency manager for Python.
⚙️ Pre-Requisites
Below are the essential steps to get your environment ready for LlaMarker. Follow the instructions based on your OS.
🖥️ LibreOffice Installation
-
Linux
- Update your package list and install LibreOffice:
sudo apt update sudo apt install libreoffice
- Ensure Marker is installed and available in your
PATH. You can also specify its location using the--marker_pathargument.
- Update your package list and install LibreOffice:
-
Windows
- Download and Install LibreOffice.
- During installation, enable the option to add LibreOffice to your system
PATH(optional but recommended).
-
macOS
- Option 1: Download LibreOffice from LibreOffice’s website and drag it into the
Applicationsfolder. - Option 2 (Homebrew):
brew install --cask libreoffice
- Option 1: Download LibreOffice from LibreOffice’s website and drag it into the
🛠️ Poetry Installation
-
Linux / macOS
- Install Poetry using the official installation script:
curl -sSL https://install.python-poetry.org | python3 -
- (If Poetry is not added to your
PATHautomatically) Add it manually:export PATH="$HOME/.local/bin:$PATH"
(You can add this line to your shell configuration file, e.g.,.bashrcor.zshrc, for permanent access.)
- Install Poetry using the official installation script:
-
macOS (Homebrew)
- Alternatively, you can use Homebrew:
brew install poetry
- Alternatively, you can use Homebrew:
-
Windows
- Download the installer from Poetry’s official site and run it.
- After installation, open a new terminal and verify Poetry is installed:
poetry --version
-
Windows Subsystem for Linux (WSL)
- Follow the Linux installation steps.
🧠 Installing Ollama & Vision Models
-
Install Ollama
Follow the instructions provided on the Ollama GitHub repo for your OS. -
Download Vision Models
Once Ollama is installed, pull the required model:ollama pull llama3.2-vision
-
Verify Model Setup
Run a sample inference to ensure everything is working correctly.
🚀 Installing LlaMarker
-
Clone the repository:
git clone https://github.com/RevanKumarD/LlaMarker.git cd LlaMarker
-
Install dependencies using Poetry:
poetry installNote: A
post_installscript for installing LibreOffice is included for Linux systems only. On Windows or macOS, install LibreOffice manually as described above.
💡 Quick Tips
- Make sure Python 3.10+ is installed before proceeding.
- If you encounter issues during the installation, refer to the official Poetry documentation.
- Ensure that Marker and LibreOffice are correctly added to your
PATHfor seamless execution of LlaMarker.
🔍 Usage
CLI Usage
poetry run python llamarker/llamarker.py --directory <directory_path> [options]
Arguments:
| Argument | Description |
|---|---|
--directory |
Root directory containing documents to process. |
--file |
Path to a single file to process (optional). |
--temp_dir |
Temporary directory for intermediate files (optional). |
--save_pdfs |
Flag to save PDFs in a separate directory (PDFs) under the root directory. |
--output |
Directory to save output files (optional). By default, parsed Markdown files are stored in the ParsedFiles folder under the root directory, and images go under pics in ParsedFiles. |
--marker_path |
Path to the Marker executable (optional). Program should auto-recognize the Marker path if it’s in your PATH. |
--force_ocr |
Force OCR on all pages, even if text is extractable. Helpful for poorly formatted PDFs or PPTs. |
--languages |
Comma-separated list of languages for OCR (default: "en"). |
--qa_evaluator |
Enable QA Evaluator for selecting the best response during image processing. |
--verbose |
Set verbosity level: 0 = WARNING, 1 = INFO, 2 = DEBUG (default: 0). |
--model |
Ollama model for image analysis (default: llama3.2-vision). A local vision model is required for this to work. |
Example Commands
-
Processing a directory
poetry run python llamarker/llamarker.py --directory /path/to/documents
-
Processing a single file with verbose output
poetry run python llamarker/llamarker.py --file /path/to/document.docx --verbose 2
-
Parsing with OCR in multiple languages
poetry run python llamarker/llamarker.py --directory /path/to/documents --force_ocr --languages "en,de,fr"
-
Saving parsed PDFs separately
poetry run python llamarker/llamarker.py --directory /path/to/documents --save_pdfs --output /path/to/output
Running the Streamlit GUI
LlaMarker also comes with a Streamlit-based graphical user interface, making it simpler to:
- Upload files (including multiple files at once) or entire directories
- Parse documents
- Download the resulting Markdown files
To launch the Streamlit app:
poetry run streamlit run llamarker/llamarker_gui.py
Once running, open the provided local URL in your browser to interact with LlaMarker.
Output Structure
-
OutDir
Contains processed PDF files (used by the GUI). -
ParsedFiles
Contains the generated Markdown files.picssubfolder: Holds extracted images from the processed files.
-
PDFs
Stores converted PDF files (if--save_pdfsis used). -
logs
Stores log files for each run, helping you track processing status and errors.
Code Example
Here’s a quick example showing how to leverage the pdf conversion utilities:
from llamarker import LlaMarker
llamarker = LlaMarker(
input_dir="/path/to/documents",
save_pdfs=True,
output_dir="/path/to/output",
verbose=1
)
# Process all documents in the specified directory
llamarker.process_documents()
# Generate summary information
results = llamarker.generate_summary()
for file, pages in results:
print(f"{file}: {pages} pages")
# Generate analysis plots
llamarker.plot_analysis(llamarker.parent_dir)
Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request. We appreciate all the help we can get in making LlaMarker even better. 🤝
License
This project references the Marker repository, which comes with its own license. Please review the Marker repo for licensing restrictions and guidelines.
© 2025 Revan Kumar Dhanasekaran. Released under the GPLv3 License.
Acknowledgments
- Huge thanks to the Marker project for providing an excellent foundation for parsing PDFs.
- Special thanks to the open-source community for continuous support and contributions.
Happy Parsing! 🌟
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llamarker-1.0.0.tar.gz.
File metadata
- Download URL: llamarker-1.0.0.tar.gz
- Upload date:
- Size: 149.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.0 CPython/3.11.3 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c2c6c8a00dbc2320d79b707daf2e0c7f5368650116fbf317d6eb9c89feaef59
|
|
| MD5 |
44e16e4de5957068153e84c1ef5c6b52
|
|
| BLAKE2b-256 |
9107dfe45246aa44bf5bfe7f37fb783444ca9f28e8685b6aca8203a42e625cbc
|
File details
Details for the file llamarker-1.0.0-py3-none-any.whl.
File metadata
- Download URL: llamarker-1.0.0-py3-none-any.whl
- Upload date:
- Size: 150.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.0 CPython/3.11.3 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c40f1b703ac95851ccaea6d666f782961d251940f525dc208c74bda25fb9b2b8
|
|
| MD5 |
137b8203848ef6522ed05a71e1aa1f72
|
|
| BLAKE2b-256 |
1f1e1ca32545f993c679d7c11f991983e5c4e2d1e9b8f847deb6fe4d034ce421
|