A universal GenAI-based local parser for complex documents of all types.
Project description
🖍️ LlaMarker
Your go-to tool for converting and parsing documents into clean, well-structured Markdown!
Fast, intuitive, and entirely local 💻🚀.
✨ Key Features
-
✨ All-in-One Parsing
Supports TXT, DOCX, PDF, PPT, XLSX, and more—even processes images inside documents. -
🖼️ Visual Content Extraction
Utilizes Llama 3.2 Vision to detect images, tables, charts, and diagrams, converting them into rich Markdown. -
🏗️ Built with Marker
Extends the open-source Marker parser to handle complex content types locally. -
🛡️ Local-First Privacy
No cloud, no external servers—all processing happens on your machine.
🚀 How It Works
-
Parsing & Conversion
- Parses and converts multiple file types (.txt, .docx, .pdf, .ppt, .xlsx, etc.) into Markdown.
- Leverages Marker for accurate and efficient parsing of both text and visual elements.
- Extracts images, charts, and tables, embedding them in Markdown.
- (Optional) Converts documents into PDFs using LibreOffice for easy viewing.
-
Visual Analysis
- Distinguishes logos from content-rich images.
- Extracts and preserves the original language from images.
- Uses multiple agents to extract useful information from the images.
-
Fast & Efficient
- Supports parallel processing for faster handling of large folders.
-
Streamlit GUI
- A user-friendly interface to upload and parse files (or multiple files at once!) or entire directories.
- Download results directly from the GUI.
📑 Table of Contents
- Features
- Prerequisites
- Installation Options
- Basic Usage
- Advanced Usage
- Output Structure
- Code Example
- Contributing
- License
- Acknowledgments
✨ Features
-
📄 Document Conversion
Converts.txt,.docx, and other supported file types into.pdfusing LibreOffice (optional if you only need to parse PDFs). -
📊 Page Counting
Automatically counts pages in PDFs using PyPDF2. -
🖼️ Image Processing
Analyzes images to differentiate logos from content-rich images. Extracts relevant data and updates the corresponding Markdown file. -
✍️ Markdown Parsing
Uses Marker to generate clean, structured Markdown files from parsed PDFs. -
🌐 Multilingual Support
Maintains the original language of the content during extraction. -
📈 Data Visualization
Generates analysis plots based on the page counts of processed documents.
⚙️ Prerequisites
Before installing or running LlaMarker, please ensure you meet the following requirements:
-
Python 3.10+
- Core language for running LlaMarker.
- Verify your Python version:
python --version
-
Marker
- Marker is an open-source parser that LlaMarker extends.
- To install Marker, follow these steps:
- Clone the repository:
git clone https://github.com/VikParuchuri/marker.git cd marker
- Install Marker in editable mode:
pip install -e .
- Verify the installation:
marker --help
- Clone the repository:
- GPU Support: If you plan to leverage GPUs, ensure PyTorch is installed with CUDA support (e.g., via
pytorch-cudaor the official PyTorch distribution). - Path Configuration: If Marker is not in your
PATH, ensure you specify its location with the--marker_pathargument.
-
LibreOffice
- Required for converting
.docx,.ppt,.xlsx, etc., into.pdfbefore parsing. - Linux (Ubuntu/Debian example):
sudo apt update sudo apt install libreoffice
- Windows:
Download the installer and consider adding LibreOffice to your systemPATH. - macOS:
- Download from LibreOffice’s website or
- Use Homebrew:
brew install --cask libreoffice
- Required for converting
-
Ollama & Vision Models
- Install Ollama for your OS.
- Pull the required model:
ollama pull llama3.2-vision
- Test run to ensure your model is set up correctly.
-
Poetry (for local development only)
- Recommended dependency manager if you’re cloning the repository to develop or modify LlaMarker.
- Linux/Mac:
curl -sSL https://install.python-poetry.org | python3 - # (If not added to PATH automatically) export PATH="$HOME/.local/bin:$PATH"
- macOS (Homebrew):
brew install poetry
- Windows:
Follow instructions on Poetry’s official site.
🚀 Installation Options
1. Install via PyPI
The simplest approach—ideal if you just want to use LlaMarker rather than develop it:
pip install llamarker
- Requires: Python 3.10+
- After installing, you have access to two main commands:
llamarker— CLI tool.llamarker_gui— Streamlit-based GUI for interactive use.
Note: LibreOffice, Marker, and any optional OCR components need to be installed separately, if you plan to use their respective features.
2. Local Development Setup
If you plan to contribute or dive into the source code:
- Clone the repository:
git clone https://github.com/RevanKumarD/LlaMarker.git cd LlaMarker
- Install dependencies using Poetry:
poetry install - Run LlaMarker locally:
- CLI:
poetry run python llamarker/llamarker.py --directory <directory_path>
- GUI:
poetry run streamlit run llamarker/llamarker_gui.py
- CLI:
No
requirements.txtis provided; Poetry is the recommended (and supported) method for local development.
📌 Basic Usage
CLI Usage
Installed via PyPI
- Process a folder:
llamarker --directory <directory_path>
- Process a single file:
llamarker --file <file_path>
Local Development
- CLI:
poetry run python llamarker/llamarker.py --directory <directory_path>
Streamlit GUI
A user-friendly interface to upload files/directories, parse them, and download results.
- Installed via PyPI:
llamarker_gui
- Local Development:
poetry run streamlit run llamarker/llamarker_gui.py
Open the link (e.g., http://localhost:8501) in your browser to start using LlaMarker via GUI.
🔧 Advanced Usage
Command-Line Arguments
| Argument | Description |
|---|---|
--directory |
Root directory containing documents to process. |
--file |
Path to a single file to process (optional). |
--temp_dir |
Temporary directory for intermediate files (optional). |
--save_pdfs |
Flag to save PDFs in a separate directory (PDFs) under the root directory. |
--output |
Directory to save output files (optional). By default, parsed Markdown files are stored in ParsedFiles and images go under ParsedFiles/pics. |
--marker_path |
Path to the Marker executable (optional). Auto-detects if Marker is in your PATH. |
--force_ocr |
Force OCR on all pages, even if text is extractable. Useful for poorly formatted PDFs or PPTs. |
--languages |
Comma-separated list of languages for OCR (default: "en"). |
--qa_evaluator |
Enable QA Evaluator for selecting the best response during image processing. |
--verbose |
Set verbosity level: 0 = WARNING, 1 = INFO, 2 = DEBUG (default: 0). |
--model |
Ollama model for image analysis (default: llama3.2-vision). A local vision model is required for this to work. |
Example Commands
- Directory processing:
llamarker --directory /path/to/documents
- Single file with verbose output:
llamarker --file /path/to/document.docx --verbose 2
- Parsing with OCR in multiple languages:
llamarker --directory /path/to/docs --force_ocr --languages "en,de,fr"
- Save parsed PDFs to a custom folder:
llamarker --directory /path/to/docs --save_pdfs --output /path/to/output
Output Structure
After processing, LlaMarker organizes files as follows:
ParsedFiles- Contains the generated Markdown files.
pics— subfolder for extracted images.
PDFs- Stores converted PDF files (if
--save_pdfsis used).
- Stores converted PDF files (if
OutDir- Contains processed PDF files (used by the GUI).
logs- Holds log files for each run (processing status, errors, etc.).
Code Example
For local development, you can programmatically use LlaMarker:
from llamarker import LlaMarker
llamarker = LlaMarker(
input_dir="/path/to/documents",
save_pdfs=True,
output_dir="/path/to/output",
verbose=1
)
# Process all documents in the specified directory
llamarker.process_documents()
# Generate summary info
results = llamarker.generate_summary()
for file, page_count in results:
print(f"{file}: {page_count} pages")
# Generate analysis plots
llamarker.plot_analysis(llamarker.parent_dir)
🚧 Shortcomings & Future Updates
Current Shortcomings:
- Limited OCR Accuracy for Complex Documents
- While OCR works well for most cases, it may struggle with highly complex layouts or poorly scanned documents.
- No Direct Cloud Integration
- Currently, LlaMarker only supports local processing. There’s no option to process files directly from cloud storage services like Google Drive or Dropbox.
- Basic Support for PPT and XLSX Parsing
- Parsing of PPT and XLSX files is available but lacks advanced formatting support (e.g., slide layouts, complex charts).
- Poor XLSX to PDF Conversion
- The current conversion of XLSX files to PDF results in poorly formatted output. Improvements are needed to handle large spreadsheets and complex tables.
- Manual Setup for Marker and LibreOffice
- Users must manually install Marker and LibreOffice, which can be cumbersome for those unfamiliar with the setup process.
Planned Future Updates:
- Enhanced OCR Capabilities
- Improve OCR performance by integrating additional vision models for better handling of complex document layouts and multi-column formats.
- Cloud Storage Integration
- Add support for uploading documents directly from cloud services (Google Drive, Dropbox, OneDrive).
- Improved PPT & XLSX Handling
- Enhance parsing accuracy for PPT and XLSX files by adding better support for slides, tables, and embedded charts.
- Better XLSX to PDF Conversion
- Improve the XLSX to PDF conversion process to handle large sheets, complex tables, and maintain proper formatting.
- Cross-Platform Installation Script
- Provide an easy-to-use installation script for all platforms (Linux, Windows, macOS) to automate the setup of dependencies like Marker and LibreOffice.
Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request. Let’s make LlaMarker even more powerful—together. 🤝
License
This project references the Marker repository, which comes with its own license. Please review the Marker repo for licensing restrictions and guidelines.
© 2025 Revan Kumar Dhanasekaran. Released under the GPLv3 License.
Acknowledgments
- Huge thanks to the Marker project for providing an excellent foundation for parsing.
- Special thanks to the open-source community for continuous support and contributions.
Happy Parsing! 🌟
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llamarker-1.0.2.tar.gz.
File metadata
- Download URL: llamarker-1.0.2.tar.gz
- Upload date:
- Size: 6.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
269f88797a338d12ce4b217894030281baf00d0e0aa44f114592386f1e1e9d17
|
|
| MD5 |
ea5f5fe8ce9e66f6449a44f2268d2a08
|
|
| BLAKE2b-256 |
3e6ffcb8eec64dcea764f9d0ac6c34e7cc12993558c7e207b2cb79cd6c81e182
|
File details
Details for the file llamarker-1.0.2-py3-none-any.whl.
File metadata
- Download URL: llamarker-1.0.2-py3-none-any.whl
- Upload date:
- Size: 6.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d012179eb52c0658b65c7c7bac7d19d48e2d1ef1c04cfd70d7499a36687a8b67
|
|
| MD5 |
ac3b0052d701895b090e678a514347c4
|
|
| BLAKE2b-256 |
97ce5492dbee7ca1a04503805c433d30e8f431ca25fd2a283ece5fda3f9a1fb6
|