Skip to main content

A universal GenAI-based local parser for complex documents of all types.

Project description

LlaMarker Logo

🖍️ LlaMarker

Your go-to tool for converting and parsing documents into clean, well-structured Markdown!
Fast, intuitive, and entirely local 💻🚀.

Python Versions License


✨ Key Features

  • All-in-One Parsing: Supports TXT, DOCX, PDF, PPT, XLSX, and more—even processes images inside documents.
  • 🖼️ Visual Content Extraction: Utilizes Llama 3.2 Vision to detect images, tables, charts, and diagrams, converting them into rich Markdown.
  • 🏗️ Built with Marker: Extends the open-source Marker parser to handle complex content types locally.
  • 🛡️ Local-First Privacy: No cloud, no external servers—all processing happens on your machine.

🚀 How It Works

  1. Parsing & Conversion

    • Parses and converts multiple file types (.txt, .docx, .pdf, .ppt, .xlsx, etc.) into Markdown.
    • Leverages Marker for accurate and efficient parsing of both text and visual elements.
    • Extracts images, charts, and tables, embedding them in Markdown.
    • (Optional) Converts documents into PDFs using LibreOffice for easy viewing.
  2. Visual Analysis

    • Distinguishes logos from content-rich images.
    • Extracts and preserves the original language from images.
    • Uses multiple agents to extract useful information from the images.
  3. Fast & Efficient

    • Supports parallel processing for faster handling of large folders.
  4. Streamlit GUI

    • A user-friendly interface to upload and parse files (or multiple files at once!) or entire directories.
    • Download results directly from the GUI.

📑 Table of Contents

  1. Features
  2. Installation
  3. Usage
  4. Output Structure
  5. Code Example
  6. Contributing
  7. License
  8. Acknowledgments

✨ Features

  • 📄 Document Conversion
    Converts .txt, .docx, and other supported file types into .pdf using LibreOffice.

  • 📊 Page Counting
    Automatically counts pages in PDFs using PyPDF2.

  • 🖼️ Image Processing
    Analyzes images to differentiate logos from content-rich images. Extracts relevant data and updates the corresponding Markdown file.

  • ✍️ Markdown Parsing
    Uses Marker to generate clean, structured Markdown files from parsed PDFs.

  • 🌐 Multilingual Support
    Maintains the original language of the content during extraction.

  • 📈 Data Visualization
    Generates analysis plots based on the page counts of processed documents.


🛠️ Installation

🔧 Requirements

  1. Python 3.10+ – Core language for running LlaMarker.
  2. Marker – Open-source parser tool. Ensure it's installed locally or available in your PATH.
  3. LibreOffice – Required for document conversion (Optional if you only need to parse PDFs).
  4. (Recommended) Poetry – Dependency manager for Python.

⚙️ Pre-Requisites

Below are the essential steps to get your environment ready for LlaMarker. Follow the instructions based on your OS.


🖥️ LibreOffice Installation

  1. Linux

    • Update your package list and install LibreOffice:
      sudo apt update
      sudo apt install libreoffice
      
    • Ensure Marker is installed and available in your PATH. You can also specify its location using the --marker_path argument.
  2. Windows

  3. macOS

    • Option 1: Download LibreOffice from LibreOffice’s website and drag it into the Applications folder.
    • Option 2 (Homebrew):
      brew install --cask libreoffice
      

🛠️ Poetry Installation

  1. Linux / macOS

    • Install Poetry using the official installation script:
      curl -sSL https://install.python-poetry.org | python3 -
      
    • (If Poetry is not added to your PATH automatically) Add it manually:
      export PATH="$HOME/.local/bin:$PATH"
      
      (You can add this line to your shell configuration file, e.g., .bashrc or .zshrc, for permanent access.)
  2. macOS (Homebrew)

    • Alternatively, you can use Homebrew:
      brew install poetry
      
  3. Windows

    • Download the installer from Poetry’s official site and run it.
    • After installation, open a new terminal and verify Poetry is installed:
      poetry --version
      
  4. Windows Subsystem for Linux (WSL)

    • Follow the Linux installation steps.

🧠 Installing Ollama & Vision Models

  1. Install Ollama
    Follow the instructions provided on the Ollama GitHub repo for your OS.

  2. Download Vision Models
    Once Ollama is installed, pull the required model:

    ollama pull llama3.2-vision
    
  3. Verify Model Setup
    Run a sample inference to ensure everything is working correctly.


🚀 Installing LlaMarker

  1. Clone the repository:

    git clone https://github.com/RevanKumarD/LlaMarker.git
    cd LlaMarker
    
  2. Install dependencies using Poetry:

    poetry install
    

    Note: A post_install script for installing LibreOffice is included for Linux systems only. On Windows or macOS, install LibreOffice manually as described above.


💡 Quick Tips

  • Make sure Python 3.10+ is installed before proceeding.
  • If you encounter issues during the installation, refer to the official Poetry documentation.
  • Ensure that Marker and LibreOffice are correctly added to your PATH for seamless execution of LlaMarker.

🔍 Usage

CLI Usage

poetry run python llamarker/llamarker.py --directory <directory_path> [options]

Arguments:

Argument Description
--directory Root directory containing documents to process.
--file Path to a single file to process (optional).
--temp_dir Temporary directory for intermediate files (optional).
--save_pdfs Flag to save PDFs in a separate directory (PDFs) under the root directory.
--output Directory to save output files (optional). By default, parsed Markdown files are stored in the ParsedFiles folder under the root directory, and images go under pics in ParsedFiles.
--marker_path Path to the Marker executable (optional). Program should auto-recognize the Marker path if it’s in your PATH.
--force_ocr Force OCR on all pages, even if text is extractable. Helpful for poorly formatted PDFs or PPTs.
--languages Comma-separated list of languages for OCR (default: "en").
--qa_evaluator Enable QA Evaluator for selecting the best response during image processing.
--verbose Set verbosity level: 0 = WARNING, 1 = INFO, 2 = DEBUG (default: 0).
--model Ollama model for image analysis (default: llama3.2-vision). A local vision model is required for this to work.

Example Commands

  1. Processing a directory

    poetry run python llamarker/llamarker.py --directory /path/to/documents
    
  2. Processing a single file with verbose output

    poetry run python llamarker/llamarker.py --file /path/to/document.docx --verbose 2
    
  3. Parsing with OCR in multiple languages

    poetry run python llamarker/llamarker.py --directory /path/to/documents --force_ocr --languages "en,de,fr"
    
  4. Saving parsed PDFs separately

    poetry run python llamarker/llamarker.py --directory /path/to/documents --save_pdfs --output /path/to/output
    

Running the Streamlit GUI

LlaMarker also comes with a Streamlit-based graphical user interface, making it simpler to:

  • Upload files (including multiple files at once) or entire directories
  • Parse documents
  • Download the resulting Markdown files

To launch the Streamlit app:

poetry run streamlit run llamarker/llamarker_gui.py

Once running, open the provided local URL in your browser to interact with LlaMarker.


Output Structure

  • OutDir
    Contains processed PDF files (used by the GUI).

  • ParsedFiles
    Contains the generated Markdown files.

    • pics subfolder: Holds extracted images from the processed files.
  • PDFs
    Stores converted PDF files (if --save_pdfs is used).

  • logs
    Stores log files for each run, helping you track processing status and errors.


Code Example

Here’s a quick example showing how to leverage the pdf conversion utilities:

from llamarker import LlaMarker

llamarker = LlaMarker(
    input_dir="/path/to/documents",
    save_pdfs=True,
    output_dir="/path/to/output",
    verbose=1
)

# Process all documents in the specified directory
llamarker.process_documents()

# Generate summary information
results = llamarker.generate_summary()
for file, pages in results:
    print(f"{file}: {pages} pages")

# Generate analysis plots
llamarker.plot_analysis(llamarker.parent_dir)

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request. We appreciate all the help we can get in making LlaMarker even better. 🤝


License

This project references the Marker repository, which comes with its own license. Please review the Marker repo for licensing restrictions and guidelines.

© 2025 Revan Kumar Dhanasekaran. Released under the GPLv3 License.


Acknowledgments

  • Huge thanks to the Marker project for providing an excellent foundation for parsing PDFs.
  • Special thanks to the open-source community for continuous support and contributions.

Happy Parsing! 🌟

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llamarker-1.0.0.tar.gz (149.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llamarker-1.0.0-py3-none-any.whl (150.8 kB view details)

Uploaded Python 3

File details

Details for the file llamarker-1.0.0.tar.gz.

File metadata

  • Download URL: llamarker-1.0.0.tar.gz
  • Upload date:
  • Size: 149.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.0 CPython/3.11.3 Windows/10

File hashes

Hashes for llamarker-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9c2c6c8a00dbc2320d79b707daf2e0c7f5368650116fbf317d6eb9c89feaef59
MD5 44e16e4de5957068153e84c1ef5c6b52
BLAKE2b-256 9107dfe45246aa44bf5bfe7f37fb783444ca9f28e8685b6aca8203a42e625cbc

See more details on using hashes here.

File details

Details for the file llamarker-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: llamarker-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 150.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.0 CPython/3.11.3 Windows/10

File hashes

Hashes for llamarker-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c40f1b703ac95851ccaea6d666f782961d251940f525dc208c74bda25fb9b2b8
MD5 137b8203848ef6522ed05a71e1aa1f72
BLAKE2b-256 1f1e1ca32545f993c679d7c11f991983e5c4e2d1e9b8f847deb6fe4d034ce421

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page