A universal GenAI-based local parser for complex documents of all types.

These details have not been verified by PyPI

Project links

Project description

LlaMarker Logo

🖍️ LlaMarker

Your go-to tool for converting and parsing documents into clean, well-structured Markdown!
Fast, intuitive, and entirely local 💻🚀.

✨ Key Features

✨ All-in-One Parsing: Supports TXT, DOCX, PDF, PPT, XLSX, and more—even processes images inside documents.
🖼️ Visual Content Extraction: Utilizes Llama 3.2 Vision to detect images, tables, charts, and diagrams, converting them into rich Markdown.
🏗️ Built with Marker: Extends the open-source Marker parser to handle complex content types locally.
🛡️ Local-First Privacy: No cloud, no external servers—all processing happens on your machine.

🚀 How It Works

Parsing & Conversion
- Parses and converts multiple file types (.txt, .docx, .pdf, .ppt, .xlsx, etc.) into Markdown.
- Leverages Marker for accurate and efficient parsing of both text and visual elements.
- Extracts images, charts, and tables, embedding them in Markdown.
- (Optional) Converts documents into PDFs using LibreOffice for easy viewing.
Visual Analysis
- Distinguishes logos from content-rich images.
- Extracts and preserves the original language from images.
- Uses multiple agents to extract useful information from the images.
Fast & Efficient
- Supports parallel processing for faster handling of large folders.
Streamlit GUI
- A user-friendly interface to upload and parse files (or multiple files at once!) or entire directories.
- Download results directly from the GUI.

✨ Features

📄 Document Conversion
Converts .txt, .docx, and other supported file types into .pdf using LibreOffice.
📊 Page Counting
Automatically counts pages in PDFs using PyPDF2.
🖼️ Image Processing
Analyzes images to differentiate logos from content-rich images. Extracts relevant data and updates the corresponding Markdown file.
✍️ Markdown Parsing
Uses Marker to generate clean, structured Markdown files from parsed PDFs.
🌐 Multilingual Support
Maintains the original language of the content during extraction.
📈 Data Visualization
Generates analysis plots based on the page counts of processed documents.

🛠️ Installation

🔧 Requirements

Python 3.10+ – Core language for running LlaMarker.
Marker – Open-source parser tool. Ensure it's installed locally or available in your PATH.
LibreOffice – Required for document conversion (Optional if you only need to parse PDFs).
(Recommended) Poetry – Dependency manager for Python.

⚙️ Pre-Requisites

Below are the essential steps to get your environment ready for LlaMarker. Follow the instructions based on your OS.

🖥️ LibreOffice Installation

Linux
- Update your package list and install LibreOffice:
```
sudo apt update
sudo apt install libreoffice
```
- Ensure Marker is installed and available in your PATH. You can also specify its location using the --marker_path argument.
Windows
- Download and Install LibreOffice.
- During installation, enable the option to add LibreOffice to your system PATH (optional but recommended).
macOS
- Option 1: Download LibreOffice from LibreOffice’s website and drag it into the Applications folder.
- Option 2 (Homebrew):
```
brew install --cask libreoffice
```

🛠️ Poetry Installation

Linux / macOS
- Install Poetry using the official installation script:
```
curl -sSL https://install.python-poetry.org | python3 -
```
- (If Poetry is not added to your PATH automatically) Add it manually:
```
export PATH="$HOME/.local/bin:$PATH"
```
  (You can add this line to your shell configuration file, e.g., .bashrc or .zshrc, for permanent access.)
macOS (Homebrew)
- Alternatively, you can use Homebrew:
```
brew install poetry
```
Windows
- Download the installer from Poetry’s official site and run it.
- After installation, open a new terminal and verify Poetry is installed:
```
poetry --version
```
Windows Subsystem for Linux (WSL)
- Follow the Linux installation steps.

🧠 Installing Ollama & Vision Models

Install Ollama
Follow the instructions provided on the Ollama GitHub repo for your OS.
Download Vision Models
Once Ollama is installed, pull the required model:
```
ollama pull llama3.2-vision
```
Verify Model Setup
Run a sample inference to ensure everything is working correctly.

🚀 Installing LlaMarker

Clone the repository:

git clone https://github.com/RevanKumarD/LlaMarker.git
cd LlaMarker

Install dependencies using Poetry:
```
poetry install
```
Note: A post_install script for installing LibreOffice is included for Linux systems only. On Windows or macOS, install LibreOffice manually as described above.

💡 Quick Tips

Make sure Python 3.10+ is installed before proceeding.
If you encounter issues during the installation, refer to the official Poetry documentation.
Ensure that Marker and LibreOffice are correctly added to your PATH for seamless execution of LlaMarker.

🔍 Usage

CLI Usage

poetry run python llamarker/llamarker.py --directory <directory_path> [options]

Arguments:

Argument	Description
`--directory`	Root directory containing documents to process.
`--file`	Path to a single file to process (optional).
`--temp_dir`	Temporary directory for intermediate files (optional).
`--save_pdfs`	Flag to save PDFs in a separate directory (`PDFs`) under the root directory.
`--output`	Directory to save output files (optional). By default, parsed Markdown files are stored in the `ParsedFiles` folder under the root directory, and images go under `pics` in `ParsedFiles`.
`--marker_path`	Path to the Marker executable (optional). Program should auto-recognize the `Marker` path if it’s in your `PATH`.
`--force_ocr`	Force OCR on all pages, even if text is extractable. Helpful for poorly formatted PDFs or PPTs.
`--languages`	Comma-separated list of languages for OCR (default: `"en"`).
`--qa_evaluator`	Enable QA Evaluator for selecting the best response during image processing.
`--verbose`	Set verbosity level: 0 = WARNING, 1 = INFO, 2 = DEBUG (default: 0).
`--model`	Ollama model for image analysis (default: `llama3.2-vision`). A local vision model is required for this to work.

Example Commands

Processing a directory

poetry run python llamarker/llamarker.py --directory /path/to/documents

Processing a single file with verbose output

poetry run python llamarker/llamarker.py --file /path/to/document.docx --verbose 2

Parsing with OCR in multiple languages

poetry run python llamarker/llamarker.py --directory /path/to/documents --force_ocr --languages "en,de,fr"

Saving parsed PDFs separately

poetry run python llamarker/llamarker.py --directory /path/to/documents --save_pdfs --output /path/to/output

Running the Streamlit GUI

LlaMarker also comes with a Streamlit-based graphical user interface, making it simpler to:

Upload files (including multiple files at once) or entire directories
Parse documents
Download the resulting Markdown files

To launch the Streamlit app:

poetry run streamlit run llamarker/llamarker_gui.py

Once running, open the provided local URL in your browser to interact with LlaMarker.

Output Structure

OutDir
Contains processed PDF files (used by the GUI).
ParsedFiles
Contains the generated Markdown files.
- pics subfolder: Holds extracted images from the processed files.
PDFs
Stores converted PDF files (if --save_pdfs is used).
logs
Stores log files for each run, helping you track processing status and errors.

Code Example

Here’s a quick example showing how to leverage the pdf conversion utilities:

from llamarker import LlaMarker

llamarker = LlaMarker(
    input_dir="/path/to/documents",
    save_pdfs=True,
    output_dir="/path/to/output",
    verbose=1
)

# Process all documents in the specified directory
llamarker.process_documents()

# Generate summary information
results = llamarker.generate_summary()
for file, pages in results:
    print(f"{file}: {pages} pages")

# Generate analysis plots
llamarker.plot_analysis(llamarker.parent_dir)

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request. We appreciate all the help we can get in making LlaMarker even better. 🤝

License

This project references the Marker repository, which comes with its own license. Please review the Marker repo for licensing restrictions and guidelines.

Acknowledgments

Huge thanks to the Marker project for providing an excellent foundation for parsing PDFs.
Special thanks to the open-source community for continuous support and contributions.

Happy Parsing! 🌟

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.2

Jan 19, 2025

1.0.1

Jan 13, 2025

This version

1.0.0

Jan 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llamarker-1.0.0.tar.gz (149.8 kB view details)

Uploaded Jan 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llamarker-1.0.0-py3-none-any.whl (150.8 kB view details)

Uploaded Jan 11, 2025 Python 3

File details

Details for the file llamarker-1.0.0.tar.gz.

File metadata

Download URL: llamarker-1.0.0.tar.gz
Upload date: Jan 11, 2025
Size: 149.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.0 CPython/3.11.3 Windows/10

File hashes

Hashes for llamarker-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9c2c6c8a00dbc2320d79b707daf2e0c7f5368650116fbf317d6eb9c89feaef59`
MD5	`44e16e4de5957068153e84c1ef5c6b52`
BLAKE2b-256	`9107dfe45246aa44bf5bfe7f37fb783444ca9f28e8685b6aca8203a42e625cbc`

See more details on using hashes here.

File details

Details for the file llamarker-1.0.0-py3-none-any.whl.

File metadata

Download URL: llamarker-1.0.0-py3-none-any.whl
Upload date: Jan 11, 2025
Size: 150.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.0 CPython/3.11.3 Windows/10

File hashes

Hashes for llamarker-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c40f1b703ac95851ccaea6d666f782961d251940f525dc208c74bda25fb9b2b8`
MD5	`137b8203848ef6522ed05a71e1aa1f72`
BLAKE2b-256	`1f1e1ca32545f993c679d7c11f991983e5c4e2d1e9b8f847deb6fe4d034ce421`

See more details on using hashes here.

llamarker 1.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

🖍️ LlaMarker

✨ Key Features

🚀 How It Works

📑 Table of Contents

✨ Features

🛠️ Installation

🔧 Requirements

⚙️ Pre-Requisites

🖥️ LibreOffice Installation

🛠️ Poetry Installation

🧠 Installing Ollama & Vision Models

🚀 Installing LlaMarker

💡 Quick Tips

🔍 Usage

CLI Usage

Example Commands

Running the Streamlit GUI

Output Structure

Code Example

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes