Skip to main content

No project description provided

Project description

Open In Colab Hugging Face GitHub license PyPI Docs

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

Documentation

Motivation:

  • Use the multi-modal advancement of LLMs
  • Enable convenience for users
  • Collaborate with a permissive license

Installation

Installing with pip

pip install lexoid

To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions

OPENAI_API_KEY=""
GOOGLE_API_KEY=""

For local inference with Ollama, no API key is required. Install Ollama, pull the target model, and keep the local server running:

ollama pull gemma4
export OLLAMA_BASE_URL=127.0.0.1:11434
ollama list
ollama serve

# docker
Reference: https://docs.ollama.com/docker#run-model-locally
CPU example (will most likely be slower; remember to adjust `OLLAMA_TIMEOUT` as needed)
- docker run -d -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_BASE_URL=0.0.0.0 -e OLLAMA_TIMEOUT=240 --name ollama ollama/ollama
- docker exec -it ollama ollama pull gemma4:latest

Optionally, to use Playwright for retrieving web content (instead of the requests library):

playwright install --with-deps --only-shell chromium

Building .whl from source

[!NOTE] Installing the package from within the virtual environment could cause unexpected behavior, as Lexoid creates and activates its own environment in order to build the wheel.

make build

Creating a local installation

To install dependencies:

make install

or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate

Usage

Example Notebook

Example Colab Notebook

Here's a quick example to parse documents using Lexoid:

from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="AUTO")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="STATIC_PARSE")["raw"]

print(parsed_md)

Parameters

  • path (str): The file path or URL.
  • parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
  • pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
  • max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
  • **kwargs: Additional arguments for the parser.

Command Line Usage

Lexoid provides a command-line interface for document parsing without writing Python code.

Installation

The CLI is automatically available after installing Lexoid:

pip install lexoid
lexoid --help

Alternatively, use with Python module syntax:

python -m lexoid --help

Parse Documents

Convert documents to markdown or JSON:

# Parse to stdout (default markdown)
lexoid parse --input document.pdf

# Save to file
lexoid parse --input document.pdf --output output.md

# Output as JSON (includes metadata, segments, token usage)
lexoid parse --input document.pdf --format json --output result.json

# Use specific parser (STATIC_PARSE, LLM_PARSE, or AUTO)
lexoid parse --input document.pdf --parser-type STATIC_PARSE

# Use specific LLM model
lexoid parse --input document.pdf --model gpt-4o

# Enable verbose logging
lexoid parse --input document.pdf --verbose

Extract Structured Data with JSON Schema

Extract data conforming to a JSON schema:

# Inline schema
lexoid schema \
  --input document.pdf \
  --schema '{"type": "object", "properties": {"title": {"type": "string"}}}' \
  --output result.json

# Schema from file
lexoid schema \
  --input document.pdf \
  --schema schema.json \
  --output result.json

# Specify LLM provider
lexoid schema \
  --input document.pdf \
  --schema schema.json \
  --api openai \
  --model gpt-4o

Convert to LaTeX

Convert documents to LaTeX format:

# Convert to stdout
lexoid latex --input document.pdf

# Save to file
lexoid latex --input document.pdf --output output.tex

# Use specific model
lexoid latex --input document.pdf --model gpt-4o

Command-line Options

Common Options

  • --input, -i: Input file path (required) - Supports PDF, images, HTML, DOCX, XLSX, PPTX, or URLs
  • --output, -o: Output file path (optional) - If not specified, output is printed to stdout
  • --verbose, -v: Enable detailed logging

Parse Command

lexoid parse --help
  • --parser-type, -p: Parser type - AUTO (default), LLM_PARSE, or STATIC_PARSE
  • --model, -m: LLM model name (default: gemini-2.5-flash)
  • --pages-per-split: Pages per chunk (default: 4)
  • --max-processes: Parallel processes (default: 4)
  • --framework: Static parser framework - pdfplumber or paddleocr
  • --format: Output format - markdown (default, plain markdown text) or json (full result with metadata, segments, token usage)

Schema Command

lexoid schema --help
  • --schema, -s: JSON schema (file path or inline JSON, required)
  • --model, -m: LLM model (default: gpt-4o-mini)
  • --api: API provider - openai, gemini, anthropic, ollama, etc. (auto-detected if not specified)
  • --example-schema: Provide example data for the schema
  • --fill-single-schema: Auto-fill single schemas

LaTeX Command

lexoid latex --help
  • --model, -m: LLM model (default: gpt-4o-mini)
  • --api: API provider - openai, gemini, anthropic, ollama, etc. (auto-detected if not specified)

Supported API Providers

  • Google
  • OpenAI
  • Hugging Face
  • Together AI
  • OpenRouter
  • Fireworks
  • Ollama

Ollama Local Parsing

Lexoid supports local LLM_PARSE inference through Ollama. The initial recommended model is gemma4:latest.

from lexoid.api import parse

result = parse(
	"path/to/document.pdf",
	parser_type="LLM_PARSE",
	api_provider="ollama",
	model="gemma4:latest",
	max_processes=1,
)

print(result["raw"])

Notes:

  • Ollama uses the default local endpoint http://localhost:11434 unless OLLAMA_BASE_URL is set.
  • Lexoid forces max_processes=1 for Ollama-backed parsing to avoid local multiprocess contention.
  • AUTO routing does not select Ollama in this first version; choose it explicitly with api_provider="ollama".

Benchmark

Results aggregated across 14 documents.

Note: Benchmarks are currently done in the zero-shot setting.

Rank Model SequenceMatcher Similarity TFIDF Similarity Time (s) Cost ($)
1 gemini-3-pro-preview 0.917 (±0.127) 0.943 (±0.159) 46.92 0.06288
2 AUTO (with auto-selected model) 0.899 (±0.131) 0.960 (±0.066) 21.17 0.00066
3 AUTO 0.895 (±0.112) 0.973 (±0.046) 9.29 0.00063
4 gpt-5.2 0.890 (±0.193) 0.975 (±0.036) 33.32 0.03959
5 gemini-2.5-flash 0.886 (±0.164) 0.986 (±0.027) 52.55 0.01226
6 mistral-ocr-latest 0.882 (±0.106) 0.932 (±0.091) 5.75 0.00121
7 gemini-2.5-pro 0.876 (±0.195) 0.976 (±0.049) 22.65 0.02408
8 gemini-2.0-flash 0.875 (±0.148) 0.977 (±0.037) 11.96 0.00079
9 claude-3-5-sonnet-20241022 0.858 (±0.184) 0.930 (±0.098) 17.32 0.01804
10 gemini-1.5-flash 0.842 (±0.214) 0.969 (±0.037) 15.58 0.00043
11 gpt-5-mini 0.819 (±0.201) 0.917 (±0.104) 52.84 0.00811
12 gpt-5 0.807 (±0.215) 0.919 (±0.088) 98.12 0.05505
13 claude-sonnet-4-20250514 0.801 (±0.188) 0.905 (±0.136) 22.02 0.02056
14 claude-opus-4-20250514 0.789 (±0.220) 0.886 (±0.148) 29.55 0.09513
15 accounts/fireworks/models/llama4-maverick-instruct-basic 0.772 (±0.203) 0.930 (±0.117) 16.02 0.00147
16 gemini-1.5-pro 0.767 (±0.309) 0.865 (±0.230) 24.77 0.01139
17 gemini-3-flash-preview 0.766 (±0.293) 0.858 (±0.210) 39.38 0.00969
18 gpt-4.1-mini 0.754 (±0.249) 0.803 (±0.193) 23.28 0.00347
19 accounts/fireworks/models/llama4-scout-instruct-basic 0.754 (±0.243) 0.942 (±0.063) 13.36 0.00087
20 gpt-4o 0.752 (±0.269) 0.896 (±0.123) 28.87 0.01469
21 gpt-4o-mini 0.728 (±0.241) 0.850 (±0.128) 18.96 0.00609
22 claude-3-7-sonnet-20250219 0.646 (±0.397) 0.758 (±0.297) 57.96 0.01730
23 gpt-4.1 0.637 (±0.301) 0.787 (±0.185) 35.37 0.01498
24 google/gemma-3-27b-it 0.604 (±0.342) 0.788 (±0.297) 23.16 0.00020
25 ds4sd/SmolDocling-256M-preview 0.603 (±0.292) 0.705 (±0.262) 507.74 0.00000
26 microsoft/phi-4-multimodal-instruct 0.589 (±0.273) 0.820 (±0.197) 14.00 0.00045
27 qwen/qwen-2.5-vl-7b-instruct 0.498 (±0.378) 0.630 (±0.445) 14.73 0.00056

Citation

If you use Lexoid in production or publications, please cite accordingly and acknowledge usage. We appreciate the support 🙏

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexoid-0.1.21.tar.gz (90.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lexoid-0.1.21-py3-none-any.whl (91.4 kB view details)

Uploaded Python 3

File details

Details for the file lexoid-0.1.21.tar.gz.

File metadata

  • Download URL: lexoid-0.1.21.tar.gz
  • Upload date:
  • Size: 90.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.12.3 Linux/6.17.0-23-generic

File hashes

Hashes for lexoid-0.1.21.tar.gz
Algorithm Hash digest
SHA256 0b36109a527565333efe600bb3e9b7818738fdd06e1b2eceee0f6cc993d24a59
MD5 588f42c192a6c0c2488d9d82379f328b
BLAKE2b-256 1932bf5db65c338a24841a20d39f2dbe26722293cbe59481a5bb961d01d89a06

See more details on using hashes here.

File details

Details for the file lexoid-0.1.21-py3-none-any.whl.

File metadata

  • Download URL: lexoid-0.1.21-py3-none-any.whl
  • Upload date:
  • Size: 91.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.12.3 Linux/6.17.0-23-generic

File hashes

Hashes for lexoid-0.1.21-py3-none-any.whl
Algorithm Hash digest
SHA256 145a0fa785e5a64882b73a61e34849f9289f741f70a9cb5858e44c117775616d
MD5 b313c08bd914d6ffca7df80a56917064
BLAKE2b-256 575a6290a6208750f33e7cfe60b01576e7e7a34324822d9a47559d904ef74129

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page