No project description provided
Project description
Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.
Motivation:
- Use the multi-modal advancement of LLMs
- Enable convenience for users
- Collaborate with a permissive license
Installation
Installing with pip
pip install lexoid
To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions
OPENAI_API_KEY=""
GOOGLE_API_KEY=""
For local inference with Ollama, no API key is required. Install Ollama, pull the target model, and keep the local server running:
ollama pull gemma4
export OLLAMA_BASE_URL=127.0.0.1:11434
ollama list
ollama serve
# docker
Reference: https://docs.ollama.com/docker#run-model-locally
CPU example (will most likely be slower; remember to adjust `OLLAMA_TIMEOUT` as needed)
- docker run -d -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_BASE_URL=0.0.0.0 -e OLLAMA_TIMEOUT=240 --name ollama ollama/ollama
- docker exec -it ollama ollama pull gemma4:latest
Optionally, to use Playwright for retrieving web content (instead of the requests library):
playwright install --with-deps --only-shell chromium
Building .whl from source
[!NOTE] Installing the package from within the virtual environment could cause unexpected behavior, as Lexoid creates and activates its own environment in order to build the wheel.
make build
Creating a local installation
To install dependencies:
make install
or, to install with dev-dependencies:
make dev
To activate virtual environment:
source .venv/bin/activate
Usage
Here's a quick example to parse documents using Lexoid:
from lexoid.api import parse
from lexoid.api import ParserType
parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="AUTO")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="STATIC_PARSE")["raw"]
print(parsed_md)
Parameters
- path (str): The file path or URL.
- parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
- pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
- max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
- **kwargs: Additional arguments for the parser.
Command Line Usage
Lexoid provides a command-line interface for document parsing without writing Python code.
Installation
The CLI is automatically available after installing Lexoid:
pip install lexoid
lexoid --help
Alternatively, use with Python module syntax:
python -m lexoid --help
Parse Documents
Convert documents to markdown or JSON:
# Parse to stdout (default markdown)
lexoid parse --input document.pdf
# Save to file
lexoid parse --input document.pdf --output output.md
# Output as JSON (includes metadata, segments, token usage)
lexoid parse --input document.pdf --format json --output result.json
# Use specific parser (STATIC_PARSE, LLM_PARSE, or AUTO)
lexoid parse --input document.pdf --parser-type STATIC_PARSE
# Use specific LLM model
lexoid parse --input document.pdf --model gpt-4o
# Enable verbose logging
lexoid parse --input document.pdf --verbose
Extract Structured Data with JSON Schema
Extract data conforming to a JSON schema:
# Inline schema
lexoid schema \
--input document.pdf \
--schema '{"type": "object", "properties": {"title": {"type": "string"}}}' \
--output result.json
# Schema from file
lexoid schema \
--input document.pdf \
--schema schema.json \
--output result.json
# Specify LLM provider
lexoid schema \
--input document.pdf \
--schema schema.json \
--api openai \
--model gpt-4o
Convert to LaTeX
Convert documents to LaTeX format:
# Convert to stdout
lexoid latex --input document.pdf
# Save to file
lexoid latex --input document.pdf --output output.tex
# Use specific model
lexoid latex --input document.pdf --model gpt-4o
Command-line Options
Common Options
--input, -i: Input file path (required) - Supports PDF, images, HTML, DOCX, XLSX, PPTX, or URLs--output, -o: Output file path (optional) - If not specified, output is printed to stdout--verbose, -v: Enable detailed logging
Parse Command
lexoid parse --help
--parser-type, -p: Parser type -AUTO(default),LLM_PARSE, orSTATIC_PARSE--model, -m: LLM model name (default: gemini-2.5-flash)--pages-per-split: Pages per chunk (default: 4)--max-processes: Parallel processes (default: 4)--framework: Static parser framework -pdfplumberorpaddleocr--format: Output format -markdown(default, plain markdown text) orjson(full result with metadata, segments, token usage)
Schema Command
lexoid schema --help
--schema, -s: JSON schema (file path or inline JSON, required)--model, -m: LLM model (default: gpt-4o-mini)--api: API provider - openai, gemini, anthropic, ollama, etc. (auto-detected if not specified)--example-schema: Provide example data for the schema--fill-single-schema: Auto-fill single schemas
LaTeX Command
lexoid latex --help
--model, -m: LLM model (default: gpt-4o-mini)--api: API provider - openai, gemini, anthropic, ollama, etc. (auto-detected if not specified)
Supported API Providers
- OpenAI
- Hugging Face
- Together AI
- OpenRouter
- Fireworks
- Ollama
Ollama Local Parsing
Lexoid supports local LLM_PARSE inference through Ollama. The initial recommended model is gemma4:latest.
from lexoid.api import parse
result = parse(
"path/to/document.pdf",
parser_type="LLM_PARSE",
api_provider="ollama",
model="gemma4:latest",
max_processes=1,
)
print(result["raw"])
Notes:
- Ollama uses the default local endpoint
http://localhost:11434unlessOLLAMA_BASE_URLis set. - Lexoid forces
max_processes=1for Ollama-backed parsing to avoid local multiprocess contention. AUTOrouting does not select Ollama in this first version; choose it explicitly withapi_provider="ollama".
Benchmark
Results aggregated across 14 documents.
Note: Benchmarks are currently done in the zero-shot setting.
| Rank | Model | SequenceMatcher Similarity | TFIDF Similarity | Time (s) | Cost ($) |
|---|---|---|---|---|---|
| 1 | gemini-3-pro-preview | 0.917 (±0.127) | 0.943 (±0.159) | 46.92 | 0.06288 |
| 2 | AUTO (with auto-selected model) | 0.899 (±0.131) | 0.960 (±0.066) | 21.17 | 0.00066 |
| 3 | AUTO | 0.895 (±0.112) | 0.973 (±0.046) | 9.29 | 0.00063 |
| 4 | gpt-5.2 | 0.890 (±0.193) | 0.975 (±0.036) | 33.32 | 0.03959 |
| 5 | gemini-2.5-flash | 0.886 (±0.164) | 0.986 (±0.027) | 52.55 | 0.01226 |
| 6 | mistral-ocr-latest | 0.882 (±0.106) | 0.932 (±0.091) | 5.75 | 0.00121 |
| 7 | gemini-2.5-pro | 0.876 (±0.195) | 0.976 (±0.049) | 22.65 | 0.02408 |
| 8 | gemini-2.0-flash | 0.875 (±0.148) | 0.977 (±0.037) | 11.96 | 0.00079 |
| 9 | claude-3-5-sonnet-20241022 | 0.858 (±0.184) | 0.930 (±0.098) | 17.32 | 0.01804 |
| 10 | gemini-1.5-flash | 0.842 (±0.214) | 0.969 (±0.037) | 15.58 | 0.00043 |
| 11 | gpt-5-mini | 0.819 (±0.201) | 0.917 (±0.104) | 52.84 | 0.00811 |
| 12 | gpt-5 | 0.807 (±0.215) | 0.919 (±0.088) | 98.12 | 0.05505 |
| 13 | claude-sonnet-4-20250514 | 0.801 (±0.188) | 0.905 (±0.136) | 22.02 | 0.02056 |
| 14 | claude-opus-4-20250514 | 0.789 (±0.220) | 0.886 (±0.148) | 29.55 | 0.09513 |
| 15 | accounts/fireworks/models/llama4-maverick-instruct-basic | 0.772 (±0.203) | 0.930 (±0.117) | 16.02 | 0.00147 |
| 16 | gemini-1.5-pro | 0.767 (±0.309) | 0.865 (±0.230) | 24.77 | 0.01139 |
| 17 | gemini-3-flash-preview | 0.766 (±0.293) | 0.858 (±0.210) | 39.38 | 0.00969 |
| 18 | gpt-4.1-mini | 0.754 (±0.249) | 0.803 (±0.193) | 23.28 | 0.00347 |
| 19 | accounts/fireworks/models/llama4-scout-instruct-basic | 0.754 (±0.243) | 0.942 (±0.063) | 13.36 | 0.00087 |
| 20 | gpt-4o | 0.752 (±0.269) | 0.896 (±0.123) | 28.87 | 0.01469 |
| 21 | gpt-4o-mini | 0.728 (±0.241) | 0.850 (±0.128) | 18.96 | 0.00609 |
| 22 | claude-3-7-sonnet-20250219 | 0.646 (±0.397) | 0.758 (±0.297) | 57.96 | 0.01730 |
| 23 | gpt-4.1 | 0.637 (±0.301) | 0.787 (±0.185) | 35.37 | 0.01498 |
| 24 | google/gemma-3-27b-it | 0.604 (±0.342) | 0.788 (±0.297) | 23.16 | 0.00020 |
| 25 | ds4sd/SmolDocling-256M-preview | 0.603 (±0.292) | 0.705 (±0.262) | 507.74 | 0.00000 |
| 26 | microsoft/phi-4-multimodal-instruct | 0.589 (±0.273) | 0.820 (±0.197) | 14.00 | 0.00045 |
| 27 | qwen/qwen-2.5-vl-7b-instruct | 0.498 (±0.378) | 0.630 (±0.445) | 14.73 | 0.00056 |
Citation
If you use Lexoid in production or publications, please cite accordingly and acknowledge usage. We appreciate the support 🙏
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lexoid-0.1.21.tar.gz.
File metadata
- Download URL: lexoid-0.1.21.tar.gz
- Upload date:
- Size: 90.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.0 CPython/3.12.3 Linux/6.17.0-23-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b36109a527565333efe600bb3e9b7818738fdd06e1b2eceee0f6cc993d24a59
|
|
| MD5 |
588f42c192a6c0c2488d9d82379f328b
|
|
| BLAKE2b-256 |
1932bf5db65c338a24841a20d39f2dbe26722293cbe59481a5bb961d01d89a06
|
File details
Details for the file lexoid-0.1.21-py3-none-any.whl.
File metadata
- Download URL: lexoid-0.1.21-py3-none-any.whl
- Upload date:
- Size: 91.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.0 CPython/3.12.3 Linux/6.17.0-23-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
145a0fa785e5a64882b73a61e34849f9289f741f70a9cb5858e44c117775616d
|
|
| MD5 |
b313c08bd914d6ffca7df80a56917064
|
|
| BLAKE2b-256 |
575a6290a6208750f33e7cfe60b01576e7e7a34324822d9a47559d904ef74129
|