Skip to main content

No project description provided

Project description

Open In Colab Hugging Face GitHub license PyPI Docs

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

Documentation

Motivation:

  • Use the multi-modal advancement of LLMs
  • Enable convenience for users
  • Collaborate with a permissive license

Installation

Installing with pip

pip install lexoid

To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions

OPENAI_API_KEY=""
GOOGLE_API_KEY=""

Optionally, to use Playwright for retrieving web content (instead of the requests library):

playwright install --with-deps --only-shell chromium

Building .whl from source

make build

Creating a local installation

To install dependencies:

make install

or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate

Usage

Example Notebook

Example Colab Notebook

Here's a quick example to parse documents using Lexoid:

from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="AUTO")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="STATIC_PARSE")["raw"]

print(parsed_md)

Parameters

  • path (str): The file path or URL.
  • parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
  • pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
  • max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
  • **kwargs: Additional arguments for the parser.

Supported API Providers

  • Google
  • OpenAI
  • Hugging Face
  • Together AI
  • OpenRouter
  • Fireworks

Benchmark

Results aggregated across 14 documents.

Note: Benchmarks are currently done in the zero-shot setting.

Rank Model SequenceMatcher Similarity TFIDF Similarity Time (s) Cost ($)
1 AUTO (with auto-selected model) 0.899 (±0.131) 0.960 (±0.066) 21.17 0.00066
2 AUTO 0.895 (±0.112) 0.973 (±0.046) 9.29 0.00063
3 gemini-2.5-flash 0.886 (±0.164) 0.986 (±0.027) 52.55 0.01226
4 mistral-ocr-latest 0.882 (±0.106) 0.932 (±0.091) 5.75 0.00121
5 gemini-2.5-pro 0.876 (±0.195) 0.976 (±0.049) 22.65 0.02408
6 gemini-2.0-flash 0.875 (±0.148) 0.977 (±0.037) 11.96 0.00079
7 claude-3-5-sonnet-20241022 0.858 (±0.184) 0.930 (±0.098) 17.32 0.01804
8 gemini-1.5-flash 0.842 (±0.214) 0.969 (±0.037) 15.58 0.00043
9 gpt-5-mini 0.819 (±0.201) 0.917 (±0.104) 52.84 0.00811
10 gpt-5 0.807 (±0.215) 0.919 (±0.088) 98.12 0.05505
11 claude-sonnet-4-20250514 0.801 (±0.188) 0.905 (±0.136) 22.02 0.02056
12 claude-opus-4-20250514 0.789 (±0.220) 0.886 (±0.148) 29.55 0.09513
13 accounts/fireworks/models/llama4-maverick-instruct-basic 0.772 (±0.203) 0.930 (±0.117) 16.02 0.00147
14 gemini-1.5-pro 0.767 (±0.309) 0.865 (±0.230) 24.77 0.01139
15 gpt-4.1-mini 0.754 (±0.249) 0.803 (±0.193) 23.28 0.00347
16 accounts/fireworks/models/llama4-scout-instruct-basic 0.754 (±0.243) 0.942 (±0.063) 13.36 0.00087
17 gpt-4o 0.752 (±0.269) 0.896 (±0.123) 28.87 0.01469
18 gpt-4o-mini 0.728 (±0.241) 0.850 (±0.128) 18.96 0.00609
19 claude-3-7-sonnet-20250219 0.646 (±0.397) 0.758 (±0.297) 57.96 0.01730
20 gpt-4.1 0.637 (±0.301) 0.787 (±0.185) 35.37 0.01498
21 google/gemma-3-27b-it 0.604 (±0.342) 0.788 (±0.297) 23.16 0.00020
22 microsoft/phi-4-multimodal-instruct 0.589 (±0.273) 0.820 (±0.197) 14.00 0.00045
23 qwen/qwen-2.5-vl-7b-instruct 0.498 (±0.378) 0.630 (±0.445) 14.73 0.00056
24 ds4sd/SmolDocling-256M-preview 0.482 (±0.365) 0.572 (±0.351) 106.19 0.00000

Citation

If you use Lexoid in production or publications, please cite accordingly and acknowledge usage. We appreciate the support 🙏

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexoid-0.1.18.tar.gz (79.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lexoid-0.1.18-py3-none-any.whl (84.2 kB view details)

Uploaded Python 3

File details

Details for the file lexoid-0.1.18.tar.gz.

File metadata

  • Download URL: lexoid-0.1.18.tar.gz
  • Upload date:
  • Size: 79.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.12.3 Linux/6.14.0-32-generic

File hashes

Hashes for lexoid-0.1.18.tar.gz
Algorithm Hash digest
SHA256 9eb4e55d839ed8cc8d7d8f765b6028ecb4082a44b459069b068dd87ad591b0ab
MD5 ad1557be68c11467b8f457c859852971
BLAKE2b-256 bc32e4623d4d062581afba8752cfd697d43f53cba7f4279b9435600055dbe3ea

See more details on using hashes here.

File details

Details for the file lexoid-0.1.18-py3-none-any.whl.

File metadata

  • Download URL: lexoid-0.1.18-py3-none-any.whl
  • Upload date:
  • Size: 84.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.12.3 Linux/6.14.0-32-generic

File hashes

Hashes for lexoid-0.1.18-py3-none-any.whl
Algorithm Hash digest
SHA256 93963c8c6b113748810dfdb57e994dfe43c1b384951014c4564407931097bbfd
MD5 78f5823309265f86a8696826e69060eb
BLAKE2b-256 1f71db20f835d6ecf2f4a84869499ca094a4a65b9252a705edb2ebfc8ff6c99d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page