No project description provided
Project description
Lexoid
Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.
Motivation:
- Use the multi-modal advancement of LLMs
- Enable convenience for users
- Collaborate with a permissive license
Installation
Installing with pip
pip install lexoid
To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions
OPENAI_API_KEY=""
GOOGLE_API_KEY=""
Optionally, to use Playwright for retrieving web content (instead of the requests library):
playwright install --with-deps --only-shell chromium
Building .whl from source
make build
Creating a local installation
To install dependencies:
make install
or, to install with dev-dependencies:
make dev
To activate virtual environment:
source .venv/bin/activate
Usage
Here's a quick example to parse documents using Lexoid:
from lexoid.api import parse
from lexoid.api import ParserType
parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="LLM_PARSE", raw=True)
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE", raw=True)
print(parsed_md)
Parameters
- path (str): The file path or URL.
- parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
- raw (bool, optional): Return raw text or structured data. Defaults to False.
- pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
- max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
- **kwargs: Additional arguments for the parser.
Benchmark
Initial results (more updates soon) Note: Benchmarks done in zero-shot scenario currently
| Rank | Model/Framework | Similarity | Time (s) |
|---|---|---|---|
| 1 | gpt-4o | 0.799 | 21.77 |
| 2 | gemini-2.0-flash-exp | 0.797 | 13.47 |
| 3 | gemini-exp-1121 | 0.779 | 30.88 |
| 4 | gemini-1.5-pro | 0.742 | 15.77 |
| 5 | gpt-4o-mini | 0.721 | 14.86 |
| 6 | gemini-1.5-flash | 0.702 | 4.56 |
| 7 | Llama-3.2-11B-Vision-Instruct (via HF) | 0.582 | 21.74 |
| 8 | Llama-3.2-11B-Vision-Instruct-Turbo (via Together AI) | 0.556 | 4.58 |
| 9 | Llama-3.2-90B-Vision-Instruct-Turbo (via Together AI) | 0.527 | 10.57 |
| 10 | Llama-Vision-Free (via Together AI) | 0.435 | 8.42 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lexoid-0.1.8.tar.gz.
File metadata
- Download URL: lexoid-0.1.8.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.12.3 Linux/6.8.0-51-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c05a32a8404296644e9776b7396b0f76f2c84fb54259ee9825ac4893bbf2853
|
|
| MD5 |
64e2019a90816f0547c4bc3a5471c1bb
|
|
| BLAKE2b-256 |
4a30c3a742172fc3b686b4b821f36425de0544511e990236e38ee28567275a3e
|
File details
Details for the file lexoid-0.1.8-py3-none-any.whl.
File metadata
- Download URL: lexoid-0.1.8-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.12.3 Linux/6.8.0-51-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a0842d1611f69911288318fdae0f46198d9f465c12b8648aa3fc4fbe2ced3a7
|
|
| MD5 |
aabea01674400cce9614efddc6643bb4
|
|
| BLAKE2b-256 |
6390c5e0be5a61a426ce559fd3459c086ee667f85919761700ecc4bdd749b473
|