No project description provided
Project description
Lexoid
Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.
Motivation:
- Use the multi-modal advancement of LLMs
- Enable convenience for users while driving innovation
- Collaborate with a permissive license
Installation
To install dependencies:
make install
or, to install with dev-dependencies:
make dev
To activate virtual environment:
source .venv/bin/activate
To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions
OPENAI_API_KEY=""
GOOGLE_API_KEY=""
To build a .whl file for testing:
poetry build
Optionally, to use Playwright for retrieving web content with the .whl package (else regular requests will be used by default):
playwright install --with-deps --only-shell chromium
Usage
Here's a quick example to parse documents using Lexoid:
from lexoid.api import parse
from lexoid.api import ParserType
parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="LLM_PARSE", raw=True)
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE", raw=True)
print(parsed_md)
Parameters
- path (str): The file path or URL.
- parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
- raw (bool, optional): Whether to return raw text or structured data. Defaults to False.
- pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
- max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
- **kwargs: Additional arguments for the parser.
Benchmark
Initial results (more updates soon)
| Rank | Model/Framework | Similarity | Time (s) |
|---|---|---|---|
| 1 | gpt-4o | 0.799 | 21.77 |
| 2 | gemini-1.5-pro | 0.742 | 15.77 |
| 3 | gpt-4o-mini | 0.721 | 14.86 |
| 4 | gemini-1.5-flash | 0.702 | 4.56 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lexoid-0.1.6.post1-py3-none-any.whl.
File metadata
- Download URL: lexoid-0.1.6.post1-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4322c5f547a46c4de10c4572fa3f119b3a3e8f213d409d2f91d349a4ed9964b
|
|
| MD5 |
f65bf894dc724c08b9e41352bde7b242
|
|
| BLAKE2b-256 |
0bc030ef21bc24793525318a3269469276a7a093fde99840d207f5a32fc12da5
|