Skip to main content

No project description provided

Project description

 ___      _______  __   __  _______  ___   ______  
|   |    |       ||  |_|  ||       ||   | |      | 
|   |    |    ___||       ||   _   ||   | |  _    |
|   |    |   |___ |       ||  | |  ||   | | | |   |
|   |___ |    ___| |     | |  |_|  ||   | | |_|   |
|       ||   |___ |   _   ||       ||   | |       |
|_______||_______||__| |__||_______||___| |______| 
                                                                                                    

Open In Colab GitHub license PyPI Docs

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

Documentation

Motivation:

  • Use the multi-modal advancement of LLMs
  • Enable convenience for users
  • Collaborate with a permissive license

Installation

Installing with pip

pip install lexoid

To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions

OPENAI_API_KEY=""
GOOGLE_API_KEY=""

Optionally, to use Playwright for retrieving web content (instead of the requests library):

playwright install --with-deps --only-shell chromium

Building .whl from source

make build

Creating a local installation

To install dependencies:

make install

or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate

Usage

Example Notebook

Example Colab Notebook

Here's a quick example to parse documents using Lexoid:

from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="LLM_PARSE")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE")["raw"]

print(parsed_md)

Parameters

  • path (str): The file path or URL.
  • parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
  • pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
  • max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
  • **kwargs: Additional arguments for the parser.

Benchmark

Results aggregated across 5 iterations each for 5 documents.

Note: Benchmarks are currently done in the zero-shot setting.

Rank Model Mean Similarity Std. Dev. Time (s)
1 gemini-2.0-flash 0.829 0.102 7.41
2 gemini-2.0-flash-001 0.814 0.176 6.85
3 gemini-1.5-flash 0.797 0.143 9.54
4 gemini-2.0-pro-exp 0.764 0.227 11.95
5 gemini-2.0-flash-thinking-exp 0.746 0.266 10.46
6 gemini-1.5-pro 0.732 0.265 11.44
7 gpt-4o 0.687 0.247 10.16
8 gpt-4o-mini 0.642 0.213 9.71
9 gemini-1.5-flash-8b 0.551 0.223 3.91
10 Llama-Vision-Free (via Together AI) 0.531 0.198 6.93
11 Llama-3.2-11B-Vision-Instruct-Turbo (via Together AI) 0.524 0.192 3.68
12 Llama-3.2-90B-Vision-Instruct-Turbo (via Together AI) 0.461 0.306 19.26
13 Llama-3.2-11B-Vision-Instruct (via Hugging Face) 0.451 0.257 4.54

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexoid-0.1.11.post1.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lexoid-0.1.11.post1-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file lexoid-0.1.11.post1.tar.gz.

File metadata

  • Download URL: lexoid-0.1.11.post1.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.3 Linux/6.8.0-53-generic

File hashes

Hashes for lexoid-0.1.11.post1.tar.gz
Algorithm Hash digest
SHA256 0eda580bd37ae7dfcb3513bf97bc1dbb82889fd2196fb1043ff6d69f6b03b229
MD5 7e9af6c431997d95af3962bff966832b
BLAKE2b-256 45d35f427877ef7ec4b589108cc0f3f3da81cd323af2fa9cd130fb15ee0c8914

See more details on using hashes here.

File details

Details for the file lexoid-0.1.11.post1-py3-none-any.whl.

File metadata

  • Download URL: lexoid-0.1.11.post1-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.3 Linux/6.8.0-53-generic

File hashes

Hashes for lexoid-0.1.11.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 08ecb1bff3721cb25bd4f2cdb6a7d0e02e43959793ada5899ef04cffa22b6474
MD5 c6dbe3ffbee5340ad3f628af9939430c
BLAKE2b-256 cc993ca1e8de566b2805e55748f92792689d720b92d3466287323c38a5a7b3e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page