Skip to main content

Python toolkit for document information extraction using LMDX

Project description

LMDX-flow

LMDX-flow is a Python toolkit designed for document information extraction using LMDX. It simplifies the process of creating prompts which contain document layout information and decoding LLM responses to extract valuable information from documents.

What is LMDX: LANGUAGE MODEL-BASED DOCUMENT INFORMATION EXTRACTION AND LOCALIZATION?

LMDX is a methodology for leveraging off-the-shelf LLMs for information extraction on semi-structured documents.

Paper : https://arxiv.org/pdf/2309.10952.pdf

  • Proposes a prompt that enables LLMs to perform the document IE task on leaf and hierarchical entities with precise localization, including without any training data.
  • Proposes a layout encoding scheme that communicate spatial information to the LLM without any change to its architecture.
  • Introduces a decoding algorithm transforming the LLM responses into extracted entities and their bounding boxes on the document, while discarding all hallucination.

Key Features

  • Prompt Generation: Easily create effective prompts based on the LMDX methodology.
  • Response Decoding: Extract entity values and bounding boxes by decoding and grounding the LLM responses.

Getting Started

pip install lmdx-flow
## Load the tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

from lmdx_flow import Pipeline
P = Pipeline(file_path,tokenizer)
prompts = P.generate_prompt(schema)
answers = P.postprocess_all_chunks(llm_responses)

To-do

  • Add support for hierarchical entities
  • Add option to use OCR-words as segment (currently uses OCR-lines as segment)

References

Explore the potential of LMDX-flow to enhance document information extraction using LLMs with ease.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmdx_flow-0.1.2.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmdx_flow-0.1.2-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file lmdx_flow-0.1.2.tar.gz.

File metadata

  • Download URL: lmdx_flow-0.1.2.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.7 Darwin/21.6.0

File hashes

Hashes for lmdx_flow-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2f49c1adb4b90b0e25b6ee4eef37cf7719894b666b6e85876fd40bbbaf141bc8
MD5 5c64fb1ebc9ad65ffe9eb475d6a0d0c6
BLAKE2b-256 ca39c87f0f34b0f68b0a58ca1ccc6ac36e0068e007b560506e630a32c0bfff2f

See more details on using hashes here.

File details

Details for the file lmdx_flow-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: lmdx_flow-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.7 Darwin/21.6.0

File hashes

Hashes for lmdx_flow-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 83b033688c4d0c9494e40089e6327fbfb75ad3e27b7bce603f886dd6841b2d50
MD5 8ff3176eaac4fe99ebc27d5bbcabfd2f
BLAKE2b-256 d9218b7d08aee58c497ac9a414b2a797abbbad099754137224ed2b5e7346e290

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page