Skip to main content

llama-index readers docugami integration

Project description

Docugami Loader

pip install llama-index-readers-docugami

This loader takes in IDs of PDF, DOCX or DOC files processed by Docugami and returns nodes in a Document XML Knowledge Graph for each document. This is a rich representation that includes the semantic and structural characteristics of various chunks in the document as an XML tree. Entire sets of documents are processed, resulting in forests of XML semantic trees.

Pre-requisites

  1. Create a Docugami workspace: http://www.docugami.com (free trials available)
  2. Add your documents (PDF, DOCX or DOC) and allow Docugami to ingest and cluster them into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. There is no fixed set of document types supported by the system, the clusters created depend on your particular documents, and you can change the docset assignments later.
  3. Create an access token via the Developer Playground for your workspace. Detailed instructions: https://help.docugami.com/home/docugami-api
  4. Explore the Docugami API at https://api-docs.docugami.com to get a list of your processed docset IDs, or just the document IDs for a particular docset.

Usage

To use this loader, you simply need to pass in a Docugami Doc Set ID, and optionally an array of Document IDs (by default, all documents in the Doc Set are loaded).

from llama_index.readers.docugami import DocugamiReader

docset_id = "tjwrr2ekqkc3"
document_ids = ["ui7pkriyckwi", "1be3o7ch10iy"]

loader = DocugamiReader()
documents = loader.load_data(docset_id=docset_id, document_ids=document_ids)

This loader is designed to be used as a way to load data into LlamaIndex.

See more information about how to use Docugami with LangChain in the LangChain docs.

Advantages vs Other Chunking Techniques

Appropriate chunking of your documents is critical for retrieval from documents. Many chunking techniques exist, including simple ones that rely on whitespace and recursive chunk splitting based on character length. Docugami offers a different approach:

  1. Intelligent Chunking: Docugami breaks down every document into a hierarchical semantic XML tree of chunks of varying sizes, from single words or numerical values to entire sections. These chunks follow the semantic contours of the document, providing a more meaningful representation than arbitrary length or simple whitespace-based chunking.
  2. Structured Representation: In addition, the XML tree indicates the structural contours of every document, using attributes denoting headings, paragraphs, lists, tables, and other common elements, and does that consistently across all supported document formats, such as scanned PDFs or DOCX files. It appropriately handles long-form document characteristics like page headers/footers or multi-column flows for clean text extraction.
  3. Semantic Annotations: Chunks are annotated with semantic tags that are coherent across the document set, facilitating consistent hierarchical queries across multiple documents, even if they are written and formatted differently. For example, in set of lease agreements, you can easily identify key provisions like the Landlord, Tenant, or Renewal Date, as well as more complex information such as the wording of any sub-lease provision or whether a specific jurisdiction has an exception section within a Termination Clause.
  4. Additional Metadata: Chunks are also annotated with additional metadata, if a user has been using Docugami. This additional metadata can be used for high-accuracy Document QA without context window restrictions. See detailed code walk-through in this notebook.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_docugami-0.3.1.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_docugami-0.3.1-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_docugami-0.3.1.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_docugami-0.3.1.tar.gz
Algorithm Hash digest
SHA256 cb53343a8bb8aaf4e54aed2aafe7cac3675deb56b0dae4d23cedf846e6c2b4f8
MD5 33c27f06a54c7bb07f1071f115582ccb
BLAKE2b-256 134f9842557a782b8898e0678eb7497cccfaaf60bcbe576fb20a489860beb3d5

See more details on using hashes here.

File details

Details for the file llama_index_readers_docugami-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_docugami-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3eae20f93174883755483c430f29514dc7f924239ccfe2aeecb2822c2fa7a443
MD5 bd34398d8690e528b4ab7b301c685a76
BLAKE2b-256 9ea36ff54a37cf7950d0d02cbb153894a388c8f1e77f9e3db3073f9331660d1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page