Skip to main content

A retrieval pipeline for single documents.

Project description

Single Document Retrieval Pipeline

A Python library to perform retrieval on a single, pre-parsed document. It identifies relevant text sections from a document based on a query (label description) and, if necessary, refines the search query iteratively to find the most sufficient context.

Testing

The main way to use the library is by calling the find_relevant_context function.

DOC_FILES_DIRECTORY = ""

OPENAI_API_KEY_INPUT = ""

LABELS_FILE_PATH = "labels.json" 
LABEL_NAME_TO_USE = "governing_law_clause"

EMBEDDING_MODEL_NAME = "text-embedding-3-large" 
CHAT_MODEL_NAME = "gpt-4o"      

# Derive doc_id and extraction_dir from DOC_FILES_DIRECTORY
doc_id_from_path = os.path.basename(DOC_FILES_DIRECTORY)
extraction_dir_from_path = os.path.dirname(DOC_FILES_DIRECTORY)

# Prepare pipeline options if custom models are specified
pipeline_opts = {}
pipeline_opts["embedding_model"] = EMBEDDING_MODEL_NAME

pipeline_opts["chat_model"] = CHAT_MODEL_NAME

relevant_context = find_relevant_context(doc_id=doc_id_from_path,
                                         label_name=LABEL_NAME_TO_USE,
                                         extraction_dir=extraction_dir_from_path,
                                         labels_file_path=LABELS_FILE_PATH,
                                         openai_api_key=OPENAI_API_KEY_INPUT,
                                         pipeline_options=pipeline_opts)

The labels.json file should be formatted as follows:

{
    "label_1": {
    "description": "",
    "examples": []
    },
    "label_2": {
    "description": "",
    "examples": []
    }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

single_doc_retrieval-0.2.0.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

single_doc_retrieval-0.2.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file single_doc_retrieval-0.2.0.tar.gz.

File metadata

  • Download URL: single_doc_retrieval-0.2.0.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for single_doc_retrieval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3f8aa6a6274d769a040ad9b74d1102c86db0b6e127413a460f8a37111a5f8906
MD5 5498d32567fedaa941733b85e8b4a3d3
BLAKE2b-256 f6955e07d681b7c73c181d5fcca4527e216833b93e0b0087a963b26be4952307

See more details on using hashes here.

File details

Details for the file single_doc_retrieval-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for single_doc_retrieval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2daa291f5e0c1171e3fc7758872da080091ab3e7f7ed1e34c22e6b735ab7d11d
MD5 189971574eeae289347049c933161598
BLAKE2b-256 23b88f301092a2b8c353adf67280a374ea541ece38024adfbcf919b298df4282

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page