A retrieval pipeline for single documents.
Project description
Single Document Retrieval Pipeline
A Python library to perform retrieval on a single, pre-parsed document. It identifies relevant text sections from a document based on a query (label description) and, if necessary, refines the search query iteratively to find the most sufficient context.
Testing
The main way to use the library is by calling the find_relevant_context function.
DOC_FILES_DIRECTORY = ""
OPENAI_API_KEY_INPUT = ""
LABELS_FILE_PATH = "labels.json"
LABEL_NAME_TO_USE = "governing_law_clause"
EMBEDDING_MODEL_NAME = "text-embedding-3-large"
CHAT_MODEL_NAME = "gpt-4o"
# Derive doc_id and extraction_dir from DOC_FILES_DIRECTORY
doc_id_from_path = os.path.basename(DOC_FILES_DIRECTORY)
extraction_dir_from_path = os.path.dirname(DOC_FILES_DIRECTORY)
# Prepare pipeline options if custom models are specified
pipeline_opts = {}
pipeline_opts["embedding_model"] = EMBEDDING_MODEL_NAME
pipeline_opts["chat_model"] = CHAT_MODEL_NAME
relevant_context = find_relevant_context(doc_id=doc_id_from_path,
label_name=LABEL_NAME_TO_USE,
extraction_dir=extraction_dir_from_path,
labels_file_path=LABELS_FILE_PATH,
openai_api_key=OPENAI_API_KEY_INPUT,
pipeline_options=pipeline_opts)
The labels.json file should be formatted as follows:
{
"label_1": {
"description": "",
"examples": []
},
"label_2": {
"description": "",
"examples": []
}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file single_doc_retrieval-0.2.0.tar.gz.
File metadata
- Download URL: single_doc_retrieval-0.2.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f8aa6a6274d769a040ad9b74d1102c86db0b6e127413a460f8a37111a5f8906
|
|
| MD5 |
5498d32567fedaa941733b85e8b4a3d3
|
|
| BLAKE2b-256 |
f6955e07d681b7c73c181d5fcca4527e216833b93e0b0087a963b26be4952307
|
File details
Details for the file single_doc_retrieval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: single_doc_retrieval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2daa291f5e0c1171e3fc7758872da080091ab3e7f7ed1e34c22e6b735ab7d11d
|
|
| MD5 |
189971574eeae289347049c933161598
|
|
| BLAKE2b-256 |
23b88f301092a2b8c353adf67280a374ea541ece38024adfbcf919b298df4282
|