Component to manager referenced documents with LLM.
Project description
Langchain-Reference
"Ask a question, get a comprehensive answer and directly access the sources used to develop that answer."
It's a very difficult goal to achieve.
Now that's not a problem!
Introduction
When publishing the LLM's response, it can be helpful to include links to the documents used to produce the answer. This way, the user can learn more or verify that the response is accurate.
The list of documents is retrieved from the vector database. Each fragment carries metadata that allows for precise identification of its origin (the URL, the page, the title, the position of the first character, etc.).
To be clearer, let's position ourselves in a typical scenario. A question is posed in a prompt, which is then processed with five documents:
a.html#chap1
: The fragment relates to Chapter 1 of thea.html
file.a.html#chap2
: The fragment relates to Chapter 2 of thea.html
file.b.pdf
: The fragment comes from theb.pdf
file.b.pdf
: Another fragment also comes from theb.pdf
file.c.csv
: The fragment is a line from thec.csv
file.
Given the scenario where the LLM uses multiple fragments from different documents to generate a response, the references should be formatted as footnotes, reflecting the different sources. For example, the LLM answers several questions using the referenced fragments:
Yes[1](id=3), certainly[2](id=2), no[3](id=4), yes[4](id=1)
In this situation, the first four fragments are used, but not the last one. The first two have different URLs, even though they come from the same document. The next two share the same URL but refer to different fragments.
The naive approach is to list all the injected documents after the response and, if possible, extract a specific title for each fragment.
Yes, certainly, no, yes
- [a chap1](a.html#chap1)
- [a chap2](a.html#chap2)
- [b frag1](b.pdf)
- [b frag2](b.pdf)
- [c](c.csv)
Yes, certainly, no, yes
Optionally, duplicates can be filtered out.
We observe that the result is not satisfactory. First, the user will be disappointed
when reading the file c.csv
to find that it doesn’t contain any information
supporting the response. This file should be excluded from the reference list since
it provides no useful information and was not used by the LLM to generate the answer.
There are also two different links leading to the same document, which could confuse
the user as to why this is the case.
What should be produced is closer to this:
Yes[1], certainly[2], no[3], yes[4]
- [1],[3] [b](b.pdf)
- [2] [a chap2](a.html#chap2)
- [4] [a chap1](a.html#chap1)
Yes[1], certainly[2], no[3], yes[4]
We identify fragments sharing the same URL to combine reference numbers and avoid unreferenced documents.
The best solution is to adjust the reference numbers when they share the same URL. This adjustment should be made during the LLM’s response generation to achieve the following:
yes[1], certainly[2], no[1], yes[3]
- [1] [b](b.pdf)
- [2] [a chap2](a.html#chap2)
- [3] [a chap1](a.html#chap1)
yes[1], certainly[2], no[1], yes[3]
Note that the reference numbers have been adjusted. As a result, we have a reference list that resembles what a human would have created.
This complexity cannot reasonably be delegated to the LLM. It would require providing the URLs for each fragment and crafting a prompt in the hope that it would always calculate correctly. Since handling URLs is not its strength, it’s better to relieve it of this responsibility and implement deterministic code capable of consistently performing the necessary calculations and adjustments. In the process, links can be directly embedded in the references.
yes<sup>[[1](b.pdf)]</sup>, certainly<sup>[[2](a.html#chap2)]</sup>,
no<sup>[[1](b.pdf)]</sup>, yes<sup>[[3](a.html#chap1)]</sup>
- [1] [b](b.pdf)
- [2] [a chap2](a.html#chap2)
- [3] [a chap1](a.html#chap1)
yes[1], certainly[2], no[1], yes[3]
Usage
You can't ask too much of an LLM. Imperative code is often the best solution. To manage document references correctly, we'll separate responsibilities into two parts. The first is not too complex for an LLM: indicate a reference number, followed by the identifier of the fragment from which the answer is extracted. The second part is the responsibility of a Python code: adjusting reference numbers if there are duplicates, injecting URLs to the original documents if necessary, then concluding the prompt with the list of all references.
Having the list of documents that have been injected into the prompt, it is possible
to add an identifier (the position of each document in the list), so that LLM can
respond with the unique number of the injected document. In this way, it is possible
to retrieve each original document and use the metadata to build a URL, for example.
The following prompt asks LLM to handle references simply, in the form :
[<num_reference>](id=<position_du_fragment>)
.
from langchain_references import FORMAT_REFERENCES
print(f{FORMAT_REFERENCES=})
FORMAT_REFERENCES='When referencing the documents, add a citation right after.'
'Use "[NUMBER](id=ID_NUMBER)" for the citation (e.g. "The Space Needle is in '
'Seattle [1](id=55)[2](id=12).").'
And the prompt:
prompt=ChatPromptTemplate.from_template(
"""
Here, the context:
{documents}
{format_references}
Question : {question}
""")
The context must be built up by adding a reference to each document.
def format_docs(docs):
return "\n".join(
# Add a document id so that LLM can reference it
[f"<document id={i+1}>\n{doc.page_content}\n</document>\n"
for i,doc in enumerate(docs)]
)
Then, thanks to langchain-references
, to modify the tokens produced by the LLM.
Encapsulate the invocation of the model with manage_references()
to adjust the
reference numbers and inject the URLs of the original documents.
from langchain_references import manage_references
chain = manage_references(
context
| rag_prompt
| model,
) | StrOutputParser()
Now, invoke the chain with the documents and the question.
question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
# Run
print(chain.invoke({"documents": docs, "question": question}))
The response from the LLM will be:
The difference subject of mathematics can refer to various areas within the field,
such as number theory, algebra, geometry, analysis, and set theory. Each area
focuses on different concepts, methods, and theorems relevant to both mathematics
and empirical sciences [1](id=1). Additionally, mathematical games and puzzles
highlight the distinction in engagement and required expertise within the mathematical
domain [3](id=3).
The response will be:
Pure mathematics focuses on abstract concepts and theoretical frameworks,
independent of practical applications, while applied mathematics is concerned
with mathematical methods that can be used in real-world situations. Pure
mathematics often explores fundamental truths and properties, whereas applied
mathematics is developed in correlation with specific applications in fields
like science and engineering <sup>[[1](https://en.wikipedia.org/wiki/Mathematics)]</sup>
<sup>[[2](https://en.wikipedia.org/wiki/Mathematical_game)]</sup>.
- **1** [Mathematics](https://en.wikipedia.org/wiki/Mathematics)
- **2** [Mathematical game](https://en.wikipedia.org/wiki/Mathematical_game)
Pure mathematics focuses on abstract concepts and theoretical frameworks, independent of practical applications, while applied mathematics is concerned with mathematical methods that can be used in real-world situations. Pure mathematics often explores fundamental truths and properties, whereas applied mathematics is developed in correlation with specific applications in fields like science and engineering [1] [2].
Style
Different styles can be used to display the references. The default style is: Markdown, but you can use:
EmptyReferenceStyle
: no references are produceTextReferenceStyle
: for console outputMarkdownReferenceStyle
: format markdown outputHTMLReferenceStyle
: format html output
You can adjust the style to suit the specific requirements of your documents.
from langchain_references import ReferenceStyle
from langchain_core.documents.base import BaseMedia
from typing import List, Tuple
class MyReferenceStyle(ReferenceStyle):
source_id_key = lambda \
media: f'{media.metadata["source"]}#{media.metadata["row"]}'
def format_reference(self, ref: int, media: BaseMedia) -> str:
return f" (See {media.metadata['title']})"
def format_all_references(self, refs: List[Tuple[int, BaseMedia]]) -> str:
if not refs:
return ""
result = []
for ref, media in refs:
source = self.source_id_key.__func__(media)
result.append(f"- [{ref}] {source}\n")
if not result:
return ""
return "\n\n" + "".join(result)
How does it work?
On the fly, each token is captured to identify the pattern of references. As soon as
the beginning of a text seems to match, tokens are accumulated until references are
identified or the capture is abandoned, as this is a false alarm. The accumulated
tokens are then produced, before the analysis is resumed.
As soon as a token appears, it is assigned an identifier, in relation to the various
documents present. Then format_reference()
is invoked.
When there are no more tokens, the list of documents used for the response is
constructed and added as the final fragment, via format_all_references()
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file langchain_references-0.2.30.tar.gz
.
File metadata
- Download URL: langchain_references-0.2.30.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 927acdacd498eaa9d7a79ebc303a5f33d42168189d5c1edb0879f751bf9f321e |
|
MD5 | cba85dfc7eff244dd3d8e252fe57888e |
|
BLAKE2b-256 | bcdccae995467e73cc9b6a9b32e453ec94267ac2835cdab4699af6eb7f9d16a0 |
File details
Details for the file langchain_references-0.2.30-py3-none-any.whl
.
File metadata
- Download URL: langchain_references-0.2.30-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec1dd0c1936591337a452a298054dc044ec2de4994a55ffe21e6dda0e4156bcd |
|
MD5 | 080eef105f5aae290bcc798f451d3f8f |
|
BLAKE2b-256 | 688b4ac4073ecc5aede3ddc89b48864b923ff292c47aac843eb0fb6de0b5dbbe |