An LLM application to safeguard the consistency of documents in a knowledge base
Project description
Welcome to the KnowledgeBase Guardian, an LLM-powered solution to keep your knowledge base consistent and free of contradictions! How, you ask? Well, every time you want to add new information to your knowledge base, the Guardian will check that it does not conflict with information that is already contained in there. To grasp the general idea, feel free to have a look at this notebook:
.
Regardless of the purpose of your knowledge base, maintaining consistency is highly desirable. However, in a large and constantly evolving knowledge base, this can prove to be a challenging task.
At Dataroots, we're developing an LLM-powered Q&A system, with our internal documents serving as the knowledge base. To us, maintaining consistency brings a range of benefits:
- Enhanced user trust: With conflicting information eliminated, users can rely on the knowledge base with confidence, leading to a positive user experience.
- Improved answer quality: By eliminating conflicting information, we can enhance the accuracy and reliability of the generated answers.
- Simplified maintenance: By automating conflict detection as much as possible, we reduce the manual effort required to maintain the knowledge base.
Off course, these benefits are not restricted to LLM-powered Q&A systems. If you're interested to keep your knowledge consistent as well, make sure to keep reading.
Keep in mind that this repo currently acts as a Proof of Concept and not as a full-fledged knowledge base management system.
💡 How it works
Our use case can be visualized as follows:
- We have an initial vector store, which contains embeddings of all documents in our knowledge base. To keep the example simple, we assume it is initially free of contradictions.
- We want to add new documents to this vector store, but are unsure if these documents are consistent with the information in the vector store.
- Before adding a document, we first retrieve the most semantically similar documents in the vector store. We then use an LLM to compare the documents and search for contradictions:
- If no contradiction is detected, the document is added to the vector store.
- If a contradiction is detected, the document is not added and we keep a log of the failed attempt.
Prerequisites
- OS: Linux or MacOS
- Python: 3.9 or higher
- OpenAI account
OR
Azure OpenAI with deployed embedding and LLM
⚙️ Setup
- Open a terminal and clone this repository by running
git clone https://github.com/datarootsio/knowledgebase_guardian.git
- Go to the cloned folder and create a virtual environment. Choose your favorite one or use venv:
python -m venv contradiction_detection
source contradiction_detection/bin/activate
- Install the dependencies:
python -m pip install -e .
⚡️ Quickstart
We provide a small demo example to get started right away. If you prefer to play around with your own data, you can jump ahead to the next section.
In data/vectorstore you'll find an index and vector store file called belgium
. It was created from three articles about Belgian cities, which we scraped from Wikipedia and which you can find in data/raw. In our example, we'll try to add three new documents to our vector store. You can find these in data/extension.
To follow along with the example, follow the setup section above and execute the following steps:
- Make sure you have an OpenAI account. Look up your OpenAI API key and write it down in .env
- In your terminal, run
python scripts/extend_vectorstore.py
This will result in the following three outputs:
- A new index and vector store file called
belgium_extended
, located in the data/vectorstore folder. - A
contradictions.log
file, indicating for which new files a contradiction was detected. If all went well, you should see that the documentLeuven_contradictions
was not added to the vector store and the output should look more or less like this: - A
execution.log
file providing information about the run. Here you'll also find logs for new documents that were added succesfully. Assuming all went well, you'll see that the documentsLeuven_aligned
andLeuven_new
were succesfully added to the vector store. The first document contains only information that is already present in the vector store, while the second introduces new information that is not conflicting with any of the information contained in the vector store. The output should look similar to this: and this:
⚒️ Setting up KnowledgeBase Guardian with your own data
Make sure to first execute the steps of the setup section above.
Choose OpenAI or Azure OpenAI
- For AzureOpenAI, complete the .env.cloud file.
For OpenAI, complete the .env file. - Set the
azure_openai
variable in config.yml to true if you use AzureOpenAI, else set it to false.
Initializing your vector store
A) You already have a FAISS vector store
Place the index file and the actual vector store file in the data/vectorstore folder. Make sure that:
- Both files have the same name
- The index file has extension
.index
- The vector store file has extension
.pkl
Now head over to config.yml and change the vectorstore_name
parameter to the name of your vector store.
B) You don't have a FAISS vector store
- Place all your
.txt
data in the data/raw folder. - Head over to config.yml and change the
vectorstore_name
parameter to the desired name for your vector store. - Optional: change the
chunk_size
andchunk_overlap
parameters - Create a vector store and index file with the chosen name in the data/vectorstore folder by running
python scripts/create_vectorstore.py
Extending your vector store and detecting contradictions
Now we want to add new documents to the vector store, but only if they are not contradicting with the information that is already contained in the vector store.
- Place the
.txt
files to be added in the data/extension folder. - Optional: change the
chunk_size
,chunk_overlap
,nb_retrieval_docs
,system_message
anduser_message
parameters in the config.yml file. - Start the contradiction detection and vector store extension with the following command. To bypass the contradiction detection mechanism, add
--disable-contradiction-detection
.
python scripts/extend_vectorstore.py
The output is threefold:
- A new index and vector store file in the data/vectorstore folder, recognizable by the presence of
_extended
in their name. - A
contradictions.log
file, indicating for which new files a contradiction was detected. For debugging purposes, it also displays the output of the LLM and the content of the most similar documents that were retrieved. - A
execution.log
file indicating information about the run. Here you'll also find logs for new documents that were added succesfully.
🧐 Limitations
-
The performance of this technique is highly dependent on the prompt. You will likely need to fine-tune the prompt (i.e., the
system_message
anduser_message
in config.yml) to your use-case -
There is no consistent handling of all chunks in a document. This means that if your document is split into multiple chunks and some of them contain contradictions while others don't, some chunks will be added to the vector store and others will not. Depending on your use case, you might want to change this behaviour.
-
To keep the example as small as possible, we chose to support only
- one vector store type (FAISS)
- one file extension (
.txt
)
Extending this code to other vector stores and file extensions is possible by leveraging Langchain or LlamaIndex.
License
This project is licensed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file knowledge_base_guardian-0.0.1.tar.gz
.
File metadata
- Download URL: knowledge_base_guardian-0.0.1.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee8f03f982e3876fc0cb9a9251ccc885a3d887781fb37f967f78e46a85bbdb46 |
|
MD5 | 356338337fee9c784d17132b6e815a39 |
|
BLAKE2b-256 | abfb8ef9dafbd2cba8e12895a769decec1a32f312dd1a94d06be717547ed6de3 |
File details
Details for the file knowledge_base_guardian-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: knowledge_base_guardian-0.0.1-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6e2ae4a238ac680528ec7046065d3a2b8b4381de2f0c3e7eb80f6aa5c17b5ca |
|
MD5 | 13a55abc16f7f6427907e8c613e140c9 |
|
BLAKE2b-256 | 4e91b0ed6af6c877edb902eeaa78dff46543c595d9b260e790e73b85e44e0fb5 |