A tool to search and retrieve relevant documents from a knowledge base using BERT/MiniLM embedding or a custom search implementation, then generate human-readable answers using OpenAI API.
Project description
Knowledge Base Search
This project provides an efficient and scalable solution to search and query a large knowledge base of documents. It allows users to search for information easily by leveraging advanced NLP techniques like BERT embeddings.
Features
- Organized code structure following SOLID principles
- BERT search for semantic similarity between queries and documents
- Preprocessing using SpaCy for efficient text processing
- Caching system to store preprocessed data and search algorithm instances for faster subsequent searches
- Logging to track search-related information and potential issues
Methodology
The Knowledge Base Search tool employs a two-step process to find relevant documents and generate human-readable answers:
-
Semantic Search: The tool preprocesses and indexes the input documents using advanced NLP techniques like BERT/MiniLM embeddings or a custom search implementation. These embeddings capture the semantic meaning of the text, allowing the search algorithm to find documents that are not just textually similar, but also semantically related to the input query. This approach ensures a more accurate and context-aware selection of relevant documents.
-
Answer Generation: After retrieving the most relevant documents, the tool integrates with OpenAI's Chat GPT API to generate human-readable answers based on the provided context. By only sending the relevant context, we can reduce the cost and improve the performance of the API calls, while ensuring that the generated answers are accurate and contextually appropriate.
This methodology is designed to be easily extensible and customizable, allowing users to implement their own search algorithms or NLP models to tailor the solution to their specific use case.
Installation
To set up the project, follow these steps:
- Clone the repository:
git clone https://github.com/your_username/knowledge_base_search.git
- Change the directory:
cd knowledge_base_search
- Create a virtual environment:
- For Windows:
python -m venv venv
- For Linux/Mac:
python3 -m venv venv
- Activate the virtual environment:
- For Windows:
venv\Scripts\activate
- For Linux/Mac:
source venv/bin/activate
- Install the required packages:
pip install -r requirements.txt
- Create a
.env
file in the root of your project and add the openai_api_key variable. Replace<your_api_key>
with your actual API key:
openai_api_key=<your_api_key>
Usage
-
Add your documents in JSON format to the
data/raw_data/documents.json
file. -
Update the
main.py
file with your query and other necessary modifications. -
Run the
main.py
script:
python main.py
This will load the documents, preprocess them, and index them using the specified search algorithm (e.g., BERT). Then, it will search for relevant documents based on your query and return the top matching results.
Contributing
Contributions are welcome! Please feel free to open issues or submit pull requests to improve the project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for knowledge-base-search-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c99981de73ddd15596f17bfcb6d537a3f778470366c610358d9f011f4ac8ea5f |
|
MD5 | 544733a52e4d9d29232e806bbf7dc918 |
|
BLAKE2b-256 | 3e89c3ed744622b608d9528aeae580fa51cce9684b72b5868ee732c00037d00b |
Hashes for knowledge_base_search-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc63fd372c2e4cb6192f787457ee2452f1d9dfc03a5fe4f7ddb6091cf7fb3fdd |
|
MD5 | 3193e9b0c03f3a8b6569a94ff6bd33cb |
|
BLAKE2b-256 | 47c3643f690bc463c3e3092d25bd893507c5d5f13a0f8e178e5c4aeb4ad70cf2 |