Synthetic dataset generator for testing semantic search quality
Project description
Semantic Synth
So you've decided to build out a RAG (Retrieval Augmented Generation) tool. Great! Now you want to build a vector search index for a critical production task.
Before you push into prodcution though, you'd like to be able to test how well the search index itself is working.
We have also released a 520k sample dataset for you to run tests. Access at: Huggingface Hub
Current methods for testing
Having reviewed multiple different platforms that work on semantic search correctness, the primary idea is to collect some passage from the dataset, and generate questions on it using an LLM, then get the LLM to answer it, and generate different types of metrics, such as faithfulness, correctness, etc.
This is an expensive operation owing to the multiple LLM calls, not to mention would not be conducive for very large scale data testing or continous monitoring.
How does this package work?
Given any text, we generate keywords using YAKE library. This effectively makes the test self supervised, without need for expensive LLM calls for synthetic generation
Philosophy
Semantic search mainly works on finding different latent meanings between queries and vectors in a search index. Therefore, it only makes sense that it should be highly effective at finding passages that contain key phrases in a document
However, there are multiple steps of possible loss. First is the chunking strategy. Then comes your vector embedding model. Post which comes the capability of the vector index you are using (these indices are often based on approximation algorithms and can be quite lossy)
For academic purposes, premade datasets are good enough, but how do you run these estimates on your own data? That is where Semantic Synth comes in
if a vector search index works well, it will be able to find phrases in your passages effectively. If not, maybe there's more to look at, such as your chunking strategy, or your embedding model itself
Who is this package for?
This package is specifically for testing the retrieval capacity of your vector search index, for testing statistical metrics such as precision, recall, F1 scores, etc.
Currently we support the generation of synthetic search terms for your content so that you can perform search and calculate accuracy metrics. We are working on adding a full fledged testing suite.
Why did we build this?
This package was built to perform research on how different chunking strategies affect vector search accuracy.
Usage
WARNING: Please note that this is an alpha release and is only suitable for testing, not for production
Installation
pip install semantic-synth
Code
from semantic_synth.datagen import KeywordDatasetGenerator
text = """
<Insert text here>
"""
gen = KeywordDatasetGenerator()
#To get single text response
print(gen.generate(content = text))
#For dataset as dataframe
content = [
'<Insert text1 here>',
'<Insert text2 here>,
'<Insert text3 here>'
]
print(gen.generate_as_df(content = content))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file semantic_synth-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: semantic_synth-0.0.3-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 622c263f913f8f8067fcaf8ecb75494f265754e440297fe5b7e8b8766fb3f6d9 |
|
MD5 | 8c10b4a3696947f315d705c56a21ce53 |
|
BLAKE2b-256 | 707948707deddbb36935b6b0a043085d9f31b890d179921f422e8f5705aed2a1 |