GPT Cache, make your chatgpt services lower cost and faster
Project description
GPT Cache
English | 中文
🤠 What is GPT Cache?
Large Language Models (LLMs) are a promising and transformative technology that has rapidly advanced in recent years. These models are capable of generating natural language text and have numerous applications, including chatbots, language translation, and creative writing. However, as the size of these models increases, so do the costs and performance requirements needed to utilize them effectively. This has led to significant challenges in developing on top of large models such as ChatGPT.
To address this issue, we have developed GPT Cache, a project that focuses on caching responses from language models, also known as a semantic cache. The system offers two major benefits:
- Quick response to user requests: the caching system provides faster response times compared to large model inference, resulting in lower latency and faster response to user requests.
- Reduced service costs: most LLM services are currently charged based on the number of tokens. If user requests hit the cache, it can reduce the number of requests and lower service costs.
If you find this idea helpful, please consider giving me a star 🌟, as it helps me as well.
🤔 Why would GPT Cache be helpful?
I believe it would be necessary for the following reasons:
- Locality is present everywhere. Like traditional application systems, AIGC applications also face similar hot topics. For instance, ChatGPT itself may be a popular topic among programmers.
- For purpose-built SaaS services, users tend to ask questions within a specific domain, with both temporal and spatial locality.
- By utilizing vector similarity search, it is possible to find a similarity relationship between questions and answers at a relatively low cost.
We also provide benchmarks to illustrate the concept. In semantic caching, there are three key measurement dimensions: false positives, false negatives, and hit latency. With the plugin-style implementation, users can easily tradeoff these three measurements according to their needs.
😊 Quick Start
Note:
- You can quickly experience gpt cache, it's worth noting but remember the repo is under heavy development
- By default, only a few libraries are installed. When you need to use additional features, related libraries will be automatically installed.
- If you have trouble installing a library due to a low pip version, run:
python -m pip install --upgrade pip
pip install
pip install gptcache
dev install
# clone gpt cache repo
git clone https://github.com/zilliztech/gpt-cache
cd gpt-cache
# install the repo
pip install -r requirements.txt
python setup.py install
quick usage
If you just want to achieve precise matching cache of requests, that is, two identical requests, you ONLY need TWO steps to access this cache
- Cache init
from gptcache.core import cache
cache.init()
# If you use the `openai.api_key = xxx` to set the api key, you need use `cache.set_openai_key()` to replace it.
# it will read the `OPENAI_API_KEY` environment variable and set it to ensure the security of the key.
cache.set_openai_key()
- Replace the original openai package
from gptcache.view import openai
# openai requests DON'T need ANY changes
answer = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "foo"}
],
)
If you want to experience vector similarity search cache locally, you can use the example Sqlite + Faiss + Towhee.
More Docs:
- System Design, how it was constructed
- Features, all features currently supported by the cache
- Examples, learn better custom caching
🤗 Modules Overview
- Pre-processing, extract the key information from the request:
- Obtain the last message from the request using
pre_embedding.py#last_content
- Obtain the session context (TODO)
- Obtain the last message from the request using
- Embed the text into a vector for similarity search:
- Use towhee with the paraphrase-albert-small-v2 model for English and uer/albert-base-chinese-cluecorpussmall for Chinese.
- Use the OpenAI embedding API.
- Keep the text as a string without any changes.
- Use the cohere embedding API.
- Support Hugging Face embedding API.
- Cache data manager, which includes searching, saving, or evicting data. Additional storage support will be added in the future, and contributions are welcome:
- Scalar store:
- Use SQLite.
- Use PostgreSQL.
- Use MySQL.
- Vector store:
- Use Milvus.
- Use Zilliz Cloud.
- Use other vector databases
- Vector index:
- Use FAISS.
- Scalar store:
- Evaluate similarity by judging the quality of cached answers:
- Use the search distance, as described in
simple.py#pair_evaluation
. - towhee uses the albert_duplicate model for precise comparison of problems to problems mode. It supports only 512 tokens.
- For string comparison, judge the cache request and the original request based on the exact match of characters.
- For numpy arrays, use
linalg.norm
.
- Use the search distance, as described in
- Post-processing: determine how to return multiple cached answers to the user:
- Choose the most similar answer.
- Choose randomly.
- Other ranking policies
😆 Contributing
Would you like to contribute to the development of GPT Cache? Take a look at our contribution guidelines.
🙏 Thank
Thanks to my colleagues in the company Zilliz for their inspiration and technical support.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.