Skip to main content

GPT Cache, make your chatgpt services lower cost and faster

Project description

GPT Cache

English | 中文

🤠 What is GPT Cache?

Large Language Models (LLMs) are a promising and transformative technology that has rapidly advanced in recent years. These models are capable of generating natural language text and have numerous applications, including chatbots, language translation, and creative writing. However, as the size of these models increases, so do the costs and performance requirements needed to utilize them effectively. This has led to significant challenges in developing on top of large models such as ChatGPT.

To address this issue, we have developed GPT Cache, a project that focuses on caching responses from language models, also known as a semantic cache. The system offers two major benefits:

  1. Quick response to user requests: the caching system provides faster response times compared to large model inference, resulting in lower latency and faster response to user requests.
  2. Reduced service costs: most LLM services are currently charged based on the number of tokens. If user requests hit the cache, it can reduce the number of requests and lower service costs.

If you find this idea helpful, please consider giving me a star 🌟, as it helps me as well.

🤔 Why would GPT Cache be helpful?

I believe it would be necessary for the following reasons:

  • Locality is present everywhere. Like traditional application systems, AIGC applications also face similar hot topics. For instance, ChatGPT itself may be a popular topic among programmers.
  • For purpose-built SaaS services, users tend to ask questions within a specific domain, with both temporal and spatial locality.
  • By utilizing vector similarity search, it is possible to find a similarity relationship between questions and answers at a relatively low cost.

We also provide benchmarks to illustrate the concept. In semantic caching, there are three key measurement dimensions: false positives, false negatives, and hit latency. With the plugin-style implementation, users can easily tradeoff these three measurements according to their needs.

😊 Quick Start

Note:

  • You can quickly experience gpt cache, it's worth noting but remember the repo is under heavy development
  • By default, only a few libraries are installed. When you need to use additional features, related libraries will be automatically installed.
  • If you have trouble installing a library due to a low pip version, run: python -m pip install --upgrade pip

pip install

pip install gptcache

dev install

# clone gpt cache repo
git clone https://github.com/zilliztech/gpt-cache
cd gpt-cache

# install the repo
pip install -r requirements.txt
python setup.py install

quick usage

If you just want to achieve precise matching cache of requests, that is, two identical requests, you ONLY need TWO steps to access this cache

  1. Cache init
from gptcache.core import cache

cache.init()
# If you use the `openai.api_key = xxx` to set the api key, you need use `cache.set_openai_key()` to replace it.
# it will read the `OPENAI_API_KEY` environment variable and set it to ensure the security of the key.
cache.set_openai_key()
  1. Replace the original openai package
from gptcache.view import openai

# openai requests DON'T need ANY changes
answer = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "foo"}
    ],
)

If you want to experience vector similarity search cache locally, you can use the example Sqlite + Faiss + Towhee.

More Docs:

🤗 Modules Overview

GPTCache Struct

  • Pre-processing, extract the key information from the request:
    • Obtain the last message from the request using pre_embedding.py#last_content
    • Obtain the session context (TODO)
  • Embed the text into a vector for similarity search:
    • Use towhee with the paraphrase-albert-small-v2 model for English and uer/albert-base-chinese-cluecorpussmall for Chinese.
    • Use the OpenAI embedding API.
    • Keep the text as a string without any changes.
    • Use the cohere embedding API.
    • Support Hugging Face embedding API.
  • Cache data manager, which includes searching, saving, or evicting data. Additional storage support will be added in the future, and contributions are welcome:
  • Evaluate similarity by judging the quality of cached answers:
    • Use the search distance, as described in simple.py#pair_evaluation.
    • towhee uses the albert_duplicate model for precise comparison of problems to problems mode. It supports only 512 tokens.
    • For string comparison, judge the cache request and the original request based on the exact match of characters.
    • For numpy arrays, use linalg.norm.
  • Post-processing: determine how to return multiple cached answers to the user:
    • Choose the most similar answer.
    • Choose randomly.
    • Other ranking policies

😆 Contributing

Would you like to contribute to the development of GPT Cache? Take a look at our contribution guidelines.

🙏 Thank

Thanks to my colleagues in the company Zilliz for their inspiration and technical support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gptcache-0.1.1.tar.gz (20.9 kB view hashes)

Uploaded Source

Built Distribution

gptcache-0.1.1-py3-none-any.whl (24.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page