OpenLLM: REST/gRPC API server for running any open Large-Language Model - StableLM, Llama, Alpaca, Dolly, Flan-T5, Custom
Project description
OpenLLM
Build, fine-tune, serve, and deploy Large-Language Models including popular ones like StableLM, Llama, Dolly, Flan-T5, Vicuna, or even your custom LLMs.
Powered by BentoML ๐ฑ
๐ Introduction
With OpenLLM, you can easily run inference with any open-source large-language models(LLMs) and build production-ready LLM apps, powered by BentoML. Here are some key features:
๐ SOTA LLMs: With a single click, access support for state-of-the-art LLMs, including StableLM, Llama, Alpaca, Dolly, Flan-T5, ChatGLM, Falcon, and more.
๐ฅ Easy-to-use APIs: We provide intuitive interfaces by integrating with popular tools like BentoML, HuggingFace, LangChain, and more.
๐ฆ Fine-tuning your own LLM: Customize any LLM to suit your needs with
LLM.tuning()
. (Work In Progress)
โ๏ธ Interoperability: First-class support for LangChain and BentoMLโs runner architecture, allows easy chaining of LLMs on multiple GPUs/Nodes
๐ฏ Streamline Production Deployment: Seamlessly package into a Bento with
openllm build
, containerized into OCI Images, and deploy with a single click
using โ๏ธ BentoCloud.
๐โ Getting Started
To use OpenLLM, you need to have Python 3.8 (or newer) and pip
installed on
your system. We highly recommend using a Virtual Environment to prevent package
conflicts.
You can install OpenLLM using pip as follows:
pip install openllm
To verify if it's installed correctly, run:
openllm -h
The correct output will be:
Usage: openllm [OPTIONS] COMMAND [ARGS]...
โโโโโโโ โโโโโโโ โโโโโโโโโโโโ โโโโโโ โโโ โโโโ โโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโ โโโ โโโโโ โโโโโ
โโโ โโโโโโโโโโโโโโโโโ โโโโโโ โโโโโโ โโโ โโโโโโโโโโโ
โโโ โโโโโโโโโโ โโโโโโ โโโโโโโโโโโโโ โโโ โโโโโโโโโโโ
โโโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโ โโโ
โโโโโโโ โโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โโโ
OpenLLM: Your one stop-and-go-solution for serving any Open Large-Language Model
- StableLM, Falcon, ChatGLM, Dolly, Flan-T5, and more
- Powered by BentoML ๐ฑ
Starting an LLM Server
To start an LLM server, use openllm start
. For example, to start a dolly-v2
server:
openllm start dolly-v2
Following this, a swagger UI will be accessible at http://0.0.0.0:3000 where you can experiment with the endpoints and sample prompts.
OpenLLM provides a built-in Python client, allowing you to interact with the model. In a different terminal window or a Jupyter notebook, create a client to start interacting with the model:
>>> import openllm
>>> client = openllm.client.HTTPClient('http://localhost:3000')
>>> client.query('Explain to me the difference between "further" and "farther"')
๐ Deploying to Production
To deploy your LLMs into production:
- Build Your BentoML Service: With OpenLLM, you can easily build your
BentoML service for a specific model, like
dolly-v2
, using thebundle
command:
openllm bundle dolly-v2
NOTE: If you wish to build OpenLLM from the git source, set
OPENLLM_DEV_BUILD=True
to include the generated wheels in the bundle.
- Containerize your Bento
bentoml containerize <name:version>
BentoML offers a comprehensive set of options for deploying and hosting online ML services in production. To learn more, check out the Deploying a Bento guide.
๐งฉ Models and Dependencies
OpenLLM currently supports the following:
Model-specific Dependencies
We respect your system's space and efficiency. That's why we don't force users
to install dependencies for all models. By default, you can run dolly-v2
and
flan-t5
without installing any additional packages.
To enable support for a specific model, you'll need to install its corresponding
dependencies. You can do this by using pip install openllm[model_name]
. For
example, to use chatglm:
pip install openllm[chatglm]
This will install cpm_kernels
and sentencepiece
additionally
Runtime Implementations
Different LLMs may have multiple runtime implementations. For instance, they
might use Pytorch (pt
), Tensorflow (tf
), or Flax (flax
).
If you wish to specify a particular runtime for a model, you can do so by
setting the OPENLLM_{MODEL_NAME}_FRAMEWORK={runtime}
environment variable
before running openllm start
.
For example, if you want to use the Tensorflow (tf
) implementation for the
flan-t5
model, you can use the following command:
OPENLLM_FLAN_T5_FRAMEWORK=tf openllm start flan-t5
Integrating a New Model
OpenLLM encourages contributions by welcoming users to incorporate their custom LLMs into the ecosystem. Checkout Adding a New Model Guide to see how you can do it yourself.
๐ Telemetry
OpenLLM collects usage data to enhance user experience and improve the product. We only report OpenLLM's internal API calls and ensure maximum privacy by excluding sensitive information. We will never collect user code, model data, or stack traces. For usage tracking, check out the code.
You can opt-out of usage tracking by using the --do-not-track
CLI option:
openllm [command] --do-not-track
Or by setting environment variable OPENLLM_DO_NOT_TRACK=True
:
export OPENLLM_DO_NOT_TRACK=True
๐ฅ Community
Engage with like-minded individuals passionate about LLMs, AI, and more on our Discord!
OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy-to-use๐ Join our Slack community!
๐ Contributing
We welcome contributions! If you're interested in enhancing OpenLLM's capabilities or have any questions, don't hesitate to reach out in our discord channel.
Checkout our Developer Guide if you wish to contribute to OpenLLM's codebase.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.