Skip to main content

A tool for running on-premise large language models on non-public data

Project description

OnPrem

OnPrem is a simple Python package that makes it easier to run large language models (LLMs) on non-public or sensitive data and on machines with no internet connectivity (e.g., behind corporate firewalls). Inspired by the privateGPT and localGPT GitHub repos, OnPrem is intended to make it easier to integrate local LLMs in practical applications.

Install

pip install onprem

For GPU support, see additional instructions below.

How to use

Setup

import os.path
from onprem import LLM

url = 'https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML/resolve/main/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin'

llm = LLM(model_name=os.path.basename(url))
llm.download_model(url, ssl_verify=True ) # set to False if corporate firewall gives you problems
There is already a file Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin in /home/amaiya/onprem_data. Do you want to still download it? (Y/n) Y
[██████████████████████████████████████████████████]

Send Prompts to the LLM

prompt = """Extract the names of people in the supplied sentences. Here is an example:
Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""

saved_output = llm.prompt(prompt)
Cillian Murphy, Florence Pugh

How to Speed Up Inference Using a GPU

The above example employed the use of a CPU.
If you have a GPU (even an older one with less VRAM), you can speed up responses.

Step 1: Install llama-cpp-python with CUDABLAS support

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.69 --no-cache-dir

It is important to use the specific version shown above due to library incompatibilities.

Step 2: Use the n_gpu_layers argument with LLM

llm = LLM(model_name=os.path.basename(url), n_gpu_layers=128)

With the steps above, calls to methods like llm.prompt will offload computation to your GPU and speed up responses from the LLM.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onprem-0.0.2.tar.gz (12.1 kB view hashes)

Uploaded Source

Built Distribution

onprem-0.0.2-py3-none-any.whl (12.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page