A tool for running on-premises large language models on non-public data
Project description
OnPrem
A tool for running large language models on-premises using non-public data
OnPrem is a simple Python package that makes it easier to run large language models (LLMs) on non-public or sensitive data and on machines with no internet connectivity (e.g., behind corporate firewalls). Inspired by the privateGPT GitHub repo and Simon Willison’s LLM command-line utility, OnPrem is designed to help integrate local LLMs into practical applications.
Install
pip install onprem
For GPU support, see additional instructions below.
How to use
Setup
import os.path
from onprem import LLM
url = 'https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML/resolve/main/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin'
llm = LLM(model_name=os.path.basename(url))
llm.download_model(url, ssl_verify=True ) # set to False if corporate firewall gives you problems
There is already a file Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin in /home/amaiya/onprem_data.
Do you want to still download it? (Y/n) Y
[██████████████████████████████████████████████████]
Send Prompts to the LLM to Solve Problems
This is an example of few-shot prompting, where we provide an example of what we want the LLM to do.
prompt = """Extract the names of people in the supplied sentences. Here is an example:
Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""
saved_output = llm.prompt(prompt)
llama.cpp: loading model from /home/amaiya/onprem_data/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 5407.72 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 1024.00 MB
Cillian Murphy, Florence Pugh
Talk to Your Documents
Answers are generated from the content of your documents.
Step 1: Download Some Documents to a Folder
import os
if not os.path.exists: os.mkdir('/tmp/sample_data')
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/sample_data/ktrain_paper.pdf -q
Step 2: Ingest the Documents into a Vector Database
llm.ingest('/tmp/sample_data')
Creating new vectorstore
Loading documents from /tmp/sample_data
Loaded 18 new documents from /tmp/sample_data
Split into 114 chunks of text (max. 500 tokens each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the prompt method
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 9.39it/s]
Step 3: Answer Questions About the Documents
question = """Answer the following question in one sentence based only on the provided context: What is ktrain?"""
answer, docs = llm.ask(question)
print('\n\nReferences:\n\n')
for i, document in enumerate(docs):
print(f"\n{i+1}.> " + document.metadata["source"] + ":")
print(document.page_content)
Ktrain is a machine learning framework that automates certain aspects of the workow while allowing for human input and choice to complement the strengths of both humans and machines.
References:
1.> /tmp/sample_data/downloaded_paper.pdf:
lection (He et al., 2019). By contrast, ktrain places less emphasis on this aspect of au-
tomation and instead focuses on either partially or fully automating other aspects of the
machine learning (ML) workflow. For these reasons, ktrain is less of a traditional Au-
2
2.> /tmp/sample_data/ktrain_paper.pdf:
lection (He et al., 2019). By contrast, ktrain places less emphasis on this aspect of au-
tomation and instead focuses on either partially or fully automating other aspects of the
machine learning (ML) workflow. For these reasons, ktrain is less of a traditional Au-
2
3.> /tmp/sample_data/downloaded_paper.pdf:
possible, ktrain automates (either algorithmically or through setting well-performing de-
faults), but also allows users to make choices that best fit their unique application require-
ments. In this way, ktrain uses automation to augment and complement human engineers
rather than attempting to entirely replace them. In doing so, the strengths of both are
better exploited. Following inspiration from a blog post1 by Rachel Thomas of fast.ai
4.> /tmp/sample_data/ktrain_paper.pdf:
possible, ktrain automates (either algorithmically or through setting well-performing de-
faults), but also allows users to make choices that best fit their unique application require-
ments. In this way, ktrain uses automation to augment and complement human engineers
rather than attempting to entirely replace them. In doing so, the strengths of both are
better exploited. Following inspiration from a blog post1 by Rachel Thomas of fast.ai
Speeding Up Inference Using a GPU
The above example employed the use of a CPU.
If you have a GPU (even an older one with less VRAM), you can speed up
responses.
Step 1: Install llama-cpp-python
with CUDABLAS support
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.69 --no-cache-dir
It is important to use the specific version shown above due to library incompatibilities.
Step 2: Use the n_gpu_layers
argument with LLM
llm = LLM(model_name=os.path.basename(url), n_gpu_layers=128)
With the steps above, calls to methods like llm.prompt
will offload
computation to your GPU and speed up responses from the LLM.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.