A tool for running on-premises large language models on non-public data

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

OnPrem.LLM

A tool for running large language models on-premises using non-public data

OnPrem.LLM is a simple Python package that makes it easier to run large language models (LLMs) on your own machines using non-public data (possibly behind corporate firewalls). Inspired by the privateGPT GitHub repo and Simon Willison’s LLM command-line utility, OnPrem.LLM is intended to help integrate local LLMs into practical applications.

The full documentation is here.

A Google Colab demo of installing and using OnPrem.LLM is here.

Install

Once you have installed PyTorch, you can install OnPrem.LLM with:

pip install onprem

For fast GPU-accelerated inference, see additional instructions below. See the FAQ, if you experience issues with llama-cpp-python installation.

How to Use

Setup

from onprem import LLM

llm = LLM()

By default, a 7B-parameter model is downloaded and used. If use_larger=True, a 13B-parameter is used. You can also supply the URL to an LLM of your choosing to LLM (see the code generation section below for an example). As of v0.0.20, OnPrem.LLM supports the newer GGUF format.

Send Prompts to the LLM to Solve Problems

This is an example of few-shot prompting, where we provide an example of what we want the LLM to do.

prompt = """Extract the names of people in the supplied sentences. Here is an example:
Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""

saved_output = llm.prompt(prompt)

Cillian Murphy, Florence Pugh

Additional prompt examples are shown here.

Talk to Your Documents

Answers are generated from the content of your documents (i.e., retrieval augmented generation or RAG). Here, we will supply use_larger=True to use the larger default model better suited to this use case in addition to using GPU offloading to speed up answer generation.

from onprem import LLM

llm = LLM(use_larger=True, n_gpu_layers=35)

Step 1: Ingest the Documents into a Vector Database

llm.ingest('./sample_data')

Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from ./sample_data
Loaded 12 new documents from ./sample_data
Split into 153 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method

Loading new documents: 100%|██████████████████████| 3/3 [00:00<00:00, 25.52it/s]

Step 2: Answer Questions About the Documents

question = """What is  ktrain?""" 
answer, docs = llm.ask(question)

 ktrain is a low-code platform designed to facilitate the full machine learning workflow, from preprocessing inputs to training, tuning, troubleshooting, and applying models. It focuses on automating other aspects of the ML workflow in order to augment and complement human engineers rather than replacing them. Inspired by fastai and ludwig, ktrain is intended to democratize machine learning for beginners and domain experts with minimal programming or data science experience.

The sources used by the model to generate the answer are stored in docs:

print('\nSources:\n')
for i, document in enumerate(docs):
    print(f"\n{i+1}.> " + document.metadata["source"] + ":")
    print(document.page_content)

Sources:


1.> ./sample_data/ktrain_paper.pdf:
lection (He et al., 2019). By contrast, ktrain places less emphasis on this aspect of au-
tomation and instead focuses on either partially or fully automating other aspects of the
machine learning (ML) workﬂow. For these reasons, ktrain is less of a traditional Au-
2

2.> ./sample_data/ktrain_paper.pdf:
possible, ktrain automates (either algorithmically or through setting well-performing de-
faults), but also allows users to make choices that best ﬁt their unique application require-
ments. In this way, ktrain uses automation to augment and complement human engineers
rather than attempting to entirely replace them. In doing so, the strengths of both are
better exploited. Following inspiration from a blog post1 by Rachel Thomas of fast.ai

3.> ./sample_data/ktrain_paper.pdf:
with custom models and data formats, as well.
Inspired by other low-code (and no-
code) open-source ML libraries such as fastai (Howard and Gugger, 2020) and ludwig
(Molino et al., 2019), ktrain is intended to help further democratize machine learning by
enabling beginners and domain experts with minimal programming or data science experi-
4. http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
6

4.> ./sample_data/ktrain_paper.pdf:
ktrain: A Low-Code Library for Augmented Machine Learning
toML platform and more of what might be called a “low-code” ML platform. Through
automation or semi-automation, ktrain facilitates the full machine learning workﬂow from
curating and preprocessing inputs (i.e., ground-truth-labeled training data) to training,
tuning, troubleshooting, and applying models. In this way, ktrain is well-suited for domain
experts who may have less experience with machine learning and software coding. Where

Text to Code Generation

We’ll use the CodeUp LLM by supplying the URL and employing the particular prompt format this model expects.

from onprem import LLM
url = 'https://huggingface.co/TheBloke/CodeUp-Llama-2-13B-Chat-HF-GGUF/resolve/main/codeup-llama-2-13b-chat-hf.Q4_K_M.gguf'
llm = LLM(url, n_gpu_layers=43) # see below for GPU information

Setup the prompt based on what this model expects (this is important):

template = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{prompt}

### Response:"""

answer = llm.prompt('Write Python code to validate an email address.', prompt_template=template)

Here is an example of Python code that can be used to validate an email address:
```
import re

def validate_email(email):
    # Use a regular expression to check if the email address is in the correct format
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    if re.match(pattern, email):
        return True
    else:
        return False

# Test the validate_email function with different inputs
print("Email address is valid:", validate_email("example@example.com"))  # Should print "True"
print("Email address is invalid:", validate_email("example@"))  # Should print "False"
print("Email address is invalid:", validate_email("example.com"))  # Should print "False"
```
The code defines a function `validate_email` that takes an email address as input and uses a regular expression to check if the email address is in the correct format. The regular expression checks for an email address that consists of one or more letters, numbers, periods, hyphens, or underscores followed by the `@` symbol, followed by one or more letters, periods, hyphens, or underscores followed by a `.` and two to three letters.
The function returns `True` if the email address is valid, and `False` otherwise. The code also includes some test examples to demonstrate how to use the function.

Let’s try out the code generated above.

import re
def validate_email(email):
    # Use a regular expression to check if the email address is in the correct format
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    if re.match(pattern, email):
        return True
    else:
        return False
print(validate_email('sam@@openai.com')) # bad email address
print(validate_email('sam@openai'))      # bad email address
print(validate_email('sam@openai.com'))  # good email address

False
False
True

The generated code may sometimes need editing, but this one worked out-of-the-box.

Built-In Web App

OnPrem.LLM includes a built-in Web app to access the LLM. To start it, run the following command after installation:

onprem --port 8000

Then, enter localhost:8000 (or <domain_name>:8000 if running on remote server) in a Web browser to access the application:

For more information, see the corresponding documentation.

Speeding Up Inference Using a GPU

The above example employed the use of a CPU.
If you have a GPU (even an older one with less VRAM), you can speed up responses.

Step 1: Install `llama-cpp-python` with CUBLAS support

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.69 --no-cache-dir

It is important to use the specific version shown above due to library incompatibilities.

Step 2: Use the `n_gpu_layers` argument with `LLM`

llm = LLM(n_gpu_layers=35)

The value for n_gpu_layers depends on your GPU memory and the model you’re using (e.g., max of 35 for default 7B model). You can reduce the value if you get an error (e.g., CUDA OOM) or you observe the model stalling in the middle of a response.

With the steps above, calls to methods like llm.prompt will offload computation to your GPU and speed up responses from the LLM.

FAQ

How do I use other models with OnPrem.LLM?

You can supply the URL to other models to the LLM constructor, as we did above in the code generation example.

As of v0.0.20, we support models in GGUF format, which supersedes the older GGML format. You can find llama.cpp-supported models with GGUF in the file name on huggingface.co.
I’m behind a corporate firewall and am receiving an SSL error when trying to download the model?
Try this:
```
from onprem import LLM
LLM.download_model(url, ssl_verify=False)
```
How do I use this on a machine with no internet access?

Use the LLM.download_model method to download the model files to <your_home_directory>/onprem_data and transfer them to the same location on the air-gapped machine.
For the ingest and ask methods, you will need to also download and transfer the embedding model files:
```
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
model.save('/some/folder')
```
Copy the some/folder folder to the air-gapped machine and supply the path to LLM via the embedding_model_name parameter.
When installing onprem, I’m getting errors related to llama-cpp-python on Windows/Mac/Linux?

See this LangChain documentation on LLama.cpp for help on insalling the llama-cpp-python package for your system. Additional tips for different operating systems are shown below:

For Linux systems like Ubuntu, try this: sudo apt-get install build-essential g++ clang. Other tips are here.

For Windows systems, either use Windows Subsystem for Linux (WSL) or install Microsoft Visual Studio build tools and ensure the selections shown in this post are installed. WSL is recommended.

For Macs, try following these tips.

If you still have problems, there are various other tips for each of the above OSes in this privateGPT repo thread. Of course, you can also easily use OnPrem.LLM on Google Colab.
llama-cpp-python is failing to load my model from the model path on Google Colab.

For reasons that are unclear, newer versions of llama-cpp-python fail to load models on Google Colab unless you supply verbose=True to the LLM constructor (which is passed directly to llama-cpp-python). If you experience this problem locally, try supplying verbose=True to LLM.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.4.0

Nov 14, 2024

0.3.2

Nov 8, 2024

0.3.1

Oct 18, 2024

0.3.0

Oct 11, 2024

0.2.4

Sep 30, 2024

0.2.3

Sep 27, 2024

0.2.2

Sep 26, 2024

0.2.1

Sep 26, 2024

0.2.0

Sep 25, 2024

0.1.4

Sep 25, 2024

0.1.3

Aug 16, 2024

0.1.2

Jun 5, 2024

0.1.1

Jun 3, 2024

0.1.0

May 29, 2024

0.0.36

Jan 16, 2024

0.0.35

Jan 15, 2024

0.0.34

Jan 13, 2024

0.0.33

Jan 8, 2024

0.0.32

Dec 11, 2023

0.0.31

Dec 9, 2023

0.0.30

Dec 7, 2023

0.0.29

Oct 27, 2023

0.0.28

Oct 6, 2023

0.0.27

Sep 30, 2023

0.0.26

Sep 27, 2023

0.0.25

Sep 27, 2023

0.0.24

Sep 26, 2023

0.0.23

Sep 25, 2023

This version

0.0.22

Sep 24, 2023

0.0.21

Sep 22, 2023

0.0.20

Sep 22, 2023

0.0.19

Sep 21, 2023

0.0.18

Sep 19, 2023

0.0.17

Sep 17, 2023

0.0.16

Sep 12, 2023

0.0.15

Sep 11, 2023

0.0.14

Sep 11, 2023

0.0.14.dev0 pre-release

Sep 11, 2023

0.0.13

Sep 10, 2023

0.0.12

Sep 9, 2023

0.0.11

Sep 8, 2023

0.0.10

Sep 8, 2023

0.0.9

Sep 6, 2023

0.0.8

Sep 5, 2023

0.0.7

Sep 4, 2023

0.0.6

Sep 4, 2023

0.0.5

Sep 3, 2023

0.0.4

Sep 1, 2023

0.0.3

Sep 1, 2023

0.0.2

Sep 1, 2023

0.0.1

Sep 1, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onprem-0.0.22.tar.gz (28.6 kB view details)

Uploaded Sep 24, 2023 Source

Built Distribution

onprem-0.0.22-py3-none-any.whl (24.8 kB view details)

Uploaded Sep 24, 2023 Python 3

File details

Details for the file onprem-0.0.22.tar.gz.

File metadata

Download URL: onprem-0.0.22.tar.gz
Upload date: Sep 24, 2023
Size: 28.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for onprem-0.0.22.tar.gz
Algorithm	Hash digest
SHA256	`d87c5ffa33139c5e38d3c06f19aa9c13180cbd60ea78d1f690be287043a6a8f4`
MD5	`f7919d984057e77a9443348e3c2449cf`
BLAKE2b-256	`1f3f1baf600f9de1876a34588c7ab8514a4f9f83e43793ca368191e9e3159480`

See more details on using hashes here.

File details

Details for the file onprem-0.0.22-py3-none-any.whl.

File metadata

Download URL: onprem-0.0.22-py3-none-any.whl
Upload date: Sep 24, 2023
Size: 24.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for onprem-0.0.22-py3-none-any.whl
Algorithm	Hash digest
SHA256	`87b432c7a4ece37e2ec5d36b36e3ade6f497c25232e152c72913914958975fe6`
MD5	`b218774afa0ba89befa71ef2949e9918`
BLAKE2b-256	`b74c01118d51f3250ba95515c50cb66ddae0a63b362d121ecb88985e95914fc0`

See more details on using hashes here.

onprem 0.0.22

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OnPrem.LLM

Install

How to Use

Setup

Send Prompts to the LLM to Solve Problems

Talk to Your Documents

Step 1: Ingest the Documents into a Vector Database

Step 2: Answer Questions About the Documents

Text to Code Generation

Built-In Web App

Speeding Up Inference Using a GPU

Step 1: Install `llama-cpp-python` with CUBLAS support

Step 2: Use the `n_gpu_layers` argument with `LLM`

FAQ

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

onprem 0.0.22

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OnPrem.LLM

Install

How to Use

Setup

Send Prompts to the LLM to Solve Problems

Talk to Your Documents

Step 1: Ingest the Documents into a Vector Database

Step 2: Answer Questions About the Documents

Text to Code Generation

Built-In Web App

Speeding Up Inference Using a GPU

Step 1: Install llama-cpp-python with CUBLAS support

Step 2: Use the n_gpu_layers argument with LLM

FAQ

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Step 1: Install `llama-cpp-python` with CUBLAS support

Step 2: Use the `n_gpu_layers` argument with `LLM`