Skip to main content

InstantLLM is the backend server for the free Instant LLM app, enabling users to effortlessly connect and interact with any self-hosted large language model through a user-friendly mobile interface anywhere in the world.

Project description

InstantLLM

InstantLLM is the backend server for the free Instant LLM app, enabling users to effortlessly connect and interact with any self-hosted large language model through a user-friendly mobile interface anywhere in the world.

Simply download the InstantLLM app on your phone, install the InstantLLM library, and with a few lines of code, you'll be able to leverage your self-hosted model seamlessly.

Remember to first visit our official website to pay for the amount of characters you want to use with our interface and you can continue!! (don't worry paying just 1$ is more than 65000 characters) and no accounts required!

Next join our discord server, there you will get your api key and your model token

Workflow with InstantLLM

  • Implement our library with the model you want to host (Llama3, Gemma2, Mistral...)
  • Join our discord server and send the !getapikey command to get your api key (keep your api key secure)
  • And again send the !gettoken command with your api key to get your model token !gettoken <api_key>
  • Run the implementation on your machine (examples below)
  • Download our free InstantLLM app on your phone
  • In our app swipe left and tap add model
  • Name your model however you want to save it in our app and then paste your model token in the token from your model provider field and press add model
  • Select your model in our app and have fun using your own hosted model anywhere in the world!

Join our discord server to get your model token and api key

Project Structure

  • InstantLLM Server: Hosted by us
  • 3rd Party Server: Hosted by our users with their self-hosted models using the examples below
  • InstantLLM App: Mobile interface to use your self-hosted models.

Api usage

  • You can increase your usage of your api key by paying for more tokens in our website (pay as you use)
  • When you send the command !gettoken <api_key> in our discord server you are linking that newly created model token to your api key
  • You can have as many api keys as you want
  • When you pay for your tokens in our website, the total combined usage of your api keys will be increased by that amount

Features

  • Interface for any self-hosted large language model.
  • Easy integration with a few lines of code.
  • User-friendly mobile interface.
  • Supports adding, removing, and selecting models.
  • Allows chatting with models and managing chats.

Requirements

  • python 3.11 (or greater)
  • ollama (recommended)

Installation

Install our library using pip:

pip install instantllm

Dont want to read all the documentation? just copy this example below

Remember first to pay for the amount of characters you want to use in our interface from our official website (don't worry paying just 1$ is more than 65000 characters) then join our discord server to get your api key and model token

Explanations on how to use your model token and api key are in the discord server

from instantllm import InstantLLMClient
from typing import Callable, Dict, Any
import asyncio
import ollama

async def message_handler(message: Dict[str, Any]):
    token = message['token']
    context_window = message['message']['message']

    #strean responses from ollama
    stream = ollama.chat(
        model='llama3:8b',
        messages=context_window,
        stream=True,
    )
    model_response = ''
    for chunk in stream:
        model_response += chunk['message']['content']

        response = {
            "role": "assistant",
            "content": model_response
        }

        if not await client.send_message(token, response):
            print("Message sending stopped")
            break

async def main():
    global client
    SERVER_URL = "ws://instantllm.ddns.net"
    API_KEY = "YOUR_API_KEY"  # Replace with your actual API key

    client = InstantLLMClient(api_key=API_KEY,server_url=SERVER_URL)
    client.set_message_handler(message_handler)

    await client.run()

if __name__ == "__main__":
    asyncio.run(main())

After running this example you just have to paste your model token in our InstantLLM app and select your model to use it in our mobile app More information available in this documentation and in our discord server

The Message Handeler

To use our InstantLLM app with your favorite language model you only have to create your custom Message Handeler function, that function is used to take the incoming message from our server to your implementation, generate the model response from that message and finally send the generated response back to our server to then be shown in the InstantLLM app

The message handeler has 3 parts:

  • Function declaration with token and context window separation
  • Your custom process to generate a response from an incoming message
  • Function to send the response back to our server

Before we go any further explaining how you have to create the message handeler you have to know the structure of the incoming message that you will receive

When a message is sent through the InstantLLM app to our server and then redirected back to you, the response will be a json that looks like this

{'token': '795ca495-9512-4d9a-8b3a-817405cae78d', 'message': {'action': 'n/a', 'message': [{'content': 'hi', 'role': 'user'}]}}

token: A unique token that identifies the client and chat that is sending the message

message: A json containing metadata about the message and the message itself

  • action: Can be n/a or STOP, the action sent by the InstantLLM server will be STOP only if the user using the InstantLLM app presses the square stop button to stop receiving messages and stop the inference of the model in your implementation, otherwise the server will always send n/a (How to use this action will be explained in the examples)
  • message: Is the entire context window of the chat that the message was sent from

The entire context window of the chat is sent always by the InstantLLM server to make the implementation with ollama or any other language model inference api easier, you wont need to save the context window on your backend, just handle the inference of the model

1 - Token and context window

In your message handeler you first have to extract the token and message (the context window) from that json response from our server like this

async def message_handler(message: Dict[str, Any]):
    token = message['token']
    context_window = message['message']['message']

The context window is a list with dictionaries that have 2 parameters role and content

[{'content': 'hi', 'role': 'user'}]

When a LLM sends a message in the chat, their response is saved with the role of assistant

[{'content': 'hi', 'role': 'user'}, {'content': 'Hi how can i help you today!', 'role': 'assistant'}]

2 - Your custom response function

You now have the entire context window of the chat you can send those messages to your favorite inference api, in this example we will use ollama for self hosted models, but you can use any other api (Gemini, OpenAi, Groq)

By using streaming with ollama, the response generation should look like this

    #strean responses from ollama
    stream = ollama.chat(
        model='llama3:8b',
        messages=context_window,
        stream=True,
    )

We will use llama3:8b as our self hosted model, also we specify the messages to be equal to the context window we got from the response from the InstantLLM server and set stream = True

So far the message handeler looks like this

async def message_handler(message: Dict[str, Any]):
    token = message['token']
    context_window = message['message']['message']

    #strean responses from ollama
    stream = ollama.chat(
        model='llama3:8b',
        messages=context_window,
        stream=True,
    )

3 - Send response back to the InstantLLM server

Because we are using streaming we have to iterate through the chunks in the stream generated by the ollama response, to do that we create a model_response variable to save the response from ollama and also create a for loop to get each chunk. Every time we get a chunk from a response the model_response variable is updated then the payload for the json is created and finally sent to the InstantLLM server

model_response = ''
for chunk in stream:
    model_response += chunk['message']['content']

    response = {
        "role": "assistant",
        "content": model_response
    }

    if not await client.send_message(token, response):
        print("Message sending stopped")
        break

The send_message function HAS TO BE inside the for loop when streaming with this exact format because IF the user presses the square stop button in the InstantLLM app that will send a request to stop the streaming of the response to your implementation, if that happens it will break out of the loop and it will stop sending messages (print message is optional but recommended for debugging)

Format:

if not await client.send_message(token, response):
    print("Message sending stopped")
    break

Your entire message_handeler function using ollama with streaming should look like this

#1 Obligatory - async function, token and context extraction
async def message_handler(message: Dict[str, Any]):
    token = message['token']
    context_window = message['message']['message']

    #-------------- Start of custom response generation --------------
    #strean responses from ollama
    stream = ollama.chat(
        model='llama3:8b',
        messages=context_window,
        stream=True,
    )
    model_response = ''
    for chunk in stream:
        model_response += chunk['message']['content']

        #2 Obligatory - When streaming responses the send message has to be inside the for loop with the mentioned format
        #and the response json has to be in this format also
        response = {
            "role": "assistant",
            "content": model_response
        }

        if not await client.send_message(token, response):
            print("Message sending stopped")
            break
    #-------------- End of custom response generation --------------

Usage

Basic Toy Example

This is a toy example showing how an "echo" implementation would work, if you want to see the real implementation with ollama please scroll down to the Real Use Case with Ollama section of this documentation, this example will send back the last message sent by the user

  • A message is sent from the InstantLLM app to our server
  • That message is redirected to your implementation
  • After the message is processed by your implementation the response is sent to our server
  • And finally the response is sent from our server to the InstantLLM app

Steps:

  • 1 Create your own message handeler
  • 2 Create the main function and asign your message handeler in the InstantLLMClient instance

1 Create a message handler:

from instantllm import InstantLLMClient
from typing import Callable, Dict, Any
import asyncio

#create your message handeler
async def message_handler(message: Dict[str, Any]):
    token = message['token']
    message_payload = message['message']['message'][-1]['content']

    full_response = f"Processed from pc: {message_payload}"
    for i in range(1, len(full_response) + 1):
        partial_response = full_response[:i]

        response = {
            "role": "assistant",
            "content": partial_response
        }

        if not await client.send_message(token, response):
            print("Message sending stopped")
            break
        await asyncio.sleep(0.1)

async def main():
    global client
    API_KEY = "YOUR_API_KEY"

    client = InstantLLMClient(API_KEY)
    client.set_message_handler(message_handler)
    client.show_logs = False

    await client.run()

if __name__ == "__main__":
    asyncio.run(main())

This example will echo the message sent back to the instant llm app

2 Run the main function to start the client.

When you run that example, your implementation will connect to the InstantLLM server and wait for messages, when a user sends a message that same message will be sent back to the InstantLLM app

If this example works as intended you can now try some real use case, below is an example using Ollama to get responses from self hosted models

Real Use Case with Ollama

Streaming response from ollama to the InstantLLM app

In this case streaming makes reference to sending each token generated by the LLM as they are being generated directly to the instant llm app, we will be creating an implementation of the InstantLLM library using ollama with streaming enabled and using llama3:8b as the self hosted model

1 Helper functions and global variables:

To start your implementation with the InstantLLM library first you have to import some libraries

from instantllm import InstantLLMClient
from typing import Callable, Dict, Any
import asyncio
import ollama

2 Create the message handler:

Here because we are streaming responses from ollama we will use the same message handeler explained in the The Message Handeler part of this documentation

#1 Obligatory - async function, token and context extraction
async def message_handler(message: Dict[str, Any]):
    token = message['token']
    context_window = message['message']['message']

    #-------------- Start of custom response generation --------------
    #strean responses from ollama
    stream = ollama.chat(
        model='llama3:8b',
        messages=context_window,
        stream=True,
    )
    model_response = ''
    for chunk in stream:
        model_response += chunk['message']['content']

        #2 Obligatory - When streaming responses the send message has to be inside the for loop with the mentioned format
        #and the response json has to be in this format also
        response = {
            "role": "assistant",
            "content": model_response
        }

        if not await client.send_message(token, response):
            print("Message sending stopped")
            break
    #-------------- End of custom response generation --------------

3 Create the main function:

Finally you just have to create the main function to host and use your self hosted model with the InstantLLM app, remember to get your api key and model token from our discord server

async def main():
    global client
    SERVER_URL = "ws://instantllm.ddns.net"
    API_KEY = "YOUR_API_KEY"  # Replace with your actual API key

    client = InstantLLMClient(api_key=API_KEY,server_url=SERVER_URL)
    client.set_message_handler(message_handler)

    await client.run()

if __name__ == "__main__":
    asyncio.run(main())

Your entire server should look like this

from instantllm import InstantLLMClient
from typing import Callable, Dict, Any
import asyncio
import ollama

async def message_handler(message: Dict[str, Any]):
    token = message['token']
    context_window = message['message']['message']

    #strean responses from ollama
    stream = ollama.chat(
        model='llama3:8b',
        messages=context_window,
        stream=True,
    )
    model_response = ''
    for chunk in stream:
        model_response += chunk['message']['content']

        response = {
            "role": "assistant",
            "content": model_response
        }

        if not await client.send_message(token, response):
            print("Message sending stopped")
            break

async def main():
    global client
    SERVER_URL = "ws://instantllm.ddns.net"
    API_KEY = "YOUR_API_KEY"  # Replace with your actual API key

    client = InstantLLMClient(api_key=API_KEY,server_url=SERVER_URL)
    client.set_message_handler(message_handler)

    await client.run()

if __name__ == "__main__":
    asyncio.run(main())

After running the main function, you will be able to use your self hosted model in our InstantLLM app anywhere in the world by just having an internet connection Now you just have to add your model token in our InstantLLM app, give it any name you want and select your model To get your model token please join our discord server and run the !gettoken command You will recieve your model token ready to use, you can also share your model token to anyone you want to give access to use your self hosted model

Sending response from ollama to the InstantLLM app (Without streaming)

This is the same message handeler but without streaming responses from ollama, this implementation can be used if you dont want to use your api calls too often, by just sending a response when you have the entire text generated you can save your api usage (pay as you use)

1 Helper functions and global variables:

We will import the same libraries as we used in the streaming enabled example

from instantllm import InstantLLMClient
from typing import Callable, Dict, Any
import asyncio
import ollama

2 Send message function to ollama and message handeler function

Because we won't use streaming in this example, we have to change how messages are sent to ollama in the message handeler, we will use the ollama.chat function to send the context window to ollama to get a response from the model we selected

Create the sendtomodel function

def sendtomodel(context, model_name):
    response = ollama.chat(model=model_name, messages=context)
    return response

And now we simply just add that function in the message handeler like this

async def message_handler(message: Dict[str, Any]):
    token = message['token']
    context_window = message['message']['message']
    print(f"Received message: {message}")

    model_response = sendtomodel(context=context_window,model_name='llama3:8b')
    model_response = model_response['message']['content']

    response = {
        "role": "assistant",
        "content": model_response
    }

    await client.send_message(token, response)

In this example the square stop button wont stop your implementation from sending messages to the InstantLLM app, to use the stop messages button from the InstantLLM app is recomended to use the streaming version of this code

After adding the sendtomodel function to the message_handeler your entire implementation without using streaming should look like this

from instantllm import InstantLLMClient
from typing import Callable, Dict, Any
import asyncio
import ollama

def sendtomodel(context, model_name):
    response = ollama.chat(model=model_name, messages=context)
    return response

async def message_handler(message: Dict[str, Any]):
    token = message['token']
    context_window = message['message']['message']
    print(f"Received message: {message}")

    model_response = sendtomodel(context=context_window,model_name='llama3:8b')
    model_response = model_response['message']['content']

    response = {
        "role": "assistant",
        "content": model_response
    }

    await client.send_message(token, response)


async def main():
    global client
    SERVER_URL = "ws://instantllm.ddns.net"
    API_KEY = "YOUR_API_KEY"  # Replace with your actual API key

    client = InstantLLMClient(api_key=API_KEY,server_url=SERVER_URL)
    client.set_message_handler(message_handler)

    await client.run()

if _name_ == "_main_":
    asyncio.run(main())

After running this example and sending a message through the InstantLLM app, your self hosted model will generate the entire response to then send the message back to the InstantLLM app, the stop button in the app wont stop the generation of the response so you will have to wait for your model to finish generating its response to be able to send a new message

The streaming example is recommended to have the functionality of the stop button

Info messages

By default the InstantLLM library will print in the console some information related to connections, reconnections, incoming messages and outgoing messages

You can disable them completely or only disable the information you dont want to see by changing some boolean flags in the instance of the InstantLLM class

Example of info messages:

  • Info about the connection with the InstantLLM server
  • Info about the received message
  • Info about the outgoing message

Console output:

INFO:instantllm.main:Connecting to server
INFO:instantllm.main:Connected to server
INFO:instantllm.main:Received message: {'token': '4e33c59a-2c9a-44cc-ac16-1fe3b9588e01', 'message': {'action': 'n/a', 'message': [{'content': 'hi', 'role': 'user'}]}}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'P'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Pr'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Pro'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Proc'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Proce'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Proces'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Process'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processe'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed '}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed f'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed fr'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed fro'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from '}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from p'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc:'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc: '}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc: h'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc: hi'}

show_received_message flag

If this variable is set to false it wont show the received message from the InstantLLM server to your implementation, default True

async def main():
    global client
    API_KEY = "YOUR_API_KEY"

    client = InstantLLMClient(API_KEY)
    client.set_message_handler(message_handler)
    client.show_received_message = False #Info about the received message disabled

    await client.run()

if __name__ == "__main__":
    asyncio.run(main())

Console output:

INFO:instantllm.main:Connecting to InstantLLM
INFO:instantllm.main:Connecting to server
INFO:instantllm.main:Connected to server
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'P'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Pr'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Pro'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Proc'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Proce'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Proces'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Process'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processe'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed '}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed f'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed fr'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed fro'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from '}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from p'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc:'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc: '}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc: h'}
INFO:instantllm.main:Sent message: {'role': 'assistant', 'content': 'Processed from pc: hi'}

show_sent_message flag

If this variable is set to false it wont show the sent message from your implementation to the InstantLLM server, default True

async def main():
    global client
    API_KEY = "YOUR_API_KEY"

    client = InstantLLMClient(API_KEY)
    client.set_message_handler(message_handler)
    client.show_sent_message = False #Info about the sent message disabled

    await client.run()

if __name__ == "__main__":
    asyncio.run(main())

Console output:

INFO:instantllm.main:Connecting to InstantLLM
INFO:instantllm.main:Connecting to server
INFO:instantllm.main:Connected to server
INFO:instantllm.main:Received message: {'token': 'f30490ee-ce75-4841-85c6-09f8e08ed3a1', 'message': {'action': 'n/a', 'message': [{'content': 
'hi', 'role': 'user'}]}}

show_logs flag

If this variable is set to false it will disable the info messages in the console completely, default True (will override the show_received_message and show_sent_message flags)

async def main():
    global client
    API_KEY = "YOUR_API_KEY"

    client = InstantLLMClient(API_KEY)
    client.set_message_handler(message_handler)
    client.show_logs = False #Info messages disabled completely

    await client.run()

if __name__ == "__main__":
    asyncio.run(main())

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instantllm-1.0.0.2.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

instantllm-1.0.0.2-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file instantllm-1.0.0.2.tar.gz.

File metadata

  • Download URL: instantllm-1.0.0.2.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for instantllm-1.0.0.2.tar.gz
Algorithm Hash digest
SHA256 a2146b56ae98247c8810f5a5110cb4b16f5fff2f6acdc1956c62608dc12e7090
MD5 d4d18b78d24edfea545702239fdb9d89
BLAKE2b-256 a6d2a90187c5c62d16d6f4a92dbd84a3a1a49fd67c9ed448729690b430d33184

See more details on using hashes here.

File details

Details for the file instantllm-1.0.0.2-py3-none-any.whl.

File metadata

  • Download URL: instantllm-1.0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for instantllm-1.0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 954d3e44421c43236ad7aee86c52e17d20561b80c0341837a5f1f256831832f6
MD5 9bc2e514862a1f7161c2125e9fba2ef6
BLAKE2b-256 52e680cc7b8ff93792d796ef9026be58cb5501c503b70b4536f808d10f392fad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page