Skip to main content

Ollama integration for Serapeum

Project description

Serapeum Ollama Provider

Ollama integration for the Serapeum LLM framework

The serapeum-ollama package provides complete Ollama backend support for Serapeum, including:

  • Chat & Completion: Full-featured LLM interface with sync/async support
  • Streaming: Real-time token streaming for both chat and structured outputs
  • Tool Calling: Function calling with automatic schema generation
  • Structured Outputs: Type-safe extraction using Pydantic models
  • Embeddings: Local embedding generation for RAG and semantic search

This adapter implements the serapeum.core.llms.FunctionCallingLLM interface, making it compatible with all Serapeum orchestrators and tools.

Table of Contents

Installation

From Source

# Install from the repository
cd libs/providers/ollama
uv sync

From PyPI (when published)

pip install serapeum-ollama

Prerequisites

You need a running Ollama server with at least one model available:

  1. Install Ollama: Visit ollama.com and follow installation instructions
  2. Start the server: Run ollama serve (usually runs on http://localhost:11434)
  3. Pull a model:
    # For chat/completion
    ollama pull llama3.1
    
    # For embeddings
    ollama pull nomic-embed-text
    

Verify installation:

ollama list

Quick Start

Basic Usage

from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk

# Initialize the model
llm = Ollama(model="llama3.1", timeout=120)

# Simple chat
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Explain quantum computing in one sentence.")])]

response = llm.chat(messages)
print(response.message.content)

LLM Features

Basic Chat

The Ollama class provides a complete chat interface:

from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, MessageList, TextChunk

llm = Ollama(
    model="llama3.1",
    temperature=0.7,
    timeout=120
)

# Single message
response = llm.chat([
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What is the capital of France?")])
])
print(response.message.content)  # "The capital of France is Paris."

# Multi-turn conversation
conversation = [
    Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are a helpful assistant.")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What's 2+2?")]),
    Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="4")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="And if I add 3?")]),
]

response = llm.chat(MessageList(messages=conversation))
print(response.message.content)  # "7"

# Access token usage
if hasattr(response.raw, 'usage'):
    print(f"Tokens used: {response.raw['usage']['total_tokens']}")

Streaming

Stream responses token-by-token for real-time feedback:

from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk

llm = Ollama(model="llama3.1")

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Write a haiku about coding.")])]

# Synchronous streaming
print("Streaming response: ", end="")
for chunk in llm.chat(messages, stream=True):
    print(chunk.delta, end="", flush=True)
print()  # newline

# Get the complete message from the last chunk
full_response = chunk.message.content

Async Operations

Full async support for concurrent operations:

import asyncio
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk

async def main():
    llm = Ollama(model="llama3.1")

    # Async chat
    response = await llm.achat([
        Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])
    ])
    print(response.message.content)

    # Async streaming
    messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Count to 5.")])]
    stream = await llm.achat(messages, stream=True)

    async for chunk in stream:
        print(chunk.delta, end="", flush=True)
    print()

asyncio.run(main())

Structured Outputs

Extract structured data using Pydantic models:

import asyncio
from pydantic import BaseModel, Field
from serapeum.ollama import Ollama
from serapeum.core.prompts import PromptTemplate


class Person(BaseModel):
  name: str = Field(description="Person's full name")
  age: int = Field(description="Person's age in years")
  occupation: str = Field(description="Person's job title")

llm = Ollama(model="llama3.1", json_mode=True)

prompt = PromptTemplate(
  "Extract person information from: {text}\n"
  "Return a JSON object with name, age, and occupation."
)

# Synchronous structured prediction
person = llm.parse(
  schema=Person,
  prompt=prompt,
  text="John Doe is a 32-year-old software engineer at Tech Corp."
)

print(f"{person.name}, {person.age}, works as {person.occupation}")
# Output: John Doe, 32, works as a software engineer

# Streaming structured outputs
for partial in llm.parse(
        schema=Person,
        prompt=prompt,
        text="Jane Smith, age 28, data scientist",
        stream=True
):
  if isinstance(partial, list):
    partial = partial[0]
  print(f"Partial: {partial}")

# Async structured prediction
async def get_structured():
  extracted_person = await llm.aparse(
    schema=Person,
    prompt=prompt,
    text="Alice Johnson is 45 and works as a CEO."
  )
  return extracted_person


result = asyncio.run(get_structured())
print(result)

Tool Calling

Create tools from functions or Pydantic models and let the LLM use them:

from pydantic import BaseModel, Field
from serapeum.ollama import Ollama
from serapeum.core.tools import CallableTool
from serapeum.core.llms import TextChunk
from serapeum.core.llms.orchestrators import ToolOrchestratingLLM
from serapeum.core.prompts import PromptTemplate


# Define tools

def get_weather(location: str, unit: str = "celsius") -> str:
  """Get current weather for a location."""
  # Simulated weather data
  return f"The weather in {location} is 72°{unit[0].upper()} and sunny."

def calculate(operation: str, a: float, b: float) -> float:
  """Perform basic math operations."""
  ops = {
    "add": a + b,
    "subtract": a - b,
    "multiply": a * b,
    "divide": a / b if b != 0 else float('inf')
  }
  return ops.get(operation, 0)

# Create orchestrator with tools
llm = Ollama(model="llama3.1", timeout=120, json_mode=True)

orchestrator = ToolOrchestratingLLM(
  llm=llm,
  prompt=PromptTemplate("Answer the user's question: {query}"),
  schema=calculate,
)

# Use tools via natural language
result = orchestrator(query="What's 15 multiplied by 8?")
print(result)  # Uses calculator_tool automatically

orchestrator = ToolOrchestratingLLM(
  llm=llm,
  prompt=PromptTemplate("Answer the user's question: {query}"),
  schema=get_weather,
)
result = orchestrator(query="What's the weather in Paris?")
print(result)  # Uses weather_tool automatically

# You can also use tools directly with the base LLM
from serapeum.core.llms import Message, MessageRole
from serapeum.core.tools import CallableTool


calculator_tool = CallableTool.from_function(calculate)
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="What's 25 + 17?")])]
response = llm.generate_tool_calls(
  tools=[calculator_tool],
  chat_history=messages,
)

# Check if model wants to call a tool
tool_calls = llm.get_tool_calls_from_response(response, error_on_no_tool_call=False)
if tool_calls:
  for call in tool_calls:
    print(f"Tool: {call.tool_name}")
    print(f"Arguments: {call.tool_kwargs}")

    # Execute the tool
    if call.tool_name == "calculate":
      result = calculate(**call.tool_kwargs)
      print(f"Result: {result}")

Completion Style Usage

Use prompt templates for completion-style interactions:

from pydantic import BaseModel
from serapeum.ollama import Ollama
from serapeum.core.prompts import PromptTemplate

llm = Ollama(model="llama3.1", temperature=0.8)

# Simple template
prompt = PromptTemplate("Write a tagline for a company that makes {product}.")
response = llm.predict(prompt, product="eco-friendly water bottles")
print(response)

# Multi-variable template
prompt = PromptTemplate(
    "Write a {style} poem about {topic} in {lines} lines."
)
response = llm.predict(
    prompt,
    style="haiku",
    topic="artificial intelligence",
    lines="3"
)
print(response)

# With output parsing
from serapeum.core.output_parsers import PydanticParser

class Summary(BaseModel):
    title: str
    main_points: list[str]
    conclusion: str

parser = PydanticParser(output_cls=Summary)
prompt = PromptTemplate(
    "Summarize this text as JSON: {text}\n"
    "Include title, main_points (array), and conclusion.",
    output_parser=parser
)

llm_json = Ollama(model="llama3.1") #, json_mode=True
summary = llm_json.predict(
    prompt,
    text="Artificial intelligence is transforming industries. It automates tasks, "
         "provides insights, and enables new capabilities. However, it also raises "
         "ethical concerns about privacy and job displacement."
)

print(summary)

Embeddings

The OllamaEmbedding class provides local embedding generation:

Basic Embedding Generation

from serapeum.ollama import OllamaEmbedding

# Initialize embedding model
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434",
)

# Generate single embedding
text_embedding = embed_model.get_text_embedding("Machine learning is fascinating.")
print(f"Embedding dimension: {len(text_embedding)}")
print(f"First 5 values: {text_embedding[:5]}")

# Query embedding (optimized for retrieval)
query_embedding = embed_model.get_query_embedding("What is machine learning?")

Batch Embeddings

Generate embeddings for multiple texts efficiently:

from serapeum.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    batch_size=32,  # Process 32 texts at a time
)

documents = [
    "Python is a high-level programming language.",
    "Machine learning enables computers to learn from data.",
    "Neural networks are inspired by biological neurons.",
    "Deep learning uses multi-layer neural networks.",
    "Natural language processing deals with text and speech.",
]

# Batch embedding generation
embeddings = embed_model.get_text_embedding_batch(documents)
print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding has {len(embeddings[0])} dimensions")

# Use with similarity search
import numpy as np

query = "What is deep learning?"
query_emb = embed_model.get_query_embedding(query)

# Calculate cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [
    (doc, cosine_similarity(query_emb, emb))
    for doc, emb in zip(documents, embeddings)
]

# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
print("\nMost similar documents:")
for doc, score in similarities[:3]:
    print(f"  {score:.3f}: {doc}")

Async Embeddings

Use async operations for better performance:

import asyncio
from serapeum.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(model_name="nomic-embed-text")

async def embed_documents():

    # Async single embedding
    embedding = await embed_model.aget_text_embedding("Hello, world!")
    print(f"Embedding generated: {len(embedding)} dimensions")

    # Async batch embeddings
    documents = [
        "Document 1 about AI",
        "Document 2 about ML",
        "Document 3 about NLP",
    ]
    text_embed = await embed_model.aget_text_embedding_batch(documents)
    print(f"Generated {len(text_embed)} embeddings asynchronously")

    # Async query embedding
    query_embed = await embed_model.aget_query_embedding("What is AI?")

    return text_embed, query_embed

asyncio.run(embed_documents())

Advanced Embedding Configuration

from serapeum.ollama import OllamaEmbedding

# Configure with instructions for better retrieval
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434",
    batch_size=16,
    keep_alive="10m",  # Keep the model loaded for 10 minutes
    query_instruction="Represent this query for retrieving relevant documents: ",
    text_instruction="Represent this document for retrieval: ",
    ollama_additional_kwargs={
        # Add any Ollama-specific parameters
    },
)

# Instructions are automatically prepended
documents = "AI is transforming healthcare."
doc_embeddings = embed_model.get_text_embedding(documents)

query = "How is AI used in medicine?"
query_embedding = embed_model.get_query_embedding(query)

# The model internally processes:
# - Document: "Represent this document for retrieval: AI is transforming healthcare."
# - Query: "Represent this query for retrieving relevant documents: How is AI used in medicine?"

Integration with Serapeum

Combine embeddings with LLMs for RAG (Retrieval-Augmented Generation):

from serapeum.ollama import Ollama, OllamaEmbedding
from serapeum.core.llms import Message, MessageRole, TextChunk

# Initialize both LLM and embeddings
llm = Ollama(model="llama3.1")
embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Document store (simplified)
knowledge_base = [
    "The Eiffel Tower is in Paris, France.",
    "The Great Wall of China is in China.",
    "The Statue of Liberty is in New York, USA.",
    "The Colosseum is in Rome, Italy.",
]

# Generate embeddings for knowledge base
kb_embeddings = embed_model.get_text_embedding_batch(knowledge_base)

# User query
query = "Where is the Eiffel Tower?"
query_emb = embed_model.get_query_embedding(query)

# Simple similarity search
import numpy as np
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [
    (doc, cosine_similarity(query_emb, emb))
    for doc, emb in zip(knowledge_base, kb_embeddings)
]
similarities.sort(key=lambda x: x[1], reverse=True)
context = similarities[0][0]

# Use LLM with retrieved context
messages = [
    Message(
        role=MessageRole.SYSTEM,
        chunks=[TextChunk(content=f"Answer based on this context: {context}")]
    ),
    Message(role=MessageRole.USER, chunks=[TextChunk(content=query)])
]

response = llm.chat(messages)
print(response.message.content)
# Output: "The Eiffel Tower is in Paris, France."

Configuration

LLM Configuration

The Ollama class accepts these parameters:

from serapeum.ollama import Ollama

llm = Ollama(
    model="llama3.1",                    # Required: Ollama model name
    base_url="http://localhost:11434",   # Ollama server URL
    temperature=0.75,                    # Sampling temperature (0.0-1.0)
    context_window=3900,                 # Max context tokens
    timeout=60.0,                # Request timeout in seconds
    json_mode=False,                     # Enable JSON formatting
    is_function_calling_model=True,      # Whether model supports tools
    keep_alive="5m",                     # How long to keep the model loaded
    additional_kwargs={                  # Provider-specific options
        "num_predict": 100,              # Max tokens to generate
        "top_k": 40,                     # Top-k sampling
        "top_p": 0.9,                    # Top-p (nucleus) sampling
        "repeat_penalty": 1.1,           # Repetition penalty
    }
)

Key Parameters:

  • model: Model identifier (e.g., "llama3.1", "mistral:latest")
  • base_url: Ollama server endpoint (default: http://localhost:11434)
  • temperature: Controls randomness (0.0 = deterministic, 1.0 = very random)
  • json_mode: Request JSON-formatted responses when True
  • timeout: Timeout for API calls (increase for slower models)
  • keep_alive: Duration to keep model in memory (e.g., "5m", "1h")
  • additional_kwargs: Pass any Ollama-specific options

Embedding Configuration

The OllamaEmbedding class parameters:

from serapeum.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",         # Required: embedding model name
    base_url="http://localhost:11434",     # Ollama server URL
    batch_size=10,                   # Batch size (1-2048)
    keep_alive="5m",                       # Model keep-alive duration
    query_instruction=None,                # Prefix for queries
    text_instruction=None,                 # Prefix for documents
    ollama_additional_kwargs={},           # Ollama API options
    client_kwargs={},                      # Client configuration
)

Available Models

Chat/Completion Models:

  • llama3.1 - Meta's Llama 3.1 (8B, 70B, 405B)
  • llama3.2 - Latest Llama model
  • mistral - Mistral 7B
  • mixtral - Mixtral 8x7B MoE
  • codellama - Code-specialized Llama
  • gemma2 - Google's Gemma 2

Embedding Models:

  • nomic-embed-text - General-purpose embeddings (768d)
  • mxbai-embed-large - High-quality embeddings (1024d)
  • snowflake-arctic-embed - Snowflake's embedding model

Pull models with:

ollama pull llama3.1
ollama pull nomic-embed-text

Examples

Complete examples are available in the examples/ directory:

  • basic_chat.py - Simple chat interactions
  • streaming_example.py - Streaming responses
  • tool_calling_example.py - Using tools with LLMs
  • structured_outputs.py - Extracting structured data
  • embeddings_rag.py - RAG with embeddings
  • async_operations.py - Async patterns

Testing

Run tests for the Ollama provider:

# All tests (from repo root)
python -m pytest libs/providers/ollama/tests

# Skip end-to-end tests (don't require Ollama server)
python -m pytest libs/providers/ollama/tests -m "not e2e"

# Only unit tests
python -m pytest libs/providers/ollama/tests -m unit

# With coverage
python -m pytest libs/providers/ollama/tests --cov=serapeum.ollama

Note: End-to-end tests require a running Ollama server with models available.

Notes

  • Server Required: Ollama must be running (ollama serve) before using this adapter
  • Tool Calling: Depends on model capabilities and Ollama version (some models don't support tools)
  • JSON Mode: Improves structured output reliability when the model supports it
  • Timeouts: Increase timeout for larger models or complex tasks
  • Async: All async methods use a per-event-loop client for thread safety

Links


Questions or issues? Open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serapeum_ollama-0.5.0.tar.gz (41.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

serapeum_ollama-0.5.0-py3-none-any.whl (42.2 kB view details)

Uploaded Python 3

File details

Details for the file serapeum_ollama-0.5.0.tar.gz.

File metadata

  • Download URL: serapeum_ollama-0.5.0.tar.gz
  • Upload date:
  • Size: 41.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for serapeum_ollama-0.5.0.tar.gz
Algorithm Hash digest
SHA256 5b74e45cacfa717d9e50547e7960b01c4084d0033c840d6478d307977fc4ac47
MD5 d8e848f6a13b8932f83d1afedd82b4fc
BLAKE2b-256 48eac0f83a985f371e54aaadc207f54f11168f3fd166987d9d42491931c0c94d

See more details on using hashes here.

File details

Details for the file serapeum_ollama-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for serapeum_ollama-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 acf4512dc26c2444d5ff579bcc424d771c7bfcab5e47a8f3e0194e672208431b
MD5 5d44dedc052cda8746c5e9d6673eea4a
BLAKE2b-256 91dbd2d7c0a9cc97df34b46547503b34aec3dda563bad1f4bee292f7c327ce4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page