Ollama integration for Serapeum
Project description
Serapeum Ollama Provider
Ollama integration for the Serapeum LLM framework
The serapeum-ollama package provides complete Ollama backend support for Serapeum, including:
- Chat & Completion: Full-featured LLM interface with sync/async support
- Streaming: Real-time token streaming for both chat and structured outputs
- Tool Calling: Function calling with automatic schema generation
- Structured Outputs: Type-safe extraction using Pydantic models
- Embeddings: Local embedding generation for RAG and semantic search
This adapter implements the serapeum.core.llms.FunctionCallingLLM interface, making it compatible with all Serapeum orchestrators and tools.
Table of Contents
Installation
From Source
# Install from the repository
cd libs/providers/ollama
uv sync
From PyPI (when published)
pip install serapeum-ollama
Prerequisites
You need a running Ollama server with at least one model available:
- Install Ollama: Visit ollama.com and follow installation instructions
- Start the server: Run
ollama serve(usually runs onhttp://localhost:11434) - Pull a model:
# For chat/completion ollama pull llama3.1 # For embeddings ollama pull nomic-embed-text
Verify installation:
ollama list
Quick Start
Basic Usage
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk
# Initialize the model
llm = Ollama(model="llama3.1", timeout=120)
# Simple chat
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Explain quantum computing in one sentence.")])]
response = llm.chat(messages)
print(response.message.content)
LLM Features
Basic Chat
The Ollama class provides a complete chat interface:
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, MessageList, TextChunk
llm = Ollama(
model="llama3.1",
temperature=0.7,
timeout=120
)
# Single message
response = llm.chat([
Message(role=MessageRole.USER, chunks=[TextChunk(content="What is the capital of France?")])
])
print(response.message.content) # "The capital of France is Paris."
# Multi-turn conversation
conversation = [
Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are a helpful assistant.")]),
Message(role=MessageRole.USER, chunks=[TextChunk(content="What's 2+2?")]),
Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="4")]),
Message(role=MessageRole.USER, chunks=[TextChunk(content="And if I add 3?")]),
]
response = llm.chat(MessageList(messages=conversation))
print(response.message.content) # "7"
# Access token usage
if hasattr(response.raw, 'usage'):
print(f"Tokens used: {response.raw['usage']['total_tokens']}")
Streaming
Stream responses token-by-token for real-time feedback:
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk
llm = Ollama(model="llama3.1")
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Write a haiku about coding.")])]
# Synchronous streaming
print("Streaming response: ", end="")
for chunk in llm.chat(messages, stream=True):
print(chunk.delta, end="", flush=True)
print() # newline
# Get the complete message from the last chunk
full_response = chunk.message.content
Async Operations
Full async support for concurrent operations:
import asyncio
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk
async def main():
llm = Ollama(model="llama3.1")
# Async chat
response = await llm.achat([
Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])
])
print(response.message.content)
# Async streaming
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Count to 5.")])]
stream = await llm.achat(messages, stream=True)
async for chunk in stream:
print(chunk.delta, end="", flush=True)
print()
asyncio.run(main())
Structured Outputs
Extract structured data using Pydantic models:
import asyncio
from pydantic import BaseModel, Field
from serapeum.ollama import Ollama
from serapeum.core.prompts import PromptTemplate
class Person(BaseModel):
name: str = Field(description="Person's full name")
age: int = Field(description="Person's age in years")
occupation: str = Field(description="Person's job title")
llm = Ollama(model="llama3.1", json_mode=True)
prompt = PromptTemplate(
"Extract person information from: {text}\n"
"Return a JSON object with name, age, and occupation."
)
# Synchronous structured prediction
person = llm.parse(
schema=Person,
prompt=prompt,
text="John Doe is a 32-year-old software engineer at Tech Corp."
)
print(f"{person.name}, {person.age}, works as {person.occupation}")
# Output: John Doe, 32, works as a software engineer
# Streaming structured outputs
for partial in llm.parse(
schema=Person,
prompt=prompt,
text="Jane Smith, age 28, data scientist",
stream=True
):
if isinstance(partial, list):
partial = partial[0]
print(f"Partial: {partial}")
# Async structured prediction
async def get_structured():
extracted_person = await llm.aparse(
schema=Person,
prompt=prompt,
text="Alice Johnson is 45 and works as a CEO."
)
return extracted_person
result = asyncio.run(get_structured())
print(result)
Tool Calling
Create tools from functions or Pydantic models and let the LLM use them:
from pydantic import BaseModel, Field
from serapeum.ollama import Ollama
from serapeum.core.tools import CallableTool
from serapeum.core.llms import TextChunk
from serapeum.core.llms.orchestrators import ToolOrchestratingLLM
from serapeum.core.prompts import PromptTemplate
# Define tools
def get_weather(location: str, unit: str = "celsius") -> str:
"""Get current weather for a location."""
# Simulated weather data
return f"The weather in {location} is 72°{unit[0].upper()} and sunny."
def calculate(operation: str, a: float, b: float) -> float:
"""Perform basic math operations."""
ops = {
"add": a + b,
"subtract": a - b,
"multiply": a * b,
"divide": a / b if b != 0 else float('inf')
}
return ops.get(operation, 0)
# Create orchestrator with tools
llm = Ollama(model="llama3.1", timeout=120, json_mode=True)
orchestrator = ToolOrchestratingLLM(
llm=llm,
prompt=PromptTemplate("Answer the user's question: {query}"),
schema=calculate,
)
# Use tools via natural language
result = orchestrator(query="What's 15 multiplied by 8?")
print(result) # Uses calculator_tool automatically
orchestrator = ToolOrchestratingLLM(
llm=llm,
prompt=PromptTemplate("Answer the user's question: {query}"),
schema=get_weather,
)
result = orchestrator(query="What's the weather in Paris?")
print(result) # Uses weather_tool automatically
# You can also use tools directly with the base LLM
from serapeum.core.llms import Message, MessageRole
from serapeum.core.tools import CallableTool
calculator_tool = CallableTool.from_function(calculate)
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="What's 25 + 17?")])]
response = llm.generate_tool_calls(
tools=[calculator_tool],
chat_history=messages,
)
# Check if model wants to call a tool
tool_calls = llm.get_tool_calls_from_response(response, error_on_no_tool_call=False)
if tool_calls:
for call in tool_calls:
print(f"Tool: {call.tool_name}")
print(f"Arguments: {call.tool_kwargs}")
# Execute the tool
if call.tool_name == "calculate":
result = calculate(**call.tool_kwargs)
print(f"Result: {result}")
Completion Style Usage
Use prompt templates for completion-style interactions:
from pydantic import BaseModel
from serapeum.ollama import Ollama
from serapeum.core.prompts import PromptTemplate
llm = Ollama(model="llama3.1", temperature=0.8)
# Simple template
prompt = PromptTemplate("Write a tagline for a company that makes {product}.")
response = llm.predict(prompt, product="eco-friendly water bottles")
print(response)
# Multi-variable template
prompt = PromptTemplate(
"Write a {style} poem about {topic} in {lines} lines."
)
response = llm.predict(
prompt,
style="haiku",
topic="artificial intelligence",
lines="3"
)
print(response)
# With output parsing
from serapeum.core.output_parsers import PydanticParser
class Summary(BaseModel):
title: str
main_points: list[str]
conclusion: str
parser = PydanticParser(output_cls=Summary)
prompt = PromptTemplate(
"Summarize this text as JSON: {text}\n"
"Include title, main_points (array), and conclusion.",
output_parser=parser
)
llm_json = Ollama(model="llama3.1") #, json_mode=True
summary = llm_json.predict(
prompt,
text="Artificial intelligence is transforming industries. It automates tasks, "
"provides insights, and enables new capabilities. However, it also raises "
"ethical concerns about privacy and job displacement."
)
print(summary)
Embeddings
The OllamaEmbedding class provides local embedding generation:
Basic Embedding Generation
from serapeum.ollama import OllamaEmbedding
# Initialize embedding model
embed_model = OllamaEmbedding(
model_name="nomic-embed-text",
base_url="http://localhost:11434",
)
# Generate single embedding
text_embedding = embed_model.get_text_embedding("Machine learning is fascinating.")
print(f"Embedding dimension: {len(text_embedding)}")
print(f"First 5 values: {text_embedding[:5]}")
# Query embedding (optimized for retrieval)
query_embedding = embed_model.get_query_embedding("What is machine learning?")
Batch Embeddings
Generate embeddings for multiple texts efficiently:
from serapeum.ollama import OllamaEmbedding
embed_model = OllamaEmbedding(
model_name="nomic-embed-text",
batch_size=32, # Process 32 texts at a time
)
documents = [
"Python is a high-level programming language.",
"Machine learning enables computers to learn from data.",
"Neural networks are inspired by biological neurons.",
"Deep learning uses multi-layer neural networks.",
"Natural language processing deals with text and speech.",
]
# Batch embedding generation
embeddings = embed_model.get_text_embedding_batch(documents)
print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding has {len(embeddings[0])} dimensions")
# Use with similarity search
import numpy as np
query = "What is deep learning?"
query_emb = embed_model.get_query_embedding(query)
# Calculate cosine similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
similarities = [
(doc, cosine_similarity(query_emb, emb))
for doc, emb in zip(documents, embeddings)
]
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
print("\nMost similar documents:")
for doc, score in similarities[:3]:
print(f" {score:.3f}: {doc}")
Async Embeddings
Use async operations for better performance:
import asyncio
from serapeum.ollama import OllamaEmbedding
embed_model = OllamaEmbedding(model_name="nomic-embed-text")
async def embed_documents():
# Async single embedding
embedding = await embed_model.aget_text_embedding("Hello, world!")
print(f"Embedding generated: {len(embedding)} dimensions")
# Async batch embeddings
documents = [
"Document 1 about AI",
"Document 2 about ML",
"Document 3 about NLP",
]
text_embed = await embed_model.aget_text_embedding_batch(documents)
print(f"Generated {len(text_embed)} embeddings asynchronously")
# Async query embedding
query_embed = await embed_model.aget_query_embedding("What is AI?")
return text_embed, query_embed
asyncio.run(embed_documents())
Advanced Embedding Configuration
from serapeum.ollama import OllamaEmbedding
# Configure with instructions for better retrieval
embed_model = OllamaEmbedding(
model_name="nomic-embed-text",
base_url="http://localhost:11434",
batch_size=16,
keep_alive="10m", # Keep the model loaded for 10 minutes
query_instruction="Represent this query for retrieving relevant documents: ",
text_instruction="Represent this document for retrieval: ",
ollama_additional_kwargs={
# Add any Ollama-specific parameters
},
)
# Instructions are automatically prepended
documents = "AI is transforming healthcare."
doc_embeddings = embed_model.get_text_embedding(documents)
query = "How is AI used in medicine?"
query_embedding = embed_model.get_query_embedding(query)
# The model internally processes:
# - Document: "Represent this document for retrieval: AI is transforming healthcare."
# - Query: "Represent this query for retrieving relevant documents: How is AI used in medicine?"
Integration with Serapeum
Combine embeddings with LLMs for RAG (Retrieval-Augmented Generation):
from serapeum.ollama import Ollama, OllamaEmbedding
from serapeum.core.llms import Message, MessageRole, TextChunk
# Initialize both LLM and embeddings
llm = Ollama(model="llama3.1")
embed_model = OllamaEmbedding(model_name="nomic-embed-text")
# Document store (simplified)
knowledge_base = [
"The Eiffel Tower is in Paris, France.",
"The Great Wall of China is in China.",
"The Statue of Liberty is in New York, USA.",
"The Colosseum is in Rome, Italy.",
]
# Generate embeddings for knowledge base
kb_embeddings = embed_model.get_text_embedding_batch(knowledge_base)
# User query
query = "Where is the Eiffel Tower?"
query_emb = embed_model.get_query_embedding(query)
# Simple similarity search
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
similarities = [
(doc, cosine_similarity(query_emb, emb))
for doc, emb in zip(knowledge_base, kb_embeddings)
]
similarities.sort(key=lambda x: x[1], reverse=True)
context = similarities[0][0]
# Use LLM with retrieved context
messages = [
Message(
role=MessageRole.SYSTEM,
chunks=[TextChunk(content=f"Answer based on this context: {context}")]
),
Message(role=MessageRole.USER, chunks=[TextChunk(content=query)])
]
response = llm.chat(messages)
print(response.message.content)
# Output: "The Eiffel Tower is in Paris, France."
Configuration
LLM Configuration
The Ollama class accepts these parameters:
from serapeum.ollama import Ollama
llm = Ollama(
model="llama3.1", # Required: Ollama model name
base_url="http://localhost:11434", # Ollama server URL
temperature=0.75, # Sampling temperature (0.0-1.0)
context_window=3900, # Max context tokens
timeout=60.0, # Request timeout in seconds
json_mode=False, # Enable JSON formatting
is_function_calling_model=True, # Whether model supports tools
keep_alive="5m", # How long to keep the model loaded
additional_kwargs={ # Provider-specific options
"num_predict": 100, # Max tokens to generate
"top_k": 40, # Top-k sampling
"top_p": 0.9, # Top-p (nucleus) sampling
"repeat_penalty": 1.1, # Repetition penalty
}
)
Key Parameters:
model: Model identifier (e.g.,"llama3.1","mistral:latest")base_url: Ollama server endpoint (default:http://localhost:11434)temperature: Controls randomness (0.0 = deterministic, 1.0 = very random)json_mode: Request JSON-formatted responses whenTruetimeout: Timeout for API calls (increase for slower models)keep_alive: Duration to keep model in memory (e.g.,"5m","1h")additional_kwargs: Pass any Ollama-specific options
Embedding Configuration
The OllamaEmbedding class parameters:
from serapeum.ollama import OllamaEmbedding
embed_model = OllamaEmbedding(
model_name="nomic-embed-text", # Required: embedding model name
base_url="http://localhost:11434", # Ollama server URL
batch_size=10, # Batch size (1-2048)
keep_alive="5m", # Model keep-alive duration
query_instruction=None, # Prefix for queries
text_instruction=None, # Prefix for documents
ollama_additional_kwargs={}, # Ollama API options
client_kwargs={}, # Client configuration
)
Available Models
Chat/Completion Models:
llama3.1- Meta's Llama 3.1 (8B, 70B, 405B)llama3.2- Latest Llama modelmistral- Mistral 7Bmixtral- Mixtral 8x7B MoEcodellama- Code-specialized Llamagemma2- Google's Gemma 2
Embedding Models:
nomic-embed-text- General-purpose embeddings (768d)mxbai-embed-large- High-quality embeddings (1024d)snowflake-arctic-embed- Snowflake's embedding model
Pull models with:
ollama pull llama3.1
ollama pull nomic-embed-text
Examples
Complete examples are available in the examples/ directory:
basic_chat.py- Simple chat interactionsstreaming_example.py- Streaming responsestool_calling_example.py- Using tools with LLMsstructured_outputs.py- Extracting structured dataembeddings_rag.py- RAG with embeddingsasync_operations.py- Async patterns
Testing
Run tests for the Ollama provider:
# All tests (from repo root)
python -m pytest libs/providers/ollama/tests
# Skip end-to-end tests (don't require Ollama server)
python -m pytest libs/providers/ollama/tests -m "not e2e"
# Only unit tests
python -m pytest libs/providers/ollama/tests -m unit
# With coverage
python -m pytest libs/providers/ollama/tests --cov=serapeum.ollama
Note: End-to-end tests require a running Ollama server with models available.
Notes
- Server Required: Ollama must be running (
ollama serve) before using this adapter - Tool Calling: Depends on model capabilities and Ollama version (some models don't support tools)
- JSON Mode: Improves structured output reliability when the model supports it
- Timeouts: Increase
timeoutfor larger models or complex tasks - Async: All async methods use a per-event-loop client for thread safety
Links
- Homepage: https://github.com/Serapieum-of-alex/serapeum
- Documentation: https://serapeum.readthedocs.io/
- Ollama: https://ollama.com/
- Changelog: HISTORY.rst
Questions or issues? Open an issue on GitHub.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file serapeum_ollama-0.5.0.tar.gz.
File metadata
- Download URL: serapeum_ollama-0.5.0.tar.gz
- Upload date:
- Size: 41.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b74e45cacfa717d9e50547e7960b01c4084d0033c840d6478d307977fc4ac47
|
|
| MD5 |
d8e848f6a13b8932f83d1afedd82b4fc
|
|
| BLAKE2b-256 |
48eac0f83a985f371e54aaadc207f54f11168f3fd166987d9d42491931c0c94d
|
File details
Details for the file serapeum_ollama-0.5.0-py3-none-any.whl.
File metadata
- Download URL: serapeum_ollama-0.5.0-py3-none-any.whl
- Upload date:
- Size: 42.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acf4512dc26c2444d5ff579bcc424d771c7bfcab5e47a8f3e0194e672208431b
|
|
| MD5 |
5d44dedc052cda8746c5e9d6673eea4a
|
|
| BLAKE2b-256 |
91dbd2d7c0a9cc97df34b46547503b34aec3dda563bad1f4bee292f7c327ce4b
|