Ollama integration for Serapeum

These details have not been verified by PyPI

Project links

Project description

Serapeum Ollama Provider

Ollama integration for the Serapeum LLM framework

The serapeum-ollama package provides complete Ollama backend support for Serapeum, including:

Chat & Completion: Full-featured LLM interface with sync/async support
Streaming: Real-time token streaming for both chat and structured outputs
Tool Calling: Function calling with automatic schema generation
Structured Outputs: Type-safe extraction using Pydantic models
Embeddings: Local embedding generation for RAG and semantic search

This adapter implements the serapeum.core.llms.FunctionCallingLLM interface, making it compatible with all Serapeum orchestrators and tools.

Installation
Prerequisites
Quick Start
LLM Features
Embeddings
Configuration
Examples
Testing
Links

Installation

From Source

# Install from the repository
cd libs/providers/ollama
uv sync

From PyPI (when published)

pip install serapeum-ollama

Prerequisites

You need a running Ollama server with at least one model available:

Install Ollama: Visit ollama.com and follow installation instructions
Start the server: Run ollama serve (usually runs on http://localhost:11434)

Pull a model:

# For chat/completion
ollama pull llama3.1

# For embeddings
ollama pull nomic-embed-text

Verify installation:

ollama list

Quick Start

Basic Usage

from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk

# Initialize the model
llm = Ollama(model="llama3.1", timeout=120)

# Simple chat
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Explain quantum computing in one sentence.")])]

response = llm.chat(messages)
print(response.message.content)

LLM Features

Basic Chat

The Ollama class provides a complete chat interface:

from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, MessageList, TextChunk

llm = Ollama(
    model="llama3.1",
    temperature=0.7,
    timeout=120
)

# Single message
response = llm.chat([
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What is the capital of France?")])
])
print(response.message.content)  # "The capital of France is Paris."

# Multi-turn conversation
conversation = [
    Message(role=MessageRole.SYSTEM, chunks=[TextChunk(content="You are a helpful assistant.")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="What's 2+2?")]),
    Message(role=MessageRole.ASSISTANT, chunks=[TextChunk(content="4")]),
    Message(role=MessageRole.USER, chunks=[TextChunk(content="And if I add 3?")]),
]

response = llm.chat(MessageList(messages=conversation))
print(response.message.content)  # "7"

# Access token usage
if hasattr(response.raw, 'usage'):
    print(f"Tokens used: {response.raw['usage']['total_tokens']}")

Streaming

Stream responses token-by-token for real-time feedback:

from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk

llm = Ollama(model="llama3.1")

messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Write a haiku about coding.")])]

# Synchronous streaming
print("Streaming response: ", end="")
for chunk in llm.chat(messages, stream=True):
    print(chunk.delta, end="", flush=True)
print()  # newline

# Get the complete message from the last chunk
full_response = chunk.message.content

Async Operations

Full async support for concurrent operations:

import asyncio
from serapeum.ollama import Ollama
from serapeum.core.llms import Message, MessageRole, TextChunk

async def main():
    llm = Ollama(model="llama3.1")

    # Async chat
    response = await llm.achat([
        Message(role=MessageRole.USER, chunks=[TextChunk(content="Hello!")])
    ])
    print(response.message.content)

    # Async streaming
    messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="Count to 5.")])]
    stream = await llm.achat(messages, stream=True)

    async for chunk in stream:
        print(chunk.delta, end="", flush=True)
    print()

asyncio.run(main())

Structured Outputs

Extract structured data using Pydantic models:

import asyncio
from pydantic import BaseModel, Field
from serapeum.ollama import Ollama
from serapeum.core.prompts import PromptTemplate


class Person(BaseModel):
  name: str = Field(description="Person's full name")
  age: int = Field(description="Person's age in years")
  occupation: str = Field(description="Person's job title")

llm = Ollama(model="llama3.1", json_mode=True)

prompt = PromptTemplate(
  "Extract person information from: {text}\n"
  "Return a JSON object with name, age, and occupation."
)

# Synchronous structured prediction
person = llm.parse(
  schema=Person,
  prompt=prompt,
  text="John Doe is a 32-year-old software engineer at Tech Corp."
)

print(f"{person.name}, {person.age}, works as {person.occupation}")
# Output: John Doe, 32, works as a software engineer

# Streaming structured outputs
for partial in llm.parse(
        schema=Person,
        prompt=prompt,
        text="Jane Smith, age 28, data scientist",
        stream=True
):
  if isinstance(partial, list):
    partial = partial[0]
  print(f"Partial: {partial}")

# Async structured prediction
async def get_structured():
  extracted_person = await llm.aparse(
    schema=Person,
    prompt=prompt,
    text="Alice Johnson is 45 and works as a CEO."
  )
  return extracted_person


result = asyncio.run(get_structured())
print(result)

Tool Calling

Create tools from functions or Pydantic models and let the LLM use them:

from pydantic import BaseModel, Field
from serapeum.ollama import Ollama
from serapeum.core.tools import CallableTool
from serapeum.core.llms import TextChunk
from serapeum.core.llms.orchestrators import ToolOrchestratingLLM
from serapeum.core.prompts import PromptTemplate


# Define tools

def get_weather(location: str, unit: str = "celsius") -> str:
  """Get current weather for a location."""
  # Simulated weather data
  return f"The weather in {location} is 72°{unit[0].upper()} and sunny."

def calculate(operation: str, a: float, b: float) -> float:
  """Perform basic math operations."""
  ops = {
    "add": a + b,
    "subtract": a - b,
    "multiply": a * b,
    "divide": a / b if b != 0 else float('inf')
  }
  return ops.get(operation, 0)

# Create orchestrator with tools
llm = Ollama(model="llama3.1", timeout=120, json_mode=True)

orchestrator = ToolOrchestratingLLM(
  llm=llm,
  prompt=PromptTemplate("Answer the user's question: {query}"),
  schema=calculate,
)

# Use tools via natural language
result = orchestrator(query="What's 15 multiplied by 8?")
print(result)  # Uses calculator_tool automatically

orchestrator = ToolOrchestratingLLM(
  llm=llm,
  prompt=PromptTemplate("Answer the user's question: {query}"),
  schema=get_weather,
)
result = orchestrator(query="What's the weather in Paris?")
print(result)  # Uses weather_tool automatically

# You can also use tools directly with the base LLM
from serapeum.core.llms import Message, MessageRole
from serapeum.core.tools import CallableTool


calculator_tool = CallableTool.from_function(calculate)
messages = [Message(role=MessageRole.USER, chunks=[TextChunk(content="What's 25 + 17?")])]
response = llm.generate_tool_calls(
  tools=[calculator_tool],
  chat_history=messages,
)

# Check if model wants to call a tool
tool_calls = llm.get_tool_calls_from_response(response, error_on_no_tool_call=False)
if tool_calls:
  for call in tool_calls:
    print(f"Tool: {call.tool_name}")
    print(f"Arguments: {call.tool_kwargs}")

    # Execute the tool
    if call.tool_name == "calculate":
      result = calculate(**call.tool_kwargs)
      print(f"Result: {result}")

Completion Style Usage

Use prompt templates for completion-style interactions:

from pydantic import BaseModel
from serapeum.ollama import Ollama
from serapeum.core.prompts import PromptTemplate

llm = Ollama(model="llama3.1", temperature=0.8)

# Simple template
prompt = PromptTemplate("Write a tagline for a company that makes {product}.")
response = llm.predict(prompt, product="eco-friendly water bottles")
print(response)

# Multi-variable template
prompt = PromptTemplate(
    "Write a {style} poem about {topic} in {lines} lines."
)
response = llm.predict(
    prompt,
    style="haiku",
    topic="artificial intelligence",
    lines="3"
)
print(response)

# With output parsing
from serapeum.core.output_parsers import PydanticParser

class Summary(BaseModel):
    title: str
    main_points: list[str]
    conclusion: str

parser = PydanticParser(output_cls=Summary)
prompt = PromptTemplate(
    "Summarize this text as JSON: {text}\n"
    "Include title, main_points (array), and conclusion.",
    output_parser=parser
)

llm_json = Ollama(model="llama3.1") #, json_mode=True
summary = llm_json.predict(
    prompt,
    text="Artificial intelligence is transforming industries. It automates tasks, "
         "provides insights, and enables new capabilities. However, it also raises "
         "ethical concerns about privacy and job displacement."
)

print(summary)

Embeddings

The OllamaEmbedding class provides local embedding generation:

Basic Embedding Generation

from serapeum.ollama import OllamaEmbedding

# Initialize embedding model
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434",
)

# Generate single embedding
text_embedding = embed_model.get_text_embedding("Machine learning is fascinating.")
print(f"Embedding dimension: {len(text_embedding)}")
print(f"First 5 values: {text_embedding[:5]}")

# Query embedding (optimized for retrieval)
query_embedding = embed_model.get_query_embedding("What is machine learning?")

Batch Embeddings

Generate embeddings for multiple texts efficiently:

from serapeum.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    batch_size=32,  # Process 32 texts at a time
)

documents = [
    "Python is a high-level programming language.",
    "Machine learning enables computers to learn from data.",
    "Neural networks are inspired by biological neurons.",
    "Deep learning uses multi-layer neural networks.",
    "Natural language processing deals with text and speech.",
]

# Batch embedding generation
embeddings = embed_model.get_text_embedding_batch(documents)
print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding has {len(embeddings[0])} dimensions")

# Use with similarity search
import numpy as np

query = "What is deep learning?"
query_emb = embed_model.get_query_embedding(query)

# Calculate cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [
    (doc, cosine_similarity(query_emb, emb))
    for doc, emb in zip(documents, embeddings)
]

# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
print("\nMost similar documents:")
for doc, score in similarities[:3]:
    print(f"  {score:.3f}: {doc}")

Async Embeddings

Use async operations for better performance:

import asyncio
from serapeum.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(model_name="nomic-embed-text")

async def embed_documents():

    # Async single embedding
    embedding = await embed_model.aget_text_embedding("Hello, world!")
    print(f"Embedding generated: {len(embedding)} dimensions")

    # Async batch embeddings
    documents = [
        "Document 1 about AI",
        "Document 2 about ML",
        "Document 3 about NLP",
    ]
    text_embed = await embed_model.aget_text_embedding_batch(documents)
    print(f"Generated {len(text_embed)} embeddings asynchronously")

    # Async query embedding
    query_embed = await embed_model.aget_query_embedding("What is AI?")

    return text_embed, query_embed

asyncio.run(embed_documents())

Advanced Embedding Configuration

from serapeum.ollama import OllamaEmbedding

# Configure with instructions for better retrieval
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434",
    batch_size=16,
    keep_alive="10m",  # Keep the model loaded for 10 minutes
    query_instruction="Represent this query for retrieving relevant documents: ",
    text_instruction="Represent this document for retrieval: ",
    ollama_additional_kwargs={
        # Add any Ollama-specific parameters
    },
)

# Instructions are automatically prepended
documents = "AI is transforming healthcare."
doc_embeddings = embed_model.get_text_embedding(documents)

query = "How is AI used in medicine?"
query_embedding = embed_model.get_query_embedding(query)

# The model internally processes:
# - Document: "Represent this document for retrieval: AI is transforming healthcare."
# - Query: "Represent this query for retrieving relevant documents: How is AI used in medicine?"

Integration with Serapeum

Combine embeddings with LLMs for RAG (Retrieval-Augmented Generation):

from serapeum.ollama import Ollama, OllamaEmbedding
from serapeum.core.llms import Message, MessageRole, TextChunk

# Initialize both LLM and embeddings
llm = Ollama(model="llama3.1")
embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Document store (simplified)
knowledge_base = [
    "The Eiffel Tower is in Paris, France.",
    "The Great Wall of China is in China.",
    "The Statue of Liberty is in New York, USA.",
    "The Colosseum is in Rome, Italy.",
]

# Generate embeddings for knowledge base
kb_embeddings = embed_model.get_text_embedding_batch(knowledge_base)

# User query
query = "Where is the Eiffel Tower?"
query_emb = embed_model.get_query_embedding(query)

# Simple similarity search
import numpy as np
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [
    (doc, cosine_similarity(query_emb, emb))
    for doc, emb in zip(knowledge_base, kb_embeddings)
]
similarities.sort(key=lambda x: x[1], reverse=True)
context = similarities[0][0]

# Use LLM with retrieved context
messages = [
    Message(
        role=MessageRole.SYSTEM,
        chunks=[TextChunk(content=f"Answer based on this context: {context}")]
    ),
    Message(role=MessageRole.USER, chunks=[TextChunk(content=query)])
]

response = llm.chat(messages)
print(response.message.content)
# Output: "The Eiffel Tower is in Paris, France."

Configuration

LLM Configuration

The Ollama class accepts these parameters:

from serapeum.ollama import Ollama

llm = Ollama(
    model="llama3.1",                    # Required: Ollama model name
    base_url="http://localhost:11434",   # Ollama server URL
    temperature=0.75,                    # Sampling temperature (0.0-1.0)
    context_window=3900,                 # Max context tokens
    timeout=60.0,                # Request timeout in seconds
    json_mode=False,                     # Enable JSON formatting
    is_function_calling_model=True,      # Whether model supports tools
    keep_alive="5m",                     # How long to keep the model loaded
    additional_kwargs={                  # Provider-specific options
        "num_predict": 100,              # Max tokens to generate
        "top_k": 40,                     # Top-k sampling
        "top_p": 0.9,                    # Top-p (nucleus) sampling
        "repeat_penalty": 1.1,           # Repetition penalty
    }
)

Key Parameters:

model: Model identifier (e.g., "llama3.1", "mistral:latest")
base_url: Ollama server endpoint (default: http://localhost:11434)
temperature: Controls randomness (0.0 = deterministic, 1.0 = very random)
json_mode: Request JSON-formatted responses when True
timeout: Timeout for API calls (increase for slower models)
keep_alive: Duration to keep model in memory (e.g., "5m", "1h")
additional_kwargs: Pass any Ollama-specific options

Embedding Configuration

The OllamaEmbedding class parameters:

from serapeum.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",         # Required: embedding model name
    base_url="http://localhost:11434",     # Ollama server URL
    batch_size=10,                   # Batch size (1-2048)
    keep_alive="5m",                       # Model keep-alive duration
    query_instruction=None,                # Prefix for queries
    text_instruction=None,                 # Prefix for documents
    ollama_additional_kwargs={},           # Ollama API options
    client_kwargs={},                      # Client configuration
)

Available Models

Chat/Completion Models:

llama3.1 - Meta's Llama 3.1 (8B, 70B, 405B)
llama3.2 - Latest Llama model
mistral - Mistral 7B
mixtral - Mixtral 8x7B MoE
codellama - Code-specialized Llama
gemma2 - Google's Gemma 2

Embedding Models:

nomic-embed-text - General-purpose embeddings (768d)
mxbai-embed-large - High-quality embeddings (1024d)
snowflake-arctic-embed - Snowflake's embedding model

Pull models with:

ollama pull llama3.1
ollama pull nomic-embed-text

Examples

Complete examples are available in the examples/ directory:

basic_chat.py - Simple chat interactions
streaming_example.py - Streaming responses
tool_calling_example.py - Using tools with LLMs
structured_outputs.py - Extracting structured data
embeddings_rag.py - RAG with embeddings
async_operations.py - Async patterns

Testing

Run tests for the Ollama provider:

# All tests (from repo root)
python -m pytest libs/providers/ollama/tests

# Skip end-to-end tests (don't require Ollama server)
python -m pytest libs/providers/ollama/tests -m "not e2e"

# Only unit tests
python -m pytest libs/providers/ollama/tests -m unit

# With coverage
python -m pytest libs/providers/ollama/tests --cov=serapeum.ollama

Note: End-to-end tests require a running Ollama server with models available.

Notes

Server Required: Ollama must be running (ollama serve) before using this adapter
Tool Calling: Depends on model capabilities and Ollama version (some models don't support tools)
JSON Mode: Improves structured output reliability when the model supports it
Timeouts: Increase timeout for larger models or complex tasks
Async: All async methods use a per-event-loop client for thread safety

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.0

Mar 12, 2026

0.4.0

Feb 26, 2026

0.3.0

Feb 23, 2026

0.2.0

Feb 22, 2026

0.1.0

Feb 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serapeum_ollama-0.5.0.tar.gz (41.4 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

serapeum_ollama-0.5.0-py3-none-any.whl (42.2 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file serapeum_ollama-0.5.0.tar.gz.

File metadata

Download URL: serapeum_ollama-0.5.0.tar.gz
Upload date: Mar 12, 2026
Size: 41.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for serapeum_ollama-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`5b74e45cacfa717d9e50547e7960b01c4084d0033c840d6478d307977fc4ac47`
MD5	`d8e848f6a13b8932f83d1afedd82b4fc`
BLAKE2b-256	`48eac0f83a985f371e54aaadc207f54f11168f3fd166987d9d42491931c0c94d`

See more details on using hashes here.

File details

Details for the file serapeum_ollama-0.5.0-py3-none-any.whl.

File metadata

Download URL: serapeum_ollama-0.5.0-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 42.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for serapeum_ollama-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`acf4512dc26c2444d5ff579bcc424d771c7bfcab5e47a8f3e0194e672208431b`
MD5	`5d44dedc052cda8746c5e9d6673eea4a`
BLAKE2b-256	`91dbd2d7c0a9cc97df34b46547503b34aec3dda563bad1f4bee292f7c327ce4b`

See more details on using hashes here.

serapeum-ollama 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Serapeum Ollama Provider

Table of Contents

Installation

From Source

From PyPI (when published)

Prerequisites

Quick Start

Basic Usage

LLM Features

Basic Chat

Streaming

Async Operations

Structured Outputs

Tool Calling

Completion Style Usage

Embeddings

Basic Embedding Generation

Batch Embeddings

Async Embeddings

Advanced Embedding Configuration

Integration with Serapeum

Configuration

LLM Configuration

Embedding Configuration

Available Models

Examples

Testing

Notes

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes