Skip to main content

OpenSearch AI Observability SDK — OTEL-native tracing and scoring for LLM applications

Project description

OpenSearch GenAI SDK

OTel-native tracing and scoring for LLM applications. Instrument your AI workflows with standard OpenTelemetry spans and submit evaluation scores — all routed to OpenSearch through a single OTLP pipeline.

Features

  • One-line setupregister() configures the full OTel pipeline (TracerProvider, exporter, auto-instrumentation)
  • observe() — single decorator / context manager that creates OTel spans with GenAI semantic convention attributes
  • enrich() — add model, token usage, and other GenAI attributes to the active span from anywhere in your code
  • Auto-instrumentation — automatically discovers and activates installed instrumentor packages (OpenAI, Anthropic, Bedrock, LangChain, etc.)
  • Scoringscore() emits evaluation metrics as OTel spans at span, trace, or session level
  • Benchmarksevaluate() runs your agent against a dataset with scorers; Benchmark uploads results from any eval framework (RAGAS, DeepEval, pytest)
  • AWS SigV4 — built-in SigV4 signing for AWS-hosted OpenSearch and Data Prepper endpoints
  • Zero lock-in — remove a decorator and your code still works; everything is standard OTel

Requirements

  • Python: 3.10, 3.11, 3.12, or 3.13
  • OpenTelemetry SDK: >=1.20.0, <2

Installation

pip install opensearch-genai-observability-sdk-py

The core package includes the OTel SDK and exporters. Auto-instrumentation of LLM libraries is opt-in — install only the providers you use:

# Single provider
pip install opensearch-genai-observability-sdk-py[openai]
pip install opensearch-genai-observability-sdk-py[anthropic]
pip install opensearch-genai-observability-sdk-py[bedrock]
pip install opensearch-genai-observability-sdk-py[langchain]

# Multiple providers
pip install "opensearch-genai-observability-sdk-py[openai,anthropic]"

# All instrumentors at once
pip install opensearch-genai-observability-sdk-py[otel-instrumentors]

# Everything
pip install opensearch-genai-observability-sdk-py[all]

Available extras: openai, anthropic, bedrock, google, langchain, llamaindex, otel-instrumentors (all instrumentors), all

Quick Start

from opensearch_genai_observability_sdk_py import register, observe, Op, enrich, score

# 1. Initialize tracing (one line)
register(endpoint="http://localhost:21890/opentelemetry/v1/traces")

# 2. Trace your functions
@observe(name="web_search", op=Op.EXECUTE_TOOL)
def search(query: str) -> list[dict]:
    return [{"title": f"Result for: {query}"}]

@observe(name="research_agent", op=Op.INVOKE_AGENT)
def research(query: str) -> str:
    results = search(query)
    enrich(model="gpt-4.1", provider="openai", input_tokens=150, output_tokens=50)
    return f"Summary of: {results}"

# 3. Use context managers for inline blocks
@observe(name="qa_pipeline", op=Op.INVOKE_AGENT)
def run(question: str) -> str:
    answer = research(question)
    with observe("safety_check", op="guardrail"):
        enrich(safe=True)
    return answer

result = run("What is OpenSearch?")

# 4. Submit scores (after workflow completes)
score(name="relevance", value=0.95, trace_id="...")

This produces the following span tree:

invoke_agent qa_pipeline
├── invoke_agent research_agent
│   └── execute_tool web_search
└── safety_check

Architecture

┌──────────────────────────────────────────────────────┐
│                   Your Application                    │
│                                                       │
│  @observe(op=Op.INVOKE_AGENT)   enrich()   score()   │
│  with observe("step", op=...)                         │
│                     │                                 │
│         opensearch-genai-observability-sdk-py         │
├──────────────────────────────────────────────────────┤
│  register()                                           │
│  ┌──────────────────────────────────────────────┐    │
│  │  TracerProvider                               │    │
│  │  ├── Resource (service.name)                  │    │
│  │  ├── BatchSpanProcessor                       │    │
│  │  │   └── OTLPSpanExporter (HTTP or gRPC)      │    │
│  │  │       └── SigV4 signing (AWS endpoints)    │    │
│  │  └── Auto-instrumentation                     │    │
│  │      ├── openai, anthropic, bedrock, ...      │    │
│  │      ├── langchain, llamaindex, haystack      │    │
│  │      └── chromadb, pinecone, qdrant, ...      │    │
│  └──────────────────────────────────────────────┘    │
└───────────────────────┬──────────────────────────────┘
                        │ OTLP (HTTP/gRPC)
                        ▼
               ┌─────────────────┐
               │  Data Prepper /  │
               │  OTel Collector  │
               └────────┬────────┘
                        │
                        ▼
               ┌─────────────────┐
               │   OpenSearch     │
               │  ├── traces      │
               │  └── scores      │
               └─────────────────┘

API Reference

register()

Configures the OTel tracing pipeline. Call once at startup.

register(
    endpoint="http://my-collector:4318/v1/traces",  # or use env vars
    service_name="my-app",
    batch=True,            # BatchSpanProcessor (True) or Simple (False)
    auto_instrument=True,  # discover installed instrumentor packages
)

Endpoint resolution (priority order):

  1. endpoint= parameter — full URL, used as-is
  2. OTEL_EXPORTER_OTLP_TRACES_ENDPOINT env var — full URL, used as-is
  3. OTEL_EXPORTER_OTLP_ENDPOINT env var — base URL, /v1/traces appended automatically
  4. http://localhost:21890/opentelemetry/v1/traces — Data Prepper default

Protocol resolution (priority order):

  1. protocol= parameter — "http" or "grpc"
  2. OTEL_EXPORTER_OTLP_TRACES_PROTOCOL env var
  3. OTEL_EXPORTER_OTLP_PROTOCOL env var
  4. Inferred from URL scheme

URL schemes:

URL scheme Transport
http:// / https:// HTTP OTLP (protobuf)
grpc:// gRPC (insecure)
grpcs:// gRPC (TLS)

http/json is not supported. A ValueError is raised if the protocol contradicts a grpc:// or grpcs:// URL scheme.

Authenticated endpoints (e.g. AWS OSIS): pass a custom exporter via exporter=:

from opensearch_genai_observability_sdk_py.exporters import AWSSigV4OTLPExporter

register(
    exporter=AWSSigV4OTLPExporter(
        endpoint="https://pipeline.us-east-1.osis.amazonaws.com/v1/traces",
        service="osis",
    )
)

AWSSigV4OTLPExporter is HTTP-only. AWS OSIS does not expose a gRPC endpoint.

observe()

Single tracing primitive — works as both a decorator and a context manager. Creates an OTel span with GenAI semantic convention attributes.

As a decorator:

@observe(name="planner", op=Op.INVOKE_AGENT)
def plan(query: str) -> str:
    enrich(model="gpt-4.1")
    return call_llm(query)

# Without parentheses (uses function name, no op)
@observe
def my_function():
    ...

As a context manager:

with observe("thinking", op=Op.CHAT) as span:
    enrich(model="gpt-4.1", input_tokens=1500)
    result = call_llm(prompt)

Parameters:

Parameter Type Default Description
name str Function __qualname__ (decorator) or "unnamed" (context manager) Span name
op str None gen_ai.operation.name value. Use Op constants or any custom string
kind SpanKind INTERNAL OTel span kind. Use SpanKind.CLIENT for external service calls

Span naming: When op is a well-known value, the span name is "{op} {name}" (e.g. "invoke_agent planner"). Custom ops follow the same pattern.

Attributes set automatically:

Attribute When set
gen_ai.operation.name When op is provided
gen_ai.agent.name All ops except execute_tool
gen_ai.tool.name When op=Op.EXECUTE_TOOL
gen_ai.input.messages / gen_ai.output.messages All ops except execute_tool (decorator only)
gen_ai.tool.call.arguments / gen_ai.tool.call.result When op=Op.EXECUTE_TOOL (decorator only)

Supported function types: sync, async, generators, async generators. Errors are captured as span status + exception events.

Op

Constants for well-known gen_ai.operation.name values. Any custom string is also accepted.

Constant Value Use for
Op.CHAT "chat" LLM chat completions
Op.INVOKE_AGENT "invoke_agent" Agent invocations
Op.CREATE_AGENT "create_agent" Agent creation/setup
Op.EXECUTE_TOOL "execute_tool" Tool/function calls
Op.RETRIEVAL "retrieval" RAG retrieval steps
Op.EMBEDDINGS "embeddings" Embedding generation
Op.GENERATE_CONTENT "generate_content" Content generation
Op.TEXT_COMPLETION "text_completion" Text completions

Custom strings work too: @observe(name="check", op="guardrail").

enrich()

Add GenAI semantic convention attributes to the currently active span. Call from inside an @observe-decorated function or a with observe(...) block.

@observe(name="chat", op=Op.CHAT)
def chat(prompt: str) -> str:
    result = call_llm(prompt)
    enrich(
        model="gpt-4.1",
        provider="openai",
        input_tokens=150,
        output_tokens=50,
        temperature=0.7,
    )
    return result

Parameters:

Parameter Attribute Description
model gen_ai.request.model Model name
provider gen_ai.provider.name Provider name (openai, anthropic, etc.)
input_tokens gen_ai.usage.input_tokens Input token count
output_tokens gen_ai.usage.output_tokens Output token count
total_tokens gen_ai.usage.total_tokens Total token count
response_id gen_ai.response.id Response/completion ID
finish_reason gen_ai.response.finish_reasons Finish reason(s)
temperature gen_ai.request.temperature Temperature setting
max_tokens gen_ai.request.max_tokens Max tokens setting
session_id gen_ai.session.id Session/conversation ID
**extra As provided Any additional key-value attributes

score()

Submits evaluation scores as OTel spans. Use any evaluation framework you prefer (autoevals, RAGAS, custom) and submit the results through score().

The score span is attached to the evaluated trace so it appears in the same trace waterfall as the spans it evaluates.

Two scoring levels:

# Span-level: score a specific span (score becomes a child of that span)
score(
    name="accuracy",
    value=0.95,
    trace_id="6ebb9835f43af1552f2cebb9f5165e39",
    span_id="89829115c2128845",
    explanation="Weather data matches ground truth",
)

# Trace-level: score the entire trace (score attaches to the root span)
score(
    name="relevance",
    value=0.92,
    trace_id="6ebb9835f43af1552f2cebb9f5165e39",
    explanation="Response addresses the user's query",
    attributes={
        "test.suite.name": "nightly_eval",
        "test.case.result.status": "pass",
    },
)

Parameters:

Parameter Type Description
name str Metric name (e.g., "relevance", "factuality")
value float Numeric score
trace_id str Hex trace ID of the trace being scored
span_id str Hex span ID for span-level scoring. When omitted, attaches to root span
label str Human-readable label ("pass", "relevant", "correct")
explanation str Evaluator justification (truncated to 500 chars)
response_id str LLM completion ID for correlation
attributes dict Additional span attributes (keys used as-is, e.g. test.* from semantic-conventions#3398)

Scores follow the OTel GenAI semantic conventions with gen_ai.evaluation.* attributes. Each score span also emits a gen_ai.evaluation.result event per the OTel GenAI event spec.

evaluate()

Run a task against a dataset, score outputs, and record results as OTel spans. Agent execution spans are children of each case span, giving full trace waterfall per case.

from opensearch_genai_observability_sdk_py import evaluate, EvalScore, observe, Op

@observe(op=Op.INVOKE_AGENT)
def my_agent(question: str) -> str:
    return call_llm(question)

def accuracy(input, output, expected) -> EvalScore:
    return EvalScore(name="accuracy", value=1.0 if expected in output else 0.0)

result = evaluate(
    name="rag-agent",
    task=my_agent,
    data=[
        {"input": "What is Python?", "expected": "programming language"},
        {"input": "What causes rain?", "expected": "water vapor"},
    ],
    scores=[accuracy],
    metadata={"agent_version": "v2"},
    record_io=True,
)

Produces:

test_suite_run rag-agent
 └── test_case [case: What is Python?]
      └── invoke_agent my_agent

Parameters:

Parameter Type Description
name str Benchmark name (test.suite.name). Stable across runs
task Callable Function that takes input and returns output. Decorate with @observe() for tracing
data list[dict] Dicts with "input" and optionally "expected", "case_id"
scores list[Callable] Scorer functions. Each receives (input, output, expected) and returns EvalScore, list[EvalScore], or float
metadata dict Attached to root span. Reserved keys (test.*, gen_ai.*) are filtered
record_io bool Record input/output/expected as span attributes (default False)

Returns BenchmarkResult with .summary (aggregate stats) and .cases (per-case results).

Benchmark

Upload pre-computed evaluation results from any framework (RAGAS, DeepEval, pytest, custom) as OTel spans. Use when you already have results and want to visualize them in agent-health or OpenSearch Dashboards.

from opensearch_genai_observability_sdk_py import Benchmark

# Upload results from your eval pipeline
with Benchmark(name="nightly-eval", metadata={"model": "gpt-4o"}, record_io=True) as b:
    b.log(input="What is Python?", output="A language", scores={"accuracy": 1.0})
    b.log(input="Capital of France?", output="Paris", scores={"accuracy": 1.0})

# Link to existing agent traces (click through from failed case → agent trace)
with Benchmark(name="ci-eval") as b:
    b.log(
        input="query",
        output="answer",
        scores={"accuracy": 0.9},
        trace_id="6ebb9835f43af1552f2cebb9f5165e39",
        span_id="89829115c2128845",
    )

OpenSearchTraceRetriever

Retrieves GenAI trace spans from OpenSearch. Works with any agent library that emits OTel GenAI semantic convention spans indexed by Data Prepper into otel-v1-apm-span-*.

from opensearch_genai_observability_sdk_py import OpenSearchTraceRetriever

# Option 1: Basic auth (local / docker-compose)
retriever = OpenSearchTraceRetriever(
    host="https://localhost:9200",
    auth=("admin", "admin"),
    verify_certs=False,
)

# Option 2: AWS OpenSearch Service (SigV4) — use this OR Option 1, not both
import boto3
from opensearchpy import RequestsAWSV4SignerAuth

credentials = boto3.Session().get_credentials()
auth = RequestsAWSV4SignerAuth(credentials, "us-west-2", "es")
retriever = OpenSearchTraceRetriever(
    host="https://search-my-domain.us-west-2.es.amazonaws.com",
    auth=auth,
)

# Retrieve all spans for a session or trace
session = retriever.get_traces("my-conversation-id")
for trace in session.traces:
    for span in trace.spans:
        print(f"{span.operation_name}: {span.name} ({span.model})")

# List recent root spans (for discovering traces to evaluate)
roots = retriever.list_root_spans(services=["my-agent"], max_results=10)

# Filter by time
from datetime import datetime
roots = retriever.list_root_spans(services=["my-agent"], since=datetime(2026, 3, 16))

# Check which traces already have evaluation spans
evaluated = retriever.find_evaluated_trace_ids(["trace-id-1", "trace-id-2"])

Constructor:

Parameter Type Default Description
host str "https://localhost:9200" OpenSearch endpoint
index str "otel-v1-apm-span-*" Index pattern for span data
auth tuple | RequestsAWSV4SignerAuth None Basic auth tuple or SigV4 auth
verify_certs bool True Verify TLS certificates

Methods:

Method Returns Description
get_traces(identifier, max_spans=10000) SessionRecord Fetch spans by conversation ID or trace ID
list_root_spans(services=None, since=None, max_results=50) list[SpanRecord] List recent root spans, optionally filtered by service
find_evaluated_trace_ids(trace_ids) set[str] Return subset of trace IDs that already have evaluation spans

Requires the [opensearch] extra: pip install opensearch-genai-observability-sdk-py[opensearch]

Auto-Instrumented Libraries

register() automatically discovers and activates any installed instrumentor packages via OTel entry points. No code changes needed — install the extras for the providers you use and their calls are traced automatically.

Category Extras / packages
LLM providers [openai], [anthropic]
Cloud AI [bedrock], [google] (Vertex AI + Generative AI)
Frameworks [langchain], [llamaindex]
All of the above + more [otel-instrumentors]

[otel-instrumentors] includes all of the above plus Cohere, Mistral, Groq, Ollama, Together, Replicate, Writer, Voyage AI, Aleph Alpha, SageMaker, watsonx, Haystack, CrewAI, Agno, MCP, Transformers, ChromaDB, Pinecone, Qdrant, Weaviate, Milvus, LanceDB, and Marqo.

Configuration

Environment Variable Description Default
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT Full OTLP traces endpoint URL
OTEL_EXPORTER_OTLP_ENDPOINT Base OTLP endpoint URL (/v1/traces appended)
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL Protocol for traces (http/protobuf, grpc)
OTEL_EXPORTER_OTLP_PROTOCOL Protocol for all signals (http/protobuf, grpc)
OTEL_SERVICE_NAME Service name for spans "default"
OPENSEARCH_PROJECT Project/service name (fallback) "default"
AWS_DEFAULT_REGION AWS region for SigV4 signing auto-detected

When no endpoint env var is set, register() defaults to the Data Prepper endpoint: http://localhost:21890/opentelemetry/v1/traces.

Examples

See the examples/ directory:

Example Description
01_tracing_basics.py @observe decorator, context manager, enrich()
02_scoring.py Span-level, trace-level, and session-level scoring
03_aws_sigv4.py AWS SigV4 authentication with AWSSigV4OTLPExporter
04_async_tracing.py Async function tracing with @observe
05_openai_auto_instrument.py OpenAI auto-instrumentation via register()
06_retrieval_and_eval.py Retrieve traces from OpenSearch, evaluate, write scores back
07_benchmarks.py evaluate() with scorers, compare agent versions
08_upload_benchmark_results.py Benchmark.log() — upload results from RAGAS, DeepEval, custom, with trace links

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opensearch_genai_observability_sdk_py-0.2.8.tar.gz (48.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file opensearch_genai_observability_sdk_py-0.2.8.tar.gz.

File metadata

File hashes

Hashes for opensearch_genai_observability_sdk_py-0.2.8.tar.gz
Algorithm Hash digest
SHA256 9fbca1fcabf7d4da743821fad211f099e4a7811a7b2be5e4322fb107e2f089da
MD5 63c633a5dab6da046276f8cfce4cc6d0
BLAKE2b-256 0c5347ee59a0086658d892a93021f1a68b3ccbf9b769c3fb415ac181563cc913

See more details on using hashes here.

Provenance

The following attestation bundles were made for opensearch_genai_observability_sdk_py-0.2.8.tar.gz:

Publisher: release-drafter.yml on opensearch-project/genai-observability-sdk-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opensearch_genai_observability_sdk_py-0.2.8-py3-none-any.whl.

File metadata

File hashes

Hashes for opensearch_genai_observability_sdk_py-0.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4da2864b7a3f9e2635c26596866f6c4d35f3bb9ee736cb846f097f63afbe1528
MD5 ef37a1cbfb7a6be5e7b81c74dadb63b0
BLAKE2b-256 e2950da1db854936b957af90dc79dfa161a965a80ab9407369105e8c47a91f5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for opensearch_genai_observability_sdk_py-0.2.8-py3-none-any.whl:

Publisher: release-drafter.yml on opensearch-project/genai-observability-sdk-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page