Format OpenTelemetry agent traces for analytics, ML, and context graphs
Project description
otela
Format OpenTelemetry agent traces for analytics, ML, and context graphs.
otela (OTel + analytics) reads raw OTel traces (OTLP/JSON) and emits
analytics-ready data in whatever shape you need: Arrow tables, Parquet,
Pandas, nested JSON records, or PyTorch tensors. It auto-detects between
OpenInference and OTel GenAI semantic conventions and gives you a single
uniform schema you can run cross-trace analytics, dashboards, ML data
prep, and context-graph builds against.
Try it (no install)
Grab a sample OTel agent trace and run otela's CLI against it via uvx —
nothing gets installed permanently:
curl -sL https://raw.githubusercontent.com/zach-blumenfeld/otela/main/examples/sample.json \
-o trace.json
uvx otela totables trace.json otela-out
This writes one Parquet file per table:
otela-out/
├── sessions.parquet # one row per session (multi-trace conversation rollup)
├── traces.parquet # one row per trace (rollup), with session_id + session_turn
├── spans.parquet # one row per span
├── messages.parquet # LLM messages, joinable on (trace_id, span_id)
├── documents.parquet # retrieved docs from RETRIEVER spans
└── links.parquet # OTel span links
Want nested-record output instead?
uvx otela torecords trace.json otela-records --format jsonl
Install
pip install otela # core
pip install "otela[pandas]" # adds the to_dfs() pandas adapter
pip install "otela[ml]" # adds torch for to_tensors()
pip install "otela[pandas,ml]" # everything
Python 3.11+.
Basic Usage
CLI
Create tabular data for analytics:
otela totables path/to/otel/trace.json path/to/output/
Create records with otela formatting — each record is a trace/workflow with nested spans, tool calls, etc.:
otela torecords path/to/otel/trace.json path/to/output/
The input can be a single OTLP/JSON file or a directory of files (walked recursively).
Python
import otela
# load one or multiple json files (file or directory)
traces = otela.load('path/to/otel/trace.json')
# dicts / json maps in otela format
trace_dicts = otela.to_dicts(traces)
# tabular (Pandas)
dfs = otela.to_dfs(traces)
# tensors (PyTorch)
tensors = otela.to_tensors(traces)
# stream straight to parquet without holding everything in memory
otela.to_parquet('path/to/traces/', 'out/', batch_size=10_000)
# on-demand dim tables: tools, agents, models, services
dims = otela.dims(traces)
Specs
To make Otel logs useful for analytics & DS/ML, we need to format a bit differently. otela has two specs for this:
agent-trace (at)(default): A minimal normalization between OpenInference and OTel GenAI Semantic Conventions that gives you uniform records for analytics. Stays close to OTel naming conventions while surfacing input & output attributes and using a schema for tabular/batch processing.workflow-graph (wg): A more opinionated spec for representing the structural decisions and actions in a workflow — agent, human, or hybrid. It's prescriptive about node and relationship types in a graph schema, optimized for context graphs, reinforcement learning, and other research.
Both take either OpenInference or OTel GenAI semantic conventions as
input and also tolerate Vercel AI SDK (ai.*), MLflow (mlflow.*), and
Traceloop (traceloop.*) attributes as fallbacks.
To specify the spec:
otela totables path/to/otel/trace.json path/to/output/ \
--spec wg/v1
trace_dicts = otela.to_dicts(otela.load('path/...', spec='at/v2'))
Forward-slash notation (at/v2) calls the spec at a specific version.
This is recommended as specs may change in non-backward-compatible ways.
Omitting the version (at) calls the latest. Specification type and
version are always embedded in the output records (spec,
spec_version columns or fields). Migration utilities will be added as
needed.
Status:
at/v2is implemented today (adds session_id, session_turn, and the sessions rollup; supersedesat/v1).wg/v1is on the roadmap — see Status below.
Output Formats
totables output formats: parquet (default), csv, arrow, json, jsonl
torecords output formats: json (default), jsonl
Specify the output format:
otela totables path/to/otel/trace.json path/to/output/ \
--format parquet
parquet streams to disk with bounded memory (controlled by
--batch-size, default 10 000 spans per row group). The other tabular
formats materialize the full tableset in memory before writing — fine for
development, but for billions of spans use parquet.
Status
| Area | Status |
|---|---|
agent-trace spec, version at/v2 |
implemented |
Session detection + multi-turn sessions rollup |
implemented |
| OTLP/JSON file + directory ingestion | implemented |
| OpenInference convention | implemented |
| OTel GenAI semconv (events + attrs) | implemented |
| Vercel AI SDK / MLflow / Traceloop | implemented |
Generic input.value / output.value fallback |
implemented |
| Streaming Parquet writer | implemented |
load / to_dfs / to_dicts / to_tensors / to_parquet / dims |
implemented |
otela totables / otela torecords CLI |
implemented |
| Real-trace fixtures: LangGraph (OpenInference) | implemented |
| Real-trace fixtures: Google ADK (OTel GenAI) | implemented |
| Real-trace fixtures: LlamaIndex (RETRIEVER / EMBEDDING) | planned — next |
| HuggingFace dataset adapters | planned — after LlamaIndex |
| Phoenix / Langfuse native export readers | planned — opportunistic |
workflow-graph spec, wg/v1 |
not yet started |
| Tokenized-text tensors for LLM fine-tuning | not yet started |
Streaming nested-record (torecords) writer |
not yet started |
| Parquet directory partitioning (Hive style) | not yet started |
Schema Reference (agent-trace, at/v2)
Every otela.load() call returns a dict of six Arrow tables. Schemas
are stable and versioned — every row carries spec and spec_version,
and spec_version only changes on a non-backward-compatible schema
change. Schemas are importable: otela.SPANS_SCHEMA,
otela.TRACES_SCHEMA, otela.SESSIONS_SCHEMA, etc.
output/
├── sessions.parquet # one row per session (multi-trace rollup)
├── traces.parquet # one row per trace (rollup) — carries session_id, session_turn
├── spans.parquet # one row per span — the fact table
├── messages.parquet # one row per LLM message; joins on (trace_id, span_id)
├── documents.parquet # one row per retrieved document
└── links.parquet # OTel span links
sessions
Session-level rollup, one row per distinct session_id. A session
groups multiple traces (turns of one conversation). Traces with no
session_id are excluded.
| Column | Type | Notes |
|---|---|---|
session_id |
string | primary key |
trace_count |
int64 | |
span_count |
int64 | sum across traces |
error_count |
int64 | sum across traces |
start_time_unix_nano |
int64 | min over traces |
end_time_unix_nano |
int64 | max over traces |
duration_ns |
int64 | |
total_input_tokens |
int64 | sum across traces; NULL if no trace had it |
total_output_tokens |
int64 | " |
total_tokens |
int64 | " |
traces
Trace-level rollup, one row per trace.
| Column | Type | Notes |
|---|---|---|
trace_id |
string | primary key |
session_id |
string | nullable; promoted from any recognized session attribute (see below) |
session_turn |
int32 | 0-indexed position within session, ordered by start_time_unix_nano ASC, trace_id lex tiebreak. Null when session_id is null. otela-derived; batch-scoped. |
root_span_id |
string | earliest parentless span |
root_span_name |
string | |
service_name |
string | from the root span's resource |
start_time_unix_nano |
int64 | min over spans |
end_time_unix_nano |
int64 | max over spans |
duration_ns |
int64 | |
span_count |
int64 | |
error_count |
int64 | spans with status = ERROR |
status |
string | worst-of: ERROR > OK > UNSET |
total_input_tokens |
int64 | sum across spans; NULL if no span had it |
total_output_tokens |
int64 | " |
total_tokens |
int64 | " |
spans
Canonical row-per-span fact table. Joins to traces on trace_id and
to the side tables on (trace_id, span_id).
- Identification:
trace_id,span_id,parent_span_id,name - Classification:
kind(AGENT | LLM | TOOL | CHAIN | RETRIEVER | EMBEDDING | RERANKER | GUARDRAIL | EVALUATOR | UNKNOWN),convention(which semconv this span came from),status_code,status_message - Timing:
start_time_unix_nano,end_time_unix_nano,duration_ns - Resource:
service_name,scope_name,scope_version - Agent-trace canonical:
model_name,tool_name,agent_name,input_tokens,output_tokens,total_tokens,io_format(text | tool_call | retrieval | unknown),input_text,output_text - Session:
session_id— promoted from any recognized session/ conversation attribute (see "Source Conventions Accepted" below). Carried on every span the source instrumentation tagged soWHERE session_id = Xqueries hit the spans table without a join. - Fidelity:
raw_attributes_json— JSON-encoded leftover attrs the normalizer didn't promote into a typed column. No information is silently dropped.
messages
One row per LLM message (system / user / assistant / tool). Sourced from
either OpenInference indexed attributes (llm.input_messages.N.message.*)
or OTel GenAI span events (gen_ai.user.message,
gen_ai.assistant.message, etc.).
Columns: trace_id, span_id, position (order within the span),
direction (input | output), role, content, tool_call_id.
documents
Retrieved documents from RETRIEVER spans.
Columns: trace_id, span_id, position, document_id, content,
score.
links
OTel span links (one span pointing at another span outside its parent chain).
Columns: trace_id, span_id, linked_trace_id, linked_span_id.
Mapping to a graph schema
The 4-FK + 1-rollup tabular layout corresponds 1:1 to the property-graph
schema in zach-blumenfeld/otel-to-neo4j.
The (:Tool), (:Agent), (:Model), (:Service) nodes are
denormalized into name columns on spans; recover them as dim tables on
demand via otela.dims(traces).
Source Conventions Accepted
Spans are auto-classified per-span. A single trace can mix conventions — e.g. an OpenInference LangChain instrumentation alongside an OTel GenAI model call.
| Convention | Detection signal | Session attribute |
|---|---|---|
| OpenInference | openinference.span.kind, llm.*, tool.*, retrieval.*, embedding.* |
session.id |
| OTel GenAI semconv | any gen_ai.* attribute or span event |
gen_ai.conversation.id |
| Vercel AI SDK | ai.* attributes |
ai.telemetry.metadata.sessionId |
| MLflow | mlflow.* attributes |
session.id or mlflow.trace.session |
| Traceloop / OpenLLMetry | traceloop.* attributes |
traceloop.association.properties.session_id |
| Google ADK (OTel GenAI extension) | gcp.vertex.agent.* attributes |
gcp.vertex.agent.session_id |
| Generic | input.value / output.value only |
— |
When multiple session attributes are present on the same span, otela
picks one with this precedence (most-standard first):
gen_ai.conversation.id → session.id → gcp.vertex.agent.session_id
→ ai.telemetry.metadata.sessionId → mlflow.trace.session →
traceloop.association.properties.session_id. The matched attribute is
promoted into the session_id column and removed from
raw_attributes_json to avoid duplication.
Design Principles
- Built for scale. PyArrow + Parquet is the canonical internal
representation. The reader is generator-based;
to_parquet()streams with bounded memory (a TB of input works the same as a 10 MB file). Pandas is a thin convenience adapter for the in-memory case. - ML-focused output. The eventual goal is a clean data-prep layer
for training agentic models. That motivates the multi-table
normalized layout, explicit nullability on numeric columns, and the
to_tensors()adapter. - Schema-stable. Every row carries
specandspec_version. Schemas are tested for drift on every fixture run. - No information loss. Anything the convention extractor doesn't
promote into a typed column lands in
raw_attributes_json. - Zero-cost optional deps.
import otelaworks without pandas or torch installed; calling the adapter raises a clearImportErrorpointing at the right extra.
Roadmap
Next up
- LlamaIndex real-trace fixture (OpenInference). Closes the last
major SDK coverage gap:
RETRIEVERandEMBEDDINGspan kinds have only been validated against synthetic fixtures so far. LangGraph + ADK don't exercise them. Samescripts/generate_fixtures.pyharness; expected to surface deeper indexed attributes (retrieval.documents.N.document.metadata.*) that may warrant promotion fromraw_attributes_json. - HuggingFace dataset adapters, after LlamaIndex. Most agent-trace
datasets on HF aren't OTLP-shaped — typically conversation logs in
parquet/jsonl, sometimes OTel exports in vendor-specific JSON. Each
dataset usually needs a small adapter that reshapes its rows into our
RawSpaniterator; the rest of the pipeline (normalize→builder→ schemas) doesn't change. Sequenced after LlamaIndex so the spec is rock-solid before discovering data-shape issues at scale.
Later
- Phoenix / Langfuse native export readers. Both speak OTLP on ingest, but their export formats are vendor-shaped. If you have traces in those backends today, the fastest path is configuring an OTLP file dump on the backend; native readers are a convenience layer worth adding once we see real demand.
wg/v1workflow-graph spec- Tokenizer-aware
to_tensors()mode for LLM fine-tuning (input_ids/attention_maskper message) - Streaming
torecordswriter (per-trace flush as soon as a trace is observably complete) - Hive-partitioned Parquet output (
/service=foo/date=2026-04-22/...) for direct DuckDB / Spark consumption - Migration utilities once a second
atversion exists
If you have production traces that would make a good test fixture, please open an issue.
Development
git clone https://github.com/zach-blumenfeld/otela
cd otela
uv sync # installs runtime + dev deps (pytest, ruff, torch, pandas)
Project layout:
src/otela/
├── schemas.py # Arrow schemas — single source of truth
├── otlp.py # OTLP/JSON parsing helpers
├── reader.py # Streaming OTLP/JSON file/directory iterator
├── normalize.py # Convention detection + at/v1 extraction
├── builder.py # Column buffers + per-trace accumulators -> Arrow tables
├── api.py # load(), to_dfs(), to_dicts(), to_parquet(), dims()
├── tensors.py # to_tensors() — optional torch dependency
└── cli.py # otela totables / otela torecords
Running tests
uv run pytest -q # full suite (synthetic fixtures only)
uv run pytest -v # verbose, shows each test name
uv run pytest tests/test_to_dicts.py # one file
uv run pytest -k traces # by name pattern
Tests against real-trace fixtures (tests/test_real_traces.py) are
skipped automatically when their fixture file doesn't exist. To run
them, generate the fixture first — see "Generating real-trace fixtures"
below. Without the fixture you'll see something like:
92 passed, 11 skipped
That's expected — the suite is green; the skips are real-trace tests waiting on a regenerated fixture.
Linting
uv run ruff check src/ tests/ scripts/
uv run ruff check --fix src/ tests/ scripts/ # auto-fix
CI runs both pytest and ruff check — both must be green.
Generating real-trace fixtures
Synthetic fixtures (examples/sample.json, examples/sample_genai.json)
cover the spec, but production SDKs surface shape edge cases that
hand-written fixtures don't. The scripts/generate_fixtures.py harness
runs minimal example agents under real instrumentation and commits the
resulting OTLP/JSON to tests/fixtures/real/ so the test suite can
assert against them.
Currently supported sources:
langgraph— LangGraph React agent + OpenInference instrumentationadk— Google ADK agent via LiteLLM (OTel GenAI semconv)
Generate fixtures:
uv sync --group fixtures
export OPENAI_API_KEY=sk-...
uv run python scripts/generate_fixtures.py langgraph
uv run python scripts/generate_fixtures.py adk
Outputs:
tests/fixtures/real/langgraph_research_agent.jsontests/fixtures/real/adk_research_agent.json
The ADK generator routes through LiteLLM to OpenAI under the hood — ADK's
OTel emission is independent of the model backend, so a gpt-4o-mini-served
trace exercises the same gen_ai.* events code path as a Vertex Gemini
one. No GCP/Vertex setup required; your OPENAI_API_KEY is enough.
The corresponding tests in tests/test_real_traces.py skip
automatically if the fixture is missing — contributors who don't
regenerate fixtures still get a green suite.
Quick-summary any otela-readable trace file:
uv run python scripts/inspect_fixture.py tests/fixtures/real/langgraph_research_agent.json
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file otela-0.2.0.tar.gz.
File metadata
- Download URL: otela-0.2.0.tar.gz
- Upload date:
- Size: 35.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4843fb36974975e3eab9ce3a3934e3bbd2ce6d8d6ba5b45cf0c533527c28c0af
|
|
| MD5 |
ba33780749044b64d9b12d91ee11149a
|
|
| BLAKE2b-256 |
a5b126560522a77bcf1f47030407adc0d1abb4ccdbbed34b5ee7d242cb3611de
|
Provenance
The following attestation bundles were made for otela-0.2.0.tar.gz:
Publisher:
release.yml on zach-blumenfeld/otela
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
otela-0.2.0.tar.gz -
Subject digest:
4843fb36974975e3eab9ce3a3934e3bbd2ce6d8d6ba5b45cf0c533527c28c0af - Sigstore transparency entry: 1453461676
- Sigstore integration time:
-
Permalink:
zach-blumenfeld/otela@046a89642984cb3e11a3388ae2e6401f7db5ab4c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zach-blumenfeld
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@046a89642984cb3e11a3388ae2e6401f7db5ab4c -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file otela-0.2.0-py3-none-any.whl.
File metadata
- Download URL: otela-0.2.0-py3-none-any.whl
- Upload date:
- Size: 39.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a713eaa44bc4b333c61c742ee7fa01ba8a0883e3b961d586d61e7f5b1bd7a441
|
|
| MD5 |
830831754af752cc5599727cac039af5
|
|
| BLAKE2b-256 |
97245ec7f48e7ad5111579bb3acc4b39a3678db3ae547d00bff6bc500e9f40db
|
Provenance
The following attestation bundles were made for otela-0.2.0-py3-none-any.whl:
Publisher:
release.yml on zach-blumenfeld/otela
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
otela-0.2.0-py3-none-any.whl -
Subject digest:
a713eaa44bc4b333c61c742ee7fa01ba8a0883e3b961d586d61e7f5b1bd7a441 - Sigstore transparency entry: 1453462160
- Sigstore integration time:
-
Permalink:
zach-blumenfeld/otela@046a89642984cb3e11a3388ae2e6401f7db5ab4c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zach-blumenfeld
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@046a89642984cb3e11a3388ae2e6401f7db5ab4c -
Trigger Event:
workflow_dispatch
-
Statement type: