OpenTelemetry for ML Systems: Practical Observability That Explains What Happened

Imagine debugging a modern ML product without observability. It is like managing an airport where planes keep arriving late, bags go missing, and passengers complain, but the control tower only tells you, “something is wrong.” You do not know whether the delay came from ground operations, weather, security, or the gate assignment.

A single prediction or generation request in ML can travel through an API gateway, a feature store, a cache, a model server, a vector database, and an external LLM provider before the user sees a response. When latency spikes, costs rise, or output quality slips, the root cause is rarely one broken function. It is usually spread across data, infrastructure, model logic, and downstream dependencies.

OpenTelemetry is an observability framework and toolkit that enables the generation, export, and collection of telemetry data, including traces, metrics, and logs.

OpenTelemetry (OTel) gives you the control-tower view. It is an open standard and ecosystem for collecting telemetry. In practice, it helps teams answer questions like these:

Where did this request actually spend time?
Which dependency failed first?
Did the regression affect one model version, one prompt shape, or one traffic segment?
Did the rollout increase latency, token usage, or error rate?

One of OpenTelemetry’s primary goals is to simplify the instrumentation of applications and systems, regardless of the programming language, infrastructure, or runtime environment involved. The storage (backend) and visualization (frontend) of telemetry data are intentionally delegated to other tools.

For ML systems, that difference is not cosmetic. It is the difference between guessing and knowing.

1. Why observability feels harder in ML systems

Traditional software systems are already complex. ML systems add a second layer of uncertainty: even when the software path is stable, the behavior of the system can still shift.

There are four recurring reasons for this:

Data changes even when code does not:
Schemas evolve, fields disappear, null rates climb, and value distributions drift. A serving stack can remain technically healthy while the model receives data it was not prepared for.
Model behavior is part of the production surface:
A new model version may be more accurate but slower. A quantized model may reduce cost but behave differently on certain hardware. An LLM prompt change may improve one class of queries while sharply increasing tokens per request.
Infrastructure shapes output quality and latency:
Autoscaling lag, GPU contention, cold starts, network jitter, and cache eviction policies all change the user experience. In ML systems, infrastructure is not a background detail. It directly changes what users observe.
Dependencies multiply quickly:
Modern ML systems often depend on feature stores, vector databases, artifact registries, workflow schedulers, model gateways, safety filters, and third-party inference APIs. Each dependency adds both latency and new failure modes.

This is why debugging is so uncomfortable: symptoms and causes are often separated in both time and place. A slow response may be caused by an upstream cache miss. A bad output may be caused by stale features produced hours earlier. A rollout may look healthy in aggregate while failing only for a specific region, tenant tier, or context length bucket.

OpenTelemetry does not magically fix these problems. It makes them visible enough to debug.

Visualizing the Logging Architecture Evolution

opentelemetry-log-architecture-evolution — *(Image source: Alibaba Cloud Blog)*

Looking at this diagram, you would see a clear transition from chaos to order. On the left side representing the “old way,” imagine a tangled web where logs, metrics, and traces are completely siloed. Each telemetry type has its own agent, its own collection pipeline, and its own backend.

On the right side, the OpenTelemetry architecture introduces a unified pipeline. The central piece is the OpenTelemetry Collector—acting as a universal translator and router. Instead of managing logs, traces, and metrics through separate channels as in traditional setups, OpenTelemetry centralizes data collection. By using the OpenTelemetry Collector, all telemetry is funneled into a unified backend, enabling seamless data association and visibility.

What OpenTelemetry is, and what it is not

It helps to think of OpenTelemetry as three layers working together:

A standard for representing telemetry data.
APIs and SDKs for instrumenting applications and services.
A Collector for receiving, processing, and exporting telemetry.

OpenTelemetry is not a dashboarding or storage product. You still need a backend such as Grafana, Prometheus, Datadog, Honeycomb, New Relic, Elastic, or OpenSearch to store, query, and visualize the data.

That separation is one of OTel’s strongest advantages: instrument once, keep backend options open.

2. The Three Pillars of Observability

The easiest way to remember OpenTelemetry is this:

Traces tell the story of one request.
Metrics summarize the health of many requests.
Logs capture the detailed evidence at important moments.

Used together, these signals let you move from “something is wrong” to “this specific dependency caused this specific class of failures.”

2.1 Traces: the path of a single request

A trace represents the lifecycle of a single request or workflow. It is composed of spans, and each span represents one unit of work, such as:

receiving an HTTP request
fetching features from an online store
querying a vector database
calling an external LLM provider
running model inference
writing output to storage

Traces are the best tool for answering questions about:

latency breakdowns
dependency bottlenecks
timeout propagation
retry behavior
where a failure first appeared

If p99 latency jumps, traces help you determine whether the time is being spent in retrieval, preprocessing, model inference, prompt construction, or a downstream provider call.

2.2 Metrics: the shape of system health over time

Metrics are numeric time series. In practice, the most useful metric types are counters, gauges, and histograms.

For ML systems, common metrics include:

request rate
error rate
inference latency histograms
queue depth
cache hit rate
GPU memory utilization
tokens per request and tokens per second
retrieval hit rate or reranker latency

Metrics are what you use for dashboards, alerting, service-level indicators, and capacity planning. They answer questions like, “Is this system healthy right now?” and “Is it getting worse?”

2.3 Logs: the detailed local evidence

Logs record discrete events with richer local context than metrics usually carry. Good ML-system logs often capture:

validation failures
exception traces
model loading warnings
feature schema mismatches
provider response codes
business decisions with structured fields

On their own, logs can become noisy very quickly. Their real value appears when they are correlated with traces using shared trace and span IDs. Then you can move from a latency spike to the exact request and the exact warning or exception that explains it.

2.4 A practical debugging sequence

In mature systems, the three signals usually work in this order:

Metrics tell you that a problem exists.
Traces show where the time or failure accumulated.
Logs explain the local condition that caused it.

That sequence is worth remembering because it mirrors how most production investigations unfold.

opentelemetry-illustration-2 — (Image adpated from signoz page)

3. How OpenTelemetry connects everything under the hood

The most important technical idea in OpenTelemetry is context propagation. Returning to our airport analogy, imagine context propagation as the unique barcode assigned to a passenger’s checked luggage. No matter which conveyor belt, security scanner, or baggage handler touches the suitcase, they all scan the barcode. This ensures the bag can always be traced back to the specific passenger and flight.

When a request enters your system, it carries tracing context, usually through the W3C traceparent header. Each service extracts that context, creates its own spans, and injects the updated context into downstream requests. That is how one logical request remains connected even as it moves across many physical services.

Without context propagation, telemetry fragments into isolated pieces. You still have spans, but you no longer have a coherent story.

3.1 A useful mathematical model: a trace as a graph

Formally, we can represent a trace $T$ as a directed acyclic graph (DAG):

$$
T = (S, E)
$$

where:

$S = {s_1, s_2, \dots, s_n}$ is the set of spans representing discrete units of work.
$E$ is the set of parent-child edges between spans, denoting causality. If $(s_i, s_j) \in E$, then span $s_i$ triggered span $s_j$.

Each span $s_i$ has at least:

a globally unique trace_id
a locally unique span_id
a start time $t_i^{start}$
an end time $t_i^{end}$
optional attributes, events, and status

Its duration is calculated simply as:

$$
d_i = t_i^{end} – t_i^{start}
$$

But end-to-end user latency is not simply the sum of all span durations, because some branches run in parallel. What matters operationally is the critical path: the longest dependency chain that determines the final response time.

That distinction matters a great deal in ML systems. If feature fetch and prompt templating run in parallel, optimizing the shorter branch may produce no user-visible improvement at all.

3.2 Spans, attributes, events, resources, and status

These terms are easy to blur together, so it is worth separating them clearly.

Span attributes

Attributes are key-value metadata attached to a span. They make traces filterable and analyzable. Common examples include:

http.method
http.route
db.system
ml.model.name
ml.model.version
gen_ai.request.model

For ML systems, attributes should answer questions that operators repeatedly ask, but they should remain low-cardinality.

Span events

Events are timestamped annotations within a span. They are ideal for important moments that do not deserve their own span, such as:

cache_miss
retry_started
schema_validation_failed
fallback_model_invoked
guardrail_blocked_output

Events add narrative detail without turning traces into forests of tiny spans.

Resource attributes

Resource attributes describe the process, service, or runtime producing telemetry. These are critical for grouping and filtering:

service.name
service.version
deployment.environment
cloud, host, container, or Kubernetes metadata

If service.name is inconsistent across deployments, dashboards and traces become difficult to interpret almost immediately.

Status

Span status indicates whether an operation succeeded or failed. Status is especially useful for debugging, alerting logic, and selective sampling strategies that retain a higher fraction of failures.

3.3 Where context propagation breaks in practice

Context propagation sounds simple in diagrams, but many real-world trace gaps come from a few recurring breakpoints:

asynchronous task queues where trace context is not forwarded in message headers
background threads or worker pools that start work without the current context
service boundaries implemented with custom HTTP or RPC wrappers that forget to inject headers
batch jobs that fan out work across partitions without preserving parent-child relationships

This is important in ML platforms because pipelines often combine synchronous APIs, asynchronous workers, and scheduled jobs. If trace context is dropped at any handoff, you still collect telemetry, but you lose the end-to-end narrative. In practice, this often feels like opening a detective novel and discovering that every third chapter is missing.

The Solution: Span Links

To solve this, OpenTelemetry provides Span Links. While parent-child edges are for synchronous or immediate execution, Links are used to associate a span with one or more spans in entirely different traces. For example, when a batch inference job runs, it cannot have thousands of HTTP requests as direct “parents.” Instead, the batch span links to the trace IDs of all the individual messages it pulled off the queue, preserving the narrative without violating the strict parent-child time boundaries.

4. The OpenTelemetry architecture in practice

At a high level, the OTel data path is straightforward:

Your application, job, or service emits telemetry through an SDK or auto-instrumentation.
Telemetry is exported over OTLP (the OpenTelemetry Protocol).
An OpenTelemetry Collector receives, processes, and forwards that data to one or more backends.

4.1 Why the Collector matters so much

The Collector is often the most underestimated part of the stack. In practice, it acts as the operational control plane for observability.

It centralizes concerns such as:

batching
retries
enrichment
redaction
routing to multiple backends
head or tail sampling
rate limiting and buffering

This separation matters even more in ML environments, where prompts, feature values, and user-generated content can easily leak into telemetry unless they are cleaned or dropped before export.

4.2 Common deployment patterns

There is no universal best deployment pattern. The right choice depends on scale, isolation requirements, and operational maturity.

Sidecar Collector: strong per-workload isolation, but higher resource overhead.
DaemonSet or node agent: common in Kubernetes and often a good balance between control and cost.
Central gateway Collector: simpler to manage centrally, but it becomes shared infrastructure and therefore a potential bottleneck.

4.3 Manual instrumentation and auto-instrumentation solve different problems

This distinction is easy to miss when teams first adopt OpenTelemetry.

Auto-instrumentation is excellent for standard boundaries such as HTTP requests, outgoing client calls, SQL queries, and message consumers. It gives you broad coverage quickly.

Manual instrumentation is what adds business meaning. It is how you mark steps like feature_retrieval, rerank_candidates, llm_guardrail_check, or fraud_rule_evaluation.

For ML systems, the strongest pattern is usually to combine both:

let auto-instrumentation capture commodity infrastructure interactions
add manual spans around ML-specific stages that matter for debugging and cost analysis

That combination keeps setup practical while still preserving the domain context that makes traces useful.

5. Where to instrument ML systems

The most common instrumentation mistake is to trace code structure instead of system structure. The most valuable spans usually align with dependency boundaries, not helper functions.

5.1 Online inference APIs

For real-time inference, operators usually need fast answers to questions such as:

What are p50, p95, and p99 latencies?
Which model version is serving traffic?
How much time is spent in feature retrieval, preprocessing, inference, and postprocessing?
Are failures isolated to one region, one tenant tier, or one dependency?

Useful span boundaries often include:

request parsing
authentication and rate limiting
cache lookup
feature store fetch
preprocessing
model inference
postprocessing or policy rules
response serialization

5.2 Batch inference and ETL pipelines

Batch systems need observability just as much as online systems, but the useful granularity is different. Here, you usually care about:

job duration
stage-level duration
retry counts
rows processed, dropped, or quarantined
read and write throughput
data quality failures

A batch job that “succeeds” while quietly processing malformed data is still a production incident. That is why data quality failures deserve first-class treatment as metrics, events, or structured logs.

5.3 Training and fine-tuning pipelines

Training already has specialized tooling such as MLflow and Weights & Biases. OpenTelemetry does not replace those tools; it complements them.

OpenTelemetry is especially useful for:

orchestration step timings
dataset load bottlenecks
storage and registry dependencies
cluster and GPU utilization
artifact movement across services

Loss curves, evaluation plots, and experiment comparisons belong in experiment-tracking tools. Pipeline observability and infrastructure observability fit naturally in OTel.

5.4 LLM and RAG applications

LLM systems are especially good candidates for OpenTelemetry because the user-visible response often depends on several chained components:

prompt construction
retrieval
reranking
tool calls
provider latency
output parsing
safety checks and guardrails

Good instrumentation helps answer subtle but important questions:

Did latency come from retrieval or generation?
Did a prompt-template change increase token usage?
Did one provider or one deployment fail more often?
Are failures correlated with long contexts, large tool outputs, or specific model families?

6. A concrete example: from symptom to root cause

Suppose a fraud-scoring endpoint becomes slower after a rollout. From the outside, it looks like one API call. Inside, it is several smaller steps stitched together:

fetch features
preprocess the feature vector
run model inference
apply business rules

If you wrap the entire flow in one giant span, you only learn that the request was slow. If you instrument each meaningful boundary, you learn why it was slow.

6.1 Instrumented Python example

A python example is available here and here. A python project is available here.

Below is a simplified example of how to instrument a fraud inference API. The code is intentionally simple to focus on the instrumentation patterns rather than the business logic.

Python

import random
import time

# `trace` is the high-level OpenTelemetry API used to create spans.
from opentelemetry import trace
# `Resource` describes the service that emits telemetry.
from opentelemetry.sdk.resources import Resource
# `TracerProvider` holds tracing configuration for this process.
from opentelemetry.sdk.trace import TracerProvider
# A span processor decides how spans are buffered and exported.
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter


# Resource attributes identify the service independently of individual requests.
resource = Resource.create(
    {
        "service.name": "fraud-inference-api",
        "service.version": "2.1.0",
        "deployment.environment": "dev",
    }
)

# The provider owns the tracer configuration for the current process.
provider = TracerProvider(resource=resource)
# Batch exporting is closer to production behavior than exporting every span immediately.
# The console exporter is useful for learning because you can inspect spans locally.
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
# Register this provider globally so future calls to `trace.get_tracer(...)` use it.
trace.set_tracer_provider(provider)

# A tracer is the object that creates spans for a specific module or subsystem.
tracer = trace.get_tracer("fraud_inference")


def fetch_features(user_id: str) -> dict:
    # Simulate an I/O-bound dependency such as Redis or an online feature store.
    time.sleep(random.uniform(0.01, 0.05))
    return {"age": 30, "transaction_count": 5, "country": "US"}


def preprocess(features: dict) -> list[float]:
    # Simulate lightweight feature transformation before inference.
    time.sleep(0.01)
    return [features["age"], features["transaction_count"]]


def predict_proba(x: list[float]) -> float:
    # Simulate model execution latency and return a fake probability.
    time.sleep(random.uniform(0.08, 0.20))
    return random.uniform(0.0, 1.0)


def handle_inference_request(user_id: str) -> dict:
    # The root span represents the full user-visible request.
    with tracer.start_as_current_span("inference.request") as request_span:
        # These attributes make the trace searchable by model identity and request type.
        request_span.set_attribute("ml.model.name", "fraud_detection_xgboost")
        request_span.set_attribute("ml.model.version", "v2.1.0")
        request_span.set_attribute("ml.inference.framework", "xgboost")
        request_span.set_attribute("ml.request.type", "single")

        # Child spans break the request into meaningful operational stages.
        with tracer.start_as_current_span("features.fetch") as feature_span:
            features = fetch_features(user_id)
            # Prefer stable, low-cardinality attributes that help debugging.
            feature_span.set_attribute("feature.store", "redis_online_store")
            feature_span.set_attribute("feature.count", len(features))

        # Preprocessing is its own stage because it can regress independently.
        with tracer.start_as_current_span("features.preprocess"):
            x = preprocess(features)

        # Model inference is often the dominant latency contributor in ML APIs.
        with tracer.start_as_current_span("model.infer") as infer_span:
            score = predict_proba(x)
            infer_span.set_attribute("ml.inference.batch_size", 1)
            # Bucketed outputs are often safer than logging raw values at large scale.
            infer_span.set_attribute(
                "ml.output.score_bucket",
                "high" if score > 0.8 else "normal",
            )

        # Postprocessing captures business logic that turns scores into decisions.
        with tracer.start_as_current_span("postprocess") as post_span:
            decision = score > 0.8
            post_span.set_attribute("ml.decision", "block" if decision else "allow")
            if decision:
                # Events annotate an important moment inside a span without creating a new span.
                post_span.add_event(
                    "rule_triggered",
                    {"rule_name": "high_risk_threshold"},
                )

        return {"score": score, "decision": decision}


if __name__ == "__main__":
    result = handle_inference_request("user_12345")
    print(result)
    # Force a flush so buffered spans are exported before the script exits.
    provider.force_flush()

import random
import time

# `trace` is the high-level OpenTelemetry API used to create spans.
from opentelemetry import trace
# `Resource` describes the service that emits telemetry.
from opentelemetry.sdk.resources import Resource
# `TracerProvider` holds tracing configuration for this process.
from opentelemetry.sdk.trace import TracerProvider
# A span processor decides how spans are buffered and exported.
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter


# Resource attributes identify the service independently of individual requests.
resource = Resource.create(
    {
        "service.name": "fraud-inference-api",
        "service.version": "2.1.0",
        "deployment.environment": "dev",
    }
)

# The provider owns the tracer configuration for the current process.
provider = TracerProvider(resource=resource)
# Batch exporting is closer to production behavior than exporting every span immediately.
# The console exporter is useful for learning because you can inspect spans locally.
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
# Register this provider globally so future calls to `trace.get_tracer(...)` use it.
trace.set_tracer_provider(provider)

# A tracer is the object that creates spans for a specific module or subsystem.
tracer = trace.get_tracer("fraud_inference")


def fetch_features(user_id: str) -> dict:
    # Simulate an I/O-bound dependency such as Redis or an online feature store.
    time.sleep(random.uniform(0.01, 0.05))
    return {"age": 30, "transaction_count": 5, "country": "US"}


def preprocess(features: dict) -> list[float]:
    # Simulate lightweight feature transformation before inference.
    time.sleep(0.01)
    return [features["age"], features["transaction_count"]]


def predict_proba(x: list[float]) -> float:
    # Simulate model execution latency and return a fake probability.
    time.sleep(random.uniform(0.08, 0.20))
    return random.uniform(0.0, 1.0)


def handle_inference_request(user_id: str) -> dict:
    # The root span represents the full user-visible request.
    with tracer.start_as_current_span("inference.request") as request_span:
        # These attributes make the trace searchable by model identity and request type.
        request_span.set_attribute("ml.model.name", "fraud_detection_xgboost")
        request_span.set_attribute("ml.model.version", "v2.1.0")
        request_span.set_attribute("ml.inference.framework", "xgboost")
        request_span.set_attribute("ml.request.type", "single")

        # Child spans break the request into meaningful operational stages.
        with tracer.start_as_current_span("features.fetch") as feature_span:
            features = fetch_features(user_id)
            # Prefer stable, low-cardinality attributes that help debugging.
            feature_span.set_attribute("feature.store", "redis_online_store")
            feature_span.set_attribute("feature.count", len(features))

        # Preprocessing is its own stage because it can regress independently.
        with tracer.start_as_current_span("features.preprocess"):
            x = preprocess(features)

        # Model inference is often the dominant latency contributor in ML APIs.
        with tracer.start_as_current_span("model.infer") as infer_span:
            score = predict_proba(x)
            infer_span.set_attribute("ml.inference.batch_size", 1)
            # Bucketed outputs are often safer than logging raw values at large scale.
            infer_span.set_attribute(
                "ml.output.score_bucket",
                "high" if score > 0.8 else "normal",
            )

        # Postprocessing captures business logic that turns scores into decisions.
        with tracer.start_as_current_span("postprocess") as post_span:
            decision = score > 0.8
            post_span.set_attribute("ml.decision", "block" if decision else "allow")
            if decision:
                # Events annotate an important moment inside a span without creating a new span.
                post_span.add_event(
                    "rule_triggered",
                    {"rule_name": "high_risk_threshold"},
                )

        return {"score": score, "decision": decision}


if __name__ == "__main__":
    result = handle_inference_request("user_12345")
    print(result)
    # Force a flush so buffered spans are exported before the script exits.
    provider.force_flush()

Click to see the output.

JSON

{'score': 0.9170080536451304, 'decision': True}
{
    "name": "features.fetch",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0xdd3b3d3c745b5192",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x77931b30d77dc765",
    "start_time": "2026-03-10T15:06:49.911350Z",
    "end_time": "2026-03-10T15:06:49.927409Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "feature.store": "redis_online_store",
        "feature.count": 3
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}
{
    "name": "features.preprocess",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0xb587cbed0d7653fa",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x77931b30d77dc765",
    "start_time": "2026-03-10T15:06:49.927624Z",
    "end_time": "2026-03-10T15:06:49.937842Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {},
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}
{
    "name": "model.infer",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0x9a1bd4fa51abae66",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x77931b30d77dc765",
    "start_time": "2026-03-10T15:06:49.938102Z",
    "end_time": "2026-03-10T15:06:50.039371Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "ml.inference.batch_size": 1,
        "ml.output.score_bucket": "high"
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}
{
    "name": "postprocess",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0x2ccf1fc03fd57a49",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x77931b30d77dc765",
    "start_time": "2026-03-10T15:06:50.039566Z",
    "end_time": "2026-03-10T15:06:50.039639Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "ml.decision": "block"
    },
    "events": [
        {
            "name": "rule_triggered",
            "timestamp": "2026-03-10T15:06:50.039614Z",
            "attributes": {
                "rule_name": "high_risk_threshold"
            }
        }
    ],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}
{
    "name": "inference.request",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0x77931b30d77dc765",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2026-03-10T15:06:49.911217Z",
    "end_time": "2026-03-10T15:06:50.039657Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "ml.model.name": "fraud_detection_xgboost",
        "ml.model.version": "v2.1.0",
        "ml.inference.framework": "xgboost",
        "ml.request.type": "single"
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}

{'score': 0.9170080536451304, 'decision': True}
{
    "name": "features.fetch",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0xdd3b3d3c745b5192",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x77931b30d77dc765",
    "start_time": "2026-03-10T15:06:49.911350Z",
    "end_time": "2026-03-10T15:06:49.927409Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "feature.store": "redis_online_store",
        "feature.count": 3
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}
{
    "name": "features.preprocess",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0xb587cbed0d7653fa",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x77931b30d77dc765",
    "start_time": "2026-03-10T15:06:49.927624Z",
    "end_time": "2026-03-10T15:06:49.937842Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {},
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}
{
    "name": "model.infer",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0x9a1bd4fa51abae66",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x77931b30d77dc765",
    "start_time": "2026-03-10T15:06:49.938102Z",
    "end_time": "2026-03-10T15:06:50.039371Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "ml.inference.batch_size": 1,
        "ml.output.score_bucket": "high"
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}
{
    "name": "postprocess",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0x2ccf1fc03fd57a49",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x77931b30d77dc765",
    "start_time": "2026-03-10T15:06:50.039566Z",
    "end_time": "2026-03-10T15:06:50.039639Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "ml.decision": "block"
    },
    "events": [
        {
            "name": "rule_triggered",
            "timestamp": "2026-03-10T15:06:50.039614Z",
            "attributes": {
                "rule_name": "high_risk_threshold"
            }
        }
    ],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}
{
    "name": "inference.request",
    "context": {
        "trace_id": "0xa25279d6bd4a89a1864acf3093591a8c",
        "span_id": "0x77931b30d77dc765",
        "trace_state": ""
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2026-03-10T15:06:49.911217Z",
    "end_time": "2026-03-10T15:06:50.039657Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "ml.model.name": "fraud_detection_xgboost",
        "ml.model.version": "v2.1.0",
        "ml.inference.framework": "xgboost",
        "ml.request.type": "single"
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.38.0",
            "service.name": "fraud-inference-api",
            "service.version": "2.1.0",
            "deployment.environment": "dev"
        },
        "schema_url": ""
    }
}

Reading the trace as a tree

One useful way to read this example is to picture the emitted trace as a small tree rather than as a flat list of function calls:

inference.request
├── features.fetch
├── features.preprocess
├── model.infer
└── postprocess

That picture matters because trace viewers usually present latency as a hierarchy. The root span tells you the end-to-end latency seen by the caller. The child spans explain where that time accumulated. In a real system, one of those child spans could itself contain deeper spans such as a Redis call, an HTTP request, or a GPU execution step.

6.2 What this example gets right

This code is intentionally simple, but it demonstrates several habits that scale well:

the root span models one user-visible request
each meaningful dependency boundary gets its own child span
resource attributes identify the service and environment
ML-specific attributes add model and decision context
the code records an event for a business-rule trigger
it avoids attaching raw user identifiers or raw feature values to telemetry

That last point is crucial. Telemetry should help you debug production systems, not quietly create privacy, security, or cardinality problems.

6.3 What a real investigation would reveal

Imagine that p95 latency rises after a deployment. A healthy debugging path might look like this:

a latency histogram shows that only the new model version regressed
traces show that model.infer dominates the critical path
correlated logs show repeated warm-up messages on newly scaled pods
the team concludes that autoscaling plus model cold starts caused the regression

Notice what changed: the team did not stop at “the endpoint is slow.” It located the exact stage, the exact traffic segment, and the likely reason.

6.4 What you would change in production

In production, you would usually make four changes:

replace the console exporter with an OTLP exporter that sends data to a Collector
combine manual spans with automatic instrumentation for HTTP frameworks, database clients, and RPC libraries
add metrics such as latency histograms and error counters alongside traces
define a small, stable vocabulary for ML-specific attributes

6.5 Failure handling and status capture you would likely add next

There is one more improvement worth calling out explicitly: the example shows the happy path, but most production value comes from how traces behave on unhappy paths.

In a real inference service, you would usually also:

record exceptions on spans when feature fetch, inference, or postprocessing fails
set span status to error so failed requests are easy to filter
add timeout or retry events when a dependency becomes slow
attach low-cardinality failure metadata such as error.type, dependency.name, or retry.count

This matters because a trace without success or failure semantics is like a flight recorder without altitude data. You can still inspect it, but the fastest route to the root cause is missing.

7. Metrics and logs that pair well with traces

Tracing is often the first place teams start, but the strongest observability setups use all three signals together.

7.1 Metrics worth tracking for model-serving systems

For an inference API, a compact but effective metrics set often includes:

request count
error count
latency histogram
request count by model version
cache hit rate
queue depth
dependency-specific error count
token usage for LLM systems

The key point is that averages are not enough. Average latency can look fine while p99 becomes unacceptable. Histograms are more useful because they let you compute percentiles and reason about tail behavior.

7.2 Logs worth correlating

Structured logs are most helpful when they capture state transitions and unusual conditions, such as:

validation errors
retry attempts
upstream rate-limit responses
model load and warm-up messages
schema or feature availability warnings
fallback-model activation

The rule of thumb is simple: log enough to explain failures, but not so much that the signal disappears into noise.

7.3 The missing bridge: exemplars and correlation fields

One advanced but highly practical idea is to make it easy to jump from an aggregate metric to an individual trace.

There are two common ways to do that:

Exemplars: some observability backends can attach representative trace references to histogram buckets or metric points.
Correlation fields in logs: including trace and span identifiers in structured logs lets operators pivot from a suspicious request trace to the exact log lines emitted during that request.

This bridge is powerful because investigations rarely begin with a trace. They usually begin with a chart, an alert, or a log search. The easier it is to pivot across signals, the faster teams move from symptom to root cause.

8. Production practices that matter

8.1 Instrument dependency boundaries, not every function

The best spans usually mark boundaries like these:

external network calls
storage lookups
model invocations
expensive CPU or GPU stages
orchestration transitions

If every helper function becomes a span, traces become dense but not informative.

8.2 Control cardinality aggressively

One of the fastest ways to make observability expensive and hard to query is to attach highly unique values to attributes. Common mistakes include:

raw user IDs
full prompts
full completion text
session IDs
unbounded URLs with query strings

High-cardinality attributes explode storage costs, degrade query performance, and make metrics backends harder to operate.

Prefer bounded alternatives such as:

prompt length buckets
token counts
tenant tier instead of tenant ID
route templates instead of full URLs
model version instead of run-specific artifact paths

8.3 Keep naming consistent

OpenTelemetry already defines semantic conventions for HTTP, databases, messaging, and RPC. Use them where they exist.

For ML- and LLM-specific metadata, establish a small internal vocabulary and keep it consistent. Teams quickly get into trouble when one service emits model.version, another emits ml_model_version, and a third emits modelVersion.

8.4 Sample traces deliberately

At scale, keeping every trace may be too expensive. Sampling is the practical answer, but it should reflect how incidents are investigated.

A common pattern is:

sample a small percentage of healthy requests
retain a much larger fraction of error traces
keep full traces for important tenants, rollout canaries, or debugging windows

In other words, do not optimize only for volume. Optimize for investigative value.

8.5 Redact sensitive information early

ML systems often handle prompts, documents, features, and outputs that may contain private or proprietary information. The safest pattern is to redact, hash, or drop sensitive fields before export, ideally in the Collector or before telemetry leaves the process.

8.6 Separate observability from evaluation

This distinction is easy to miss. Observability tells you what happened in production. Evaluation tells you whether the model behavior is good.

You need both, but they answer different questions. A request can be fast and fully traced while still being semantically wrong. Conversely, an accurate model can still create incidents if retrieval is slow or a provider is unstable.

8.7 Common anti-patterns that quietly reduce observability value

There are several failure modes that look like instrumentation progress but produce weak operational outcomes:

creating many tiny spans for helper functions instead of a few spans for dependency boundaries
attaching raw prompts, documents, or user identifiers as searchable attributes
changing span names across services for the same logical stage
capturing traces but not retaining enough failed traces to debug incidents
instrumenting latency but ignoring queueing, retries, and fallback behavior

These anti-patterns are costly because they increase telemetry volume faster than they increase understanding. Good observability is not just about collecting more data. It is about collecting the right structure.

9. A pragmatic rollout plan

The best OTel adoption plans are incremental.

9.1 Start with one critical path

Instrument one workflow that matters to users or revenue, such as:

one online inference endpoint
one batch scoring pipeline
one RAG request path

This keeps scope small while still generating useful feedback.

9.2 Add only high-value metadata first

Begin with a minimal set of attributes that you know will help investigations:

service.name
service.version
deployment.environment
ml.model.name
ml.model.version
low-cardinality dependency metadata

Then add more only when the extra fields clearly improve debugging.

9.3 Connect traces, metrics, and logs early

Many teams instrument traces first and stop there. That is useful, but incomplete. The real payoff appears when the three signals reinforce one another.

An ideal investigation flow looks like this:

an alert fires because p99 latency increased
metrics narrow the affected service and time window
traces reveal the slow dependency or critical-path stage
correlated logs explain the exact failure mode

That is what observability maturity looks like in practice.

9.4 A minimal starter checklist

If a team wanted to operationalize this article with minimal overhead, a strong first version would be:

instrument one user-facing path with a root span and three to five child spans
add service.name, service.version, deployment.environment, and one or two stable ML attributes
emit one latency histogram and one error counter per critical endpoint
ensure logs include trace identifiers for correlation
verify that failed requests are sampled at a higher rate than healthy ones
review exported fields for privacy and cardinality before broad rollout

That checklist is small on purpose. In observability, a narrow system that teams actually use is better than an ambitious design that never becomes part of incident response.

10. Summary

OpenTelemetry gives ML teams a shared language for understanding complex systems in production. It helps you see one request end to end, understand aggregate health over time, and connect incidents to the events that caused them.

The core ideas are simple:

traces explain one request
metrics summarize many requests
logs preserve detailed evidence
context propagation keeps distributed work connected
the Collector keeps instrumentation operationally manageable and vendor-neutral

For ML systems, the most important habit is to instrument the boundaries that actually matter: feature retrieval, model inference, vector search, batch stages, and external providers. Start with one critical path, keep metadata low-cardinality and privacy-aware, and grow the system only where it improves investigations.

That is usually enough to turn observability from an afterthought into an engineering advantage.

Silpa

Website | + posts

Silpa brings 5 years of experience in working on diverse ML projects, specializing in designing end-to-end ML systems tailored for real-time applications. Her background in statistics (Bachelor of Technology) provides a strong foundation for her work in the field. Silpa is also the driving force behind the development of the content you find on this site.

S L Happy

Machine Learning Engineer at HP | Website | + posts

Happy is a seasoned ML professional with over 15 years of experience. His expertise spans various domains, including Computer Vision, Natural Language Processing (NLP), and Time Series analysis. He holds a PhD in Machine Learning from IIT Kharagpur and has furthered his research with postdoctoral experience at INRIA-Sophia Antipolis, France. Happy has a proven track record of delivering impactful ML solutions to clients.

Subscribe to our newsletter!