APM — Application Performance Monitoring

Learning Objectives

Navigate and interpret distributed traces in xTraces
Use the service map to understand inter-service dependencies
Identify slow spans and trace errors using TraceQL
Correlate a slow trace to related logs and metrics

Distributed Tracing Concepts

A trace represents the complete journey of a single request through your distributed system. Each unit of work within a trace is a span.

Span attributes (key metadata on each span):

service.name — which service emitted this span
http.method, http.target, http.status_code — HTTP details
db.system, db.statement — database queries
span.kind — SERVER, CLIENT, PRODUCER, CONSUMER, INTERNAL
otel.status_code — OK or ERROR

Finding Slow Requests with TraceQL

TraceQL is xTraces's query language for searching traces:

# All traces taking more than 500ms
{duration > 500ms}

# Errors in the payment-api service
{resource.service.name = "payment-api" && status = error}

# Slow database spans
{span.db.system = "postgresql" && duration > 200ms}

# Find a specific trace by ID
{traceID = "abc123def456"}

# All traces involving the checkout-api and payment-api
{resource.service.name = "checkout-api" && resource.service.name = "payment-api"}

xTraces Views

Trace Search

Open Explore → xTraces
Select Search tab
Filter by service name, span name, duration

[Screenshot: xTraces search view showing a list of traces with duration, service name, and timestamp columns]

Trace Detail View

Click any trace to see the full waterfall:

[Screenshot: xTraces trace waterfall view showing nested spans with durations, service names highlighted in different colours]

Reading the waterfall:

Wide spans = slow operations (investigate these first)
Red spans = errors
Gaps between spans = network latency or serialisation overhead
Deep nesting = many service hops (N+1 query pattern or excessive fan-out)

Service Map

The service map visualises inter-service call relationships, built from trace data:

[Screenshot: xTraces service map showing nodes for each service with edges representing call paths, request rates, and error rates labelled on edges]

Enable via xTraces datasource configuration:

jsonData:
  serviceMap:
    datasourceUid: xscaler-metrics

RED Metrics from Traces

RED = Rate, Errors, Duration — the three metrics derived from traces:

# R — Request Rate (traces per second)
sum(rate(traces_spanmetrics_calls_total{service="payment-api"}[5m]))

# E — Error Rate
sum(rate(traces_spanmetrics_calls_total{service="payment-api", status_code="STATUS_CODE_ERROR"}[5m]))
/ sum(rate(traces_spanmetrics_calls_total{service="payment-api"}[5m]))

# D — Duration (p99)
histogram_quantile(0.99,
  sum by (le) (
    rate(traces_spanmetrics_duration_milliseconds_bucket{service="payment-api"}[5m])
  )
)

These span metrics are available if you enable the spanmetrics connector in your OTel Collector.

Hands-On Exercise

Exercise 6.3 — Explore a Distributed Trace

Open Grafana → Explore → Select xTraces datasource
Click Search tab → Run Query (no filters)

[Screenshot: xTraces search results showing 10-20 recent traces from the loadgen service]

Sort by Duration (descending) — find the slowest trace
Click the trace to open the waterfall view
Find the slowest span — note:
- Which service emitted it?
- What operation was it?
- How long did it take?
Click the span → click Logs for this span

[Screenshot: Trace detail with a selected span showing the "Logs" side panel with xLogs log lines filtered by trace_id]

Exercise 6.4 — Find Errors with TraceQL

In xTraces Explore, select the TraceQL tab
Enter:

{status = error}

How many error traces were found in the last hour?
Click one — identify which service and span produced the error

Validation

xTraces search returns traces
You can click a trace and read the waterfall
Clicking a span shows "Logs for this span" from xLogs
TraceQL {status = error} returns error traces
You can identify the slowest span in a trace

Key Takeaways

Session 6.2 Summary

A trace = a tree of spans representing one request's journey through all services
Wide spans = slow; red spans = errors — these are investigation starting points
TraceQL queries xTraces: {resource.service.name = "x" && duration > 500ms}
The service map visualises call graphs built from trace data
Trace → Logs correlation: clicking a span reveals xLogs logs filtered by trace_id
RED metrics (Rate, Errors, Duration) from traces provide service-level SLO visibility

← Previous: Dashboard Creation
Next: Alerting →

Learning Objectives​

Distributed Tracing Concepts​

Finding Slow Requests with TraceQL​

xTraces Views​

Trace Search​

Trace Detail View​

Service Map​

RED Metrics from Traces​

Hands-On Exercise​

Exercise 6.3 — Explore a Distributed Trace​

Exercise 6.4 — Find Errors with TraceQL​

Validation​

Key Takeaways​