APM — Application Performance Monitoring
Learning Objectives
- Navigate and interpret distributed traces in xTraces
- Use the service map to understand inter-service dependencies
- Identify slow spans and trace errors using TraceQL
- Correlate a slow trace to related logs and metrics
Distributed Tracing Concepts
A trace represents the complete journey of a single request through your distributed system. Each unit of work within a trace is a span.
Span attributes (key metadata on each span):
service.name— which service emitted this spanhttp.method,http.target,http.status_code— HTTP detailsdb.system,db.statement— database queriesspan.kind— SERVER, CLIENT, PRODUCER, CONSUMER, INTERNALotel.status_code— OK or ERROR
Finding Slow Requests with TraceQL
TraceQL is xTraces's query language for searching traces:
# All traces taking more than 500ms
{duration > 500ms}
# Errors in the payment-api service
{resource.service.name = "payment-api" && status = error}
# Slow database spans
{span.db.system = "postgresql" && duration > 200ms}
# Find a specific trace by ID
{traceID = "abc123def456"}
# All traces involving the checkout-api and payment-api
{resource.service.name = "checkout-api" && resource.service.name = "payment-api"}
xTraces Views
Trace Search
- Open Explore → xTraces
- Select Search tab
- Filter by service name, span name, duration
[Screenshot: xTraces search view showing a list of traces with duration, service name, and timestamp columns]
Trace Detail View
Click any trace to see the full waterfall:
[Screenshot: xTraces trace waterfall view showing nested spans with durations, service names highlighted in different colours]
Reading the waterfall:
- Wide spans = slow operations (investigate these first)
- Red spans = errors
- Gaps between spans = network latency or serialisation overhead
- Deep nesting = many service hops (N+1 query pattern or excessive fan-out)
Service Map
The service map visualises inter-service call relationships, built from trace data:
[Screenshot: xTraces service map showing nodes for each service with edges representing call paths, request rates, and error rates labelled on edges]
Enable via xTraces datasource configuration:
jsonData:
serviceMap:
datasourceUid: xscaler-metrics
RED Metrics from Traces
RED = Rate, Errors, Duration — the three metrics derived from traces:
# R — Request Rate (traces per second)
sum(rate(traces_spanmetrics_calls_total{service="payment-api"}[5m]))
# E — Error Rate
sum(rate(traces_spanmetrics_calls_total{service="payment-api", status_code="STATUS_CODE_ERROR"}[5m]))
/ sum(rate(traces_spanmetrics_calls_total{service="payment-api"}[5m]))
# D — Duration (p99)
histogram_quantile(0.99,
sum by (le) (
rate(traces_spanmetrics_duration_milliseconds_bucket{service="payment-api"}[5m])
)
)
These span metrics are available if you enable the spanmetrics connector in your OTel Collector.
Hands-On Exercise
Exercise 6.3 — Explore a Distributed Trace
- Open Grafana → Explore → Select
xTracesdatasource - Click Search tab → Run Query (no filters)
[Screenshot: xTraces search results showing 10-20 recent traces from the loadgen service]
-
Sort by Duration (descending) — find the slowest trace
-
Click the trace to open the waterfall view
-
Find the slowest span — note:
- Which service emitted it?
- What operation was it?
- How long did it take?
-
Click the span → click Logs for this span
[Screenshot: Trace detail with a selected span showing the "Logs" side panel with xLogs log lines filtered by trace_id]
Exercise 6.4 — Find Errors with TraceQL
- In xTraces Explore, select the TraceQL tab
- Enter:
{status = error}
- How many error traces were found in the last hour?
- Click one — identify which service and span produced the error
Validation
- xTraces search returns traces
- You can click a trace and read the waterfall
- Clicking a span shows "Logs for this span" from xLogs
- TraceQL
{status = error}returns error traces - You can identify the slowest span in a trace
Key Takeaways
- A trace = a tree of spans representing one request's journey through all services
- Wide spans = slow; red spans = errors — these are investigation starting points
- TraceQL queries xTraces:
{resource.service.name = "x" && duration > 500ms} - The service map visualises call graphs built from trace data
- Trace → Logs correlation: clicking a span reveals xLogs logs filtered by
trace_id - RED metrics (Rate, Errors, Duration) from traces provide service-level SLO visibility
← Previous: Dashboard Creation
Next: Alerting →