Customer Architecture Workshop

Learning Objectives

Apply platform architecture knowledge to design a real customer scenario
Choose appropriate OTel deployment topology for given constraints
Identify potential cardinality risks in a customer's label set
Present and defend architectural decisions

Workshop Overview

This 30-minute group exercise asks you to design an observability architecture for a realistic customer scenario. Work in groups of 3-4 people.

Scenario: FinTech Startup — PayFast

PayFast is a payment processing startup with the following infrastructure:

Current State:

12 microservices (Java and Node.js)
3 Kubernetes clusters: prod-uk, prod-us, staging
Each cluster has 10–20 nodes
200–500 pods per cluster at peak
Daily traffic: 2M transactions/day
Currently: no observability tooling

Requirements:

Collect metrics, logs, and traces from all services
Strict tenant isolation between prod-uk and prod-us data
Log retention: 30 days
Metric retention: 90 days
Budget: Start with the $19/month Scale plan

Constraints:

No ability to modify application code in the first phase (Java services have -javaagent hooks only)
Kubernetes teams manage their own clusters
Security team requires all secrets managed via Kubernetes Secrets
No direct internet access from pods (outbound only via egress proxy)

Workshop Tasks

Task 1 — Tenant Design (10 minutes)

Design the tenant structure for PayFast. Answer:

How many tenants should be created? Why?
What should the display_name and environment be for each?
How many API keys are needed?
Which xScaler plan fits the initial rollout?

Discussion template:

Tenants:
1. display_name: _____, environment: _____, cluster: _____
2. display_name: _____, environment: _____, cluster: _____
3. display_name: _____, environment: _____, cluster: _____

API Keys per tenant: _____
Reason for number of keys: _____

Initial plan: _____ because _____

Task 2 — Collector Deployment (10 minutes)

Design the OTel Collector deployment for prod-uk Kubernetes cluster. Answer:

Agent Mode or Gateway Mode? Or both?
How many collector replicas/pods?
Which receivers are needed?
How are credentials managed?
How does traffic reach xScaler (via egress proxy)?

Discussion template:

Deployment type: DaemonSet / Deployment / Both
Replicas: _____
Receivers: _____
Credential management: _____
Egress proxy configuration: _____

Task 3 — Cardinality Risk Review (10 minutes)

PayFast's Java services emit these Prometheus metrics:

# HELP payment_request_total Total payment requests
# TYPE payment_request_total counter
payment_request_total{
  merchant_id="merchant_12345",    # 50,000 merchants
  currency="GBP",                  # 30 currencies
  payment_method="card",           # 5 methods
  user_id="user_abc",              # 2M users
  request_id="req_xyz",            # unique per request
  status="success"                 # 3 statuses
} 1

Calculate:

Current cardinality (worst case): ______ series
Which labels should be removed from this metric?
Which labels should be kept?
After cleanup, what is the estimated series count?

Reference Architecture

Sample Solution

Click to reveal suggested answer

Task 1 — Tenant Design:

3 tenants: PayFast UK Production, PayFast US Production, PayFast Staging
2 API keys per tenant (active + backup)
Start with Scale plan ($19/mo) — 20k active series included
Upgrade to Enterprise if series count exceeds 20k

Task 2 — Collector Deployment:

Agent Mode (DaemonSet) — 1 collector per node
Receivers: otlp (receives from Java javaagent), hostmetrics (node metrics)
Credentials: Kubernetes Secret → mounted as environment variables
Egress proxy: set HTTPS_PROXY env var in DaemonSet pod spec

Task 3 — Cardinality:

Current worst case: 50,000 × 30 × 5 × 2,000,000 × unique × 3 = astronomically high
Remove: merchant_id (50k), user_id (2M), request_id (unique)
Keep: currency, payment_method, status
After cleanup: 30 × 5 × 3 = 450 series per metric ✅

Key Takeaways

Session 3.3 Summary

Tenant design: one tenant per environment per region for isolation
For Kubernetes: DaemonSet (Agent Mode) is almost always the right choice
Cardinality review is essential before onboarding — one bad label can blow up storage
Always remove: user_id, request_id, trace_id, session_id from metrics
Keep: environment, region, service, http_method, status_code (bounded values)

← Previous: Architecture Review
Next: Session 4 Overview →

Learning Objectives​

Workshop Overview​

Scenario: FinTech Startup — PayFast​

Workshop Tasks​

Task 1 — Tenant Design (10 minutes)​

Task 2 — Collector Deployment (10 minutes)​

Task 3 — Cardinality Risk Review (10 minutes)​

Reference Architecture​

Sample Solution​

Key Takeaways​