Performance Engineering - Evaluative Metrics

Today it's LLMs. Yesterday it was CDNs. Yesteryears long gone, Big Iron lore.

Performance Engineering - Evaluative Metrics
don't go gettin' your panties in a bunch for CXL until after you know your performance baselines

LLM Inference

Sector Common metrics Less-known / specialist metrics
GP-GPU Service Infrastructure TTFT, ITL/TPOT, E2E request latency, TPS, RPS/QPS, Queue-Fill, KV/C metrics request_queue_time, request_prefill_time, request_decode_time, inflight time, SM occupancy, HBM/DRAM memory throughput
API Service Infrastructure p50/p95/p99 latency, streaming TTFB/TTFT, RPS/QPS, error rate, 429 rate, active downstream connections, backend latency x-envoy-upstream-service-time, upstream_rq_time, TR/Tw/Tc/Tr/TA timer split in HAProxy, retry rate, queue-slot saturation, circuit-breaker usage
Network Hardware + Protocol Infra RTT (round-trip time), one-way latency, jitter/PDV (packet delay variation), PPS/BPS, packet loss, retransmits ECN mark rate, PFC (priority flow control) pause events, microburst depth, CQE (completion queue entry) compression state, SymbolErrors, LinkRecoveries, PTP offset/path delay
Prompt Caching, Compute + Re-Compute cache hit rate, cached_tokens, prompt_tokens, generation_tokens, TTFT reduction, cache read/write counts partial-hit ratio, cache-miss ratio, prefix-cache block hit rate, predicted KV hit rate, per-request routing overhead
Prompt Caching, Storage Infra cache hit ratio, eviction rate, cache latency, TTL, GPU/CPU cache usage CPU-vs-GPU cache-hit split, priority-based eviction effect, TTL refresh behavior, KV-block fragmentation / block reuse
Prompt Caching, API + Load-Balancers per-route/model hit rate, active and pending requests, backend latency, request rate, retries, error rate cache-aware routing predicted KV hit rate, routing overhead, queue time before backend selection, circuit-breaker saturation, connection-slot pressure

Acronyms

  • TTFT == Time to First Token
  • ITL == Inter-Token Latency
  • TPOT == Time per Output Token
  • E2E == end-to-end
  • TKS == tokens/sec
  • RPS == requests/sec
  • QPS == queries/sec
  • KV/C == Key-Value Cache

LLM Dataset Training

Operations Common metrics Less-known / specialist metrics
Pre-loading datasets ingest throughput (docs/s, rows/s, bytes/s, tokens/s), tokenization throughput, DataLoader throughput, dedup ratio fuzzy-duplicate rate, semantic-duplicate rate, contamination rate/AUC, consumer lag, shuffle spill bytes
Pre-training + MoE tokens/s, step time, MFU (model FLOPs utilization), training loss/perplexity, all-to-all/dispatch time, GPU memory auxiliary load-balancing loss, router z-loss, expert capacity factor, token-drop rate, per-layer expert imbalance
Post-training + MoE step time, samples/s or tokens/s, train/val loss, reward/preference accuracy, KL divergence, MFU chosen vs rejected reward margin, chosen/rejected rewards, router aux/z loss during alignment, expert saturation under small-batch tuning

HPC + HFT Analytics Infrastructure

Sector Common metrics Less-known / specialist metrics
Virtual Machine Clusters p50/p99 service latency, jitter, CPU ready time, throughput, packet loss CPU co-stop, NUMA locality (numa_hit/numa_miss/local_node/other_node), vCPU scheduling contention, latency-sensitivity effectiveness
Baremetal Systems wire-to-wire latency, p99/p999 jitter, PPS/Mpps, cycles, instructions, LLC-load-misses, branch-misses, RX/TX drops NUMA locality, interrupt coalescence, CQE compression state, packet pacing, driver extended stats / ring stress
Content Delivery Networking cache hit ratio, TTFB, origin offload, request rate, error rate, egress bandwidth shield-layer hit ratio, origin_ttfb, child/parent cache status, regional cache-performance variance
Low-Latency Trade-Execution Networks one-way latency, RTT, jitter, packet loss, order latency, feed latency microburst depth, queue/buffer pressure, ECN mark rate, PFC pause events, path asymmetry, PTP offset
Dark Fiber Regional Network Infra one-way latency, RTT, availability, BER, pre-FEC BER, post-FEC BER OSNR/ESNR, Q-factor, CD (chromatic dispersion), PMD (polarization mode dispersion), FEC degrade indicators
Quantitative Research + Machine Learning backtest wall-clock runtime, feed latency, order latency, feature-serving latency, feature-ingestion throughput training-serving skew, feature inflight vs write-to-store success metrics, feature health/correctness monitoring
Data Analytics + Multivariate Analysis job duration, stage/task duration, throughput, end-to-end delay, records-consumed-rate, consumer lag, shuffle read/write spill bytes, skew via task-duration discrepancy or shuffle-read imbalance, straggler share, input-pipeline prefetch effectiveness

SLA + SLO Reliability, Telemetry, Alerting

  • Pre-Defined latency/error/throughput SLIs and error budgets require burn-rate alerting
  • Prometheus and Alertmanager define scrape and notification timing controls
  • OpenTelemetry defines histograms and exemplars
  • Apdex is a standard user-satisfaction score
  • Elastic APM measures application performance traces
Sector Common metrics Less-known / specialist metrics
SLA + SLO Monitoring, Telemetry, Alerting Infra availability SLI, latency SLI, error rate, throughput, Apdex, error-budget burn rate multi-window multi-burn-rate alerts, scrape_duration / scrape_timeout ratio, group_wait / group_interval / repeat_interval, histogram bucket design, exemplars

Shared Analytics of Interest

Metric Signal Typical domains
Queue time Separates saturation from raw compute/network slowness LLM serving, API gateways, load balancers, HFT
Prefill vs decode split Distinguishes prompt-processing bottlenecks from token-generation bottlenecks LLM GPU serving
Prefix/KV cache hit rate Direct proxy for avoidable recompute and TTFT improvement LLM serving, agentic systems
Auxiliary MoE loss and router z-loss Early warning for expert imbalance and routing instability MoE training
CPU ready / co-stop / NUMA miss Often the real cause of inconsistent latency in virtualized clusters VM-based HFT / HPC
Microburst depth and ECN/PFC behavior Reveals congestion that average bandwidth hides Low-latency Ethernet / RoCE fabrics
OSNR / CD / PMD / pre-FEC BER Core optical-health indicators long before full link failure Dark fiber / coherent optics
Spill bytes and consumer lag Early warning for data-path backpressure and skew Big-data pipelines
Burn rate and exemplars Better operational signal than raw alert count or average latency SLO / observability stacks