Sign in Subscribe

By Eva Winterschön in Engineering — 13 Jun 2026

Performance Engineering - Evaluative Metrics

Today it's LLMs. Yesterday it was CDNs. Yesteryears long gone, Big Iron lore.

don't go gettin' your panties in a bunch for CXL until after you know your performance baselines

LLM Inference

Sector	Common metrics	Less-known / specialist metrics
GP-GPU Service Infrastructure	TTFT, ITL/TPOT, E2E request latency, TPS, RPS/QPS, Queue-Fill, KV/C metrics	request_queue_time, request_prefill_time, request_decode_time, inflight time, SM occupancy, HBM/DRAM memory throughput
API Service Infrastructure	p50/p95/p99 latency, streaming TTFB/TTFT, RPS/QPS, error rate, 429 rate, active downstream connections, backend latency	x-envoy-upstream-service-time, upstream_rq_time, TR/Tw/Tc/Tr/TA timer split in HAProxy, retry rate, queue-slot saturation, circuit-breaker usage
Network Hardware + Protocol Infra	RTT (round-trip time), one-way latency, jitter/PDV (packet delay variation), PPS/BPS, packet loss, retransmits	ECN mark rate, PFC (priority flow control) pause events, microburst depth, CQE (completion queue entry) compression state, SymbolErrors, LinkRecoveries, PTP offset/path delay
Prompt Caching, Compute + Re-Compute	cache hit rate, cached_tokens, prompt_tokens, generation_tokens, TTFT reduction, cache read/write counts	partial-hit ratio, cache-miss ratio, prefix-cache block hit rate, predicted KV hit rate, per-request routing overhead
Prompt Caching, Storage Infra	cache hit ratio, eviction rate, cache latency, TTL, GPU/CPU cache usage	CPU-vs-GPU cache-hit split, priority-based eviction effect, TTL refresh behavior, KV-block fragmentation / block reuse
Prompt Caching, API + Load-Balancers	per-route/model hit rate, active and pending requests, backend latency, request rate, retries, error rate	cache-aware routing predicted KV hit rate, routing overhead, queue time before backend selection, circuit-breaker saturation, connection-slot pressure

Acronyms

TTFT == Time to First Token
ITL == Inter-Token Latency
TPOT == Time per Output Token
E2E == end-to-end
TKS == tokens/sec
RPS == requests/sec
QPS == queries/sec
KV/C == Key-Value Cache

LLM Dataset Training

Operations	Common metrics	Less-known / specialist metrics
Pre-loading datasets	ingest throughput (docs/s, rows/s, bytes/s, tokens/s), tokenization throughput, DataLoader throughput, dedup ratio	fuzzy-duplicate rate, semantic-duplicate rate, contamination rate/AUC, consumer lag, shuffle spill bytes
Pre-training + MoE	tokens/s, step time, MFU (model FLOPs utilization), training loss/perplexity, all-to-all/dispatch time, GPU memory	auxiliary load-balancing loss, router z-loss, expert capacity factor, token-drop rate, per-layer expert imbalance
Post-training + MoE	step time, samples/s or tokens/s, train/val loss, reward/preference accuracy, KL divergence, MFU	chosen vs rejected reward margin, chosen/rejected rewards, router aux/z loss during alignment, expert saturation under small-batch tuning

HPC + HFT Analytics Infrastructure

Sector	Common metrics	Less-known / specialist metrics
Virtual Machine Clusters	p50/p99 service latency, jitter, CPU ready time, throughput, packet loss	CPU co-stop, NUMA locality (numa_hit/numa_miss/local_node/other_node), vCPU scheduling contention, latency-sensitivity effectiveness
Baremetal Systems	wire-to-wire latency, p99/p999 jitter, PPS/Mpps, cycles, instructions, LLC-load-misses, branch-misses, RX/TX drops	NUMA locality, interrupt coalescence, CQE compression state, packet pacing, driver extended stats / ring stress
Content Delivery Networking	cache hit ratio, TTFB, origin offload, request rate, error rate, egress bandwidth	shield-layer hit ratio, origin_ttfb, child/parent cache status, regional cache-performance variance
Low-Latency Trade-Execution Networks	one-way latency, RTT, jitter, packet loss, order latency, feed latency	microburst depth, queue/buffer pressure, ECN mark rate, PFC pause events, path asymmetry, PTP offset
Dark Fiber Regional Network Infra	one-way latency, RTT, availability, BER, pre-FEC BER, post-FEC BER	OSNR/ESNR, Q-factor, CD (chromatic dispersion), PMD (polarization mode dispersion), FEC degrade indicators
Quantitative Research + Machine Learning	backtest wall-clock runtime, feed latency, order latency, feature-serving latency, feature-ingestion throughput	training-serving skew, feature inflight vs write-to-store success metrics, feature health/correctness monitoring
Data Analytics + Multivariate Analysis	job duration, stage/task duration, throughput, end-to-end delay, records-consumed-rate, consumer lag, shuffle read/write	spill bytes, skew via task-duration discrepancy or shuffle-read imbalance, straggler share, input-pipeline prefetch effectiveness

SLA + SLO Reliability, Telemetry, Alerting

Pre-Defined latency/error/throughput SLIs and error budgets require burn-rate alerting
Prometheus and Alertmanager define scrape and notification timing controls
OpenTelemetry defines histograms and exemplars
Apdex is a standard user-satisfaction score
Elastic APM measures application performance traces

Sector	Common metrics	Less-known / specialist metrics
SLA + SLO Monitoring, Telemetry, Alerting Infra	availability SLI, latency SLI, error rate, throughput, Apdex, error-budget burn rate	multi-window multi-burn-rate alerts, scrape_duration / scrape_timeout ratio, group_wait / group_interval / repeat_interval, histogram bucket design, exemplars

Shared Analytics of Interest

Metric	Signal	Typical domains
Queue time	Separates saturation from raw compute/network slowness	LLM serving, API gateways, load balancers, HFT
Prefill vs decode split	Distinguishes prompt-processing bottlenecks from token-generation bottlenecks	LLM GPU serving
Prefix/KV cache hit rate	Direct proxy for avoidable recompute and TTFT improvement	LLM serving, agentic systems
Auxiliary MoE loss and router z-loss	Early warning for expert imbalance and routing instability	MoE training
CPU ready / co-stop / NUMA miss	Often the real cause of inconsistent latency in virtualized clusters	VM-based HFT / HPC
Microburst depth and ECN/PFC behavior	Reveals congestion that average bandwidth hides	Low-latency Ethernet / RoCE fabrics
OSNR / CD / PMD / pre-FEC BER	Core optical-health indicators long before full link failure	Dark fiber / coherent optics
Spill bytes and consumer lag	Early warning for data-path backpressure and skew	Big-data pipelines
Burn rate and exemplars	Better operational signal than raw alert count or average latency	SLO / observability stacks