Performance Engineering - Evaluative Metrics
Today it's LLMs. Yesterday it was CDNs. Yesteryears long gone, Big Iron lore.
LLM Inference
| Sector | Common metrics | Less-known / specialist metrics |
|---|---|---|
| GP-GPU Service Infrastructure | TTFT, ITL/TPOT, E2E request latency, TPS, RPS/QPS, Queue-Fill, KV/C metrics | request_queue_time, request_prefill_time, request_decode_time, inflight time, SM occupancy, HBM/DRAM memory throughput |
| API Service Infrastructure | p50/p95/p99 latency, streaming TTFB/TTFT, RPS/QPS, error rate, 429 rate, active downstream connections, backend latency | x-envoy-upstream-service-time, upstream_rq_time, TR/Tw/Tc/Tr/TA timer split in HAProxy, retry rate, queue-slot saturation, circuit-breaker usage |
| Network Hardware + Protocol Infra | RTT (round-trip time), one-way latency, jitter/PDV (packet delay variation), PPS/BPS, packet loss, retransmits | ECN mark rate, PFC (priority flow control) pause events, microburst depth, CQE (completion queue entry) compression state, SymbolErrors, LinkRecoveries, PTP offset/path delay |
| Prompt Caching, Compute + Re-Compute | cache hit rate, cached_tokens, prompt_tokens, generation_tokens, TTFT reduction, cache read/write counts | partial-hit ratio, cache-miss ratio, prefix-cache block hit rate, predicted KV hit rate, per-request routing overhead |
| Prompt Caching, Storage Infra | cache hit ratio, eviction rate, cache latency, TTL, GPU/CPU cache usage | CPU-vs-GPU cache-hit split, priority-based eviction effect, TTL refresh behavior, KV-block fragmentation / block reuse |
| Prompt Caching, API + Load-Balancers | per-route/model hit rate, active and pending requests, backend latency, request rate, retries, error rate | cache-aware routing predicted KV hit rate, routing overhead, queue time before backend selection, circuit-breaker saturation, connection-slot pressure |
Acronyms
- TTFT == Time to First Token
- ITL == Inter-Token Latency
- TPOT == Time per Output Token
- E2E == end-to-end
- TKS == tokens/sec
- RPS == requests/sec
- QPS == queries/sec
- KV/C == Key-Value Cache
LLM Dataset Training
| Operations | Common metrics | Less-known / specialist metrics |
|---|---|---|
| Pre-loading datasets | ingest throughput (docs/s, rows/s, bytes/s, tokens/s), tokenization throughput, DataLoader throughput, dedup ratio | fuzzy-duplicate rate, semantic-duplicate rate, contamination rate/AUC, consumer lag, shuffle spill bytes |
| Pre-training + MoE | tokens/s, step time, MFU (model FLOPs utilization), training loss/perplexity, all-to-all/dispatch time, GPU memory | auxiliary load-balancing loss, router z-loss, expert capacity factor, token-drop rate, per-layer expert imbalance |
| Post-training + MoE | step time, samples/s or tokens/s, train/val loss, reward/preference accuracy, KL divergence, MFU | chosen vs rejected reward margin, chosen/rejected rewards, router aux/z loss during alignment, expert saturation under small-batch tuning |
HPC + HFT Analytics Infrastructure
| Sector | Common metrics | Less-known / specialist metrics |
|---|---|---|
| Virtual Machine Clusters | p50/p99 service latency, jitter, CPU ready time, throughput, packet loss | CPU co-stop, NUMA locality (numa_hit/numa_miss/local_node/other_node), vCPU scheduling contention, latency-sensitivity effectiveness |
| Baremetal Systems | wire-to-wire latency, p99/p999 jitter, PPS/Mpps, cycles, instructions, LLC-load-misses, branch-misses, RX/TX drops | NUMA locality, interrupt coalescence, CQE compression state, packet pacing, driver extended stats / ring stress |
| Content Delivery Networking | cache hit ratio, TTFB, origin offload, request rate, error rate, egress bandwidth | shield-layer hit ratio, origin_ttfb, child/parent cache status, regional cache-performance variance |
| Low-Latency Trade-Execution Networks | one-way latency, RTT, jitter, packet loss, order latency, feed latency | microburst depth, queue/buffer pressure, ECN mark rate, PFC pause events, path asymmetry, PTP offset |
| Dark Fiber Regional Network Infra | one-way latency, RTT, availability, BER, pre-FEC BER, post-FEC BER | OSNR/ESNR, Q-factor, CD (chromatic dispersion), PMD (polarization mode dispersion), FEC degrade indicators |
| Quantitative Research + Machine Learning | backtest wall-clock runtime, feed latency, order latency, feature-serving latency, feature-ingestion throughput | training-serving skew, feature inflight vs write-to-store success metrics, feature health/correctness monitoring |
| Data Analytics + Multivariate Analysis | job duration, stage/task duration, throughput, end-to-end delay, records-consumed-rate, consumer lag, shuffle read/write | spill bytes, skew via task-duration discrepancy or shuffle-read imbalance, straggler share, input-pipeline prefetch effectiveness |
SLA + SLO Reliability, Telemetry, Alerting
- Pre-Defined latency/error/throughput SLIs and error budgets require burn-rate alerting
- Prometheus and Alertmanager define scrape and notification timing controls
- OpenTelemetry defines histograms and exemplars
- Apdex is a standard user-satisfaction score
- Elastic APM measures application performance traces
| Sector | Common metrics | Less-known / specialist metrics |
|---|---|---|
| SLA + SLO Monitoring, Telemetry, Alerting Infra | availability SLI, latency SLI, error rate, throughput, Apdex, error-budget burn rate | multi-window multi-burn-rate alerts, scrape_duration / scrape_timeout ratio, group_wait / group_interval / repeat_interval, histogram bucket design, exemplars |
Shared Analytics of Interest
| Metric | Signal | Typical domains |
|---|---|---|
| Queue time | Separates saturation from raw compute/network slowness | LLM serving, API gateways, load balancers, HFT |
| Prefill vs decode split | Distinguishes prompt-processing bottlenecks from token-generation bottlenecks | LLM GPU serving |
| Prefix/KV cache hit rate | Direct proxy for avoidable recompute and TTFT improvement | LLM serving, agentic systems |
| Auxiliary MoE loss and router z-loss | Early warning for expert imbalance and routing instability | MoE training |
| CPU ready / co-stop / NUMA miss | Often the real cause of inconsistent latency in virtualized clusters | VM-based HFT / HPC |
| Microburst depth and ECN/PFC behavior | Reveals congestion that average bandwidth hides | Low-latency Ethernet / RoCE fabrics |
| OSNR / CD / PMD / pre-FEC BER | Core optical-health indicators long before full link failure | Dark fiber / coherent optics |
| Spill bytes and consumer lag | Early warning for data-path backpressure and skew | Big-data pipelines |
| Burn rate and exemplars | Better operational signal than raw alert count or average latency | SLO / observability stacks |