Inference Sector KPI SLO

Observability & Telemetry :: SLO Definition Additions

Inference Sector KPI SLO

Why should anyone care about KPIs and SLOs?

Generally, unless one gets paid to deal with OBS/Tel and SLA conformance.. ehh. Except that the modern world operates within the bounds of a diverse and often not-fascinating array of performance indicators and services' operating metrics.

So.. Here are two for today.

  • KPI == Key Performance Indicator
  • SLO == Service Level Objective

Minor Backstory - Today

While aggregating directory content from several workstations into a unified NFS remote mount,(can't simply have one workstation, right? right), another encounter with SLO defs appears upon the terminal.

This time it's for LLM Inference customers and their typical requirements. Perhaps our standards repository would be helpful as an adjunct resource:

  • https://github.com/yukon-systems/YukonSYS-Standard-Definitions

Inference Service KPIs by Sector

An extension of distributed cloud services architecture with heavy focus on baremetal with VMs and Containers throughout. Many ways to solve the problems, scale the infra, etc.

SLO Additions to the Classifier Repo

  • This file is a compact companion to the observability + telemetry glossary.
  • Each threshold is a starting point and should be tightened to real-world workload SLOs after baselining standard operations.
Sector High-signal KPIs Default alert style
LLM Inference (GP-GPU Service Infra) TTFT, ITL/TPOT, E2E, QTS SLO-derived or baseline-relative; baseline-relative
LLM Inference (API Service Infra) P95 LAT, ERR, 429R, UST SLO-derived
LLM Inference (Network Hardware + Protocol Infra) OWD, PDV, LOSS, ECN baseline-relative; budget-relative
LLM Inference (Prompt Caching, Compute + Re-Compute) PCHR, CTR, TTFTR, RCR absolute for cache-eligible traffic
LLM Inference (Prompt Caching, Storage Infra) CHR, CLAT, COCC, EVR absolute; absolute for cache-eligible traffic
LLM Inference (Prompt Caching API + Load-Balancers) RCHR, LQ, URT, RTR absolute
LLM Training (bulk initial datasets) ITPS, DHR, UDR, CONT absolute; capacity-envelope
LLM Training (pre-training MoE) MFU, STEP, A2AS, EIR absolute starting point; baseline-relative
LLM Training (post-training MoE) TPS, MFU, RACC, RMAR absolute starting point; baseline-relative
HPC - High Frequency Trading - VM Clusters RDY, CSTP, NUMA MISS, T2D absolute starting point
HPC - High Frequency Trading - Baremetal W2W, JITR, CPM, MPKI baseline-relative; budget-relative
HPC - High Frequency Trading - CDN CHR, OOR, TTFB, OTTFB absolute
HPC - High Frequency Trading - Low-Latency Exec OWD, JIT, FLAT, OLAT baseline-relative; budget-relative
HPC - Dark Fiber Regional Network + Infra AVAIL, OWD, Pre-FEC BER, OSNRM SLO-derived; budget-relative
HPC - Quantitative Research + Machine Learning BTT, FLAT, TSS, FFL absolute starting point; baseline-relative
HPC - Big-Data & Multivariate Pattern Analysis JDUR, SKR, SPILL, CLAG SLA-derived; absolute
SLA + SLO Monitoring, Telemetry, Alerting Infra SLI-AV, SLI-LAT, EBR, UP SLO-derived

Global Policy Adjustments

  • baseline_window:
    • 7d median or p50/p95 baseline unless otherwise stated
  • slo_policy:
    • page: error budget burn rate > 14.4 over 1h and 5m, or > 6 over 6h and 30m
    • ticket: error budget burn rate > 1 over 3d and 6h
  • capacity_policy:
    • warn: sustained > 80% of validated steady-state capacity
    • critical: sustained > 90% of validated steady-state capacity or latency inflects above SLO
  • regression_policy:
    • warn: > 1.10x to 1.25x baseline depending on sector sensitivity
    • critical: > 1.20x to 1.50x baseline depending on sector sensitivity

Reference Considerations

  • Thresholds are starting operational thresholds, not universal laws.
  • Where a standards or vendor reference provides an acceptable range, that range is used.
  • Where no universal value exists, thresholds are either SLO-derived or baseline-relative (typically against a 7d median or a validated capacity envelope).
  • For latency-sensitive sectors, alert on percentile regressions and budget exhaustion, not averages.