Inference Sector KPI SLO
Observability & Telemetry :: SLO Definition Additions
Why should anyone care about KPIs and SLOs?
Generally, unless one gets paid to deal with OBS/Tel and SLA conformance.. ehh. Except that the modern world operates within the bounds of a diverse and often not-fascinating array of performance indicators and services' operating metrics.
So.. Here are two for today.
- KPI == Key Performance Indicator
- SLO == Service Level Objective
Minor Backstory - Today
While aggregating directory content from several workstations into a unified NFS remote mount,(can't simply have one workstation, right? right), another encounter with SLO defs appears upon the terminal.
This time it's for LLM Inference customers and their typical requirements. Perhaps our standards repository would be helpful as an adjunct resource:
- https://github.com/yukon-systems/YukonSYS-Standard-Definitions
Inference Service KPIs by Sector
An extension of distributed cloud services architecture with heavy focus on baremetal with VMs and Containers throughout. Many ways to solve the problems, scale the infra, etc.
SLO Additions to the Classifier Repo
- This file is a compact companion to the observability + telemetry glossary.
- Each threshold is a starting point and should be tightened to real-world workload SLOs after baselining standard operations.
| Sector | High-signal KPIs | Default alert style |
|---|---|---|
| LLM Inference (GP-GPU Service Infra) | TTFT, ITL/TPOT, E2E, QTS | SLO-derived or baseline-relative; baseline-relative |
| LLM Inference (API Service Infra) | P95 LAT, ERR, 429R, UST | SLO-derived |
| LLM Inference (Network Hardware + Protocol Infra) | OWD, PDV, LOSS, ECN | baseline-relative; budget-relative |
| LLM Inference (Prompt Caching, Compute + Re-Compute) | PCHR, CTR, TTFTR, RCR | absolute for cache-eligible traffic |
| LLM Inference (Prompt Caching, Storage Infra) | CHR, CLAT, COCC, EVR | absolute; absolute for cache-eligible traffic |
| LLM Inference (Prompt Caching API + Load-Balancers) | RCHR, LQ, URT, RTR | absolute |
| LLM Training (bulk initial datasets) | ITPS, DHR, UDR, CONT | absolute; capacity-envelope |
| LLM Training (pre-training MoE) | MFU, STEP, A2AS, EIR | absolute starting point; baseline-relative |
| LLM Training (post-training MoE) | TPS, MFU, RACC, RMAR | absolute starting point; baseline-relative |
| HPC - High Frequency Trading - VM Clusters | RDY, CSTP, NUMA MISS, T2D | absolute starting point |
| HPC - High Frequency Trading - Baremetal | W2W, JITR, CPM, MPKI | baseline-relative; budget-relative |
| HPC - High Frequency Trading - CDN | CHR, OOR, TTFB, OTTFB | absolute |
| HPC - High Frequency Trading - Low-Latency Exec | OWD, JIT, FLAT, OLAT | baseline-relative; budget-relative |
| HPC - Dark Fiber Regional Network + Infra | AVAIL, OWD, Pre-FEC BER, OSNRM | SLO-derived; budget-relative |
| HPC - Quantitative Research + Machine Learning | BTT, FLAT, TSS, FFL | absolute starting point; baseline-relative |
| HPC - Big-Data & Multivariate Pattern Analysis | JDUR, SKR, SPILL, CLAG | SLA-derived; absolute |
| SLA + SLO Monitoring, Telemetry, Alerting Infra | SLI-AV, SLI-LAT, EBR, UP | SLO-derived |
Global Policy Adjustments
- baseline_window:
7d median or p50/p95 baseline unless otherwise stated
- slo_policy:
page: error budget burn rate > 14.4 over 1h and 5m, or > 6 over 6h and 30mticket: error budget burn rate > 1 over 3d and 6h
- capacity_policy:
warn: sustained > 80% of validated steady-state capacitycritical: sustained > 90% of validated steady-state capacity or latency inflects above SLO
- regression_policy:
warn: > 1.10x to 1.25x baseline depending on sector sensitivitycritical: > 1.20x to 1.50x baseline depending on sector sensitivity
Reference Considerations
- Thresholds are starting operational thresholds, not universal laws.
- Where a standards or vendor reference provides an acceptable range, that range is used.
- Where no universal value exists, thresholds are either SLO-derived or baseline-relative (typically against a 7d median or a validated capacity envelope).
- For latency-sensitive sectors, alert on percentile regressions and budget exhaustion, not averages.