Validating E2ET via LTR

Outage Testing in Production? No. Make it Stable, make it Sane.

Share
Validating E2ET via LTR

Lot of chatter on the interwebs lately about "state surveillance", where "both sides" are relentlessly attempting to remove all End-to-End-Encyption (aka E2EE), but that is not what this post is about, and this post is not about politics.

Also a lot of talk about Github being compromised and having 3,000+ internal repos exfiltrated. Surely GitHub has 100% Code Coverage Testing and End-to-End Test (Work)Flows, right? No, they do not. Anyway..

This post is focused on E2ET (End-to-End-Testing) of application service roles and machine provisioning automation. So, let's start by seeing just exactly wtf I'm referring to with testing in the first place.


CI Systems - Failure Mode Example

In the reporting system here we use a host type definition of K10 which is mostly arbitrary, but the root story is an irrelevant tangent, used in this infrastructure as an example of a standdalone system which requires specific attention to failure state definitions, stateful analysis of gated operations, and is used for iterative looped-based resolution testing.

The K10 host is assigned to a role which features CPU-native cpuid2cpuflags architecture optimized compiler flags. We never use generic kernels, or generic machines, because efficiency and optimization in the industry is basically a broken afterthought and it gets old after a while - dealing with non-optimized non-attentitive systems engineering approaches featuring inefficient broken theories based the shrug-method, "just make it auto-scaling and put more kubes or something?"


Network Booting the Baseline OS

The K10 baseline system boots over the network using iPXE/HTTPv4, while other devices require chainload PXE/TFTP -> iPXE/HTTPv4 boot images. This facilitate a system of host-based discovery and domain authentication via initrd embedded kernel modules + connection parameters for FreeIPA/SSSD for RBAC + AAA using MAC and PKE enrollment. We also do this for FreeBSD hosts, with some minor differences.

The K10 is also connected to local-rack PDUs with remote power cycling via either ACPI signal for "soft shutdown" or via its switched-outlet port. The PDUs are controlled by the Command-&-Control Server using SNMPv3 user + group authentication via its infrastrcture management LAN connection with an auth-hook to FreeIPA via RADIUS.

Let's See Some Reports!

Having covered some blathering about provisioning systems, we can move on to agent-based querying for SITREP , which is executed on our in-house LLM structured query execution tooling. That whole chain of events required writing a new "Agentic Forge Control Plane", which was enjoyable most of the time... and it has become a baseline distributed comms system for "Project Coherent Storage" which is currently managing a bit over 1PB of data on OpenZFS.

Earlier this year I paused for some PTO and rewrote all of my formerly Ansible driven and automated Infrastructure as Code workflows to support an upcoming revision to our dual-continent presence at six datacenter locations. It's occasionally and very tentatively being referred to as "minimalist-maxxing" which is mostly pure nonsense where grammatical excellence is concerned, but ultimately accomplishes the following:

  • Gentoo w/ Optimized Stage4, LLVM/Clang, OpenRC
  • Minimalist service containers using our Stage4 baseline, with service roles layered on top. Try it out:

No, these were not Kubernetes, not Helm, not Cloudslop. Sometimes translates into Infrastructure as a Service, for enablement of event-driven architecture with a hybrid-local service enablement for LLM predictive action controls.

That's a lot of words to say, "I make the code do the things with the signals processing and patterns and stuff." Wait wait.. no, this is not going to cover anything about "N~less Computing".

stop calling anything "Serverless Computing" or "Whatever-less Barf".

[host K10]: currently has transient AAA validation, not E2ET pass. The AP7901 reboot proved the current netboot rootfs does not persist SSSD/IPA state.

Under this definition, K10 must rebuild rootfs or disk-install with aaa-domain-client, reboot, then pass the full post-boot E2ET suite before it can unblock xen-sun99-x12spl-099108.rfc1918.host reimage confidence.

Standard Example 'K10 Automation Report'

# E2ET Definition
End to End Testing / E2ET should mean: a reproducible host acceptance pipeline that proves a machine can move from inventory intent to installed, rebooted, validated, scored, documented, and release-eligible state.

For RFC99, a host is not E2ET-passed just because we fixed it live. It passes only after the fix is represented in source-of-truth and survives the full lifecycle.

## Policy Definition
For validation hosts like `K10`:

1. Fix live only to recover evidence or confirm root cause.
2. Backport the fix into repo-owned profile, role, manifest, package list, NetBox/DNS/IPAM metadata, and docs.
3. Reinstall or rebuild/reboot through the intended provisioning path.
4. Run post-boot validation.
5. Generate a conformance report.
6. Only then mark the host RC/GA-capable.

> A one-off live fix can be marked transient validated, but never E2ET passed.

## E2ET Stages

### Host Pipeline

1. **Inventory Gate**: NetBox/IPAM/DNS/Ansible host metadata matches expected MACs, interfaces, service IPs, PDU mapping, console path, boot protocol.
2.** Provisioning Gate**: iPXE or PXE-chainload path works, installer runs, ZFSBootMenu/bootfs/rootfs pools validate, OS install completes.
3. **First Boot Gate**: kernel cmdline, serial console, hostname/FQDN, SSH, time sync, logging, package profile, OpenRC services.
4. **Platform Gate**: CPU model, RAM, disks, NVDIMM/Optane if present, sysfs, kernel modules, firmware, CVE mitigation status.
5. **Network Gate**: management interface, service interfaces, VLANs, routes, DNS forward/reverse, NetBox consistency, nmap readiness.
6. **Storage Gate**: ZFS pools, NFSv3/v4, NFS-RDMA, Ceph, iSER, iSCSI, sshfs as applicable.
7. **Identity Gate**: FreeIPA/SSSD/PAM/SSH keys, UID/GID consistency, sudo policy, offline cache, break-glass account.
8. **Service Gate**: required services running, no unexpected failed services, role-specific smoke/functional checks.
9. **Performance Gate**: CPU, memory, disk, network, build/distributed compile benchmarks against baseline.
10. **Conformance Report**: JSON + Markdown + optional JUnit output with hard-gate pass/fail and weighted score.

### Scoring Model
Use hard gates plus weighted scoring.

### Hard Fails
  - Cannot boot.
  - Cannot SSH via management path.
  - IPAM/DNS mismatch for primary identity.
  - Root pool invalid.
  - SSSD missing on an AAA-required profile.
  - Required service failed.

### Score Zones
  - p60: minimally usable, not release candidate.
  - p80: acceptable lab host.
  - p90: RC candidate.
  - p95: GA for normal infra.
  - p99: production-critical or rebuild-template quality.

## Policyinctions
While there is overlap with "conformance tiers", this functionally is separate from "statistical confidence" until we have sufficient historical run-data to compute "Real-Number Percentiles" and data-driven pattern-based repeatably-provable "Statistically Significant" probability assessments.

### K10 Applied Meaning
The host definition for `K10` currently has transient AAA (not AAA DNS: `"ACK & AGREE & APPROVE"`) validation, not E2ET pass.

#### **Actioned Event**
Its connected PDU port *(SKU: APC AP7901)* reboot proved the current `netboot rootfs` does not persist `SSSD/IPA` state. 

#### **Action Rephase**
Host definition `K10` requires a rebuild of rootfs or disk-install with `aaa-domain-client`, another reboot, then pass the full `Post-Boot E2ET` test suite before it can unblock dependency-trees. 

Otherwise K10 risks being labeled a permanent `reimage-processing confidence blocker`, which leads to hardware decomission.

## K10 Dead-Reckoning
Previously we defined the `RFC99` & `SUN99` workflow as `K10 Host E2ET Acceptance Pipeline v0.1`, for this we'll re-implement the process using block-notation in Ansible + Python with the new report tooling, and enable `cicd-rfc99-jenkins-099199` for orchestrating it once the checks are stable.