prose

reflections on OSS, HPC, and Ai/ML engineering, with occasional considerations on Cognitive Neuroscience

Home

About

Performance & Regressions -- How Orgs Ship Code to Production

Aug 15, 2024

4 minutes to read

See that image? Right around the mid-section we see a common problem with the manner in which code is shipped to production. That one little section, displayed as a subsection of “build & packaging”, happily named “regression + performance testing”. Whether the product is a user-facing site, or middle-ware component, or baremetal systems which provide cloud resources that we all know and (sometimes) love… the success of the product is gated by an org’s ability to prioritize testing and automations for “Performance XOR Regression” (some orgs consider them inclusive, some consider them separate).

So we see Performance and Regressions visualized as practically a footnote, which is representative of the present landscape: many orgs don’t understand regression at scale vs regression in labs vs regression modeling. Most concerning is that too many orgs fail to provide management directives which involve full Perf+Reg test coverage. Instead, as a result of the lack of understanding, many orgs expect that their engineering teams will miraculously “make time” in their fully allocated and often over-subscribed schedules. That is not possible, and when production suffers, we all suffer. (insert meme with person riding bike, then falling from their own mistakes, wincing in pain on the ground, painfully expressing “Why would the code do this to us?")

Secondly, a common misconception from engineering (eng teams and their management structure) is that Perf/Reg testing exists solely in that one stage. One simple misconception often causes an array of new and unnecessary tech debt, while often concurrently perpetuating existing tech debt.

Ok. So who is not failing at Performance and Regression testing? We can look to the ground-breaking engineering efforts on projects which have defined the human race: Voyager I and II, the J.Webb Space Telescope, CERN’s particle accelerators, and many others. What’s the difference? Time and planning, adhering to standards, routine audits, change controls, and many more.

How about simpler projects? Do they attend to P&R as well as JBL or CERN or Los Alamos? Unfortunately not. Over the course of nearly a quarter century of focus on this and related topics, the strong majority of orgs which I’ve had the pleasure of working at have engaged in “Spectrum Testing” instead of “Binary Testing” when it comes to performance + regression analysis, expecting that half-measures will be sufficient. They are not, and because this area of engineering requires investment of time, people, and hardware, it often becomes a budgetary line item to cross out during the times in which it is most critical. Such is the reality of simple and sometimes willful ignorance. Humans are fallible, expectedly so, but the beautiful thing is that we can change and learn and improve; but how… through awareness, analysis, and iterative adjustment (just like in engineering).

Investing partially, whether fiscally or in human hours, is not sufficient. Planning these elements for success always requires time, hardware, and people: investment and awareness. No one sends a telescope into Lagrange orbit one half-million miles from earth without proper test coverage (this isn’t the Hubble (sorry, sorry, I know it’s not funny)). The reasons are obvious. However, back on terra, faulty and insufficiently tested code can deploy to a hyperscaler’s cloud with sufficient impact that national and global infrastructure goes offline, and that code can ship without a second set of eyes? Yep. It happens, but it does not need to happen. We can expect better, and we should. Engineers and end-users and everyone else deserves better.

Easier said than done, sure, but there are no improvements without intention and action. Orgs must be dedicated to P&R coverage, and they must be receptive to potential improvements and course corrections. It requires management to be aware that “performance” does not imply “efficiency” and vice-versa. Quite simply, gaining any benefits from sufficient test coverage requires that management embrace improvements to “testing culture” within their org, and to ensure that sufficient time/cycles/bandwidth on each engineering team’s schedule is inclusive of engagement with org-wide performance engineering directives.

[1] LinkedIN image reference: www.linkedin.com/feed/upda…

prose

reflections on OSS, HPC, and Ai/ML engineering, with occasional considerations on Cognitive Neuroscience

prose

reflections on OSS, HPC, and Ai/ML engineering, with occasional considerations on Cognitive Neuroscience

Performance & Regressions -- How Orgs Ship Code to Production

Categories

Emotive

Logistics

Musings

Photography

Engineering

Industry

Linux

Neurology

FreeBSD

Philosophy

Photos

Reference

Archives