Investigating Issues: The Field Of Truth & The Fog Of War

Adelaide, South Australia

In the last post I talked about the role of quality assurance staff in software projects. They examine the documented requirements and report discrepancies between the specified digital product and the examined digital product. A similar process plays out on projects where there are no formal QA staff, although how exactly the next steps play out can be quite different. Here I would like to focus on what happens when the decision is made to investigate an issue.

I've previously mentioned the concept of a "field of truth" that software developers navigate. It is a field of truth because on many projects there are several instances (or parallel tracks) of the digital factory live at one time, and these instances may represent different versions, and the deployments may have some functional differences due to practical considerations (e.g. it is not cost-effective to support all features on multiple instances when accounting for cloud computing costs). In such conditions understanding the operating context is not as simple as finding 'the' truth - there are multiple truths, and they are all correct, and they all may conflict.

The "field of truth" concept is directly relevant to the investigation of issues for a number of reasons.

One is that the sensible trade-offs we make for cost optimisation on deployments designed for testing (which are inherently lower volume) are at times directly responsible for gaps between expectations and reality when inspecting the digital product.

The other is tied to a new (for this series) concept, the "fog of war". The fog of war is a military concept that relates to uncertainty during operations. If you are a soldier behind enemy lines, you are in an unpredictable environment armed with the information that you brought with you and what your senses can provide you in real-time. There is a whole field of possible events that could be happening at any given moment that would impact your short-term future, and many forks in the road that emerge from how you may respond to these hypothetical events.

During the process of observing complex systems in the digital world (e.g. most contemporary software) a similar effect occurs. Software developers utilise the field of truth, which also includes the current state of code as well as prior states and sometimes intermediary “future” states that represent works in progress, in order to build a mental model of what is happening in digital systems. It is often not practical to understand, and sometimes not practical to even observe to a sufficient level of detail exactly what our devices are doing as they are executing the instructions embedded in our computer programs at the most granular level (relevant: the concept of primitives).

Beyond the sheer volume of variations in behaviours across devices, deployments, versions, there is also the baked-in instability of the digital supply chain. Our vendors are capable of altering the deal at any time. So software developers take the things they know to be true, they validate things we know should be true for a working system, in doing so they narrow the coverage of the fog of war. When we are lucky, we are able to narrow it to the extent that we can identify a single point of failure, and if we are luckier still, we are able to alter the deal to correct it in a way that does not create new problems.

These corrections may look like adjustments to the supply chain, adjustments to the machines, perhaps adjustments to entire production processes. These changes can (and often do) have unintended consequences. Changes at a greater scale tend to correlate to greater risk, one of the things that software developers do over time is form instincts as to what is an acceptable level of risk for the moment, and part of that determination is how they relate cause to effect.

I previously wrote about an instance where I identified serious memory pressure risk on a number of devices when running software I was developing. I was armed with the knowledge that many of our users had less capable devices than the ones that we observed extreme memory pressure with. I was also armed with the knowledge that we were probably going to keep adding new features, and increasing that pressure. My resolution involved making some systems harder to read in code, it was also in a very practical sense, the only way to keep those digital factories operating. That was a good-case scenario where the only stakeholder who bears the cost is the development team. Other kinds of issues (or levels of severity) can have serious costs to other stakeholders, and often that looks like delays in the project timeline.

Sometimes there are no quick wins. No one necessarily did anything wrong. Once you needed a bike, and now you need a plane. The evidence is not in a neon sign floating in the air, it's hidden in the fog of war.

Until next time.