I’m a big fan of performance reliability, and it isn’t the expensive & ‘cheap thinking’ type that is failover and resilience. I’m talking about ‘functional performance resilience’ – the one where your system doesn’t blow up because someone somewhere has entered something silly or there has been a series of system defects that causes the system to slow or crash under load. When I look for ways to make a system more performant (see my definition of performant here), one of the ways I look to increase reliability is to identify items that are not entirely obvious and have a low-risk threshold per item when looked at in isolation.
Removing Performance Reliability Risks – Don’t Ignore the FLAGS!!!
The risks I’m particularly seeking to address are unhandled exceptions, errors, messages telling you something you didn’t expect and warnings – I’m going to call these Flags. I’ve been on a number of sites where system logs (including DB) are not monitored in live. How a system behaves in live will ALWAYS be different from test. If these Flags are not monitored, then this increases reliability risk over time. Each of these Flags becomes a potential single point of failure (SPOF) for performance. These Flags are very different from functional issues – as functional issues get noticed and prioritised. Flags tend not to, they are invisible and don’t affect ‘the business’ in an obvious way, so get brushed under the carpet.
Wood Lice and Your House
A single woodlice is not a problem in your house, but if you see one you step on it and remove. If left unchecked you become infested, the many small problems will become a much larger issue – I think of Flags in a similar way.
An Avalanche of Flags
No monitoring of flags encourages an undisciplined approach, new releases will add to the number of Flags and before anyone realises there are a high number of Flags on system … and then its too late. Simply looking at something and accepting it as a ‘valid error’ is common. In my book – the whole point of Flags are they are there not to be ignored – the system is behaving in a way that wasn’t expected.
Wood From Trees
This is best illustrated when something bad happens on the live environment – people take a look at the logs, realize there are so many false Flags that they can’t easily piece together the crime scene. Its littered with other debris. A hard job has suddenly become a lot harder. I’ve seen companies just ‘hope’ that that thing doesn’t just happen again (it failed over) because it too time consuming to the bottom of it. This is a bad experience for end users and its also the result of poor Flag management. This can result in expensive & avoidable solutions such as Splunk.
Causes of poor Flag Management:
Here are what I think are the underlying causes:
Distorted Low Risk: We have a tendency to trivialize the small risks we feel we can control (e.g. Flags or functional issues) and exaggerate the large ones we feel we cannot (large influx of traffic, hardware failure). But over time, these low risks can easily add up to a much larger risk than the infrequent. This is perhaps the underlying cause people are most guilty of. Over time a flag or set of flags becoming a Single Point Of Failure is inevitable.
Usability: Visibility & usability. If all the flags are consolidated into a single place, easily accessible and easy to drill into – then people become much more proactive about them. If they are hidden, dissipate and take time to access then who on earth is going to poke about in them unless there is an issue? … and by the time there is an issue its usually too late….. Particularly relevant if a system has a high number of servers.
Technical Management: Of lack of. Communicating this risk and ensuring adequate processes and tools are in place is down to a lack of technical oversight (or Technical PM). I view this as essential infrastructure & plumbing that should be in place – If I commission a house, I expect the guttering, drainage and basic infrastructure to be built in. Not stated after the fact.
So an effective “Performance Reliability” strategy is one that is a disciplined and a steady slog approach. Monitoring Flags on a daily basis, investigating and introducing fixes. Its also a virtuous approach – the system behavior is better understood and reliability and overall health of the system is increased leading to less down time, more efficient use of dev time and results in less time on unnecessary remedial solutions. Reliability is an intrinsic part of a Performant system. Keep the woodlice at bay. Consolidate your logs, monitor and keep your house clean.