Post Performance Testing: What to do when application performance issues occur.
So performance testing is complete, tests have been agreed and signed off and then the system goes live …… only to experience some sort of performance issue. All eyes are on the performance team – so what went wrong? It is my experience the main candidates can attributed to one of the following :
- Incorrect Performance Model: The performance model has been built against an incorrect set of non-functional requirements i.e. live system behavior is very different from what has been specified and load tested against.
- Functional Permutation: A low risk functional change in the system, which has not been performance tested has conspired indirectly with other components to lock the entire system
- Late live ‘hot fixes’: A new build or system configuration changes has been released without the necessary performance testing.
- Scaled Down Env: Performance testing has been executed against a scaled environment – this is a complex subject, and I’ve attempted to cover this subject here. If this can be avoided then do so.
- Mismatched Deployment: There are significant differences in configuration between the live environment and performance environment.
It is inevitable that there is going to be something, somewhere that is going to trip the system up. Deal with it when it happens and be prepared. Diagnosing can be a long drawn out task – Quite often you are asked to help diagnose a crime scene after it has been cleaned up. Here are a set of generic questions to ask, these obviously depend on the clients and circumstances:
1) What build is the system live with?
2) Were any hotfixes applied?
3) What were the symptoms leading up to the issue?
4) How long after release did the issue occur?
5) Access to Heap Dumps, Log files, DB Log files, User Numbers, details of any timed jobs
6) What time of day did the issue occur, what was done to correct? Has the issue reoccurred?
The main aim of the questions is to create a picture of what actually caused the issue – re-creating the issue in the test environment is imperative. Help is usually needed; DBA’s and lead developers are the key people here. Lead developers often have a strong gut feel for the innards of the system – they can hone in quickly on suspect code or interactions. If the organization you are dealing with is well structured they will have someone dedicated full time to monitoring and tuning live system behavior.
What I would strongly advise against are fixes based on conjecture. Educated guesses are fine – but they must be evidenced and replicated if possible. I’ve seen many a hot-fix being applied only to have no effect. Conjecture is no substitute for hard evidence. If the evidence cannot be gathered then alter the monitoring/log levels on the system so it can be.
It is a hard for some clients to digest, but it is impossible to give a 100% confidence level in no performance issues being experienced after the system has gone live. There simply isn’t enough time to performance test everything – this expectation should be set and made clear to stakeholders. The performance role should give a level of confidence in the ability of the system to handle load prior to live with clear boundaries [Setting Performance Expectations link here].
- Live Monitoring tools directly aid the ability to quickly find the culprits of performance issues. If there are no tools (e.g Sitescope, Splunk) then the organization is blind to live behavior.
- Always aim to replicate the issue in the performance environment
- Educated guesses should be evidenced, if there isn’t enough evidence then change logging on the system so it can be.
- Performance of the live system can never be 100% guaranteed prior to release, take a risk-based approach and set expectations. *