As a consultant I’ve had the privilege of working on many sites, experienced different technologies, implementations and customer approaches. Although my experience is not definitive, I’m going to identify some of the reoccurring themes that are common and appear to accumulate in an enterprise system displaying poor performance characteristics.
Note: I’ve deliberately steered away from the delivery mechanisms that deal with the effects of latency (such as CDN’s and best practices for browser rendering), they have been covered elsewhere and are well understood. This piece will deal with the ‘inside the firewall’ anti-performant patterns.
Each of the below items should be considered summaries, they can easily be expanded into more expansive articles, but for readability they have been kept succinct:
Organic Inheritance and loss of control: Applications are built over years, people and knowledge move on. Code isn’t “in-line” documented, understood or owned. Temporary fixes become permanent. No concerted effort is made to understand, refactor or reengineer old code as this ‘just works’; only the truth of the matter is no-one has the time to understand what ‘that bit does’ and they have become scared of changing it because it does something that no-one quite understands and is used by a lot of other code.
Prevention: Constant refactoring of existing code, peer reviews and strict in-line documentation and coding standards. If there is code that isn’t understood this should be made a priority.
‘Over the fence’ to Production: The system resides on hardware that is managed by an external vendor. In some extreme cases I’ve seen clients think that once it’s live it’ll just work. When it’s ‘thrown over fence’, they fail to see the need for continuous monitoring of both hardware, software, error messages and end user response times.
Prevention: Consolidated logging consoles e.g. Splunk, loglogic, graylog2 (Splunk is overpriced ) – particularly if there are a high number of web & application servers. Active monitoring of log messages and removal of spurious error logs. Monitoring and reporting of hardware performance counters such as CPU, I/O and heap allocation. Instrumentation of key business flows – into CSV’s and graphed, this will vastly reduce the need for expensive APM’s or tools such as Gomez. (See Creating Performance Reliability)
No dedicated DB expertise: DB’s are complex, I think of them as operating systems in their own right. I’ve been surprised by how many DB’s have been constructed by developers – sometimes many developers contribute to a DB. But without a holistic source of control and expertise the DB will become unmanageable and a source of application contention.
Prevention: Employ a DB developer (not a DBA).. all tables changes, PL/SQL and SQL should at least be QA’ed by this role before being allowed live onto the system. A good DB developer will constantly measure the live system performance and tune, tune, tune.
No Live Measures for Continuous Improvement (CI): The performance characteristics of a system are constantly subject to change (such as load) and variables that are in a constant state of flux (such as user profiles). How the system behaves in test will be extremely different to live. If you don’t measure, monitor and feedback constantly from the live Payday Loans environment you are going the way of the Lemming.
Prevention: Measure! If you don’t measure how can you improve? The quality and depth of your measurements are directly related to CI improvements you can feedback.
The Encapsulation Jigsaw: Many different services are pieced together with little oversight or communication. A loss of respect for CPU cycles and understanding of internal system behaviour within these services becomes the normal. More detail can be found here.
Unmeasured Cache Behaviour: The system utilizes a number of caches for commonly requested objects, only no-one has any data on how these caches are actually used in the real world. The cache contains a 1000 objects – but no-one knows the size of each of these objects, how long they live for, how many times they are requested, how many times they are thrashed. Is the cache too big or too small? Is the TTL too high or too small? This all creates memory overhead and inefficient use of resources such as DB’s. (See Measuring Cache Performance)
Prevention: Measure and report on the caches everyday. Understand their behavior, refine and keep in check. Caches are a good idea that become a drag if left unchecked.
Unrealistic Performance Expectations: Not strictly an anti-pattern but a pattern I see that becomes a significant drag to projects. A new system (or part) is due to go live, the business expects use that is going to turn out to be wildly above expectations – resulting in a system that is over engineered, resourced & over budget. By default the business is going to be over optimistic and think this is the best thing since sliced bread.
Prevention: Use current usage statistics, derive a usage profile that is derived from a mathematical formula that can be scrutinized and subject to rigor by all parties. If variables aren’t known then guess and state. Make sure this derived figure is reviewed through walk-throughs and classroom inspections. I’ve been in many places where figures are stated and not inspected until too late. Months of effort can be wiped off a project by simple review.
Non technical PM’s: Project managers that concentrate on managing up, not down – have no interest or appreciation of development practices. Developers are then not given enough steer and are allowed to work in isolation without an awareness of the overall system. This is the one that irks me the most – would you employ a football manager that has little understanding of how the pieces fit together and flow? Or just fills in the gaps and reports on progress (or lack of)?
Prevention: Employ a PM that has good development experience.
I’ve found each of the above are common patterns that accumulate into a system becoming none performant. More over, the above patterns make quickly identifying the root cause of poor performance almost impossible. Putting simple measures and practices (such as performance by design approaches) in place during development and product lifecycle will mitigate performance and non performant risk. I’ve seen many of the above implemented and they work extremely well.