Caches are all about performance. Performance Engineers need to have an appreciation of caches to be effective – they are used everywhere and directly have an impact on smooth running of an application. This is one area where a live system can quickly fall over because time hasn’t been taken to understand the behaviour of caches during load testing.
I’ve been using Memcache with WordPress, its so simple it’s beautiful. I quickly starting caching frequency requested pages that requested expensive SQL. Its so simple its scary and I was tempted to find different ways I could use it as ‘an easy way out’ … (and therein the path to abuse lies). WordPress is a little different in its approach – it caches to disk, not to memory. Which actually isn’t a bad thing – not everything needs to be cached to memory, expensive op’s are OK to cache to disk, and disk is cheaper than memory. Anyway, I drift ….
Here’s a list of things to consider with real world caching:
Load Testing Vs Real World testing: If the test data you use for load testing is less than the cache size, but the real world data is greater than cache size you are always going to get faster results – because the system will never see thrashing/cache expires and dramatically lower DB activity. Sometimes you just can’t model the breath of data a live system has against a load model – that OK, just assess this risk and document.
Cache Lemming Visibility: Caches are live but not monitored!? You are blind and are going the way of the lemming. How big is the cache size? How many elements does it have? How many cache hits/misses have you got? What are the sizes of each of the elements – are any too big? Eviction policy and rate … If you aren’t monitoring, trending and keeping tabs its all going to go horribly wrong.
Here are some essential stats:
[table id=3 /]
Always work out the total size of the cache – not all elements belonging to a cache are necessarily the same size.
Transient Visibility: The nature of the above stats will change over a busy and non-busy period. So I take these states every X mins intervals and analyze in Excel. It lets me know the behavior and also allows me to work out if cache sizes are going to need resizing (essential if you are up to your limit in memory). Busy periods are generally the most useful (but not necessarily).
Session Caching: Are caches heavy on user sessions. This makes a difference if load-testing and you are attempting to cheat by upping the transaction rate (See Cheating – User or high transaction rates). Everyone cheats (or should) – but when you do, know the risks.
Large Cache Objects: If the cache holds objects that are large in size be suspicious – is a small attribute of that object required, or is it needed in its entirety? Consider the processing/serialization overhead. If we don’t cache hit big objects often, why cache?- This will have an impact on memory consumption and create excessive GC. Don’t be afraid to ask questions. Not everything is best in cache e.g. Large SQL result-sets. .
Evidence based Cache profiling: A good Software Engineer will evidence where the application is spending the majority of its time, particularly on the DB. If these are frequent then they should be good candidates for cache.
Sharing: Does the Cache scale and share memory across servers? (Memcache can). Do you have a dedicated cache server? A lot of implementations do not – which means there is a risk of cache duplication across servers.
Cache Stampede: Under very high loads of concurrency caches can conspire to empty at the same time – causing a stampede effect on the DB and killing the system.
Phase Synchronization: Highly concurrency and high usage systems have a tendency to stop being unordered and ‘phase shift’ into a pattern of behaviour – a pattern of order comes from relative chaos. This can cause Cache stampedes and is a difficult one to identify. A good analogy to describe this is the Millenium Bridge – people started walking in synchronisation, causing it to wobble. Adding noise (aka Jitter) will help eliminate this. This type of behaviour is only typical in large scale systems. (ps I made this title up – as I couldn’t find a definition)
Papering the Cracks
I have mixed feelings about caches – Its an inevitable and required strategy, however I’ve often seen them misused by software engineers. Its too easy to take a good idea a little too far – and then begin to lose site of the fundamental architectural problem. “Lets put everything in cache…. It’ll make everything faster” …. Its almost as if the DB and tuning is then forgotten (See Costs of Encapsulation). Whats more I’ve seen SE’s then get their knickers in a twist attempting to solve locking and contention issues (you know everything a DB does really really well) through excessive use of a cache. So my general observation about caches is that operations that require non time critical reads (i.e. data that isn’t likely to change soon) are generally good candidates. Highly volatile time critical data isn’t.
Throwing the golden goose over the fence and then forgetting
The golden rule for effective cache performance is to monitor, analyse and actively measure, measure measure. No visibility means you are blind and likely to find trouble
You might also be interested in: