Guardian.co.uk
Graham and various other people from the development and operations team pitching in.
3 years ago - published static files with apache SSI to fill-in gaps. Moved to a fully dynamic system. Now, they're somewhere in between.
Stack -
- apache
- resin
- spring / hibernate / velocity
- Oracle DB backend (not recommended!)
Measured the application - 1300 requests to DB just to render homepage.
Added ehcache to hibernate as 2nd level cache and added a warmup script before putting into load balancer
30m unique users per month
270m pages per month
250 requests/second at lunchtime
1500 requests/second peak.
GC tools
Google weakref cache (part of
Google Collections)
Eclipse memory analyser - what's using all my memory?
Cacti for monitoring - DB usage was killing it.
8 app servers in each co-lo (London and Manchester).
400MB used by cache - churn meant was pretty ineffective.
Tried or considered ehcache distribution and jboss cache distribution.
Rejected since cache eviction via replication would have thrashed it.
memcached
massive improvement in response times, but DB load still high.
went to caching every query for 5 minutes. DB load vanished and is flat even as more app servers come on-line.
servlet filter writing to memcached made it stink fast.
took a days worth of logs and Hadoop to see how long the cache should be. 1 minute was the sweet spot.
Emergency switch to serve a static copy of the site, minus personalization features.
Daemon or script scrapes the sit; they can handle 700req/s/node when the site's operating in this mode.
new content published in this mode has a copy pressed so it can be served from disk - publish is slower than with the other system but updates still possible
Highly recommend that this sort of emergency degrade read-only mode should be built-in from the off - they've used this approach with the MPs Expenses apps built to crowd-source investigations.