Common Sense Performance Indicators in the Cloud by Nick Gerner (SEOmoz)
SEOmoz has been EC2/S3 based since 2008. They scaled from 50 to 500 nodes. Nick is a developer who wanted him some operational statistics!
Their architecture has many tiers – S3, memcache, appl, lighttpd, ELB. They needed to visualize it.
This will not be about waterfalls and DNS and stuff. He’s going to talk specifically about system (Linux system) and app metrics.
/proc is the place to get all the stats. Go “man proc” and understand it.
What 5 things does he watch?
- Load average – like from top. It combines a lot of things and is a good place to start but explains nothing.
- CPU – useful when broken out by process, user vs system time. It tells you who’s doing work, if the CPU is maxed, and if it’s blocked on IO.
- Memory – useful when broken out by process. Free, cached, and used. Cached + free = available, and if you have spare memory, let the app or memcache or db cache use it.
- Disk – read and write bytes/sec, utilization. Basically is the disk busy, and who is using it and when? Oh, and look at it per process too!
- Network – read and write bytes/sec, and also the number of established connections. 1024 is a magic limit often. Bandwidth costs money – keep it flat! And watch SOA connections.
Perf Monitoring For Free
- data collection – collectd
- data storage- rrdtool
- dashboard management – drraw
They put those together into a dashboard. They didn’t want to pay anyone or spend time managing it. The dynamic nature of the cloud means stuff like nagios have problems.
They’d install collectd agents all over the cluster. New nodes get a generic config, and node names follow a convention according to role.
Then there’s a dedicated perf server with the collectd server, a Web server, and drraw.cgi. In a security group everyone can connect in to.
Back up your performance data- it’s critical to have history.
Cloudwatch gives you stuff – but not the insight you have when breaking out by process. And Keynote/Gomez stuff is fine but doesn’t give you the (server side) nitty gritty.
More about the dashboard. Key requirements:
- Summarize nodes and systems
- Visualize data over time
- Stack measurements per process and per node
- Handle new nodes dynamically w/o config chage
He showed their batch mode dashboard. Just a row per node, a metric graph per column. CPU broken out by process with load average superimposed on top. You see things like “high load average but there’s CPU to spare.” Then you realize that disk is your bottleneck in real workloads. Switch instance types.
Memory broken out by process too. Yay for kernel caching.
Disk chart in bytes and ops. The steady state, spikes, and sustained spikes are all important.
Network – overlay the 95th percentile cause that’s how you get billed.
Web Server dashboard from an API server is a little different.
Add Web requests by app/request type. app1, app2, 302, 500, 503… You want to see requests per second by type.
mod_status gives connections and children idleness.
System wide dashboard. Each graph is a request type, then broken out by node. And aggregate totals.
And you want median latency per request. And any app specific stuff you want to know about.
So get the basic stats, over time, per node, per process.
Understand your baseline so you know what’s ‘really’ a spike.
Ad hoc tools -try ’em!
- dstat -cdnml for system characteristics
- iotop for per process disk IO
- iostat -x 3 for detailed disk stats
- netstat -tnp for per process TCP connection stats
His slides and other informative blog posts are at nickgerner.com.
A good bootstrap method… You may want to use more/better tools but it’s a good point that you can certainly do this amount for free with very basic tooling, so something you pay for best be better! I think the “per process” intuition is the best takeaway; a lot of otherwise fancy crap doesn’t do that.
But in the end I want more – baselines, alerting, etc.