Monitoring Sucks but Alerting is Beautiful

I (@wickett) work as the Cloud Ops Team Lead at National Instruments where we have several Software as a Service products that we have built on different cloud providers (AWS, Azure, Google) and have implemented with a host of other supporting SaaS tools (cloudkick, ZenDesk, AlertSite and PagerDuty plus several others).  When building out our products we decided to eat our own dog food and use SaaS solutions as much as possible.  Great, but where am I going with all this?

No matter what tools you are using to monitor or log on your systems, you need a reliable way to get actionable events to your Ops Team.  If you have implemented several types of monitors, you generally are setting up an email address for them to send alerts to.  Then you write scripts to forward those to on call devices or forwarding rules to turn them into SMS–not exactly state of the art.  Try mixing in a global on call rotation and trying to configure all of your monitoring tools to account for that and it becomes a big problem.

Enter PagerDuty–by far this is the best SaaS product we use in our day-to-day Operations team.  PagerDuty is an alerting tool that is simple, easy-to-use and integrates into your other systems.   Why is PagerDuty so awesome?  Well, I am glad you asked.

  1. User defined escalation.  Once an alert gets sent into PagerDuty, it is processed through our escalation pathway.  It determines who is the first level of support and begins to alert that person.  Here is the cool part, I can choose to be alerted however I want and other ops team members can choose however they want.  For me, I get an email after 1 minute, SMS alerts at 3 minute intervals for the next 9 minutes, then phone calls every 5 minutes for the next 20 minutes.  Lets say you are a hard sleeper then you might want to skip the SMS and move straight to phone calls.  If I don’t acknowledge the alert in 30 minutes, it will get escalated to the next ops team member.
  2. Alert Acknowledgement.  From any one of the alert mechanisms above, there is an in-kind way to acknowledge and resolve the alert.  I can reply to the SMS message with an ACK code or when I get a call from the PagerDuty version of Siri I can select a response right on my keypad at that time.  No time is lost logging into PagerDuty to acknowledge the alert and the ops team can just get busy responding.
  3. Equality of alerts.  This is a subtle one.  We have a policy on our team that all alerts are equal and need to be handled with the same care and diligence.  Anything that makes it to PagerDuty is treated with the same level of importance and is escalated through the same channel–no “you can just ignore those alerts from that system over there” syndrome on my team.  All alerts are escalated and all must be handled.
  4. API integration.  PagerDuty lets you integrate with tools you already use (e.g. nagios, zenoss, cloudkick, splunk) and those tools can open and close alerts as they are detected and/or resolved.
  5. Email integration. Even if you have created some code or monitor that you want to alert from that doesn’t integrate with the API then all you need to be able to do is send an email.  Once that email is received, PagerDuty will treat it as an alert.
  6. Global 24/7 tools.  PagerDuty works in lots of countries and has scheduling that allows for follow-the-sun ops teams to thrive.  I am on call every day for 8 hours and after my shift ends one of my other global ops team members is on call (@einsamsoldat and @hafizramly) for the next 8 hours–at which point I am bumped to the second tier escalation path.  Most tools miss this and for our team this is a huge benefit.

I had initially thought I would just write a quick paragraph or two about why using PagerDuty for alerting is great for your devops team.  But, I am such a big fan that I couldn’t resist coming up with more reasons why we love PagerDuty on our team.  The biggest reason of all is that as an Ops team manager, I can sleep at night knowing that all alerts will get handled and I won’t woken up with a phone call from a VP or marketing person telling me that our cloud products are down–well at least not because of a failure to handle alerts.

I would encourage you to stop using subpar alerting mechanisms in monitoring and logging tools that don’t treat alerting as a first class citizen.  Those tools were’t created to be awesome at alerting they were meant to be detective in nature.  PagerDuty is made for alerting and defintely deserves a spot in your devops toolchain.

 

Leave a comment

Filed under DevOps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s