Yearly Archives: 2012

Breaking the Silence – Agile Admin Updates!

Well I know it’s been quiet around here lately.  Two of the agile admins, myself and Peco, have switched jobs and have been crazy busy.

Peco has moved to Opnet as a SE – he always loved their APM tool Panorama and we got a lot of mileage out of it at NI. Notably, this included getting some of our suppliers, like Vignette, to buy it and use it on their products to find the performance problems before we did… We were having Vignette V7 performance issues and used Panorama to identify exactly what they were, and when we fed it back to Vignette engineering they were like “how the hell did you do that…” Opnet’s never had great marketing, they’ve been doing what New Relic and AppDynamics have been doing for a long time but aren’t as much of a recognized name. It’s an interesting move for Peco, he went from long time Ops guy to a Java dev, working on system automation like PIE, and now an SE because he wants trigger time on customer interaction.  He’s a renaissance man!  And working on his own stealth startup on the side of course.

I have moved to Bazaarvoice, a SaaS startup in Austin (well, are you still a startup when  you’re up to 400 people?) to be their Manager of Release Engineering.  They have been increasing their dev and ops forces by a very large margin (P.S. We’re hiring!) and wanted someone to run the “middle third” of their DevOps teams.  There’s one team of DevOps embedded into all the product teams, kind of a matrixed organization; then there’s my team to do the build and deploy automation and cloud and data center stuff; then there’s a support team to wrangle the pages and tickets.

I was there for a week doing new hire training when they said “So we have moved to agile doing two week sprints, and that’s going OK except we realized we can’t release every two weeks safely. Get that going.  We start biweekly releases in three weeks. Zero customer impact each time!” Therefore I’ve been in a little bit of a frenzy doing that. It was definitely on the fine line between “I like a challenge” and “Oh shit.” I lucked out and got a week of slack because we decided to IPO on the date that the first release was supposed to go, and we decided that doing both on the same day was just a wee bit too ambitious. The first biweekly release slipped from a Thursday till the next Tuesday and only had two minor issues that needed fixing after, and the next one is coming up!  We should have it tamped down to regular in a sprint or two and I hope to return to more regular blogging.

On the side of course I continue to help run the Austin Cloud User Group and the Agile Austin DevOps SIG and put together DevOpsDays Austin and do other random stuff, so sadly when the overcommittedness hits the blog has to give.

James is doing great, and just had a new baby!  So he’s been out of pocket some too. He’s still at NI, and now large and in charge of all operations for their SaaS products. They are fighting the brave fight to get PIE open sourced and out the door.

We are all going to SXSW Interactive this weekend and will regroup and decide how we’re going to bring you all the latest in hot cloud/DevOps/tech stuff going forward!

1 Comment

Filed under General

Monitoring Sucks but Alerting is Beautiful

I (@wickett) work as the Cloud Ops Team Lead at National Instruments where we have several Software as a Service products that we have built on different cloud providers (AWS, Azure, Google) and have implemented with a host of other supporting SaaS tools (cloudkick, ZenDesk, AlertSite and PagerDuty plus several others).  When building out our products we decided to eat our own dog food and use SaaS solutions as much as possible.  Great, but where am I going with all this?

No matter what tools you are using to monitor or log on your systems, you need a reliable way to get actionable events to your Ops Team.  If you have implemented several types of monitors, you generally are setting up an email address for them to send alerts to.  Then you write scripts to forward those to on call devices or forwarding rules to turn them into SMS–not exactly state of the art.  Try mixing in a global on call rotation and trying to configure all of your monitoring tools to account for that and it becomes a big problem.

Enter PagerDuty–by far this is the best SaaS product we use in our day-to-day Operations team.  PagerDuty is an alerting tool that is simple, easy-to-use and integrates into your other systems.   Why is PagerDuty so awesome?  Well, I am glad you asked.

  1. User defined escalation.  Once an alert gets sent into PagerDuty, it is processed through our escalation pathway.  It determines who is the first level of support and begins to alert that person.  Here is the cool part, I can choose to be alerted however I want and other ops team members can choose however they want.  For me, I get an email after 1 minute, SMS alerts at 3 minute intervals for the next 9 minutes, then phone calls every 5 minutes for the next 20 minutes.  Lets say you are a hard sleeper then you might want to skip the SMS and move straight to phone calls.  If I don’t acknowledge the alert in 30 minutes, it will get escalated to the next ops team member.
  2. Alert Acknowledgement.  From any one of the alert mechanisms above, there is an in-kind way to acknowledge and resolve the alert.  I can reply to the SMS message with an ACK code or when I get a call from the PagerDuty version of Siri I can select a response right on my keypad at that time.  No time is lost logging into PagerDuty to acknowledge the alert and the ops team can just get busy responding.
  3. Equality of alerts.  This is a subtle one.  We have a policy on our team that all alerts are equal and need to be handled with the same care and diligence.  Anything that makes it to PagerDuty is treated with the same level of importance and is escalated through the same channel–no “you can just ignore those alerts from that system over there” syndrome on my team.  All alerts are escalated and all must be handled.
  4. API integration.  PagerDuty lets you integrate with tools you already use (e.g. nagios, zenoss, cloudkick, splunk) and those tools can open and close alerts as they are detected and/or resolved.
  5. Email integration. Even if you have created some code or monitor that you want to alert from that doesn’t integrate with the API then all you need to be able to do is send an email.  Once that email is received, PagerDuty will treat it as an alert.
  6. Global 24/7 tools.  PagerDuty works in lots of countries and has scheduling that allows for follow-the-sun ops teams to thrive.  I am on call every day for 8 hours and after my shift ends one of my other global ops team members is on call (@einsamsoldat and @hafizramly) for the next 8 hours–at which point I am bumped to the second tier escalation path.  Most tools miss this and for our team this is a huge benefit.

I had initially thought I would just write a quick paragraph or two about why using PagerDuty for alerting is great for your devops team.  But, I am such a big fan that I couldn’t resist coming up with more reasons why we love PagerDuty on our team.  The biggest reason of all is that as an Ops team manager, I can sleep at night knowing that all alerts will get handled and I won’t woken up with a phone call from a VP or marketing person telling me that our cloud products are down–well at least not because of a failure to handle alerts.

I would encourage you to stop using subpar alerting mechanisms in monitoring and logging tools that don’t treat alerting as a first class citizen.  Those tools were’t created to be awesome at alerting they were meant to be detective in nature.  PagerDuty is made for alerting and defintely deserves a spot in your devops toolchain.

 

Leave a comment

Filed under DevOps

Awesome Austin Tech Meetups

Austin is such a great place to be a techie.

  • The Austin Cloud User Group (I help run it) meets every third Tuesday evening, and we’ve ben having 50+ people come in to check out some awesome stuff.  Next meeting Feb 21 on Puppet, hosted by Pervasive.
  • The Agile Austin DevOps SIG meets fourth Wednesdays, we had our meeting today and had about 20 attendees, hosted by CA/Hyperformix. I also help run that one.
  • The Austin Big Data User Group is back meeting – next one is tomorrow night! Hosted by Bazaarvoice.
  • The Austin OWASP chapter is one of the biggest and most active in the country, and also meets monthly, hosted by National Instruments. Fellow Agile Admin James Wickett helps run that group.
  • The Cloud Security Alliance, Austin chapter is just getting started but has a lot of momentum and we’re coordinating with them from the ACUG and OWASP sides. Their first meeting is tonight, come out!

There are others but those are my favorites and therefore the coolest by definition.

There’s also cool events coming up you should keep an eye out for.

  • DevOpsDays Austin, Apr 2-3, hosted by National Instruments, and this’ll be big! Patrick Debois and the whole crew of DevOps illuminati will be here. Now taking sponsors and speakers! Register now!
  • AppSec USA 2012, Oct 23-26 – Austin OWASP kicks so much ass with LASCON that the annual OWASP convention is coming here to Austin this year!
  • South by Southwest Interactive, March 9-13 – quickly becoming theWeb conference in the flyover states :-). Lots of stuff happens during it, like:
    • Austin Cloud/DevOps party courtesy GeekAustin (ACUG is a community sponsor). March 10.
    • CloudCamp – Dave Nielsen will be bringing a CloudCamp to Austin again this year during SXSWi. Details TBD, sounding like Mar 11 maybe.
  • The Cloud Security Alliance and ACUG are hoping to put together an Austin cloud conference, too. Maybe early 2013.

Leave a comment

Filed under Cloud, Conferences, DevOps

Why Does Cloud Load Balancing Suck?

Back in the old world of real infrastructure, we used Netscalers or F5’s and we were happy.  Now in the cloud, you have several options all of which seem to have problems.

1. Open source.  But once you want SSL, and redundancy, and HTTP compression, you get people saying with a straight face “nginx (for HTTP compression) –> Varnish cache (for caching) –> HTTP level load balancer (HAProxy, or nginx, or the Varnish built-in) –> webservers.” (Quoted from Server Fault).  Like four levels, often with the same software twice in it. And don’t forget some kind of heartbeat between the two front-ends. Oh look I’ve spent $150/mo on just machines to run my load balancing. And I really want to load balance/failover between all my tiers not just the front end.  It’s a lot of software parts to go wrong.

2. Zeus.  For some reason none of the other LB vendors have gotten off their happy asses and delivered a good software load balancer you can use in Amazon.  I got tired of talking to our Netscaler reps about it after the first couple years.  They’re more interested in selling their hardware to the cloud data centers than helping real people load balance their apps. Zeus is the only one – and it’s really quite expensive

3. Amazon ELBs.  These just have a lot of problems under the hood.  We’ve been engaged with Amazon ELB product management on them – large files serve out super slow; users get hits refused due to throttling/changes during ELB scaling – basically if you want 100% of your hits to come through you can’t use them.

4. Geo-IP load balancing, through Dyn or whoever. They claim to have the failover problem fixed, but it still only works for the front end tier of what is a multitier architecture. I certainly don’t want to have to advertise every internal IP in external DNS to make load balancing work.

And really the frustrating part is there seems to have been no headway on any of this stuff in a decade. Same old open source options, same old techniques.  Can someone come up with a way to load balance on the cloud that a) doesn’t lose any hits, b) is one thing not 4 things, and c) is useful for front and back end balancing?  Seems like a necessary part of oh say every system ever, why is it still so hard?

31 Comments

Filed under Cloud, DevOps