Monthly Archives: March 2012

Monitorin’ Ain’t Easy

The DevOps space has been aglow with discussion about monitoring.  Monitoring, much like pimping, is not easy. Everyone does it, but few do it well.

Luckily, here on the agile admin we are known for keeping our pimp hands strong, especially when it comes to monitoring. So let’s talk about monitoring, and how to use it to keep your systems in line and giving you your money!

In November, I posted about why your monitoring is lying to you. It turns out that this is part of a DevOps-wide frenzy of attention to monitoring.

John Vincent (@lusis) started it with his Why Monitoring Sucks blog post, which has morphed into a monitoringsucks github project to catalog monitoring tools and needs. Mainstream monitoring hasn’t changed in oh like 10 years but the world of computing certainly has, so there’s a gap appearing. This refrain was quickly picked up by others (Monitoring  Sucks. Do Something About It.) There was a Monitoring Sucks panel at SCALE last week and there’s even a #monitoringsucks hashtag.

Patrick Debois (@patrickdebois) has helped step into the gap with his series of “Monitoring Wonderland” articles where he’s rounding up all kinds of tools. Check them out…

However it just shows how fragmented and confusing the space is. It also focuses almost completely on the open source side – I love open source and all but sometimes you have to pay for something. Though the “big ol’ suite” approach from the HP/IBM/CA lot makes me scream and flee, there’s definitely options worth paying for.

Today we had a local DevOps meetup here in Austin where we discussed monitoring. It showed how fragmented the current state is.  And we met some other folks from a “real” engineering company, like NI, and it brought to mind how crap IT type monitoring is when compared to engineering monitoring in terms of sophistication.  IT monitoring usually has better interfaces and alerting, but IT monitoring products are very proud when they have “line graphs!” or the holy grail, “histograms!” Engineering monitoring systems have algorithms where they can figure out the difference between a real problem and a monitoring problem.  They apply advanced algorithms when looking at incoming metrics (hint: signal processing).  When is anyone in IT world who’s all delirious about how cool “metrics” going to figure out some math above the community college level?

To me, the biggest gap especially in cloud land – partially being addressed by New Relic and Boundary – in the space is agent based real user monitoring.  I want to know each user and incoming/outgoing transaction, not at the “tcpdump” level but at the meaningful level.  And I don’t want to have to count on the app to log it – besides the fact that devs are notoriously shitful loggers, there are so many cases where something goes wrong – if tomcat’s down, it’s not logging, but requests are still coming in…  Synthetic monitoring and app metrics are good but they tend to not answer most of the really hard questions we get with cloud apps.

We did a big APM (application performance management) tool eval at NI, and got a good idea of the strengths and weaknesses of the many approaches. You end up wanting many/all of them really. Pulling box metrics via SNMP or agents, hitting URLs via synthetic monitors locally or across the Internet, passive network based real user monitoring, deep dive metric gathering (Opnet/AppDynamics/New Relic/etc.)…  We’ll post more about our thoughts on all these (especially Peco, who led that eval and is now working for an APM company!).

Your thoughts on monitoring?  Hit me!

Leave a comment

Filed under DevOps

Service Delivery Best Practices?

So… DevOps.  DevOps vs NoOps has been making the rounds lately. At Bazaarvoice we are spawning a bunch of decentralized teams not using that nasty centralized ops team, but wanting to do it all themselves.  This led me to contemplate how to express the things that Operations does in a way that turns them into developer requirements?

Because to be honest a lot of our communication in the Ops world is incestuous.  If one were to point a developer (or worse, a product manager) at the venerable infrastructures.org site, they’d immediately poop themselves and flee.  It’s a laundry list of crap for neckbeards, but spends no time on “why do you want this?” “What value will this bring to your customer?”

So for the “NoOps” crowd who want to do ops themselves – what’s an effective way to state these needs?  I took previous work – ideas from the pretty comprehensive but waterfally “Systems Development Framework” Peco and I did for NI, Chris from Bazaarvoice’s existing DevOps work – and here’s a cut.  I’d love feedback. The goal is to have a small discrete set of areas (5-7), with a small discrete set (5-7) of “most important” items – most importantly, stated in a way so that people understand that these aren’t “annoying things some ops guy wants me to do instead of writing 3l33t code” but instead stated as they should be, customer facing requirements like any other.  They could then be each broken into subsidiary more specific user stories (probably a whole lot of them, to be honest) but these could stand in as “epics” or “pillars” for making your Web thing work right.

Service Delivery Best Practices

Availability

  • I will make my service highly available, so that customers can use it constantly without issues
  • I will know about any problems with my service’s delivery to customers
  • I can restore my service quickly from any disaster or loss
  • I can make my data resistant to corruption from runtime issues
  • I will test my service under routine disruption to make it rugged and understand its failure modes

Performance

  • I know how all parts of the product perform and can measure performance against a customer SLA
  • I can run my application and its data globally, to serve our global customer base
  • I will test my application’s performance and I know my app’s limitations before it reaches them

Support

  • I will make my application supportable by the entire organization now and in the future
  • I know what every user hitting my service did
  • I know who has access to my code and data and release mechanism and what they did
  • I can account for all my servers, apps, users, and data
  • I can determine the root cause of issues with my service quickly
  • I can predict the costs and requirements of my service ahead of time enough that it is never a limiter
  • I will understand the communication needs of customers and the organization and will make them aware of all upgrades and downtime
  • I can handle incidents large and small with my services effectively

Security

  • I will make every service secure
  • I understand Web application vulnerabilities and will design and code my services to not be subject to them
  • I will test my services to be able to prove to customers that they are secure
  • I will make my data appropriately private and secure against theft and tampering
  • I will understand the requirements of security auditors and expectations of customers regarding the security of our services

Deployment

  • I can get code to production quickly and safely
  • I can deploy code in the middle of the day with no downtime or customer impact
  • I can pilot functionality to limited sets of users
  • I will make it easy for other teams to develop apps integrated with mine

Also to capture that everyone starting out can’t and shouldn’t do all of this… We struggled over whether these were “Goals” or “Best Practices” or what… On the one hand you should only put in e.g. “as much security as you need” but on the other hand there’s value in understanding the optimal condition.

2 Comments

Filed under DevOps

Breaking the Silence – Agile Admin Updates!

Well I know it’s been quiet around here lately.  Two of the agile admins, myself and Peco, have switched jobs and have been crazy busy.

Peco has moved to Opnet as a SE – he always loved their APM tool Panorama and we got a lot of mileage out of it at NI. Notably, this included getting some of our suppliers, like Vignette, to buy it and use it on their products to find the performance problems before we did… We were having Vignette V7 performance issues and used Panorama to identify exactly what they were, and when we fed it back to Vignette engineering they were like “how the hell did you do that…” Opnet’s never had great marketing, they’ve been doing what New Relic and AppDynamics have been doing for a long time but aren’t as much of a recognized name. It’s an interesting move for Peco, he went from long time Ops guy to a Java dev, working on system automation like PIE, and now an SE because he wants trigger time on customer interaction.  He’s a renaissance man!  And working on his own stealth startup on the side of course.

I have moved to Bazaarvoice, a SaaS startup in Austin (well, are you still a startup when  you’re up to 400 people?) to be their Manager of Release Engineering.  They have been increasing their dev and ops forces by a very large margin (P.S. We’re hiring!) and wanted someone to run the “middle third” of their DevOps teams.  There’s one team of DevOps embedded into all the product teams, kind of a matrixed organization; then there’s my team to do the build and deploy automation and cloud and data center stuff; then there’s a support team to wrangle the pages and tickets.

I was there for a week doing new hire training when they said “So we have moved to agile doing two week sprints, and that’s going OK except we realized we can’t release every two weeks safely. Get that going.  We start biweekly releases in three weeks. Zero customer impact each time!” Therefore I’ve been in a little bit of a frenzy doing that. It was definitely on the fine line between “I like a challenge” and “Oh shit.” I lucked out and got a week of slack because we decided to IPO on the date that the first release was supposed to go, and we decided that doing both on the same day was just a wee bit too ambitious. The first biweekly release slipped from a Thursday till the next Tuesday and only had two minor issues that needed fixing after, and the next one is coming up!  We should have it tamped down to regular in a sprint or two and I hope to return to more regular blogging.

On the side of course I continue to help run the Austin Cloud User Group and the Agile Austin DevOps SIG and put together DevOpsDays Austin and do other random stuff, so sadly when the overcommittedness hits the blog has to give.

James is doing great, and just had a new baby!  So he’s been out of pocket some too. He’s still at NI, and now large and in charge of all operations for their SaaS products. They are fighting the brave fight to get PIE open sourced and out the door.

We are all going to SXSW Interactive this weekend and will regroup and decide how we’re going to bring you all the latest in hot cloud/DevOps/tech stuff going forward!

1 Comment

Filed under General