Archive

Archive for the ‘DevOps’ Category

Monitoring Sucks but Alerting is Beautiful

February 19, 2012 Leave a comment

I (@wickett) work as the Cloud Ops Team Lead at National Instruments where we have several Software as a Service products that we have built on different cloud providers (AWS, Azure, Google) and have implemented with a host of other supporting SaaS tools (cloudkick, ZenDesk, AlertSite and PagerDuty plus several others).  When building out our products we decided to eat our own dog food and use SaaS solutions as much as possible.  Great, but where am I going with all this?

No matter what tools you are using to monitor or log on your systems, you need a reliable way to get actionable events to your Ops Team.  If you have implemented several types of monitors, you generally are setting up an email address for them to send alerts to.  Then you write scripts to forward those to on call devices or forwarding rules to turn them into SMS–not exactly state of the art.  Try mixing in a global on call rotation and trying to configure all of your monitoring tools to account for that and it becomes a big problem.

Enter PagerDuty–by far this is the best SaaS product we use in our day-to-day Operations team.  PagerDuty is an alerting tool that is simple, easy-to-use and integrates into your other systems.   Why is PagerDuty so awesome?  Well, I am glad you asked.

  1. User defined escalation.  Once an alert gets sent into PagerDuty, it is processed through our escalation pathway.  It determines who is the first level of support and begins to alert that person.  Here is the cool part, I can choose to be alerted however I want and other ops team members can choose however they want.  For me, I get an email after 1 minute, SMS alerts at 3 minute intervals for the next 9 minutes, then phone calls every 5 minutes for the next 20 minutes.  Lets say you are a hard sleeper then you might want to skip the SMS and move straight to phone calls.  If I don’t acknowledge the alert in 30 minutes, it will get escalated to the next ops team member.
  2. Alert Acknowledgement.  From any one of the alert mechanisms above, there is an in-kind way to acknowledge and resolve the alert.  I can reply to the SMS message with an ACK code or when I get a call from the PagerDuty version of Siri I can select a response right on my keypad at that time.  No time is lost logging into PagerDuty to acknowledge the alert and the ops team can just get busy responding.
  3. Equality of alerts.  This is a subtle one.  We have a policy on our team that all alerts are equal and need to be handled with the same care and diligence.  Anything that makes it to PagerDuty is treated with the same level of importance and is escalated through the same channel–no “you can just ignore those alerts from that system over there” syndrome on my team.  All alerts are escalated and all must be handled.
  4. API integration.  PagerDuty lets you integrate with tools you already use (e.g. nagios, zenoss, cloudkick, splunk) and those tools can open and close alerts as they are detected and/or resolved.
  5. Email integration. Even if you have created some code or monitor that you want to alert from that doesn’t integrate with the API then all you need to be able to do is send an email.  Once that email is received, PagerDuty will treat it as an alert.
  6. Global 24/7 tools.  PagerDuty works in lots of countries and has scheduling that allows for follow-the-sun ops teams to thrive.  I am on call every day for 8 hours and after my shift ends one of my other global ops team members is on call (@einsamsoldat and @hafizramly) for the next 8 hours–at which point I am bumped to the second tier escalation path.  Most tools miss this and for our team this is a huge benefit.

I had initially thought I would just write a quick paragraph or two about why using PagerDuty for alerting is great for your devops team.  But, I am such a big fan that I couldn’t resist coming up with more reasons why we love PagerDuty on our team.  The biggest reason of all is that as an Ops team manager, I can sleep at night knowing that all alerts will get handled and I won’t woken up with a phone call from a VP or marketing person telling me that our cloud products are down–well at least not because of a failure to handle alerts.

I would encourage you to stop using subpar alerting mechanisms in monitoring and logging tools that don’t treat alerting as a first class citizen.  Those tools were’t created to be awesome at alerting they were meant to be detective in nature.  PagerDuty is made for alerting and defintely deserves a spot in your devops toolchain.

 

Categories: DevOps

Awesome Austin Tech Meetups

January 25, 2012 Leave a comment

Austin is such a great place to be a techie.

  • The Austin Cloud User Group (I help run it) meets every third Tuesday evening, and we’ve ben having 50+ people come in to check out some awesome stuff.  Next meeting Feb 21 on Puppet, hosted by Pervasive.
  • The Agile Austin DevOps SIG meets fourth Wednesdays, we had our meeting today and had about 20 attendees, hosted by CA/Hyperformix. I also help run that one.
  • The Austin Big Data User Group is back meeting – next one is tomorrow night! Hosted by Bazaarvoice.
  • The Austin OWASP chapter is one of the biggest and most active in the country, and also meets monthly, hosted by National Instruments. Fellow Agile Admin James Wickett helps run that group.
  • The Cloud Security Alliance, Austin chapter is just getting started but has a lot of momentum and we’re coordinating with them from the ACUG and OWASP sides. Their first meeting is tonight, come out!

There are others but those are my favorites and therefore the coolest by definition.

There’s also cool events coming up you should keep an eye out for.

  • DevOpsDays Austin, Apr 2-3, hosted by National Instruments, and this’ll be big! Patrick Debois and the whole crew of DevOps illuminati will be here. Now taking sponsors and speakers! Register now!
  • AppSec USA 2012, Oct 23-26 – Austin OWASP kicks so much ass with LASCON that the annual OWASP convention is coming here to Austin this year!
  • South by Southwest Interactive, March 9-13 – quickly becoming theWeb conference in the flyover states :-) . Lots of stuff happens during it, like:
    • Austin Cloud/DevOps party courtesy GeekAustin (ACUG is a community sponsor). March 10.
    • CloudCamp – Dave Nielsen will be bringing a CloudCamp to Austin again this year during SXSWi. Details TBD, sounding like Mar 11 maybe.
  • The Cloud Security Alliance and ACUG are hoping to put together an Austin cloud conference, too. Maybe early 2013.

Why Does Cloud Load Balancing Suck?

January 19, 2012 3 comments

Back in the old world of real infrastructure, we used Netscalers or F5′s and we were happy.  Now in the cloud, you have several options all of which seem to have problems.

1. Open source.  But once you want SSL, and redundancy, and HTTP compression, you get people saying with a straight face “nginx (for HTTP compression) –> Varnish cache (for caching) –> HTTP level load balancer (HAProxy, or nginx, or the Varnish built-in) –> webservers.” (Quoted from Server Fault).  Like four levels, often with the same software twice in it. And don’t forget some kind of heartbeat between the two front-ends. Oh look I’ve spent $150/mo on just machines to run my load balancing. And I really want to load balance/failover between all my tiers not just the front end.  It’s a lot of software parts to go wrong.

2. Zeus.  For some reason none of the other LB vendors have gotten off their happy asses and delivered a good software load balancer you can use in Amazon.  I got tired of talking to our Netscaler reps about it after the first couple years.  They’re more interested in selling their hardware to the cloud data centers than helping real people load balance their apps. Zeus is the only one – and it’s really quite expensive

3. Amazon ELBs.  These just have a lot of problems under the hood.  We’ve been engaged with Amazon ELB product management on them – large files serve out super slow; users get hits refused due to throttling/changes during ELB scaling – basically if you want 100% of your hits to come through you can’t use them.

4. Geo-IP load balancing, through Dyn or whoever. They claim to have the failover problem fixed, but it still only works for the front end tier of what is a multitier architecture. I certainly don’t want to have to advertise every internal IP in external DNS to make load balancing work.

And really the frustrating part is there seems to have been no headway on any of this stuff in a decade. Same old open source options, same old techniques.  Can someone come up with a way to load balance on the cloud that a) doesn’t lose any hits, b) is one thing not 4 things, and c) is useful for front and back end balancing?  Seems like a necessary part of oh say every system ever, why is it still so hard?

Categories: Cloud, DevOps Tags: ,

Why Your Testing Is Lying To You

November 17, 2011 2 comments

As a follow-on to Why Your Monitoring Is Lying To You.  How is it that you can have an application go through a whole test phase, with two-day-long load tests, and have surprising errors when you go to production?  Well, here’s how…  The same application I describe in the case study part of the monitoring article slipped through testing as well and almost went live with some issues. How, oh how could this happen…

I Didn’t See Any Errors!

Our developers quite reasonably said “But we’ve been developing and using this app in dev and test for months and haven’t seen this problem!” But consider the effects at work in But, You See, Your Other Customers Are Dumber Than We Are. There are a variety of levels of effect that prevent you from seeing intermittent problems, and confirmation bias ends up taking care of the rest.

The only fix here is rigor. If you hit your application and test and it errors, you can’t just ignore it. “I hit reload, it worked.  Maybe they were redeploying. On with life!” Maybe it’s your layer, maybe it’s another layer, it doesn’t matter, you have to log that as a bug and follow up and not just cancel the bug as “not reproducible” if you don’t see it yourself in 5 minutes of trying.  Devs sometimes get frustrated with us when we won’t let up on occurrences of transient errors, but if you don’t know why they happened and haven’t done anything to fix it, then it’s just a matter of time before it happens again, right?

We have a strict policy that every error is a bug, and if the error wasn’t detected it is multiple bugs – a bug with the monitoring, a bug with the testing, etc. If there was an error but “you don’t know why” – you aren’t logging enough or don’t have appropriate tools in place, and THAT’s a bug.

Our Load Test/Automated Tests Didn’t See Any Errors!

I’ll be honest, we don’t have much in the way of automated testing in place. So there’s that.  But we have long load tests we run.  “If there are intermittent failures they would have turned up over a two day load test right?” Well, not so fast. How confident are you this error is visible to and detected by your load test?  I have seen MANY load test results in my lifetime where someone was happily measuring the response time of what turned out to be 500 errors.  “Man, my app is a lot faster this time!  The numbers look great! Wait… It’s not even deployed. I hit it manually and I get a Tomcat page.”

Often we build deliberate “lies” into our software. We throw “pretty” error pages that aren’t basic errors. We are trying not to leak information to customers so we bowderlize failures on the front end. We retry maniacally in the face of failed connections, but don’t log it. We have to use constrained sets of return codes because the client consuming our services (like, say, Silverlight) is lobotomized and doesn’t savvy HTTP 401 or other such fancy schmancy codes.

Be careful that load tests and automated tests are correctly interpreting responses.  Look at your responses in Fiddler – we had what looked to the eye to be a 401 page that was actually passing back a 200 HTTP return code.

The best fix to this is test driven development.  Run the tests first so you see them fail, then write the code so you see them work!  Tests are code, and if you just write them on your working code then you’re not really sure if they’ll fail if somethings bad!

Fault Testing

Also, you need to perform positive and negative fault testing. Test failures end to end, including monitoring and logging and scaling and other management stuff. At the far end of this you have the cool if a little crazy Chaos Monkey.  Most of us aren’t ready or willing to jack up our production systems regularly, but you should at least do it in test and verify both that things work when they should and that they fail and you get proper notification and information if they do.

Try this.  Have someone Chaos Monkey you by turning off something random – a database, making a file system read only, a back end Web service call.  If you have redundancy built in to counter this, great, try it with one and see the failover, but then have them break “all of it” to provoke a failure.  Do you see the failure and get alerted? More importantly, do you have enough information to tell what they broke?  “One of the four databases we connect to” is NOT an adequate answer. Have someone break it, send you all the available logs and info, and if you can’t immediately pinpoint the problem, fix that.

How Complex Systems Fail, Invisibly

In the end, a lot of this boils down to How Complex Systems Fail. You can have failures at multiple levels – and not really failures, just assumptions – that stack on top of each other to both generate failures and prevent you from easily detecting those failures.

Also consider that you should be able to see those “short of failure” errors.  If you’re failing over, or retrying, or whatnot – well it’s great that you’re not down, but don’t you think you should know if you’re having to fail over 100x more this week?  Log it and turn it into a metric. On our corporate Web site, there’s hundreds of thousands of Web pages, so a certain level of 404s is expected.  We don’t alert anyone on a 404.  But we do metricize it and trend it and take notice if the number spikes up (or down – where’d all that bad content go?).

Whoelsale failures are easy to detect and mitigate.  It’s the mini-failures, or things that someone would argue are not a failure, on a given level that line up with the same kinds of things on all the other layers and those lined up holes start letting problems slip through.

http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html

Categories: DevOps Tags: , , ,

Why Your Monitoring Is Lying To You

November 15, 2011 10 comments

In my Design for Failure article, I mentioned how many of the common techniques we use to allegedly detect failure really don’t.  This time, we’ll discuss your monitoring and why it is lying to you.

Well, you have some monitoring, don’t you, couldn’t it tell you if an application is down? Obviously not if you are just doing old SNMP/box level monitoring, but you’re all DevOps and you know you have to monitor the applications because that’s what counts. But even then, there are common antipatterns to be aware of.

Synthetic Monitoring

Dirty secret time, most application monitoring is “synthetic,” which means it hits a specific URL or set of URLs once in a while, often 5-10 minutes apart. Also, since there are a lot of transient failures out there on the Internet, most ops groups have their monitors set where they have to see 2-5 consecutive failures – because ops teams don’t like being woken up at 3 AM because an application hiccuped once (or the Internet hiccuped on the way to the application). If the problem happens on only 1 of every 20 hits, and you have to see three errors in a row to alert, then I’ll leave it to your primary school math skills to determine how likely it is you’ll catch the problem.

You can improve on this a little bit, but in the end synthetic monitoring is mainly useful for coarse uptime checking and performance trending.

Metric Monitoring

OK, so synthetic monitoring is only good for rough up/down stuff, but what about my metric monitoring? Maybe I have a spiffier tool that is continuously pulling metrics from Web servers or apps that should give me more of a continuous look.  Hits per second over the last five minutes; current database space, etc.

Well, I have noticed that metrics monitors, with startling regularity, don’t really tell you if something is up or down, especially historically. If you pull current database space and the database is down, you’d think there would be a big nasty gap in your chart but many tools don’t do that – either they report the last value seen, or if it’s a timing report it happily reports you timing of errors. Unless you go to the trouble to say “if the thing is down, set a value of 0 or +infinity or something” then you can sometimes have a failure, then go back and look at your historical graphs and see no sign there’s anything wrong.

Log Monitoring

Well surely your app developers are logging if there’s a failure, right? Unfortunately logging is a bit of an art, and the simple statement “You should log the overall success or failure of each hit to your app, and you should log failures on any external dependency” can be… reinterpreted in many ways. Developers sometimes don’t log all the right things, or even decide to suppress certain logs.

You should always log everything.  Log it at a lower log level, like INFO, if it’s routine, but then at least it can be reviewed if needed and can be turned into a metric for trending via tools like Splunk. My rules are simple:

  • Log the start and end of each hit – are you telling the client success or failure? Don’t rely on the Web server log.
  • Log every single hit to an external dependency at INFO
  • Log every transient failure at WARN
  • Log every error at ERROR

Real User Monitoring

Ah, this is more like it.  The alleged Holy Grail of monitoring is real user monitoring, where you passively look at the transactions coming in and out and log them.  Well, on the one hand, you don’t have to rely on the developers to log, you can log despite them.  But you don’t get as much insight as you’d think. If the output from the app isn’t detectable as an error, then the monitoring doesn’t help.  A surprising amount of the time, failures are not thrown as a 500 or other expected error code. And checking for content within a payload is often fragile.

Also, RUM tools tend to be network sniffer based, which don’t work well in the cloud or in many network topologies.  And you get so much data, that it can be hard to find the real problems without spending a lot of time on it.

No, Really – One Real World Example

We had a problem just this week that managed to successfully slip through all our layers of monitoring – luckily, our keen eyes caught it in preproduction. We had been planning a bit app release and had been setting up monitoring for it. It seemed like everything was going fine. But then the back end databases (SQL Azure in this case) had a pretty long string of failures for about 10 minutes, which brought our attention to the issue. As I looked into it, I realized that it was very likely we would have seen smaller spates of SQL Azure connection issues and thus application outage before – why hadn’t we?  I investigated.

We don’t have any good cloud-compliant real user monitoring in place yet.  And the app was throwing a 200 http code on an error (the error page displayed said 401, but the actual http code was 200) so many of our synthetic monitors were fooled. Plus, the problem was usually occasional enough that hitting once every 10 minutes from Cloudkick didn’t detect it. We fixed that bad status code, and looked at our database monitors. “I know we have monitors directly on the databases, why aren’t those firing?”

Our database metric monitors through Cloudkick, I was surprised to see, had lovely normal looking graphs after the outage.I provoked another outage in test to see, and sure enough, though the monitors ‘went red,’ for some reason they were still providing what seemed to Cloudkick like legitimate data points, and once the monitors “went green,” nothing about any of the metric graphs indicated anything unusual! In other words, the historical graphs had legitimate looking data and did not reveal the outage. That’s a big problem. So we worked on those monitors.

I still wanted to know if this had been happening.  “We use Splunk to aggregate our logs, I’ll go look there!” Well, there were no error lines in the log that would indicate a back end database problem. Upon inquiring, I heard that since SQL Azure connection issues are a known and semi-frequent problem, logging of them is suppressed, since we have retry logic in place.  I recommended that we log all failures, with ones that are going to be retried simply logged at a lower severity level like WARN, but ERROR on failures after the whole spread of retries. I declared this a showstopper bug that had to be fixed before release – not everyone was happy with that, but sometimes DevOps requires tough love.

I was disturbed that we could have periods of outage that were going unnoticed despite our investment in synthetic monitoring, pulling metrics, and searching logs. When I looked back at all our metrics over periods of known outage and they all looked good, I admit I became somewhat irate. We fixed it and I’m happy with our detection now, but I hope this is instructive in showing you how bad assumptions and not fully understanding the pros and cons of each instrumentation approach can end up leaving “stacked holes” that end up profoundly compromising your view of your service!

DevOps Tip: Design for Failure

November 14, 2011 Leave a comment

We have had some interesting  internal discussions lately about application reliability.  It’s probably not a surprise to many of you that the cloud is unreliable, on a small scale that is.  Sure, on the large scale you use the cloud to make highly resilient environments. But a certain percentage of calls to the cloud fail – whether it’s Amazon’s or Azure’s management APIs, or hitting Amazon or Azure storage, or going through an Amazon ELB, or hitting SQL Azure. Heck, on Azure they plainly state that they will pull your instances out from under you, restart them, and move them to other hardware without notice. If you’re running 2 or more, they won’t do them all at the same time – so again, you get large scale resilience but at the cost of some small scale unreliability.

The problem is, that people sometimes come from the assumption that their application is always working fine, unless you can prove otherwise. This is fundamentally the wrong assumption. You have to assume your application has problems, unless you can prove it doesn’t.  This changes your approach to testing, logging, and monitoring profoundly.

Take the all too common example of an app with intermittent failures. Let’s say it’s as bad as 1 in 20 times.  1 in 20 times a customer hits your application, it fails somehow. It is very likely you don’t know this. Because by default, you don’t know it. How would you? Well, obviously, by monitoring, logging, and testing. I’ll follow this up with a series of posts describing how and why those often fail to detect problems. The short form is that “ha ha, no they don’t.”

Here’s a bad story I’ll tell on myself.  Here at NI, we rolled out a PDF instant quote generation widget.  We have over 250 apps on ni.com, so we don’t put synthetic monitors on all of them (remind me to tell you about the time early at NI that I discovered synthetic monitoring was producing 30% of our site load). Apparently the logging wasn’t all that good either, it didn’t trigger any of our log monitoring heuristics. Anyway, come to find out later on that the app was failing in production about 75% of the time. This is an application on a “monitored” site, where a developer and a tester signed off on the app. Whoops.  If you do a cursory test and assume it’ll work – well you know what they say about assumptions – they make an ass out of “you” and “mption.” :-)

Anyway, to me part of the good part about the cloud is that they come out and say “we’re going to fail 2-5% of the time, code for it.” Because before the cloud, there were failures all the time too, but people managed to delude themselves into thinking there weren’t; that an application (even a complex Internet-based application) should just work, hit after hit, day after day, on into the future. So by having handling failure built in – like a lesser version of the Chaos Monkey – you’re not really just making your app cloud friendly, you’re making it better.

Real engineers who make cars and whatnot know better. That’s why there was a big ol’ maintenance hatch on the side of the Hubble Space Telescope; if any of you have watched the Hubble 3D IMAX film you get to see them performing maintenance on it.  If a billion dollar telescope in fricking space has problems and needs to be maintainable, so does your little Web app.

But I see so many apps that don’t really take failure into account.  Oh, maybe they retry some connections if they fail. But what if you get to the end of your retries? What if the response you get back is an unexpected HTTP code or unexpected payload? You’d think in the age of try/catch and easily integrated logging frameworks you wouldn’t see this any more, but I see it all the time. It’s a combination of not realizing that failure is ubiquitous, and not thinking about the impact (especially the customer facing impact) of that failure.

This is one of the (many) great DevOps learning experiences – Ops helping teach Devs all the things that can go wrong that don’t really go wrong much in a “frictionless” lab environment.  “So, what do you do if your hard drive is suddenly not there?” (Common with Amazon EBS failures.)  “What do you do if you took data off that queue and then your instance restarts before you put it into the database?” (Hopefully a transaction.) “What do you do if you can’t make that network connection, are you retrying every 5 ms and then filling up the system’s TCP connections?” (True story.) “Hey, I’m sure your app is pure as the driven snow right now, but is it always going to work the same when the PaaS vendor changes the OS version under you?”

In all circumstances, you should

  • Plan for failure (understand failure modes, retry, design for it)
  • Detect failure (monitor, log, etc.)
  • Plan for and detect failure of your schemes to plan for and detect failure!

We do some security threat modeling here. I wonder if there’s not a lightweight methodology like that which could be readily adapted for reliability modeling of apps.  Seems like something someone would have done… But a simple one, not like lame complicated risk matrices. I’ll have to research that.

How We Do Cloud and DevOps: The Motion Picture

September 15, 2011 Leave a comment

Our good friend Damon Edwards from dev2ops came by our Austin office and recorded a video of Peco and I explaining how we do what we do! Peco never blogs, so this is a rare opportunity to hear him talk about these topics, and he’s full of great sound bytes. :-)

I apologize in advance for how much I say “right.”

Won’t somebody please think of the systems?

September 12, 2011 3 comments

Won't somebody please think of the systems?

What is the goal of DevOps?  If you ask a lot of people, they say “Continuous integration!  Pushing functionality out faster!”  The first cut at a DevOps Wikipedia article pretty much only considered this goal. Unfortunately, this is a naive view largely popular among developers who don’t fully understand the problems of service management. More rapid delivery and the practice of continuous integration/deployment is cool and it’s part of the DevOps umbrella of concerns, but it is not the largest part.

Let us review the concepts behind IT Service Management. I don’t like ITIL in terms of a prescriptive thing to implement, but as a cognitive framework to understand IT work, it’s great. Anyway, depending on what version you are looking at, there are a lot of parts of delivering a service to end users/customers.

1. Service Strategy (tie-guy stuff)
2. Service Design (including capacity, availability, risk management)
3. Service Transition (release and change management)
4. Service Operation (operations)
5. Continual Service Improvement (metrics)

Let’s concentrate on the middle three.  Service transition (release) is where CI fits in.  And that’s great.  But most of the point of DevOps is the need for ops to be involved in Service Design and for the developers to be involved in Service Operation!

Service Transition

Sure, in the old waterfall mindset, service transition is where “the work moves from dev to ops.” Dev guys do everything before that, ops guys do everything after that, we just need a more graceful handoff.  DevOps is not about trying to file some of the rough bits off the old way of doing things. It’s about improving service quality by more profoundly integrating the teams through the whole pipeline.

Here at NI, continuous integration was our lowest ranked DevOps priority. It’s a nice to have, while improving service design and operation was way, way more important to us. We’re starting work on it now, but consider our DevOps implementation to be successful without it. If you don’t have service design and operation nailed, then focusing on service transition risks “delivering garbage more quickly to users, and having it be unsupportable.”

Service Design

Services will not be designed correctly without embedded operational experience in their design and creation. You can call this “systems engineering” and say it’s not ops… But it’s ops. Look at the career paths if that confuses you. Our #1 priority in our DevOps implementation was to avoid the pain and problems of developing services with only input from functional developers. A working service is, more than ever, a synthesis of systems and applications and need to be designed as such. We required our systems architect and applications architect to work hand in hand, mutually design the team and tools and products, review changes from both devs and ops…

Service Operation

Services can not be operated correctly if thrown over the wall to an ops team, and they will not be improved at the appropriate rate. Developers need to be on hand to help handle problems and need to be following extremely closely what their application does in the users’ hands to make a better product. This was our #2 priority when implementing DevOps, and self service is a high implementation priority.  Developers should be able to see their logs and production metrics in realtime so we put things like Splunk and Cloudkick in place and made their goal to not be operations facing, but developer facing, tools.

The Bottom Line

DevOps is not about just making the wall you throw stuff over shorter. With apologies to Dev2Ops,

Improvement? I think not!

To me the point of DevOps is to not have a wall – to have both development and operations involved in the service design, the transition, and the operation. Just implementing CI without doing that isn’t DevOps – it’s automating waterfall. Which is fine and all, but you’re missing a lot of the point and are not going to get all the benefits you could.

Categories: DevOps Tags: , , ,

Analysts on DevOps

September 8, 2011 Leave a comment

DevOps is getting enough traction that there are papers coming out on if from the various analyst groups. I thought I’d spur a roundup of this research – they can be very valuable in converting your upper management types into understanding and seeing the value of DevOps.

Chime in below with more to add to the list!  Analyst stuff, not random blogs, please – something that you would put in front of upper management.

Categories: DevOps Tags: ,

Addressing the IT Skeptic’s View on DevOps

September 2, 2011 1 comment

A recent blog post on DevOps by the IT Skeptic entitled DevOps and traditional ITSM – why DevOps won’t change the world anytime soon got the community a’frothing. And sure, the article is a little simmered in anti-agile hate speech (apparently the Agilistias and cloud hypesters and cowboys are behind the whole DevOps thing and are leering at his wife and daughter and dropping his property values to boot) but I believe his critiques are in general very perceptive and that they are areas we, the DevOps movement, should work on.

Go read the article – it’s really long so I won’t sum the whole thing up here.

Here’s the most germane critiques and what we need to do about them. He also has some poor and irrelevant or misguided critiques, but why would I waste time on those?  Let’s take and action on the good stuff that can make DevOps better!

Lack of a coherent definition

This is a very good point. I went to the first meeting of an Austin DevOps SIG lately and was treated to the usual debate about “the definition of DevOps” and all the varied viewpoints going into that.  We need to emerge more of a structured definition that either includes and organizes or excludes the various memetic threads. It’s been done with Agile, and we can do it too. My imperfect definition of DevOps on this site tries to clarify this by showing there are different levels (principles, methods, and practices) that different thoughts about DevOps slot into.

Worry about cowboys

This is a valid concern, and one I share. Here at NI, back in the day programmers had production passwords, and they got taken away for real good reasons.  “Oh, let’s just give the programmers pagers and the root password” is not a responsible interpretation of DevOps but it’s one I’ve heard bandied about; it’s based on a false belief that as long as you have “really smart” developers they’ll never jack everything up.

Real DevOps shops that are uptaking practices that could be risky, like continuous deployment, are doing it with extreme levels of safeguard put into place (automated testing, etc.).  This is similar to the overall problem in agile – some people say “agile? Great!  I’ll code at random,” whereas really you need to have a very high percentage of unit test coverage. And sure, when you confront people with this they say “Oh, sure, you need that” but there is very little constructive discussion or tooling around it. How exactly do I build a good systems + app code integration/smoke test rig? “Uh you could write a bunch of code hooked to Hudson…” This should be one of the most discussed and best understood parts of the chain, not one of the least, to do DevOps responsibly.

We’re writing our own framework for this right now – James is doing it in Ruby, it’s called Sparta, and devs (and system folks) provide test chunks that the framework runs and times in an automated fashion. It’s not a well solved problem (and the big-dollar products that claim to do test automation are nightmares and not really automated in the “devs easily contribute tests to integrate into a continuous deploy” sense.

Team size

Working at a large corporation, I also share his concern about people’s cunning DevOps schemes that don’t scale past a 12 person company.  “We’ll just hire 7 of the best and brightest and they’ll do everything, and be all crossfunctional, and write code and test and do systems and ops and write UIs and everything!” is only a legit plan for about 10 little hot VC funded Web 2.0 companies out there.  The rest of us have to scale, and doing things right means some specialization and risks siloization.

For example, performance testing.  When we had all our developers do their own performance testing, the limit of the sophistication of those tests was “I’ll run 1000 hits against it and time how long it takes to finish.  There, 4 seconds.  Done, that’s my performance testing!”  The only people who think Ops, QA, etc. are such minor skill sets that someone can just do them all is someone who is frankly ignorant of those fields. Oh, P.S. The NoOps guys fall into this category, please don’t link them to DevOps.

We have struggled with this.  We’ve had to work out what testing our devs do versus how we closely align with external test teams.  Same with security, performace, etc.  The answer is not to completely generalize or completely silo – Yahoo! had a great model with their performance team, where their is a central team of super-experts but there are also embedded folks on each product team.

Hiring people

Very related to the previous point – again unless you’re one of the 10 hottest Web 2.0 plays and you can really get the best of the best, you are needing to staff your organization with random folks who graduated from UT with a B average. You have to have and manage tiers as well as silos – some folks are only ready to be “level 1 support” and aren’t going to be reading some dev’s Java code.

Traditional organizations and those following ITIL very closely can definitely create structures that promote bad silos and bad tiering. But just assuming everyone will be of the same (high) skill level and be able to know everything is a fallacy that is easy to fall into, since it’s those sort of elite individuals who are the leading uptakers of DevOps.  Maybe Gene Kim’s book he’s working on (“Visible DevOps” or similar) will help with that.

Tools fixation

Definitely an issue.  An enhanced focus on automation is valuable.  Too many ops shops still just do the same crap by hand day after day, and should be challenged to automate and use tools.  But a lot of the DevOps discussions do become “cool tool litanies” and that’s cart before the horse.  In my terminology, you don’t want to drive the principles out of the practices and methods – tooling is great but it should serve the goals.

We had that problem on our team. I had to talk to our Ops team and say “Hey, why are we doing all these tool implementations?  What overall goal are they serving? “  Tools for the sake of tools are worse than pointless.

Process

It is true that with agile and with DevOps that some folks are using it as an excuse to toss out process.  It should simply be a different kind of process! And you need to take into account all the stuff that should be in there.

A great example is Michael Howard et al. at Microsoft with their Security Development Lifecycle.  The first version of it was waterfall.  But now they’ve revamped it to have an agile security development lifecycle, so you know when to do your threat modeling etc.

Build instead of buy

Well, there are definitely some open source zealots involved with most movements that have any sysadmins involved. We would like to buy instead of build, but the existing tools tend to either not solve today’s problems or have poor ROI.

In IT, we implemented some “ITIL compliant” HP tools for problem tracking, service desk, and software deployment. They suck, and are very rigid, and cost a lot of money, and required as much if not more implementation time than writing something from scratch that actually addressed our specific requirements. And in general that’s been everyone’s experience. The Ops world has learned to fear the HP/IBM/CA/etc systems management suites because it’s just one of those niches that is expensive and bad (like medical or legal software).

But having said that, we buy when we can! Splunk gave us a lot more than cobbling together our own open source thing.  Cloudkick did too. Sure, we tend to buy SaaS a lot more than on prem software now because of the velocity that gives us, but I agree that you need to analyze the hidden costs of building as part of a build/buy – you just need to also see the hidden costs and compromised benefits of a buy.

Risk Control

This simply goes back to the cowboy concern. It’s clearly shown that if you structure your process correctly, with the right testing and signoff gates, then agile/devops/rapid deploys are less risky.

We came to this conclusion independently as well.  In IT, we ran (still do) these Web go lives once a month.  Our Web site consists of 200+ applications  and we have 70 or so programmers, 7 Web ops, a whole Infrastructure department, a host of third party stuff (Oracle and many more)… Every release plan was 100 lines long and the process of planning them and executing on them was horrific. The system gets complex enough, both technically and organizationally, that rollbacks + dependencies + whatnot simply turn into paralysis, and you have to roll stuff out to make money.  When the IT apps director suggested “This is too painful – we should just do these quarterly instead, and tell the business they get to wait 2 more months to make their money,” the light went on in my mind. Slower and more rigorous is actually worse.  It’s not more efficient to put all the product you’re shipping for the month onto a huge ass warehouse on the back of a giant truck and drive it around doing deliveries, either; this should be obvious in retrospect. Distribution is a form of risk management. “All the eggs in one big basket that we’ll do all at one time” is the antithesis of that.

The Future

We started DevOps here at NI from the operations guys.  We’d been struggling for years to get the programmers to take production responsibility for their apps. We had struggled to get them access to their own logs, do their own deploys (to dev and test), let business users input Apache redirects into a Web UI rather than have us do it… We developed a whole process, the Systems Development Framework, that we used to engage with dev teams and make sure all the performance, reliability, security, manageability, provisioning, etc. stuff was getting taken care of… But it just wasn’t as successful as we felt like it could be.  Realizing that a more integrated model was possible, we realized success was actually an option. Ask most sysadmin shows if they think success is actually a possible outcome of their work, and you’ll get a lot of hedging kinds of “well success is not getting ruined today” kinds of responses.

By combining ops and devs onto one team, by embedding ops expertise onto other dev teams, by moving to using the same tools and tracking systems between devs and ops, and striving for profound automation and self service, we’ve achieved a super high level of throughput within a large organization. We have challenges (mostly when management decides to totally change track on a product, sigh) but from having done it both ways – OMG it’s a lot better. Everything has challenges and risks and there definitely needs to to be some “big boy” compatible thinking  on DevOps – but it’s like anything else, those who adopt early will reap the rewards and get competitive advantage on the others. And that’s why we’re all in. We can wait till it’s all worked out and drool-proof, but that’s a better fit for companies that don’t actually have to produce/achieve any more (government orgs, people with more money than God like oil and insurance…).

Categories: DevOps Tags: , ,
Follow

Get every new post delivered to your Inbox.

Join 29 other followers