Category Archives: DevOps

Pertaining to agile system administration concepts and techniques.

Austin Cloud Camp Wrap-up

Austin recently had a CloudCamp and my guess is that it drew in close to 100 attendees.

Before I get into the actual event, let me start this post with a brief story.

During the networking time, I committed one of the worst networking faux pas that one can make when networking: I tried a lame joke upon meeting someone new. One of the other attendees asked me why my company was interested in CloudCamp. I sarcastically replied to his inquisition by explaining that we were really excited about CloudCamp because we do a lot of work with weather instrumentation. Anything to do with clouds, we are so there… Silence.

Blink.

Another blink…. Fail.

At this point I explain that I am an idiot and making sarcastic jokes that fail all the time and I duck out to a different conversation. So, forgetting about my awkward sense of humor, lets move on. Learn from me, don’t make weather jokes at a CloudCamp.

Notes from CloudCamp Austin

At any event, one of the best things that can happen is meeting people in your field. I was able to meet some cool guys in Austin with ServiceMesh and Pervasive. There are also beginning plans to start an AWS User Group in Austin which will be really awesome. Ping me if you want the scoop and I will let you know as I find anything out about it.

The talk I attended was led by the agile admin’s very own: Ernest Mueller. The notes from it are below.

Systems Management in the Cloud

One of the discussion points was how people were implementing dynamic scaling and what infrastructure they are wrapping around that.

Tools people are using in the cloud to achieve dynamic scaling in Amazon Web Services (AWS):
OSSEC for change control and security
Ganglia for reporting
Collectd for monitoring
– Cron tasks for other reporting and metric gathering
Pentaho and Jasper for metrics
– RESTful interface for the managed services layer. Reporting also gets done via RESTful service.
Quartz scheduler to do scaling with metrics around what collectd is monitoring.

When monitoring, we have to start by understanding the perspective of the customers and then try to wrap monitors around that. Are we focused on user or provider? Infrastructure monitoring or application monitoring? The creator of the application that is deployed to the cloud and the environment can provide hooks for the monitoring platform. Which means that developers need to be looking on the horizon of ops early in the development phase.

This is a summary of what I saw at CloudCamp Austin, but I would love to hear what other sessions people went to and what the big takeaways were for them.

Leave a comment

Filed under Cloud, DevOps

Why A HTTP Sniffer Is Awesome

While looking at Petit for my post on log management tools, I was thrilled to see it link to a sniffer that generates Web type logs called Justniffer.  Why, you might ask, isn’t that a pretty fringe thing?  Well settle in while I tell you why it’s bad ass.

We used to run a Web analytics product here called NetGenesis.  Like all very old Web analytics products, it relied on you to gather together all your log files for it to parse, resulting in error prone nightly cronjob kinds of nonsense.  So they came out with a network sniffer that logged into Apache format, like this does apparently.  It worked great and got the info in realtime (as long as the network admins didn’t mess up our network taps, which did happen from time to time).

I quickly realized this sniffer was way better than log aggregation, especially because my environment had all kinds of weird crap like Domino Web servers and IIS5 that don’t log in a civilized manner.  And since it sat between the Web servers and the client, it could log “client time,” “server time”, and had a special “900” error code for client aborts/timeouts.  I self-implemented what would be a predecessor to todays’ RUM tools like Tealeaf and Coradiant on it.  We used it to do realtime traffic analysis, cross-site reporting, and even used it for load testing as we’d transform and replay the captured logs against test servers. Using it also helped us understand the value of the Steve Souders front end performance stuff when he came around.

Eventually our BI folks moved to a Javascript page tag based system, which are the modern preference in Web analytics systems.  Besides the fact that these schemes only get pages that can execute JS and not all the images and other assets, we discovered that they were reasonably flawed and were losing about 10% of the traffic that we were seeing in the network sniffer log.  After a long and painful couple months, we determined that the lost traffic was from no known source and happened with other page tag based systems (Google Analytics, etc.), not just this supplier’s tool, and the BI folks finally just said “Well…  It gives us pretty clickstreams and stuff, let’s go ahead with it.”  Sadly that sunset our use of the Netgenesis network sniffer and there wasn’t another like it in the open source realm (I looked).  Eventually we bought a Coradiant to do RUM (the sales rep kept trying to explain this “new network RUM concept” to us and kept being taken aback and how advanced the questions were we asked) but I missed the accessibility of my sniffer log…  Big log aggregators like Splunk help fill that gap somewhat but sometimes you really want to grep|cut|sort|uniq the raw stuff.

On the related topic of log replayers, we have really wanted one for a long time.  No one has anything decent.  We’ve bugged every supplier that we deal with on any related product, from RUM to load testing to whatever.  Recording a specific transaction and using that is fine, but nothing compares to the demented diversity of real Internet traffic.  We wrote a custom replayer for our sniffer log, although it didn’t do POST (didn’t capture payloads – looks like justniffer can though!) and got a lot of mileage out of it.  Found al ot of app bugs before going to production with that baby.  Anyway, none of the suppliers can figure it out (Oracle just put together a DB traffic version of this in their new version 12 though).  Now that there’s a sniffer we can use, we already have a decent replayer, we’re back in business!  So I’m excited, it’s a blast from the past but also one of those core little things that you can’t believe there isn’t one of, and that empowers someone to do a whole lot of cool stuff.

Leave a comment

Filed under DevOps

Good DevOps Discussions

An interesting point and great discussion on “what is DevOps”, including a critique about it not including other traditional Infrastructure roles well, on Rational Survivability (heh, we’re using the same blog theme.  I feel like a girl in the same dress as another at a party.).  It seems to me that some of the complaints about DevOps – only a little here, but a lot more from Andi Mann, Ubergeek – seem to think DevOps is some kind of developer power play to take over operations.  At least from my point of view (an ops guy driving a devops implementation in a large organization) that is absolutely not the case.  Seems to me to be a case of over-touchiness based on the explicit and implicit critique of existing Infrastructure processes that DevOps represents.  Which is natural; agile development had/has the exact same challenge.

Note that DevOps is starting to get more press; here’s a cnet article talking about DevOps and the cloud (two great tastes that taste great together…).

And here’s a bonus slideshare presentation on “From Agile Development to Agile Operations” that is really good.

2 Comments

Filed under DevOps

Log Management Tools

We’re researching all kinds of tools as we set up our new cloud environment, I figure I may as well share for the benefit of the public…

Most recently, we’re looking at log management.  That is, a tool to aggregate and analyze your log files from across your infrastructure.  We love Splunk and it’s been our tool of choice in the past, but it has two major drawbacks.  One, it’s  quite expensive. In our new environment where we’re using a lot of open source and other new-format vendors, Splunk is a comparatively big line item for a comparatively small part of an overall systems management portfolio.

Two, which is somewhat related, it’s licensed by amount of logs it processes per day.  Which is a problem because when something goes wrong in our systems, it tends to cause logging levels to spike up.  In our old environment, we keep having to play this game where an app will get rolled to production with debug on (accidentally or deliberately) or just be logging too much or be having a problem causing it to log too much, and then we have to blacklist it in Splunk so it doesn’t run us over our license and cause the whole damn installation to shut off.  It took an annoying amount of micromanagement for this reason.

Other than that, Splunk is the gold standard; it pulls anything in, graphs it, has Google-like search, dashboards, reports, alerts, and even crazier capabilities.

Now on the “low end” there are really simple log watchers like swatch or logwatch.  But we’d really like something that will aggregate ALL our logs (not just syslog stuff using syslog-ng – app server logs, application logs, etc.), ideally from both UNIX and Windows systems, and make them usefully searchable.  Trying to make everything and everyone log using syslog is an ever receding goal.  It’s a fool’s paradise.

There’s the big appliance vendors on the “high end” like LogLogic and LogRhythm, but we looked at them when we looked at Splunk and they are not only expensive but also seem to be “write only solutions” – they aggregate your logs to meet compliance requirements, and do some limited pattern matching, but they don’t put your logs to work to help you in your actual work of application administration the dozen ways Splunk does.  At best they are “SIEM”s – security information and event managers – that alert on naughty intruders.  But with Splunk I can do everything from generate a report of 404s to send to our designers to fix their bad links/missing images to graph site traffic to make dashboards for specific applications for their developers to review.  Plus, as we’re doing this in the cloud, appliances need not apply.  (Ooo, that’s a catchy phrase, I’ll have to use that for a separate post!)

I came across three other tools that seem promising:

  • Logscape from Liquidlabs – does graphing and dashboards like Splunk does.  And “live tail” – Splunk mysteriously took this out when they revved from version 3 to 4!  Internet rumor is that it’s a lot cheaper.  Seems like a smaller, less expensive Splunk, which is a nice thing to be, all considered.
  • Octopussy – open source and Perl based (might work on Windows but I wouldn’t put money on it).  Does alerting and reporting.  Much more basic, but you can’t beat the price.  Don’t think it’ll meet our needs though.
  • Xpolog – seems nice and kinda like Splunk.  Most of the info I can find on it, though, are “What about xpolog, is good!” comments appended to every forum thread/blog post about Splunk I can find, which is usually a warning sign – that kind of guerrilla marketing gets old quick IMO.  One article mentions looking into it and finding it more expensive but with some nice features like autodiscovery, but not as open as Splunk.

Anyone have anything to add?  Used any of these?  We’ve gotten kind of addicted to having our logs be immediately accessible, converted into metrics, etc.  I probably wouldn’t even begrudge Splunk the money if it weren’t for all the micromanagement you have to put into running it.  It’s like telling the fire department “you’re licensed for a maximum of three fires at a time” – it verges on irresponsible.

21 Comments

Filed under DevOps

Automated Testing Basics

Everyone thinks they know how to test… Most people don’t.  And most people think they know what automated testing is, it’s just that they never have time for it.

Into this gap steps an interesting white paper, The Automated Testing Handbook, from Software Test Professionals, an online SW test community.  You have to register for a free community account to download it, but I did and it’s A great 100-page conceptual introduction to automated testing. We’re trying to figure out testing right now – it’s for our first SaaS product so that’s a challenge,  and also because of the devops angle we’re trying to figure out “what does ‘unit test’ mean regarding an OS build or a Tomcat server before it gets apps deployed?'”  So I was interested in reading a basic theory paper to give me some grounding; I like doing that when engaging in a new field rather than just hopping from specific to specific.

Some takeaways from the paper:

  • Automated testing isn’t just capture/replay, that’s unmaintainable.
  • Automated testing isn’t writing a program to test a program, that’s unscalable.
  • What you want is a framework to automate the process around the testing.  Some specific tests can/should be automated and others can’t/shouldn’t.  But a framework helps you with both, and lets you make more reusable tests.

Your tests should be version controlled, change controlled, etc. just like software (testware?).

Great quote from p.47:

A thorough test exercises more than just the application itself: it ultimately tests the entire environment, including all of the supporting hardware and surrounding software.

We are having issues with that right now and our larval SaaS implementation – our traditional software R&D folks historically can just do their own narrow scope tests, and say “we’re done.”  Now that we’re running this as a big system in the cloud, we need to run integration tests including the real target environment, and we’re trying to convince them that’s something they need to spend effort in.

They go on to mention major test automation approaches, including capture/replay, data-driven, and table-driven – their pros and cons and even implementation tips.

  1. Capture/replay is basically recording your manual test so you can do it again a lot.  Which makes it helpful for mature and stable apps and also for load testing.
  2. Data driven just means capture/replay with varied inputs.  It’s a little more complicated but much better than capture/replay, where you either don’t adequately test various input scenarios or you have a billion recorded scripts.
  3. Table driven is where the actions themselves are defined as data as well – one step short of hardcoding a test app.  But (assuming you’re using a tool that can work on multiple apps) portable across apps…

I guess writing pure code to test is an unspoken #4, but unspoken because she (the author) doesn’t think that’s a real good option.

There’s a bunch of other stuff in there too, it’s a great introduction to the hows and whys and gotchas of test automation.  Give it a read!

Leave a comment

Filed under DevOps

Velocity and DevOpsDays!

A double threat is coming your way.  Velocity 2010, the Web performance and operations conference, is June 22-24 in Santa Clara, CA.  As one of the very few conventions targeted at our discipline, we’ve been attending since the first one in 2008.  And this time, there’s dessert – the day after it ends, a new DevOps unconference, DevOpsDays 2010, will be held nearby in Mountain View!

OpsCamp Austin kicked ass, and I’m sure this will be even better.  So come double up on Ops knowledge and meet other right-thinking individuals.

If you want to read all my musings from the previous Velocity conferences, you can do that too!

Leave a comment

Filed under DevOps

Busting the Myths of Agile Development: What People are Really Doing

I just watched a good Webcast from an IBM agile expert about the state of Agile in the industry, and it had some interesting bits that touch upon agile operations.

Webcast – (registration required, sadly)

It’s by Scott Ambler, IBM’s practice leader for agile development.  They surveyed programmers using Dr. Dobbs’ “IT State of the Union” survey, which has wide reach across the world and types of programmers; the data in this presentation comes from that and other surveys.  All their surveys and results and even detailed response data (they’ve done a lot over time) are online for your perusal.

In the webcast, he talks some about “core” agile development extending to “disciplined” agile development that extends to address the full system lifecycle, is both risk and value driven, and has governance and standards.  “Core” begs the question of “where do requirements and architecture come from?” and “how do I get into production?”   Some agile folks who consider themselves purists say these aren’t needed and you should just start coding, he calls this view “phenomenally naive.”

He does mention some times where things like enterprise architecture were slow and introduced huge delays into the process because it took months to do reviews and signoffs.  He calls this “dysfunctional” but really isn’t it the way things are unless there’s a pattern to change it? I think project management and enterprise architecture are suffering from the same problem we operations folks are, which is that we’re just now figuring out “what does it mean to incorporate agile concepts into what we do?”

The meat of the preso is over agile team metrics and busting myths about “it’s only for small colocated teams…”  Here’s my summary notes.

  • Agile teams have a success rate higher than traditional teams.  Agile and iterative about tie and come in with much higher (2-4x) result on quality, functionality, cost efficiency, and timeliness than traditional or ad hoc processes.
  • Agile is for not just coding, but project selection/initiation, transition, ops, and maintenance.  More and more folks are doing that, but some get stuck with doing agile only in the coding and not the surrounding part of the lifecycle.
  • Agile is being used by co-located teams in only 45% of the time; all the rest are distributed in some manner.  But that does affect success some  – “far located” teams have 20% lower success than fully co-located.  Don’t distribute if you don’t have to.
  • Most orgs are using agile with small teams, the vast majority are size 10 or less.  But they see success with even the very large teams.
  • We’re seeing people successful with agile under complaince frameworks liek Sarbox, and governance frameworks like ISO – even ITIL is mentioned.
  • Agile isn’t just for simple projects – the real mix in the wild is actually weighted at medium to very complex projects.
  • Though agile is great for greenfield projects, there’s very large percentages of teams using it on legacy environments.  COTS development is the rarest.
  • 32% of successful agile teams are working with enterprise architecture and operations teams.  Should be more, but that’s a significant inroad.  He says those teams are also most successful when behaving agile (or at least lean).
  • Biggest problems with agile adoption are a waterfall culture (especially one where the overall governance everyone has to plug into is tuned to waterfall) and stakeholder involvement.  Testers say “We need a detailed spec before we can start testing…”  DBAs say “Developers can’t code until we have a complete data model…”  Management resistance is actually the lowest obstacle (14% of respondents)!

A lot of nice stats.  The two biggest takeaways are “agile isn’t just for certain kinds of projects, it’s being used for more than that and is successful in many different areas” and “agile is for the entire lifecycle not just coding.”  As advocates of agile systems, I think that’s a good sign that the larger agile community is wandering our way as we’re building up our conception of DevOps and wandering their way ourselves!

Leave a comment

Filed under DevOps

Our First DevOps Implementation

Although we’re currently engaged in a more radical agile infrastructure implementation, I thought I’d share our previous evolutionary DevOps implementation here (way before the term was coined, but in retrospect I think it hits a lot of the same notes) and what we learned along the way.

Here at NI we did what I’ll call a larval DevOps implementation starting about seven years ago when I came and took over our Web Systems team, essentially an applications administration/operations team for our Web site and other Web-related technologies.  There was zero automation and the model was very much “some developers show up with code and we had to put it in production and somehow deal with the many crashes per day resulting from it.”  We would get 100-200 on-call pages a week from things going wrong in production.  We had especially entertaining weeks where Belgian hackers would replace pages on our site with French translations of the Hacker’s Manifesto.  You know, standard Wild West stuff.  You’ve been there.

Step One: Partner With The Business

First thing I did (remember this is 2002), I partnered with the business leaders to get a “seat at the table” along with the development managers.  It turned out that our director of Web marketing was very open to the message of performance, availability, and security and gave us a lot of support.

This is an area where I think we’re still ahead of even a lot of the DevOps message.  Agile development carries a huge tenet about developers partnering side-by-side with “the business” (end users, domain experts, and whatnot).  DevOps is now talking about Ops partnering with developers, but in reality that’s a stab at the overall more successful model of “biz, dev, and ops all working together at once.” Continue reading

Leave a comment

Filed under DevOps

dev2ops Interview

Want to hear me spout off more about DevOps?  Well, here’s your chance; I did an interview with Damon Edwards of DTO and they’ve posted it on the dev2ops blog!

Killer quote:

“I say this as somebody who about 15 years ago chose system administration over development.  But system administration and system administrators have allowed themselves to lag in maturity behind what the state of the art is. These new technologies are finally causing us to be held to account to modernize the way we do things.  And I think that’s a welcome and healthy challenge.”

Leave a comment

Filed under DevOps

Before DevOps, Don’t You Need OpsOps?

From the “sad but true” files comes an extremely insightful point apparently discussed over beer by the UK devops crew recently – that we are talking about dev and ops collaboration but the current state of collaboration among ops teams is pretty crappy.

This resonates deeply with me.  I’ve seen that problem in spades.  I think in general that a lot of the discussion about the agile ops space is too simplistic in that it seems tuned to organizations of “five guys, three of whom are coders and two of whom are operations” and there’s no differentiation.  In real life, there’s often larger orgs and a lot of differentiation that causes various collaboration challenges.  Some people refer to this as Web vs Enterprise, but I don’t think that’s strictly true; once your Web shop grows from 5 guys to 200 it runs afoul of this too – it’s a simple scalability and organizational engineering problem.

As an aside, I don’t even like the “Ops” term – a sysadmin team can split into subgroups that do systems engineering, release management, and operational support…  Just saying “Ops” seems to me to create implications of not being a partner in the initial design and development of the overall system/app/service/site/whatever you want to call it.

Ops Verticals

Here, we have a large Infrastructure department.  Originally, it was completely siloed by technology verticals, and there’s a lot of subgroups.  Network, UNIX, Windows, DBA, Lotus Notes, Telecom, Storage, Data Center…  Some ten plus years ago when the company launched their Web site in earnest, they quickly realized that wasn’t going to work out.  You had the buck-passing behavior described in the blog posts above that made issues impossible to solve in a timely fashion, plus it made collaboration with devs/business nearly impossible.  Not only did you need like 8 admins to come involve themselves in your project, but they did not speak similar enough languages – you’d have some crusty UNIX admin yelling “WHAT ABOUT THE INODES” until the business analyst started to cry.

Dev Silos

But are our developers here better off?  They are siloed by business unit.  Just among the Web developers there’s the eCommerce developers, eCRM, Product Advisors, Community, Support, Content Management…  On the one hand, they are able to be very agile in creating solutions inside their specific niche.  On the other hand, they are all working within the same system environment, and they don’t always stay on the same page in terms of what technologies they are using. “Well, I’m sure THAT team bought a lovely million dollar CMS, but we’re going to buy our own different million dollar CMS.   No, you don’t get more admin resource.”  Over time, they tried to produce architecture groups and other cross-team initiatives to try to rein in the craziness, with mixed but overall positive results.

Plugging the Dike

What we did was create a Web Administration group (Web Ops, whatever you want to call it) that was holistically responsible for Web site uptime, performance, and security.  Running that team was my previous gig, did it for five years.  That group was more horizontally focused and would serve as an interface to the various technology verticals; it worked closely with developers in system design during development, coordinated the release process, and involved devs in troubleshooting during the production phase.

BizOps?

In fact, we didn’t just partner with the developers – we partnered with the business owners of our Web site too, instead of tolerating the old model of “Business collaborates with the developers, who then come and tell ops what to do.”  This was a remarkably easy sell really.  The company lost money every minute the Web site was down, and it was clear that the dev silos weren’t going to be able to fix that any more than the ops silos were.  So we quickly got a seat at the same table.

Results

This was a huge success.  To this day, our director of Web Marketing is one of the biggest advocates of the Web operations team.  Since then, other application administration (our word for this cross-disciplinary ops) teams have formed along the same model.  The DevOps collaboration has been good overall – with certain stresses coming from the Web Ops team’s role as gatekeeper and process enforcement.  Ironically, the biggest issues and worst relationships were within Infrastructure between the ops teams!

OpsOps – The Fly In The Ointment

The ops team silos haven’t gone down quietly.  To this day the head DBA still says “I don’t see a good reason for you guys [WebOps] to exist.”  I think there’s a common “a thing is just the sum of its parts” mindset among admins for whatever reason.  There are also turf wars arising from the technology silo division and the blurring of technology lines by modern tech.  I tried again and again to pitch “collaborative system administration.”  But the default sysadmin behavior is to say “these systems are mine and I have root on them.  Those are your systems and you have root on them.  Stay on your side of the line and I’ll stay on mine.”

Fun specific Catch-22 situations we found ourselves in:

  • Buying a monitoring tool that correlates events across all the different tiers to help root-cause production problems – but the DBAs refusing to allow it on “their” databases.
  • Buying a hardware load balancer – we were going to manage it, not the network team, and it wasn’t a UNIX or Windows server, so we couldn’t get anyone to rack and jack it (and of course we weren’t allowed to because “Why would a webops person need server room access, that’s what the other teams are for”).

Some of the problem is just attitude, pure and simple.  We had problems even with collaboration inside the various ops teams!  We’d work with one DBA to design a system and then later need to get support from another DBA, who would gripe that “no one told/consulted them!”  Part of the value of the agile principles that “DevOps” tries to distill is just a generic “get it into your damn head you need to be communicating and working together and that needs to be your default mode of operation.” I think it’s great to harp on that message because it’s little understood among ops.  For every dev group that deliberately ostracizes their ops team, there’s two ops teams who don’t think they need to talk to the devs – in the end, it’s mostly our fault.

Part of the problem is organizational.  I also believe (and ITIL, I think, agrees with me) that the technology-silo model has outlived its usefulness.  I’d like to see admin teams organized by service area with integral DBAs, OS admins, etc.  But people are scared of this for a couple reasons.  One is that those admins might do things differently from area to area (the same problem we have with our devs) – this could be mitigated by “same tech” cross-org standards/discussions.  The other is that this model is not the cheapest.  You can squeeze every last penny out if you only have 4 Windows admins and they’re shared by 8 functional areas.  Of course, you are cutting off your nose to spite your face because you lose lots more in abandoned agility, but frankly corporate finance rules (minimize G&A spending) are a powerful driver here.

If nothing else, there’s not “one right organization” – I’d be tempted to reorg everyone from verticals into horizontals, let that run for 5 years, and then reorg back the other way, just to keep the stratification from setting in.

Specialist vs Generalist

One other issue.  The Web Ops team we created required us to hire generalists – but generalists that knew their stuff in a lot of different areas.  It became very hard to hire for that position and training took months before someone was at all effective.  Being a generalist doesn’t scale well.  Specialization is inevitable and, indeed, desirable (as I think pretty much anything in the history of anything demonstrates).  You can mitigate that with some cross-training and having people be generalists in some areas, but in the end, once you get past that “three devs, two ops, that’s the company” model, specialization is needed.

That’s why I think one of the common definitions of DevOps – all ops folks learning to be developers or vice versa – is fundamentally flawed.  It’s not sustainable.  You either need to hire all expensive superstars that can be good at both, or you hire people that suck at both.

What you do is have people with varying mixes.  In my current team we have a continuum of pure ops people, ops folks doing light dev, devs doing light ops, and pure devs.  It’s good to have some folks who are generalizing and some who are specializing.  It’s not specializing that is bad, it’s specialists who don’t collaborate that are bad.

Conclusion

So I’ve shared a lot of experiences and opinions above but I’m not sure I have a brilliant solution to the problem.  I do think we need to recognize that Ops/Ops collaboration is an issue that arises with scale and one potentially even harder to overcome than Dev/Ops collaboration.  I do think stressing collaboration as a value and trying to break down organizational silos may help.  I’d be happy to hear other folks’ experiences and thoughts!

6 Comments

Filed under DevOps