Category Archives: DevOps

Pertaining to agile system administration concepts and techniques.

by karthequian | November 19, 2013 · 11:54 am

The AWS ReInvent Conference Recap

Last week I attended AWS ReInvent in Las Vegas. It was the largest conference I’ve been to with 9000 people, and a crazy number of sessions. When I was trying to decide what sessions to go to, I realized I had multiple conflicts at every slot (a good problem to have). It was also one of the funnest conferences I’ve been to, and I’ll be back next year (a bit more prepared next time around).

I’ll post about the sessions I went to, but the following are my favorite highlights from the conference:

Day 2 keynote with Werner Vogels: After a more marketing and “C” centric keynote on day 1, the day 2 keynote was tuned more to the large developer crowd in the audience, and I left inspired. Check out all my notes here.
Expo Hall: Holy cow! I got tired after walking around just 1/2 this hall. According to the booklet, there were maybe over 170 sponsors, and it took a while to walk through and check out what everyone was doing. The expo hall was also packed the first couple of days, so I went on the 3rd day when things were a lot quieter (pro tip: If you want the best swag, go the first day), but it also gave me a chance to talk to more of the folks in a leisurely manner! My 2 favorite highlights about the expo hall were Datadog (Most enthusiastic even on day 3), and Cloudability (who knew that I was a customer of theirs even though I didn’t realize another team at Mentor used the product; I thought that was pretty awesome!).
Crazy number of sessions: I’m glad these are all on YouTube now. I hear the slides are also going to be online in a bit. This will give me a way to catch up on the sessions that I missed out on.
AWS Hands on labs: This was pretty cool! You could skip a session or two and do a hands on lab on an AWS technology. I spent some time doing a hands on learning AWS beanstalk, and it was totally worthwhile.
Day 1 (Tuesday): I only got in on Tuesday, but next time I’ll need to register in time to be a part of the hackathon or gameday. I talked to a bunch of folks who attended these, and they ended up having a great time at both of these. The Gameday was pretty cool as well and was targeted at DevOps folks. You ended up forming a team with a bunch of other folks and had to build an application infrastructure that was resilient to any kind of breakage. Then, you’d swap your credentials with another team, and they would try to break your infrastructure; you can imagine how this would end up being entertaining!
Meeting up with folks, and catching up with people I hadn’t seen in a while.
VEGAS! It was good to not lose at the roulette tables this time around 🙂

A lot of developer friends commented that the talks were light on technical side of things, which I thought was true; the way I got more out of these was actually talking to the product managers and the customer at the end of the talk to ask and understand some of the more technical concepts. This is true for most conferences, but was especially true for this one.

Stay tuned for a bunch of post conference session updates!

Leave a comment

Filed under DevOps

Tagged as 2013, aws, Conferences, reinvent

by Ernest Mueller | October 28, 2013 · 10:54 am

LASCON Interview: Nick Galbreath

Nick Galbreath (@ngalbreath) is VP of Engineering with client9, LLC.

What are you doing nowadays since leaving Etsy?

I am managing a small DevOps team for a company whose engineering team is based in Moscow, from Tokyo, Japan. Some other executives and our biggest customer is from there. And, I love Japan!

I know you from Velocity and the other DevOps conferences. Why are you here at a security conference?

I’ve been active at Black Hat, DEFCON, etc. as well as DevOps conferences. I’ve found that if your company is in operational chaos you don’t need security. Once you have a good operational component and it’s not in chaos – standardized infrastructure, automation – you get up to the level where you can be effective at security. I used the same approach at Etsy – I started there working on security, stopped, worked in infrastructure until that was basically squared away, and only then started working on security again. You have to work your way up Maslow’s hierarchy.

It’s the same with development. My background is originally development and when you’re programming in C/C++ your main effort is stability, but all those NPEs and other bugs are also security issues. I don’t know any company doing well at security and not well at development, I’m not sure you can do it. Nail the basics and then the advanced topics are achievable.

What’s your opinion on how much the security space has left developers behind?

Look at the real core issues behind security. Dev teams have trouble with writing secure code, ops folks have problems with patching – at security conferences you don’t see anything for solving those problems. Working on offense/breaking and blocking tools is lucrative but inhibits us from going after the root causes.

For many security pros, working in a team instead of solo is a different skill set. “We don’t want to bother the developers with this” – siloed approaches are killing us.

What do you see as the most interesting thing going on in the security landscape right now?

What has happened in the last 3-4 months, as much as I hate to say it, with all the leaking of documents – we’ve been lazy about encryption and privacy and other foundational elements and we assumed it worked, now we’re doing some healthy review to do a next generation of those. It brought that discussion to the forefront. The certificate authority problems, and the NSA stuff – we need to spend some time and think about this. The next generation of SSL and certificate transparency are very interesting.

In terms of pure language work… Improvement of cryptography. Also, we’re making more business level APIs for common problems like PHP5’d password hashing APIs. If your’e building a Web app and need auth you’re starting from zero most of the time and now you’re starting to see things put into the languages that solve these problems.

Out in the larger DevOpsey world, what are the things to watch, what is your team excited about?

Stuff that we’re excited about is traditional devops stuff like really treating our infrastructure like code. No button clicking, infrastructure completely specified in config files in source control, code reviews, and then the file pushed to production to allocate/deallocate hardware and deploy software. That’s a big change.

How do we disseminate best practices/prevent worst practices through those who aren’t the technical “1%?”

Well, best practices are harder

People went into server programming because they don’t like doing user interface stuff. But the joke’s on us, there is still a user interface, it’s configuration files, installers, etc. which are nontrivial. We should either be bundling audit software or server-side config healthchecks to provide warnings. “Why do you have SSL v2 enabled?” “Why are your .htaccess files visible by default?” [Ed: Where the hell did apache chkconfig go?]

People in ops can write these but retroactively folks won’t use them… But the future can have them. If you at least get warned that your Apache config is using suboptimal security configs it’s your deliberate negligence to not do it right.

Maybe take the module approach (Apache wouldn’t want it in their core I’m sure) – if you want to work on it give me a call!

What message do you want to send to other security folks?

For security people, the message is, “It’s really important you start bringing your non-security friends to these security conferences.” Devs and ops and business and QA. They’ll find it interesting and get involved. It’s really important.

Last year, we had a dozen people from my company come out to AppSec. But except for me and our security team, they’re not back this year. There just wasn’t enough content to hold the interest of the devs. What can we do about that?

Really! Interesting. Maybe we need more of a proper dev track, with more things like Karthik’s talk.

A project I’ve wanted to do for a very long time – most people in business and development don’t have real idea of how much damage can be done, it’s why we have Red Teams. If someone’s really good at SQLi, etc. do a talk showing how much damage can be done.

Also – if you work at any company, you depend on an immense set of open source software and they don’t have a security person or anything. Get involved in their process, try to help them and make it better and it’ll improve quality of everyone’s systems. We could do a hackathon during the convention to improve some existing projects.

Leave a comment

Filed under Conferences, DevOps, Security

Tagged as convention, interview, lascon, Security

by wickett | August 20, 2013 · 11:50 pm

The Agile Admin at SXSW, we need your help

Right in the Agile Admin’s hometown (Austin, TX) is one of the coolest conferences out there–it is a special place where hipsters and venture capitalists and programmers and designers and gamers unite. The Agile Admin team is always at SXSW usually in search of new tech and ideas and more often in search of free drinks. This year is gonna be different. This year, James submitted a talk on Rugged Driven Dev and if the talk gets enough votes, the Agile Admin will be represented in the SXSW Interactive lineup.

We need your vote to make it happen. We would love to help the aforementioned hipsters find out about all the cool stuff going on in the Rugged and DevOps communities and bring them into the fold.

Would you vote for the talk? It only takes a few seconds to create an account and vote. You can cast your vote here > http://panelpicker.sxsw.com/vote/19539

2 Comments

Filed under DevOps

by Ernest Mueller | July 22, 2013 · 3:04 pm

Sustaining vs Strangulation

The other day I came across two interesting articles that showcase two facets of one problem (and more notably, a problem that I have been working on myself). Read the two articles, they are:

Can Sustaining Engineering Be Agile and Engaging For Team Members by Matt Roberts
Legacy Application Strangulation by Paul Hammant

I manage a large mostly-sustaining team here at Bazaarvoice that I’ve moved to Agile and DevOps. As Matt points out, sustaining teams are problematic in theory. The strangulation approach, especially the airline booking app “single trunk” approach, is better from a number of perspectives. But, our org made the decision to put all the legacy work with a sustaining team so that the many teams of new-product devs would be able to get maximum speed. It did allow for greater speed of new development to not have to do support at the same time. However, it also provided significant challenges – initially underestimating the effort needed to sustain, new teams not having the benefit of the lessons old team already learned from running at scale, sustaining teams feeling like second class citizens, other teams being tempted to shed even newer work to the sustaining team (even though it’s technically just for the one product). I can’t prove that taking the strangulation vs sustaining approach would have been better, but in retrospect, I would want to try that instead. We are strangling old vs new product from a customer-facing point of view in terms of dialing up new products/dialing down old ones instead of doing “big bang” upgrades, but we’re not doing it inside a single team/single trunk model like Matt mentions and it seems like that could mitigate many of these issues.

We are making the best of the sustaining gig on our team, however. It’s not light work or lacking in innovation. We run support requests through a kanban and then have two scrum-type sprint teams for CI and occasional feature work, plus lots of infrastructure work. We dole out up to a billion hits a day and have a reach of 400M+ users, with traffic and data volume doubling year over year, so we are in the interesting position of being largely frozen in terms of features, how most product managers understand them (pretty buttons!), but having to innovate and rearchitect quite aggressively on all our “nonfunctional” areas (performance, availability, security, etc.). When people tell us we’re “feature frozen” I tell them they have a poor understanding of the word “feature,” or maybe they should think about “changes” rather than “features.” This is one of the key DevOps culture change points many orgs have to face, and educating PMs and upper management on a more holistic definition of “feature” that includes managing nonfunctional requirements is a key success factor.

We’re also doing a number of the things Matt’s article encourages to make sustaining work engaging. We push hard on customer satisfaction (we are riding at 99% of customer tickets fulfilled within SLA; we have a big dashboard with leaderboards that promote that), empower the team to perform continuous improvement to make the system better, and consult with the “next gen” teams on their work. As a result we have really good results, really good relationships with the Implementation and Support groups outside Engineering, and pretty good team morale. Of course, general recognition and stuff like that so that everyone sees and appreciates the team’s work helps.

Though in the end, we are also trying to outsource the sustaining work so that our engineers aren’t all sad from having to do it. (Our current team soldiers on because they know the company depends on them, but other engineers in the company don’t want to do move over to do sustaining work,.) So… There’s that. Our job is to juggle the desire of employees to move off sustaining plus the desire of other teams to get those employees and developing the outsourcers’ expertise with the needs of maintaining the legacy app with the excellence required.

From what I’ve learned from this, I believe a solid product renewal plan would involve:

a) Teams that own services, not apps or projects (ITSM/ITIL 101).

b) Those teams own the design, development, sustaining, operations, deployment, and whatever other task you want to apply to the product – from conception to delivery.

c) Every app and library and service and tool has to be owned by an appropriate service team, regardless of what engineer moved to what team or corporate reprioritization happened or whatever completely-legitimate corporate sob story you have.

d) Then if you need to make a major sea change, you employ the strangulation method to transition effort on a team, not using a separate sustaining team.

The risk with this approach is that a team gets filled up with sustaining work. But that is a chance for them to eat their own dog food. Go fix whatever’s causing that sustaining work! Retire the stuff that doesn’t make sense any more! Passing completed items off into a black hole for “sustaining” chews up just as much resources and time, it just provides the convenient fiction that since you can’t see it, it must not be affecting your velocity.

What do you think? How have you approached this problem? Am I on crack? Let me know.

Leave a comment

Filed under DevOps

Tagged as development, itsm, service, strangling, sustaining

by Ernest Mueller | July 19, 2013 · 10:58 am

Scrum for Operations: How We Got Started

Welcome to the newest article in Scrum for Operations. I started this series when I was working for NI. But now I’m going through the same process at BV so time to pick it back up again! Like my previous post on Speeding Up Releases, I’m going to go light on theory and heavy on the details, good and bad, of how exactly we implemented Agile and DevOps and where we are with it.

Here at BV (Bazaarvoice), the org had adopted Agile wholesale just a couple months before I started. We also adopted DevOps shortly after I joined by embedding ops folks in the product teams. Before the Agile/DevOps implementation there was a traditional organization consisting of many dev teams and one ops team, with all the bottlenecking and siloing and stuff that’s traditional in that kind of setup. Newer teams (often made up of newly hired engineers, since we were growing quickly) that started out on the new DevOps model picked it up fine, but in at least one case we had a lot of culture change to do with an existing team.

Our primary large legacy team is called the PRR team (Product Ratings and Reviews) after the name of their product, which now does lots more than just ratings and reviews, but naturally marketing rebranding does little to change what all the engineers know an app is called. Many of the teams working on emerging greenfield products that were still in development had just one embedded ops engineer, but on our primary production software stack, we had a bunch. PRR serves content into many Internet retailer’s pages; 450 million people see our reviews and such. So for us scalability, performance, monitoring, etc. aren’t a sideline, they’re at least half of the work of an engineering team!

This had previously been cast as “a separate PRR operations team.” The devs were used to tossing things over the wall to ops and punting on the responsibility even if it was their product, and the ops were used to a mix of firefighting and doing whatever they wanted when not doing manual work the devs probably should have automated.

I started at BV as Release Manager, but after we got our releases in hand, I was asked to move over to lead the PRR team and take all these guys and achieve a couple major goals, so I dug in.

Moving Ops to Agile

I actually started implementing Agile with the PRR Ops team because I managed just them for a couple months before being given ownership of the whole department. I had worked closely with many of them before in my release manager role so I knew how they did things already. The Ops team consisted of 15 engineers, 2/3 of which were in Ukraine, outsourced contractors from our partner Softserve.

At start, there was no real team process. There were tickets in JIRA and some bigger things that were lightly project managed. There was frustration with Austin management about the outsourcers’ performance, but I saw that there was not a lot of communication between the two parts of the team. “A lot of what’s going bad is our fault and we can fix it,” I told my boss.

Standups

A the first process improvement step, I introduced daily standups (in Sep 2012). These were made more complicated by the fact that we have half of our large team in Ukraine; as a result we used Webex to conduct them. “Let’s do one Austin standup and one Ukraine standup” was suggested – but I vetoed that since many of the key problems we were facing were specifically because of poor communication between Austin and Ukraine. After the initial adjustment period, everyone saw the value of the visibility it was giving us and it became self-sustaining. (With any of these changes, you just have to explain the value and then make them do it a little while “as a pilot” to get it rolling. Once they see how it works in practice and realize the value it’s bringing, the team internalizes it and you’re ready for the next step.) Also because of the large size and international distribution I did the “no-no” of writing up the standup and sending the notes out in email. It wasn’t really that hard, here’s an example standup email from early on:

Subject: PRR Infrastructure Daily Standup Notes 11/05/2012

Individual Standups
(what you did since last standup, what you will do by the next standup, blockers if any)

Alexander C – did: AVI-93 dev deploy testing of c2, release activity training; will do: finish dev c2, start other clusters
Anton P – did: review AVI-271 sharded solr in AWS proxy, AVI-282 migrating AWS to solr sharding; will do: finish and test
Bryan D – did: Hosted SEO 2.0 discussion may require Akamai SSL, Tim’s puppet/vserver training, DOS-2149 BA upgrade problems, document surgical deploy safety, HOST-71 lab2 ssh timeout, AVI-790, 793 lab monitoring, nexus errors; will do: finish prep Magpie retro, PRR sprint planning, Akamai tickets for hosted SEO, backlog task creation.
Larry B – did: MONO-107,109 7.3 release branch cut, release training; will do: AVI-311 dereg in DNS (maybe monitoring too?)
Lev P – did: deploy script change testing AVI-771; will do: more of that
Oleg K – did: review AVI-676 changes, investigate deployment runbooks/scripts for solr sharding AVI-773; to do: testing that, AVI-774 new solr slaves
Oleksandr M – did: out Friday after taking solr sharding live; will do: prod cleanup AVI-768, search_engine.xml AVI-594
Oleksandr Y – did: AVI-789 BF monitoring, had to fix PDX2 zabbix; will do: finish it and move to AVI-585 visualization
Robby M – did: testing AVI-676 and communicating about AWS sharding; will do: work with Alex and do AVI-698 c7 db patches for solr sharding
Sergii V – did: AVI-703 histograms, AVI-763 combining graphs; will do: continue that and close AVI-781 metrics deck
Serhiy S – did: tested aws solr puppet config AVI-271, CMOD stuff AVI-798, AVI-234
Taras U – did: tested BVC-126599 data deletion. Will do: pick up more tickets for testing
Taras Y – did: AVI-776 black Friday scale up plan, AVI-762 testing BF scale up; will do: more scale up testing
Vasyl B – did: MONO-94 GTM automation to test; will do: AVI-770 ftp/zabbix thing
Artur P – did: AVI-234 remove altstg environment, AVI-86 zabbix monitoring of db performance “mikumi”; will do: more on those

For context, while this was going on we were planning for Black Friday (BF) and executing on a large project to shard our Solr indexes for scaling purposes. The standup itself brought loads of visibility to both sides of the team and having the emails brought a lot of visibility to managers and stakeholders too. It also helped us manage what all the outsourcers were doing (I’ll be honest, before that sometimes we didn’t know what a given guy in Ukraine was doing that week – we’d get reports in code later on, but…).

I took the notes in the standup straight into an email and it didn’t really slow us down (I cheated by having the JIRA project up so I could copy over the ticket numbers). Because of the number of people, the Webex, and the language barrier the standups took 30 minutes. Not the fastest, but not bad.

Backlog

After everyone got used to the standups, I introduced a backlog (maybe 2 weeks after we started the standups). We had JIRA tickets with priorities before, but I added a Greenhopper Scrum style backlog. Everyone got the value of that immediately, since “we have 200 P2 tickets!” is obviously Orwellian at best. When stakeholders (my boss, other folks) had opinions on priorities we were able to bring up the stack-ranked backlog and have a very clear discussion about what it was more or less urgent/important than. (Yes, there were a couple yelling matches about “it’s meaningless to have five ‘top priorities!'” before we had this.) Interrupt tickets would just come in at the top.

Here’s a clip of our backlog just to get the gist of it…

All the usual work… just in a list. “Work it from the top!” We still had people cherry-picking things from farther down because “I usually work on builds” or “I usually work on metrics” but I evangelized not doing that.

Swimlanes

Using this format also gave me insight into who was doing what via the swimlanes view in JIRA. When we’d do the standup we started going down in swimlane order and I could ask “why I don’t see that ticket” or see other warning signs like lots of work in progress. An example swimlane:

This helped engineers focus on what they were supposed to be doing, and encouraged them to put interrupts into the queue instead of thrashing around.

Sprints

Once we had the backlog, it was time to start sprinting! We had our first sprint planning meeting in October and I explained the process. They actually wanted to start with one week sprints, which was interesting – in the dev world often times you start in with really long (4-6 week) sprints and try to get them shorter as you get more mature. In the ops world, since things are faster paced, we actually started at a week and then lengthened it out later once we got used to it.

The main issue that troubled people was the conjunction of “interrupt” tickets with proactive implementation tickets. This kind of work is why lots of people like to say “Ops should use kanban.”

However, I learned two things doing all this. The first is that for our team at least, the lion’s share of the work was proactive, not reactive, especially if you use a 1-2 week lookahead. “Do they really need that NOW, or just by next sprint?” Work that used to look interrupt driven under a “chaos plus big projects” process started to look plannable. That helped us control the thrash of “here’s a new urgent request” and resist it breaking the current sprint where possible.

Also, the amount of interrupt work varies from day to day but not significantly for a large team over a 1-2 week period. This means that after a couple sprints, people could reliably predict how many points of stories they could pull because they knew how much time got pulled to interrupt work on average. This was the biggest fear of the team in doing sprint planning – that interrupt work would make it impossible to plan – and there was no way to bust through it except for me to insist that we do a couple sprints and reevaluate. Once we’d done some, and people learned to estimate, they got comfortable with it and we’ve been scrumming away since.

And the third thing – kanban is harder to do correctly than Scrum. Scrum enforces some structure. I’ve seen a lot of teams that “use kanban” and by that they mean “they do whatever comes to mind, in a completely uncontrolled manner,” indistinguishable from how Ops used to do things. Real kanban is subtle and powerful, and requires a good bit of high level understanding to do correctly. Having a structure helped teach my team how to be agile – they may be ready for kanban in another 6 months or so, perhaps, but right now some guard rails while they learn a lot of other best practices are serving us well.

Poker Planning

After the traditional explanation (several times) about what story points are, people started to get it. We used planningpoker.com for the actual voting – it’s a bit buggy but free, and since sprint planning was also 15 people on both (or more) sides of a Webex, it was invaluable.

Velocity

It’s hard to argue with success. We watched the team velocity, and it basically doubled sprint to sprint for the first 4 sprints; by the end of November we were hitting 150 story points per sprint. I wish I had a screen cap of the velocity from those original sprints; Greenhopper is a little cussed and refuses to show you more than 7 sprints back, but it was impressive and everyone got to see how much more work they were completing (as opposed to ‘doing’). I do have one interesting one though:

This is our 6th and following sprints; you see how our average velocity was still increasing (a bit spikily) but in that last sprint we finally got to where we weren’t overpromising and underdelivering, which was an important milestone I congratulated the team on. Nowadays their committed/completed numbers are always very close, which is a sign of maturity.

Just Add Devs – False Start!

After the holiday rush, they asked me and another manager, Kelly, to take over the dev side of PRR as well, so we had the whole ball of wax (doubling the number of people I was managing). We tried to move them straight to full Scrum and also DevOps the team up using the embedded ops engineer model we were using on the other 2.0 teams. PRR is big enough there were enough people for four subteams, so we divided up the group into four sprint teams, assigned a couple ops engineers to each one, and said “Go!”.

This went over pretty much like a lead balloon. It was too much change too fast. Most of the developers were not used to Agile, and trying to mentor four teams at once was difficult. Combined with that was the fact that most of the ops staff was remote in Ukraine, what happened was each Austin-BV-employee-led team didn’t really consider “those ops guys” part of their team (I look around from my desk and see four other devs but don’t see the ops people… Therefore they’re not on my team.) And that ops team was used to working as one team and they didn’t really segment themselves along those lines meaningfully either. Since they were mostly remote, it was hard to break that habit. We tried to manage that for a little while, but finally we had to step back and try again.

Check back soon for Scrum for Operations: Just Add DevOps, where I reveal how we got Agile and DevOps to work for us after all!

17 Comments

Filed under Agile, DevOps

Tagged as agile, DevOps, Operations, scrum, scrum4ops, system administration, systems, web, webops

by Ernest Mueller | July 17, 2013 · 1:27 pm

Scrum for Operations: Fitting In As An Ops Engineer

So far in this series, I’ve introduced the basics of Scrum as it generally is used and explained the practices that make it extremely successful. But that’s for developers, right? If you are in operations, what does this mean to you? How do you fit in? For an ops person, the major challenges are mental – you have to reorient your way of thinking, and then things drop into place very well.

I’m writing from the perspective of a Web operations guy, though I’ve done more traditional sysadmin work and managed infrastructure (and dev) teams over time (and started off as a dev, many years ago). Some of my terminology is oriented towards creating a product and keeping a Web site up, but you should be able to conceptually substitute your own kind of system, just as all different kinds of developers, not just Web developers, use and benefit from agile.

The Team

First, “DevOps.” Get an Ops person assigned to the dev team. This is fundamental – if it’s an externalized relationship, where the dev team is making requests of your “Infrastructure org”, you will not be seen as part of the team and your effectiveness will be extremely diminished. You need to be more or less dedicated to this project, not handling it from some shared work queue. This reinforces the fundamental values of Agile. You join the team, and you dedicate yourself to the overall success of the product you are working on. It is this integration, and the trust that arises from shared goals, that will remove a lot of the traditional roadblocks you are used to facing when dealing with a dev team. A real agile team should have similarly embedded product, QA, and UX folks, it’s not a new idea.

You are not “a UNIX guy” or “A DBA” any more. You are “a member of the Ratings and Reviews team,” and you happen to have a technical specialty. This may seem like sophistry but it’s actually one of the most critical parts of this cultural transformation.

The Backlog

Start thinking of tasks in a customer-feature-facing kind of way for the backlog. For example, no one but you wants to hear about “configuring the SAN,” they want to know that at the end of the sprint “customers will be able to save files to persistent storage.” If what you’re doing doesn’t have any benefit to the end customer – why are you doing it again? You shouldn’t be.

Figure out how to state operational concerns like performance, maintainability, and availability as benefits in the backlog. Some infrastructure stuff belongs in the backlog, other parts of it belong more in standards (e.g. the team Definition of Done now states you have to have monitoring on a new service…). The product manager and dev team aren’t dumb, they will understand that performance, availability, security, ability to release their software, etc. are important goals that have merit in the backlog. The typical story-lingo is “As an X, I want Y so I can do Z.” “As a client, I want my data backed up so that in the case of a disaster, I am minimally affected.” “As an engineer, I want the uptime state of my services monitored so I can ensure customers are being served.”

You will be challenged (and this is good) on items that are “monkey work.” “I need to go delete log files off that server, so it doesn’t crash.” Hey, why are we doing that? Why is it manual? Should we have a story for proper log rotation? Need a developer to help? You will see a virtuous cycle develop to “fix things right.” Most of the devs haven’t seen a lot of the demeaning stuff you’re asked to do, and they’ll try to help fix it.

The Sprint

I’ll be honest, the first time I was confronted with the prospect of breaking up systems work into sprints I thought it was very unlikely it could be done. “Things are either short interrupts or long projects, right, that doesn’t make any sense.” And then I did it, and the scales dropped from my eyes. Remember refactoring. Developers doing agile are used to refactoring, while we are used to only having “one bite at the apple” – if we don’t get the systems all 100% right before we unleash the developers on them, then we won’t be able to change them later right? Wrong!

In a certain sense, sprint planning is a big load off from traditional planning. Infrastructure folks are used to being asked to provide a granular task breakdown and timeline of 6 months worth of work for some big-bang implementation. Then when reality causes the plan to deviate from that, everyone freaks. Agile takes horizon planning and institutionalizes it – you only need to be able to specifically plan your next 2 (or so) weeks, and if you can’t do that you need to try harder. What can you implement in 2 weeks that has some kind of value? Get a Tomcat running sprint 1, then tune it sprint 2, then monitor it sprint 3 – don’t bundle everything up into one huge mass.

Testing

Figure out what unit tests mean to you for things you are implementing. “Nothing” is the wrong answer. If you’re making a network change, for example, there is something you can do to test that short of “waiting for people to complain.” If you are installing tomcat on a server – if you’re using a framework like chef or puppet they’ll have testing options built in, but even if not there’s certain things you can do to ensure its functionality instead of passing it on and causing lost time and rework when someone else finds out it’s not working right.

More to come, meditate upon those truths for a bit – ask questions in the comments!

2 Comments

Filed under DevOps

Tagged as agile, DevOps, Operations, scrum, scrum4ops, system administration, systems, web, webops

by Ernest Mueller | July 17, 2013 · 9:51 am

Cloud Austin Logging Tool Roundup Presentations

James, Karthik, and I run Cloud Austin, a technical user group for cloud computing types in Austin. Last night we broke new ground by videoing the presentations using Hangouts On Air, and the result is a cool bunch of 15 minute presentations on Splunk, Sumo Logic, Logstash, Greylog2 (including one from Lennat Koopmann, the maintainer) and the first public presentation of Project Meniscus, Rackspace’s new logging system.

You can go get slides and watch the 2+ hour long video on the Cloud Austin blog.

Leave a comment

Filed under Cloud, DevOps

Tagged as greylog2, logging, logstash, meniscus, splunk, sumo logic

by wickett | June 25, 2013 · 10:11 am

Notes and Tweets from DevOps Days Silicon Valley

Over the last few months I have been using TweetScriber (an iPad app) to take notes at conferences. The really nice part about it is that it is a note taking application that allows you to live-tweet and record other people’s tweets all in one place. At DevOps Days Silicon Valley 2013, I tried to use TweetScriber to record what happened and capture what others were saying on twitter as well.

Here are my raw notes from DevOps Days Silicon Valley Day 1 and DevOps Days Silicon Valley Day 2. I also ran an open space on doing security testing with gauntlt and recorded those notes as well.

The Agile Admin team is working on putting together a summary of DevOps Days and Velocity Conference, but until that is released the raw notes will have to suffice.

Leave a comment

Filed under DevOps

by Ernest Mueller | June 24, 2013 · 2:51 pm

Crosspost: How Bazaarvoice Weathered The AWS Storm

For regular agile admin readers, I wanted to point out the post I did on the Bazaarvoice engineering blog, How Bazaarvoice Weathered The AWS Storm, on how we have designed for resiliency to the point where we had zero end user facing downtime during last year’s AWS meltdown and Leapocalypse. It’s a bit late, I wrote it like in July and then the BV engineering blog kinda fell dormant (guy who ran it left, etc.) and we’re just getting it reinvigorated. Anyway, go read the article and also watch that blog for more good stuff to come!

Leave a comment

Filed under Cloud, DevOps

Tagged as availability zone, aws, bazaarvoice, Cloud, DevOps, ec2, leapocalypse, outage, region, resiliency

by Ernest Mueller | June 24, 2013 · 1:51 pm

Velocity 2013 Wrapup

Whew, we’re all finally back home from the conferencing. Fun was had by all.

@iteration1, @ernestmueller, @wickett

Over the next week I’ll go back to the liveblog articles and put in links to slides/videos where I can find them (feel free and post ones you know in comments on the appropriate post!). We’ll also try to sum up the best takeaways into a Velocity 2013 and DevOpsDays Silicon Valley 2013 quick guide, for those without the patience to read the extended dance remix.

Leave a comment

Filed under Conferences, DevOps

Tagged as velocity, velocityconf, velocityconf13

Category Archives: DevOps

The AWS ReInvent Conference Recap

LASCON Interview: Nick Galbreath

The Agile Admin at SXSW, we need your help

Sustaining vs Strangulation

Scrum for Operations: How We Got Started

Moving Ops to Agile

Standups

Backlog

Swimlanes

Sprints

Poker Planning

Velocity

Just Add Devs – False Start!

Scrum for Operations: Fitting In As An Ops Engineer

Cloud Austin Logging Tool Roundup Presentations

Notes and Tweets from DevOps Days Silicon Valley

Velocity 2013 Wrapup

Subscribe

Recent Comments

Recent Posts

Austinites

Cloud

DevOps

Archives