Tag Archives: web

Classy up your curl with curl-trace

 

Let’s say you are debugging some simple web requests and trying to discern where things are slowing down.  Curl is perfect for that.  Well, sort of perfect. I don’t know about you but I forget all the switches for curl to make it work like I want.  Especially in a situation where you need to do something quickly.

Let me introduce you to curl-trace.

It’s not a new thing to install, its just an opinionated way to run curl.  To give you a feel for what it does, lets start with the output from curl-trace.

Screenshot 2016-03-11 10.15.35

As you can see, this breaks up the request details like response code, redirects and IP in the Request Details section and then breaks down the timing of the request in the Timing Analysis section.  This uses curl’s --write-out option and was inspired by this post, this post, and my co-worker Marcus Barczak.

The goal of curl-trace is to quickly expose details for troubleshooting web performance.

How to setup curl-trace

Step 1

Download .curl-format from github (or copy from below)

\n
 Request Details:\n
 url: %{url_effective}\n
 num_redirects: %{num_redirects}\n
 content_type: %{content_type}\n
 response_code: %{response_code}\n
 remote_ip: %{remote_ip}\n
 \n
 Timing Analysis:\n
 time_namelookup: %{time_namelookup}\n
 time_connect: %{time_connect}\n
 time_appconnect: %{time_appconnect}\n
 time_pretransfer: %{time_pretransfer}\n
 time_redirect: %{time_redirect}\n
 time_starttransfer: %{time_starttransfer}\n
 ----------\n
 time_total: %{time_total}\n
 \n

And put that in your home directory as .curl-format or wherever you find convenient.

Step 2

Add an alias to your .bash_profile (and source .bash_profile) for curl-trace like this:


alias curl-trace='curl -w "@/path/to/.curl-format" -o /dev/null -s'

Be sure to change the /path/to/.curl-format to the location you saved .curl-format. Once you do that and source your .bash_profile you are ready to go.

Usage

Now you can run this:

$ curl-trace https://google.com

Or follow redirects with -L

$ curl-trace -L https://google.com

Thats it…

Now you are ready to use curl-trace. If you have anything to add to it, just send me an issue on github or a PR or ping me on twitter: https://twitter.com/wickett.

Enjoy!

UPDATE: 3/17/2016

There was a lot of good feedback on curl-trace so it has now been moved to its own repo: https://github.com/wickett/curl-trace

 

2 Comments

Filed under DevOps

Scrum for Operations: How We Got Started

Welcome to the newest article in Scrum for Operations. I started this series when I was working for NI. But now I’m going through the same process at BV so time to pick it back up again! Like my previous post on Speeding Up Releases, I’m going to go light on theory and heavy on the details, good and bad, of how exactly we implemented Agile and DevOps and where we are with it.

Here at BV (Bazaarvoice), the org had adopted Agile wholesale just a couple months before I started. We also adopted DevOps shortly after I joined by embedding ops folks in the product teams.  Before the Agile/DevOps implementation there was a traditional organization consisting of many dev teams and one ops team, with all the bottlenecking and siloing and stuff that’s traditional in that kind of setup.  Newer teams (often made up of newly hired engineers, since we were growing quickly) that started out on the new DevOps model picked it up fine, but in at least one case we had a lot of culture change to do with an existing team.

Our primary large legacy team is called the PRR team (Product Ratings and Reviews) after the name of their product, which now does lots more than just ratings and reviews, but naturally marketing rebranding does little to change what all the engineers know an app is called. Many of the teams working on emerging greenfield products that were still in development had just one embedded ops engineer, but on our primary production software stack, we had a bunch. PRR serves content into many Internet retailer’s pages; 450 million people see our reviews and such. So for us scalability, performance, monitoring, etc. aren’t a sideline, they’re at least half of the work of an engineering team!

This had previously been cast as “a separate PRR operations team.” The devs were used to tossing things over the wall to ops and punting on the responsibility even if it was their product, and the ops were used to a mix of firefighting and doing whatever they wanted when not doing manual work the devs probably should have automated.

I started at BV as Release Manager, but after we got our releases in hand, I was asked to move over to lead the PRR team and take all these guys and achieve a couple major goals, so I dug in.

Moving Ops to Agile

I actually started implementing Agile with the PRR Ops team because I managed just them for a couple months before being given ownership of the whole department. I had worked closely with many of them before in my release manager role so I knew how they did things already. The Ops team consisted of 15 engineers, 2/3 of which were in Ukraine, outsourced contractors from our partner Softserve.

At start, there was no real team process.  There were tickets in JIRA and some bigger things that were lightly project managed. There was frustration with Austin management about the outsourcers’ performance, but I saw that there was not a lot of communication between the two parts of the team. “A lot of what’s going bad is our fault and we can fix it,” I told my boss.

Standups

A the first process improvement step, I introduced daily standups (in Sep 2012). These were made more complicated by the fact that we have half of our large team in Ukraine; as a result we used Webex to conduct them. “Let’s do one Austin standup and one Ukraine standup” was suggested – but I vetoed that since many of the key problems we were facing were specifically because of poor communication between Austin and Ukraine. After the initial adjustment period, everyone saw the value of the visibility it was giving us and it became self-sustaining.  (With any of these changes, you just have to explain the value and then make them do it a little while “as a pilot” to get it rolling. Once they see how it works in practice and realize the value it’s bringing, the team internalizes it and you’re ready for the next step.) Also because of the large size and international distribution I did the “no-no” of writing up the standup and sending the notes out in email.  It wasn’t really that hard, here’s an example standup email from early on:

Subject: PRR Infrastructure Daily Standup Notes 11/05/2012

Individual Standups
(what you did since last standup, what you will do by the next standup, blockers if any)

Alexander C – did: AVI-93 dev deploy testing of c2, release activity training; will do: finish dev c2, start other clusters
Anton P – did: review AVI-271 sharded solr in AWS proxy, AVI-282 migrating AWS to solr sharding; will do: finish and test
Bryan D – did: Hosted SEO 2.0 discussion may require Akamai SSL, Tim’s puppet/vserver training, DOS-2149 BA upgrade problems, document surgical deploy safety, HOST-71 lab2 ssh timeout, AVI-790, 793 lab monitoring, nexus errors; will do: finish prep Magpie retro, PRR sprint planning, Akamai tickets for hosted SEO, backlog task creation.
Larry B – did: MONO-107,109 7.3 release branch cut, release training; will do: AVI-311 dereg in DNS (maybe monitoring too?)
Lev P – did: deploy script change testing AVI-771; will do: more of that
Oleg K – did: review AVI-676 changes, investigate deployment runbooks/scripts for solr sharding AVI-773; to do: testing that, AVI-774 new solr slaves
Oleksandr M – did: out Friday after taking solr sharding live; will do: prod cleanup AVI-768, search_engine.xml AVI-594
Oleksandr Y – did: AVI-789 BF monitoring, had to fix PDX2 zabbix; will do: finish it and move to AVI-585 visualization
Robby M – did: testing AVI-676 and communicating about AWS sharding; will do: work with Alex and do AVI-698 c7 db patches for solr sharding
Sergii V – did: AVI-703 histograms, AVI-763 combining graphs; will do: continue that and close AVI-781 metrics deck
Serhiy S – did: tested aws solr puppet config AVI-271, CMOD stuff AVI-798, AVI-234
Taras U – did: tested BVC-126599 data deletion. Will do: pick up more tickets for testing
Taras Y – did: AVI-776 black Friday scale up plan, AVI-762 testing BF scale up; will do: more scale up testing
Vasyl B – did: MONO-94 GTM automation to test; will do: AVI-770 ftp/zabbix thing
Artur P – did: AVI-234 remove altstg environment, AVI-86 zabbix monitoring of db performance “mikumi”; will do: more on those

For context, while this was going on we were planning for Black Friday (BF) and executing on a large project to shard our Solr indexes for scaling purposes. The standup itself brought loads of visibility to both sides of the team and having the emails brought a lot of visibility to managers and stakeholders too. It also helped us manage what all the outsourcers were doing (I’ll be honest, before that sometimes we didn’t know what a given guy in Ukraine was doing that week – we’d get reports in code later on, but…).

I took the notes in the standup straight into an email and it didn’t really slow us down (I cheated by having the JIRA project up so I could copy over the ticket numbers). Because of the number of people, the Webex, and the language barrier the standups took 30 minutes. Not the fastest, but not bad.

Backlog

After everyone got used to the standups, I introduced a backlog (maybe 2 weeks after we started the standups). We had JIRA tickets with priorities before, but I added a Greenhopper Scrum style backlog. Everyone got the value of that immediately, since “we have 200 P2 tickets!” is obviously Orwellian at best. When stakeholders (my boss, other folks) had opinions on priorities we were able to bring up the stack-ranked backlog and have a very clear discussion about what it was more or less urgent/important than. (Yes, there were a couple yelling matches about “it’s meaningless to have five ‘top priorities!'” before we had this.) Interrupt tickets would just come in at the top.

Here’s a clip of our backlog just to get the gist of it…

Screen Shot 2013-07-18 at 6.02.33 PMAll the usual work… just in a list.  “Work it from the top!” We still had people cherry-picking things from farther down because “I usually work on builds” or “I usually work on metrics” but I evangelized not doing that.

Swimlanes

Using this format also gave me insight into who was doing what via the swimlanes view in JIRA.  When we’d do the standup we started going down in swimlane order and I could ask “why I don’t see that ticket” or see other warning signs like lots of work in progress.  An example swimlane:

Screen Shot 2013-07-18 at 6.24.53 PM

 

This helped engineers focus on what they were supposed to be doing, and encouraged them to put interrupts into the queue instead of thrashing around.

Sprints

Once we had the backlog, it was time to start sprinting! We had our first sprint planning meeting in October and I explained the process. They actually wanted to start with one week sprints, which was interesting – in the dev world often times you start in with really long (4-6 week) sprints and try to get them shorter as you get more mature.  In the ops world, since things are faster paced, we actually started at a week and then lengthened it out later once we got used to it.

The main issue that troubled people was the conjunction of “interrupt” tickets with proactive implementation tickets.  This kind of work is why lots of people like to say “Ops should use kanban.”

However, I learned two things doing all this.  The first is that for our team at least, the lion’s share of the work was proactive, not reactive, especially if you use a 1-2 week lookahead. “Do they really need that NOW, or just by next sprint?” Work that used to look interrupt driven under a “chaos plus big projects” process started to look plannable. That helped us control the thrash of “here’s a new urgent request” and resist it breaking the current sprint where possible.

Also, the amount of interrupt work varies from day to day but not significantly for a large team over a 1-2 week period.  This means that after a couple sprints, people could reliably predict how many points of stories they could pull because they knew how much time got pulled to interrupt work on average. This was the biggest fear of the team in doing sprint planning – that interrupt work would make it impossible to plan – and there was no way to bust through it except for me to insist that we do a couple sprints and reevaluate.  Once we’d done some, and people learned to estimate, they got comfortable with it and we’ve been scrumming away since.

And the third thing – kanban is harder to do correctly than Scrum.  Scrum enforces some structure. I’ve seen a lot of teams that “use kanban” and by that they mean “they do whatever comes to mind, in a completely uncontrolled manner,” indistinguishable from how Ops used to do things. Real kanban is subtle and powerful, and requires a good bit of high level understanding to do correctly. Having a structure helped teach my team how to be agile – they may be ready for kanban in another 6 months or so, perhaps, but right now some guard rails while they learn a lot of other best practices are serving us well.

Poker Planning

After the traditional explanation (several times) about what story points are, people started to get it. We used planningpoker.com for the actual voting – it’s a bit buggy but free, and since sprint planning was also 15 people on both (or more) sides of a Webex, it was invaluable.

Velocity

It’s hard to argue with success.  We watched the team velocity, and it basically doubled sprint to sprint for the first 4 sprints; by the end of November we were hitting 150 story points per sprint. I wish I had a screen cap of the velocity from those original sprints; Greenhopper is a little cussed and refuses to show you more than 7 sprints back, but it was impressive and everyone got to see how much more work they were completing (as opposed to ‘doing’).  I do have one interesting one though:

Screen Shot 2013-02-01 at 11.12.19 AMThis is our 6th and following sprints; you see how our average velocity was still increasing (a bit spikily) but in that last sprint we finally got to where we weren’t overpromising and underdelivering, which was an important milestone I congratulated the team on. Nowadays their committed/completed numbers are always very close, which is a sign of maturity.

Just Add Devs – False Start!

After the holiday rush, they asked me and another manager, Kelly, to take over the dev side of PRR as well, so we had the whole ball of wax (doubling the number of people I was managing). We tried to move them straight to full Scrum and also DevOps the team up using the embedded ops engineer model we were using on the other 2.0 teams.  PRR is big enough there were enough people for four subteams, so we divided up the group into four sprint teams, assigned a couple ops engineers to each one, and said “Go!”.

This went over pretty much like a lead balloon. It was too much change too fast.  Most of the developers were not used to Agile, and trying to mentor four teams at once was difficult. Combined with that was the fact that most of the ops staff was remote in Ukraine, what happened was each Austin-BV-employee-led team didn’t really consider “those ops guys” part of their team (I look around from my desk and see four other devs but don’t see the ops people… Therefore they’re not on my team.)  And that ops team was used to working as one team and they didn’t really segment themselves along those lines meaningfully either.  Since they were mostly remote, it was hard to break that habit. We tried to manage that for a little while, but finally we had to step back and try again.

Check back soon for Scrum for Operations: Just Add DevOps, where I reveal how we got Agile and DevOps to work for us after all!

17 Comments

Filed under Agile, DevOps

Scrum for Operations: Fitting In As An Ops Engineer

So far in this series, I’ve introduced the basics of Scrum as it generally is used and explained the practices that make it extremely successful. But that’s for developers, right? If you are in operations, what does this mean to you? How do you fit in? For an ops person, the major challenges are mental – you have to reorient your way of thinking, and then things drop into place very well.

I’m writing from the perspective of a Web operations guy, though I’ve done more traditional sysadmin work and managed infrastructure (and dev) teams over time (and started off as a dev, many years ago). Some of my terminology is oriented towards creating a product and keeping a Web site up, but you should be able to conceptually substitute your own kind of system, just as all different kinds of developers, not just Web developers, use and benefit from agile.

The Team

First, “DevOps.” Get an Ops person assigned to the dev team. This is fundamental – if it’s an externalized relationship, where the dev team is making requests of your “Infrastructure org”, you will not be seen as part of the team and your effectiveness will be extremely diminished. You need to be more or less dedicated to this project, not handling it from some shared work queue. This reinforces the fundamental values of Agile. You join the team, and you dedicate yourself to the overall success of the product you are working on. It is this integration, and the trust that arises from shared goals, that will remove a lot of the traditional roadblocks you are used to facing when dealing with a dev team. A real agile team should have similarly embedded product, QA, and UX folks, it’s not a new idea.

You are not “a UNIX guy” or “A DBA” any more.  You are “a member of the Ratings and Reviews team,” and you happen to have a technical specialty. This may seem like sophistry but it’s actually one of the most critical parts of this cultural transformation.

The Backlog

Start thinking of tasks in a customer-feature-facing kind of way for the backlog. For example, no one but you wants to hear about “configuring the SAN,” they want to know that at the end of the sprint “customers will be able to save files to persistent storage.” If what you’re doing doesn’t have any benefit to the end customer – why are you doing it again? You shouldn’t be.

Figure out how to state operational concerns like performance, maintainability, and availability as benefits in the backlog. Some infrastructure stuff belongs in the backlog, other parts of it belong more in standards (e.g. the team Definition of Done now states you have to have monitoring on a new service…). The product manager and dev team aren’t dumb, they will understand that performance, availability, security, ability to release their software, etc. are important goals that have merit in the backlog. The typical story-lingo is “As an X, I want Y so I can do Z.” “As a client, I want my data backed up so that in the case of a disaster, I am minimally affected.” “As an engineer, I want the uptime state of my services monitored so I can ensure customers are being served.”

You will be challenged (and this is good) on items that are “monkey work.”  “I need to go delete log files off that server, so it doesn’t crash.” Hey, why are we doing that?  Why is it manual? Should we have a story for proper log rotation? Need a developer to help? You will see a virtuous cycle develop to “fix things right.” Most of the devs haven’t seen a lot of the demeaning stuff you’re asked to do, and they’ll try to help fix it.

The Sprint

I’ll be honest, the first time I was confronted with the prospect of breaking up systems work into sprints I thought it was very unlikely it could be done. “Things are either short interrupts or long projects, right, that doesn’t make any sense.” And then I did it, and the scales dropped from my eyes. Remember refactoring. Developers doing agile are used to refactoring, while we are used to only having “one bite at the apple” – if we don’t get the systems all 100% right before we unleash the developers on them, then we won’t be able to change them later right?  Wrong!

In a certain sense, sprint planning is a big load off from traditional planning. Infrastructure folks are used to being asked to provide a granular task breakdown and timeline of 6 months worth of work for some big-bang implementation. Then when reality causes the plan to deviate from that, everyone freaks.  Agile takes horizon planning and institutionalizes it – you only need to be able to specifically plan your next 2 (or so) weeks, and if you can’t do that you need to try harder. What can you implement in 2 weeks that has some kind of value? Get a Tomcat running sprint 1, then tune it sprint 2, then monitor it sprint 3 – don’t bundle everything up into one huge mass.

Testing

Figure out what unit tests mean to you for things you are implementing.  “Nothing” is the wrong answer.  If you’re making a network change, for example, there is something you can do to test that short of “waiting for people to complain.” If you are installing tomcat on a server – if you’re using a framework like chef or puppet they’ll have testing options built in, but even if not there’s certain things you can do to ensure its functionality instead of passing it on and causing lost time and rework when someone else finds out it’s not working right.

More to come, meditate upon those truths for a bit – ask questions in the comments!

2 Comments

Filed under DevOps

Velocity 2013 Day 1 Liveblog – Hands-on Web Performance Optimization Workshop

OK we’re wrapping up the programming on Day 1 of Velocity 2013 with a Hands-on Web Performance Optimization Workshop.

Velocity started as equal parts Web front end performance stuff and operations; I was into both but my path lead me more to the operations side, but now I’m trying to catch up a bit – the whole CSS/JS/etc world has grown so big it’s hard to sideline in it.  But here I am!  And naturally performance guru Steve Souders is here.  He kindly asked about Peco, who isn’t here yet but will be tomorrow.

One of the speakers is wearing a Google Glass, how cute.  It’s the only other one I’ve seen besides @victortrac’s. Oh, the guy’s from Google, that explains it.

@sergeyche (TruTV), @andydavies (Asteno), and @rick_viscomi (Google/YouTube) are our speakers.

We get to submit URLs in realtime for evaluation at man.gl/wpoworkshop!

Tool Roundup

Up comes webpagetest.org, the great Web site to test some URLs. They have a special test farm set up for us, but the abhorrent conference wireless largely prevents us from using it. “It vill disappear like pumpkin vunce it is over” – sounds great in a Russian accent.

YSlow the ever-popular browser extension is at yslow.org.

Google Pagespeed Insights is a newer option.

showslow.com trends those webpagetest metrics over time for your site.

Real Page Tests

Hmm, since at Bazaarvoice we don’t really have pages per se, we’re just embedded in our clients’ sites, not sure what to submit!  Maybe I’ll put in ni.com for old times’ sake, or a BV client. Ah, Nordstrom’s already submitted, I’ll add Yankee Candle for devious reasons of my own.

redrobin.com – 3 A’s, 3 F’s. No excuse for not turning on gzip. Shows the performance golden rule – 10% of the time is back end and 90% is front end.

“Why is my time to first byte slow?”  That’s back end, not front end, you need another tool for that.

nsa.gov – comes back all zeroes.  General laughter.

Gus Mayer – image carousel, but the first image it displays is the very last it loads.  See the filmstrip view to see how it looks over time. Takes like 6 seconds.

Always have a favicon – don’t have it 404. And especially don’t send them 40k custom 404 error pages. [Ed. I’ll be honest, we discovered we were doing that at NI many years ago.] It saves infrastructure cost to not have all those errors in there.

Use 85% lossy compression on images.  You can’t tell even on this nice Mac and it saves so much bandwidth.

sitespeed.io will crawl your whole site

speedcurve is a paid service using webpagetest.

Remember webpagetest is open source, you can load it up yourself (“How can we trust your dirty public servers!?!” says a spectator).

Mobile

webpagetest has some mobile agents

httpwatch for iOS

1 Comment

Filed under Conferences, DevOps

Speeding Up Releases

Hi all!  My new job’s been affording me few opportunities for blogging, but I’m getting into the groove, so you should see more of me now.

Releasing All The Time!

Continuous integration is the bomb.  We can all generally agree on that.  But my life has become one of halfway steps that I think will be familiar to many of you, and I don’t believe in hiding the real world that’s not all case study perfect out there.  So rather than give you the standard theory-list of “what you should do for nice futuristic DevOps releases,” let me tell you of our march from a 10 week to 2 week to 1 week release tempo at Bazaarvoice.

Biweekly Releases!

I started with BV at the start of February of this year. They said, “Our new release manager!  We’ve been waiting for you!  We adopted agile and then tried to move from our big-bang 10 week release cycle to 2 weeks and it blew up like you wouldn’t believe.  Get us to two week releases.  You’ve got a month. Go!”  The product management team really needed us to be able to roll out features more quickly, do piloting and A/B testing, and generally be way more agile in delivery to the customer and not just in dev-land.

Background – our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution for that purpose. The codebase started seven years ago and grew monolithically for much of that time . (“The monolith” was the semi-affectionate code name for the stack when I started, as in “is your app’s code on the monolith?”) The app is running across multiple physical and cloud based datacenters and pushing out billions of hits a day, so there’s a low tolerance window for errors – our end user facing display apps have to have zero downtime for releases, though we can do up to two hours of downtime in a 3-5 AM window for customer administrative systems. Stack is Java, Linux, mySQL, Solr, et al. Extremely complex, just like any app added on to for years.

There had been a SWAT team formed after the semi-disastrous 2 week release that identified the main problems.  Just like everywhere else, the main impediments were:

  • Lack of automation in testing
  • Poor SCM code discipline

Our CTO was very invested in solving the problem, so he supported the solution to #1 – the QA team hired up and got some automation folks in, and the product teams were told they had to stop feature development and do several sprints of writing and automating tests until they could sustain the biweekly cadence.

The solution to #2 had two parts.  One was a feature flagging system so we could launch code “dark.” We had a crack team of devs crank this one out. I won’t belabor it because Facebook etc. love to do DevOps presentations on the approach and its benefits, but it’s true.  Now we release all code dark first, and can enable it for certain clients or other segments.

Two was process – a new branching process where a single release branch comes off trunk every two weeks several days before release, and changes aren’t allowed to it except to fix issues found in QA, and those are approved and labeled into discrete release candidates. The dev environment gets trunk twice a day, the QA environment gets branch every time a new release candidate is labeled. Full product CIT must be passing to get a release candidate. As always, process steps like this sound like common sense but when you need 100 developers in 10 teams to uptake them immediately, the little issues come out and play.

There were a couple issues we couldn’t fix in the time allotted.  One was that our Solr indexes are Godawful huge.  Like 20 GB huge.  JVM GC tuning is a popular hobby with us. To make changes, reindex, and distribute the indexes in time to perform a zero-downtime deployment, with replication lag nipping at our heels, was a bigger deal.  The other was that our build and deploy pipeline was pretty bad.  All the keywords you want to hear are there – Puppet, TeamCity, Rundeck, svn, noah, maven/Nexus, yum…  But they are inconsistently implemented, embedded in a huge crufty bash script framework and parts have gone largely untended.

The timeframe was extremely aggressive.  I project managed the hell out of it and all the teams were very on board and helpful, and management was very supportive.  I actually got a slight delay, and was grateful for it, because our IPO date came up on the same date when we were supposed to start biweekly releases, and even the extremely ambitious were taken aback by the risk of cocking up the service on that day. We did our first biweekly release on March 6th and then every two weeks thereafter.  We had a couple rough patches, but they were good learning experiences.

For example, as our first biweekly release day approached, tests just weren’t passing. I brought all the dev managers to the go/no-go meeting (another new institution) and asked them, “are we go?” (The release manager role had been set up by upper management as more prescriptive, with the thought I’d be sitting there yelling at them “It’s no-go,” but that’s really not an effective long term strategy).  They all kinda shuffled, and hemmed, and hawed (a lot of pressure from internal stakeholders wanted this release to go out NOW), but then one manager said “No, we’re no go.  It’s just not safe.” Once she said that everyone else got over that initial taboo of saying “no go” and concurred that some of their areas were no go.  The release went out 5 calendar days late but a lot more smoothly than the last release did (44 major issues then, 5 this time).

The next release, though, was the real make-or-break.  On the one hand everyone had a first real pass through the process and so some of the “but I didn’t know I needed to have testing signoff by that day and time” breaking-in static was gone, but on the other hand they’d had 2 months between the previous two releases to test and plan, and this one allowed only two weeks.  It went off with no delay and only 1 issue.

Of course, we had deliberately sandbagged that a little because it coincided a with ‘test development only” sprint.  But anyone who thinks a complex release in a large scale environment will go smoothly just because you’re deploying code with no functional changes has clearly never been closer than a 10-foot pole to real world Web operations. As we ramped back up on feature development, the process was also becoming more ingrained and testing better, so it went well.

We had one release go bad in May, and when we looked at it we realized a lot of changes weren’t being sufficiently QA’ed.  So what we did was simply add a set of fields to all JIRA tickets for the team to specify who tested the change, and we wrote a script to parse our Subversion commit comments and label JIRA tickets with the appropriate release (trying to get people to actually fill out tickets correctly is pain and usually doomed to failure, so we made an end run with automation).  So then as a release came up, on a wiki page is a list of all the tickets in the release and who tested them and how (automatic, manual, did not test). We actually did this for two releases with paper printouts and physical signoffs to develop the process before we automated it.  This corrected the issue and we ran from then on with very low problem rates. As advertised, releasing fewer changes more frequently allows us to get both a higher throughput of changes and, paradoxically, higher quality with them.

Weekly Releases!

The process worked great through the summer. In the biweekly release communication and presentations, I had explained we’d be moving to weekly and then to continuous deployment as soon as we could make it happen. Well, the solr index distribution problem took a while – two reorgs kicked it around and it was an ambitious “use bittorrent to distribute the index to all the servers in our various DCs” pretty propellerhead kind of thing that had to happen. It took the summer to get that squared away. In the meantime I also conducted a project internally called “Neverland” to fix some of the most egregious technical debt in our TeamCity and Nexus setup and deployment scripts.

The real testament to the culture change that happened as part of the biweekly release project is that while that project was a “big deal” – I had stakeholders from all over the business, big all hands presentations, project plans out the yin-yang, the entire technical leadership team sweating the details – moving from biweekly to weekly releases was largely a non-event.

The QA team worked in the background leading up to it to push test automation levels up higher. Then we basically just said “Hey, you guys want to release faster don’t you?” “Well sure!” “OK. we’re going weekly in two weeks. Check out the updated process docs.” “All right.” And we did, starting the first release in September.  The Solr index got reindexed and redistributed (and man, it had been a while – it compacted down nicely) and deployment ran great. No change in error rate at all. We’ve been weekly since then, the only change  is when we don’t release during critical change freeze windows around Black Friday/Cyber Monday and other holiday prime times. We think our setup is robust enough that it’s safe to release even then, but, heck, no one’s perfect so it’s probably prudent to pause, and many of our clients are really adamant about holiday change freezes to us and to all their suppliers.

The one concern voiced by engineers about the overhead of the release process was addressed by automating it more and by educating.  For example, the go/no-go meeting was, at times, a little messy. Some of the other teams (especially ones not located in Austin) wouldn’t show up, or test signoffs wouldn’t be ready, and it would turn into delays and running around. The opportunity to do it more quickly actually helped a lot! Whereas the meeting had been 30 minutes if we were lucky when we started, now the meeting is taking 5 minutes, and only longer when someone screws around and doesn’t dial into the Webex on time.

“If it’s painful, do it more often” is a message that some folks still balk at when confronted with, but it is absolutely true.

Now, the path wasn’t easy and I was blessed with a very high caliber of people at Bazaarvoice – Dev, Ops, and QA. Everyone was always very focused on “how do we make this work” or “how do we improve this” with very little of the turf warring, blocking, and politics that I sadly have come to expect in a corporate environment. The mindset is very much “if we come up with a new way that’s better and we all agree on that, we will change to do that thing TOMORROW and not spend months dithering about it,” which is awesome and helped drive these changes through much faster than, honestly, I initially estimated it would take.

Releasing All The Time!

Continuous integration on “the monolith” was a distant myth initially, but now we’re seeing how we can get there and the benefits we’ll reap from doing so. Our main impediments remaining are:

1. CIT not passing.  We don’t have a rule where if CIT is failing checkins are blocked, mainly because there’s a bunch of old legacy tests that are flaky. This often results in release milestones being delayed because CIT isn’t passing and there’s 6 devs’ checkins in the last failing build. Step 1 is fix the flaky tests and step 2 is declare work stoppage when CIT is failing. The senior developers see the wisdom in this so I expect it to go down without much friction. Again, the culture is very much about ruthlessly adopting an innovation if the key players agree it will be beneficial.

2. Builds, CIT, and deployment are slow as molasses in January. Build 1 hour, CIT 40 minutes, deploy 3 hours. Why? Various legacy reasons that give me a headache when I have to listen to them. Basically “that’s how it is now, and complete rewrite is potentially beyond any one person’s ability and definitely would take multiple man-months.” We’re analyzing what to do here. We also have a “staging” environment customers use for integration, and so currently we have to deploy to dev, test, deploy to QA, test, deploy to staging (hitting the downtime window), test, deploy to production (hitting the downtime window), test. So basically 2 days minimum. However, staging is really production and step one is release them at the same time.  There’s a couple “but I can only test this kind of change in staging” items left that basically just require telling someone “Figure out how to test it in QA now.” Going to “always release trunk” will remove the whole branch deployment and separate dev and QA environments. So that’s 2 of 4 deployments removed, but then it’s a matter of figuring out cost vs benefit of smashing down parts of that 4:40. I have one proposal in front of me for chucking all the current deploy infrastructure for a Jenkins-driven one, I need to figure out if it is complete enough…

Am I Doing It Wrong?

Chime in in the comments below with questions or if there’s some way I could have cut the Gordian knot better.  I think we’ve moved about as fast as you can given a lot of legacy code and technical debt (and having a lot of other stuff people need to be working on to keep a service up and get out new functionality).   The three step process I used that works, as it does so often, was:

  1. Communicate a clear vision
  2. Drive execution relentlessly
  3. Keep metrics and continually improve

Thanks for reading, and happy releasing!

1 Comment

Filed under DevOps

Scrum for Operations: Order from Chaos

Welcome to the second installment in Scrum for Operations, a series where I talk about (and go through) the process of doing systems work as part of a DevOps team according to the Scrum methodology. Last time, I introduced the basics of Scrum as it generally is used, and its key benefit of frequently delivering useful functionality. But I already hear the objections – “How can that turn out all right?” It is so light on process that one’s initial inclination is to dismiss it as “cowboy coding”, and we already know not to be “cowboy sysadmins,” right? One’s intuition might be (and mine was in the beginning, I’ll be honest) that this would lead to a metastable process that could not be sustainable without fundamental fatal flaws overtaking it.

Well, as I learned after trying to learn more and kicking the tires with our dev team here, there are several core disciplines that are agile’s saving graces.

Testing

We ops guys are used to testing being a neglected afterthought in the development process, often tossed over the wall to a QA team that isn’t well integrated into the product. Therefore we have a hard time trusting code that’s being handed over to us given our experience – we get it handed to us and it doesn’t work!

Well, agile pretty much understands that without pervasive testing, this kind of fast cycle process is doomed. At its extreme, some practitioners use Test Driven, aka Test First, development where failing tests must be written first and then code filled in behind it till the test passes. This creates a large inherent test framework.

Even agile groups that don’t do this almost always have metrics on unit test coverage and a required bar devs must hit.  Here, our desktop software group that’s newly using Scrum has the mandate that “there must be XX% unit test code coverage or you’re not ready to ship.”

Similarly, acceptance testing (automated continuous testing of user stories vs the code) is a common part of agile. Continuous ongoing testing ensures quality through the dev cycle and reduces the need for time-intensive, and mysteriously always insufficient, big-cycle regression testing.

This is a great culture. And there’s all kinds of different tests – unit test, integration/functional/regression testing, performance testing, fault testing… Starting to get interesting to you?  How about monitoring? In reality application monitoring is a special case of testing – it’s a “lightweight integration regression test.” Our initial approach to DevOps includes making test coverage goals for things like monitors and performance testing, because that plugs into the existing agile mindset well.

Bonus new terminology thing – the quick acceptance test you do upon release, which we always called “critical path testing,” is now being called “smoke testing” by the hip. Update your dictionaries!

A side note on formal QA groups. Just as we are working on DevOps, there has been previous work on how QA teams interact with agile dev teams, and there are a variety of different doctrines on how to split the work – often, it’s devs that are responsible for a lot of the testing. It’s a hard balance – you want the devs to be responsible for some of the testing because the best testing is “close to the code,” but just like with Ops, a real QA team has expertise beyond what a developer can just bolt in with 10% of their attention. Here, we have a dev team and also a remote QA team; devs test their own code on the daily build and then there’s a weekly push to a more stable environment where the QA team does acceptance testing and is moving into performance testing and the like.

Anyway, this endemic focus on testing and automation of testing and testing metrics is the pin that makes this agile flywheel actually turn without just flying off. (You are correct, some agile teams don’t do this – we call those “the unsuccessful ones.”)

And this is for you to do as well! There’s a whole post or series of posts in the topic “What does a unit test mean for something infrastructurey” – it is incumbent on you to figure it out and also have high test coverage with your work.

Refactoring

In general, agile dev is the epitome of horizon planning. You know you can’t get all the requirements ahead of time (or if you do get them all ahead of time, what you come out with won’t serve any real human’s need) and similarly preplanned architecture and design often doesn’t survive contact with the scrum. So it’s not “don’t plan or design,” but it’s “plan and design in an ongoing manner.”

This is one of the scariest parts for an ops person – we assume that we get one “bite at the apple”, and once we’ve set up the systems and let in the developers, we’ll never be allowed to change anything without a fight.  But developers have this problem internally all the time – one dev is working on a core library or API that other developers are using, and they don’t wait for core guy to get done before they start. Instead, they have adopted a concept they call refactoring. Refactoring just means that each sprint, you are open to redoing fundamental stuff that needs to change (or that you realize you did kinda ghetto in the first place).

Because this is an accepted part of the iterative approach, you get to leverage this as well.First iteration they get the basic Tomcat and mySQL install out of the repo, and they can get started – and then in the second iteration you front it with Apache, or tune the DB for security, or whatnot and they have to make some changes to fit. I’m not promising no one will ever cry about this, but it’s a part of what makes the culture successful so it’s there for you to leverage.

And for you to adhere to! Be open to refactoring your infrastructure based on the emerging project needs.

Source Control

A developer might not even mention this, and most books on agile don’t, because to them it’s so fundamental a discipline that it’s like breathing air. Sadly the same can not be said of Ops folks, so I’m mentioning it. When code is changed, it is in a shared source control repository – which gives other people on the group visibility into it (a collaboration touchpoint), is a common place to source it from (a deployment touchpoint), and can be used to easily manage multiple versions, even experimental ones, and merge or roll back changes.

This is the most fundamental empowering technology of modern software development (not just agile) and you must uptake it immediately or you have lost.  Fair warning. It is the stepping stone that will allow subsequent Cool DevOps Automation to happen.

Conclusion

These three disciplines convert agile/Scrum from dangerous free-for-all to a new technique that gets your product done both more quickly and with higher quality than a waterfall method. I’ll talk further next time about how Ops slots into all this, and how you can fit your systems admin work into a Scrum mindset.

Also note, there’s some other agile disciplines surrounding agile design and encapsulation and patterns and whatnot, which I don’t understand well enough yet to speak authoritatively on. Feel free and chime in with other core disciplines if you are!

Leave a comment

Filed under DevOps

Scrum for Operations: What Is Scrum

Agile

It’s not a mandatory part of DevOps, but I believe that DevOps works a lot better if operations teams adopt Agile.  But all that most systems teams know about Agile is that “it’s that thing that makes the development teams not plan worth a damn any more.”  Well, though there may be some truth to that, a well run agile process is very effective and not uncontrolled at all. Constructing and maintaining infrastructure in an agile manner is very possible – we’ve done it. In the beginning it seems daunting, just as it did initially to the legions of software developers who were steeped in waterfall based thinking, but once you adopt it I think you’ll see a lot of traditional pain points get a lot better – that was our experience. This is the first in a series of posts about using Scrum for Web operations, and I thought I’d start with explaining what Scrum is from an ops guy’s point of view.

Scrum

Scrum is one of the more popular agile methodologies.  Agile was initially defined by the Agile Manifesto and from there got turned into more specific implementations of its principles, methods, and practices, and Scrum is one of those more specific prescriptions on how to “do” agile. Other methodologies have been used for DevOps – like check out this great presentation on how Stephen Nelson-Smith (@LordCope) did a XP DevOps implementation at a British government agency.

While developing our first two products that we delivered on the cloud using DevOps, we used an “agile-ish” methodology, which is to say not a formal agile approach.  This time, we’ve decided to run Scrum by the book, not just for the feature developers but for our systems engineers, operations staff, security engineers, and system automation developers. Some folks are talking about hybrid models like Scrumban (Scrum + Kanban/Lean) to better incorporate ops work, but we’re going to start with Scrum first and see where we get to.

As a system administrator, that’s scary, because we don’t know Scrum. But Scrum is pretty darn simple.

Here’s two short videos that give a pretty good intro to Scrum.

Scrum in Under 10 Minutes, courtesy Axosoft:

Scrum Basics:

(P. S. Scrum master lady from this video… Call me!)

But if like me you think videos are not an efficient method of conveying information, here’s the short form.

The Team

In Scrum, you form a small crossfunctional team to all work together on a product instead of having to cross organizational boundaries and fill out forms to get any work done.  The roles consist of:

  • product owner – the “business guy” who has say over what features get greenlit and when
  • scrum master – the project manager who keeps everyone on track
  • team – ideally 5 to 7 developers writing code (this is where Ops will plug in, in DevOps)
  • testers, security, other folks – probably should be included too, but adoption of that varies

Of course in a mid to large sized organization there are other stakeholders, like customers and management and legal and whatnot. But you form a team that has everything it needs to complete its work.

The Backlog

All needed features are brainstormed and put into a master list of features called a “product backlog.” This backlog contains everything – including your systems tasks. The backlog is then broken up into smaller chunks, like release backlogs (features targeted for a specific release) and sprint backlogs (specific tasks for a sprint). The sprint backlog is basically your work breakdown structure, if you’re more comfortable with that terminology, except that the tasks are worded more in terms of what feature they provide instead of in terms of what specific things you need to do; this fosters communication with the product owner.

The developers (and, ideally, operations staff)  on the team help generate the backlog and provide time estimates in hours for each task. The product owner owns the prioritization and ordering (as constrained by things that are actual dependencies).

The Sprint

Work is performed in month-long iterations called “sprints.” Requirements are frozen for the sprint and the development is time-boxed – it must end at the termination of the sprint. At the end of the sprint, whatever features were put in that sprint should be complete and “ready to ship.”

The Standup

As the scrum, or sprint, progresses, there are daily “standups” – 15 minute meetings where everyone stands up, reports what they have done since the last standup, what they plan to accomplish by the next standup, and any roadblocks they are encountering. By keeping this meeting short it doesn’t waste time, but by having an actual face to face meeting you get very rapid and effective collaboration that cannot be achieved via managing project plans, sending emails, generating status reports, or the like.  It cuts out all the busywork and keeps the kernel of coordination that lets a team keep up velocity.

The Burndown

The progress of the team against the sprint backlog is tracked by a “burndown chart.” As each team member completes their tasks for the sprint you can easily see whether you are on track for successful completion or not.

And that’s Scrum in a nutshell.  Five things. Small integrated team, backlog, sprints, standups, burndown.

Next Time

I’ll talk about each of these areas more in depth later in the series, as we go through them ourselves as we develop a real product, and explain how a system administrator (aka systems engineer, infrastructure admin, operations ninja) can fit their work into this structure. But first, I will explain why Agile/Scrum is not just “crazy talk.” To a hardened system admin, or really to anyone used to working in a waterfall environment, it is very counterintuitive that this approach doesn’t just degenerate into the IT equivalent of orcs pillaging a city. But agile has several interesting practices that make it work, and should be very interesting to an operations person.

4 Comments

Filed under DevOps

Why SSL Sucks

How do you make your Web site secure?  “SSL!” is the immediate response.  It’s a shame that there are so many pain points surrounding using SSL.

  • It slows things down.  On ni.com we bought a hardware SSL accelerator, but in the cloud we can’t do that.
  • Cert management.  Everyone uses loads of VIPs, so you end up needing to get wildcard certs.  But then you have to share them – we’re a big company, and don’t want to give the main wildcard cert out to multiple teams.  And you can’t get extended validation (EV) on wildcard certs.
  • UI issues.  We just tried turning SSL on for our internal wiki.  But anytime there’s any non-https included element – warnings pop up.  If there’s a form field to go to search, which isn’t https, warnings pop up.  Hell, sometimes you follow a link to a non-https page, and warnings pop up.
  • CAs.  We have an internal NI CA and try to have some NI-signed certs, but of course you have to figure out how to get them into every browser in your enterprise.
  • It’s just retardedly complicated.  For putting a cert on Apache, it’s pretty well-worn, but recently I was trying to set up encrypted replication for mySQL and OpenDS and Jesus, doing anything other than the default self signed cert is hell on earth.  “Oh is that in the right format wallet?”

The result is that SSL’s suckiness ends up driving behavior that degrades security.  People know to just “accept the exception” any time they hit a site that complains about an invalid cert.  We have decided to remove SSL from most of our internal wiki and just leave it on the login page to avoid all the UI issues.  We couldn’t secure our replication from a combination of bugs (OpenDS secure replication works, until you restart any server that is – then it’s broken permanently) and the hassle.

In general, there has been little to no usability work put into the cert/encryption area, and that is why still so few people use it.  PGPing your email is only for gearheads. Hell, you have to transform key formats to use PuTTY to log into Amazon servers using SSL.  Stop the madness.

If the world really gave a crap about encryption, then e.g. your public key could be attached to your Facebook profile and people’s mail readers could pull that in automatically to validate signatures, for instance. “Key exchange” isn’t harder than the other kinds of more-convenient information exchange that happen all the time on the Net.   And you could take a standard cert in whatever format your CA gives it to you and feed it into any software in an easy and standard way and have it start encrypting its communication.

Me to world – make it happen!

7 Comments

Filed under Security

Upcoming Free Velocity WebOps Web Conference

O’Reilly’s Velocity conference is the only generalized Web ops and performance conference out there.  We really like it; you can go to various other conferences and have 10-20% of the content useful to you as a Web Admin, or you can go here and have most of it be relevant!

They’ve been doing some interim freebie Web conferences and there’s one coming up.  Check it out.  They’ll be talking about performance functionality in Google Webmaster Tools, mySQL, Show Slow, provisioning tools, and dynaTrace’s new AJAX performance analysis tool.

O’Reilly Velocity Online Conference: “Speed and Stability”
Thursday, March 17; 9:00am PST
Cost: Free

Leave a comment

Filed under Conferences, DevOps

Come To OpsCamp!

Next weekend, Jan 30 2009, there’s a Web Ops get-together here in Austin called OpsCamp!  It’ll be a Web ops “unconference” with a cloud focus.  Right up our alley!  We hope to see you there.

Leave a comment

Filed under DevOps, Uncategorized