Tag Archives: service

Sustaining vs Strangulation

The other day I came across two interesting articles that showcase two facets of one problem (and more notably, a problem that I have been working on myself). Read the two articles, they are:

I manage a large mostly-sustaining team here at Bazaarvoice that I’ve moved to Agile and DevOps. As Matt points out, sustaining teams are problematic in theory. The strangulation approach, especially the airline booking app “single trunk” approach, is better from a number of perspectives. But, our org made the decision to put all the legacy work with a sustaining team so that the many teams of new-product devs would be able to get maximum speed. It did allow for greater speed of new development to not have to do support at the same time. However, it also provided significant challenges – initially underestimating the effort needed to sustain, new teams not having the benefit of the lessons old team already learned from running at scale, sustaining teams feeling like second class citizens, other teams being tempted to shed even newer work to the sustaining team (even though it’s technically just for the one product).   I can’t prove that taking the strangulation vs sustaining approach would have been better, but in retrospect, I would want to try that instead.  We are strangling old vs new product from a customer-facing point of view in terms of dialing up new products/dialing down old ones instead of doing “big bang” upgrades, but we’re not doing it inside a single team/single trunk model like Matt mentions and it seems like that could mitigate many of these issues.

We are making the best of the sustaining gig on our team, however. It’s not light work or lacking in innovation. We run support requests through a kanban and then have two scrum-type sprint teams for CI and occasional feature work, plus lots of infrastructure work. We dole out up to a billion hits a day and have a reach of 400M+ users, with traffic and data volume doubling year over year, so we are in the interesting position of being largely frozen in terms of features, how most product managers understand them (pretty buttons!), but having to innovate and rearchitect quite aggressively on all our “nonfunctional” areas (performance, availability, security, etc.). When people tell us we’re “feature frozen” I tell them they have a poor understanding of the word “feature,” or maybe they should think about “changes” rather than “features.” This is one of the key DevOps culture change points many orgs have to face, and educating PMs and upper management on a more holistic definition of “feature” that includes managing nonfunctional requirements is a key success factor.

We’re also doing a number of the things Matt’s article encourages to make sustaining work engaging. We push hard on customer satisfaction (we are riding at 99% of customer tickets fulfilled within SLA; we have a big dashboard with leaderboards that promote that), empower the team to perform continuous improvement to make the system better, and consult with the “next gen” teams on their work. As a result we have really good results, really good relationships with the Implementation and Support groups outside Engineering, and pretty good team morale. Of course, general recognition and stuff like that so that everyone sees and appreciates the team’s work helps.

Though in the end, we are also trying to outsource the sustaining work so that our engineers aren’t all sad from having to do it. (Our current team soldiers on because they know the company depends on them, but other engineers in the company don’t want to do move over to do sustaining work,.)  So… There’s that.  Our job is to juggle the desire of employees to move off sustaining plus the desire of other teams to get those employees and developing the outsourcers’ expertise with the needs of maintaining the legacy app with the excellence required.

From what I’ve learned from this, I believe a solid product renewal plan would involve:

a) Teams that own services, not apps or projects (ITSM/ITIL 101).

b) Those teams own the design, development, sustaining, operations, deployment, and whatever other task you want to apply to the product – from conception to delivery.

c) Every app and library and service and tool has to be owned by an appropriate service team, regardless of what engineer moved to what team or corporate reprioritization happened or whatever completely-legitimate corporate sob story you have.

d) Then if you need to make a major sea change, you employ the strangulation method to transition effort on a team, not using a separate sustaining team.

The risk with this approach is that a team gets filled up with sustaining work.  But that is a chance for them to eat their own dog food. Go fix whatever’s causing that sustaining work! Retire the stuff that doesn’t make sense any more!  Passing completed items off into a black hole for “sustaining” chews up just as much resources and time, it just provides the convenient fiction that since you can’t see it, it must not be affecting your velocity.

What do you think?  How have you approached this problem?  Am I on crack? Let me know.

Leave a comment

Filed under DevOps

Service Delivery Best Practices?

So… DevOps.  DevOps vs NoOps has been making the rounds lately. At Bazaarvoice we are spawning a bunch of decentralized teams not using that nasty centralized ops team, but wanting to do it all themselves.  This led me to contemplate how to express the things that Operations does in a way that turns them into developer requirements?

Because to be honest a lot of our communication in the Ops world is incestuous.  If one were to point a developer (or worse, a product manager) at the venerable infrastructures.org site, they’d immediately poop themselves and flee.  It’s a laundry list of crap for neckbeards, but spends no time on “why do you want this?” “What value will this bring to your customer?”

So for the “NoOps” crowd who want to do ops themselves – what’s an effective way to state these needs?  I took previous work – ideas from the pretty comprehensive but waterfally “Systems Development Framework” Peco and I did for NI, Chris from Bazaarvoice’s existing DevOps work – and here’s a cut.  I’d love feedback. The goal is to have a small discrete set of areas (5-7), with a small discrete set (5-7) of “most important” items – most importantly, stated in a way so that people understand that these aren’t “annoying things some ops guy wants me to do instead of writing 3l33t code” but instead stated as they should be, customer facing requirements like any other.  They could then be each broken into subsidiary more specific user stories (probably a whole lot of them, to be honest) but these could stand in as “epics” or “pillars” for making your Web thing work right.

Service Delivery Best Practices

Availability

  • I will make my service highly available, so that customers can use it constantly without issues
  • I will know about any problems with my service’s delivery to customers
  • I can restore my service quickly from any disaster or loss
  • I can make my data resistant to corruption from runtime issues
  • I will test my service under routine disruption to make it rugged and understand its failure modes

Performance

  • I know how all parts of the product perform and can measure performance against a customer SLA
  • I can run my application and its data globally, to serve our global customer base
  • I will test my application’s performance and I know my app’s limitations before it reaches them

Support

  • I will make my application supportable by the entire organization now and in the future
  • I know what every user hitting my service did
  • I know who has access to my code and data and release mechanism and what they did
  • I can account for all my servers, apps, users, and data
  • I can determine the root cause of issues with my service quickly
  • I can predict the costs and requirements of my service ahead of time enough that it is never a limiter
  • I will understand the communication needs of customers and the organization and will make them aware of all upgrades and downtime
  • I can handle incidents large and small with my services effectively

Security

  • I will make every service secure
  • I understand Web application vulnerabilities and will design and code my services to not be subject to them
  • I will test my services to be able to prove to customers that they are secure
  • I will make my data appropriately private and secure against theft and tampering
  • I will understand the requirements of security auditors and expectations of customers regarding the security of our services

Deployment

  • I can get code to production quickly and safely
  • I can deploy code in the middle of the day with no downtime or customer impact
  • I can pilot functionality to limited sets of users
  • I will make it easy for other teams to develop apps integrated with mine

Also to capture that everyone starting out can’t and shouldn’t do all of this… We struggled over whether these were “Goals” or “Best Practices” or what… On the one hand you should only put in e.g. “as much security as you need” but on the other hand there’s value in understanding the optimal condition.

2 Comments

Filed under DevOps

Beware the Deceptive SLA, My Friend

We’re trying to come to an agreement with a SaaS vendor about performance and availability service level agreements (SLAs).  I discussed this topic some in my previous “SaaS Headaches” post.  I thought it would be instructive to show people the standard kind of “defense in depth” that suppliers can have to protect against being held responsible for what they host for you.

We’ve been working on a deal with one specific supplier.  As part of it, they’ll be hosting some images for our site.  There’s a business team primarily responsible for evaluating their functionality etc., we’re just in the mix as the faithful watchdogs of performance and availability for our site.

Round 1 – “What are these SLAs you speak of?”  The vendor offers no SLA.  “Unacceptable,” we tell the project team.  They fret about having to worry about that along with the 100 other details of coming to an agreement with the supplier, but duly go back and squeeze them.  It takes a couple squeezes because the supplier likes to forget about this topic – send a list of five questions with one of them being “SLA,” you get four answers back, ignoring the SLA question.

Round 2 – “Oh, you said ‘SLA’!  Oh, sure, we have one of those.”  We read the SLA and it only commits to their main host being pingable.  Our service could be completely down, and it doesn’t speak to that.  Back to our project team, who now between the business users, procurement agent, and legal guy need more urging to lean on the supplier.  The supplier plays dumb for a while, and then…

Continue reading

1 Comment

Filed under Cloud, General