Tag Archives: metrics

DevOpsDays Silicon Valley 2013 Day 2 Liveblog

Woooo!  Last day of a week of conferencing.  DevOpsDays Day 1 was good and I have even more openspace topics I plan to propose next time.  As usual this is being livestreamed and will be viewable later as well at bmc.com/devops.

Sponsor Watch… Got to talk to our friends at PagerDuty (alert management) and Datadog (monitoring/dashboarding), we use them and love them. And I got to see Stormpath again, they first showed up at last DevOpsDays with a SaaS hosted auth solution (not like PingIdentity and Okta, they actually store the usernames/passwords for you, Les Hazlewood the Apache Shiro guy started it) and they’re growing quickly. Also talked to SaltStack which does salt, a remote command execution framework. 10gen was here with a MongoDB SaaS backup solution (nice!) and monitoring solution.

Leading the Horses to Drink

By Damon Edwards (@damonedwards) from DTO and now #SimplifyOps.

How to spread DevOps in enterprises.  There’s silos you know.  The term DevOps may work against you – it’s evangelical and being overused/washed already.

There is no ‘why’ other than the why of the business. Read your Deming/Collins/Four Steps to the Epiphany/etc.

Go ask people… Something.

Develop a common DevOps vision. Not a process because they’ll get blinders on. [Ed: I believe this is a false dichotomy – you should teach both. Vision without process lacks focus and process without vision lacks direction.  It’s like accuracy and precision.]

  1. See the system
  2. Focus on flow
  3. Recognize feedback loops

Do a value stream mapping – read Learning to See.  OK, this is the meat of the preso – very hard to read though.

Take your information flow and turn it into an artifact flow

Do a timeline analysis, find waste

Metrics.  Establish the metric chain of what matters to the business, driven down to a capability which influences what matters to the business, and driven down to an activity over which an infividual can cause/influence outcomes.

Doesn’t require saying “devops.”

  1. Teach concepts
  2. Analysis
  3. Metrics chains
  4. Do something
  5. Iterate

Only takes like 3 days to bootcamp it. Then put in continuous improvement loops.

You can only break silos by brute-force being the boss, but misalignment will reassert itself. Have to change the alignment.

Q&A: Do it with everyone in the same room with whiteboards/postits, it works better than getting fancy

Beyond the Pretty Charts

Toufic Boubez from Metafor.  Cofounded Layer 7 and escaped when CA acquired them.

Came from a popular DOD Austin Openspace – see the blog post!

  1. We’ve moved beyond static thresholds – or, at least, everyone thinks they suck. Need more dynamic analytics.
  2. Context is important – planned and known (or should be known) events cause deviation. Correlate events with metric gathering.
  3. Don’t just look at timelines. Check the thinking round Etsy’s Kale and Skyline, many eval methods assume normal metric distribution and that’s uncommon. Look at a histogram of any given data – like latency is usually gamma not gaussian.
  4. Is all data important to collect? There’s argument over that.  Get it all and analyze vs figure out what’s important to not waste time.
  5. We all want to automate. Need detection before it’s critical. Can’t always have a human in the loop. Whipping out the control theory – open loop control systems, closed loop – to get self healing systems we need current state/desired state diffing from our monitoring systems and taking action. [Ed. We experimented with this back at NI, we had Sitescope going to a homegrown system called “monolith” that would take actions. Hard to account for all factors though and eventually was discontinued.] Also supervised vs unsupervised loops [Ed: – we might have kept monolith around if it SMSed us and said “memory is high on this server I believe I should restart the java process, is that OK” and we could PagerDuty-like say yea or nay.]

How much data do you need?  No more res that twice your highest frequency (Nyquist-Shanon). Most algorithms will smooth/average/etc.

Q&A: Are control systems more appropriate for small not large systems?  No – just like in industry, as long as you design for that then it’s not just for toys.

And now I step in for the vendor pitch for Riverbed.  Agile Admin Peco left yesterday and the other Riverbed booth guys made themselves scarce, so I did their shout-out for them. They have Zeus EC2 LBs, Aptimize web front end optimizer, and Opnet Appinternals Xpert APM tool!  Very cool.

Identifying Waste in your Build Pipeline

Scott Turnquest from Thoughtworks

Tools: Value stream mappings, fishbone analysis, “5 Whys”

So how do we do that value stream mapping? Here we go!  [Ed: Oh, this is nice, I was sad that in the DTO presentation they mentioned them and threw some up but didn’t really dive down into one.]

A day of analysis of one small feature –  a day of wait, 4 days of dev, 2 mins of wait, 1 hour of acceptance tests, 4 hours of deploy, 1 day in staging, 4 hours to deploy to prod. Note the waste areas – “4 days in dev?  Really?” and the long ass deploy windows [Ed: Our value stream looks depressingly like this.] Process cycle efficiency of 75% (value creation time/total time)

So to determine the source of those waste areas, use the fishbone diagram. Had long feedback cycles from structure of code and build/deploy pipelines. Couldn’t test w/o AWS and can’t test individual components, provisioning was serial and repos were flaky.

Fix underlying cause (most impact first) – deploy pipelines. Reduce failure rate of deployments. Half were failing, and failing slow. Moved to AMI baking for reliability. [Ed: They said I was crazy a couple years ago when I said this, “no it’s a foil ball…” Bake when you can!] So this got them from 4 hours to 2 hours, and then parallelized and got down to 25 minutes. This cut down the staging and prod deploys but also the dev time. Process cycle efficiency up to 83%.

5 Whys root cause analysis method. Figured out manual hard to automate deployments were at the root, automated them – don’t be afraid to restructure/redesign when complexity gets in the ways.

Analysis techniques are not just for analysts!

Read Jez Humble’s “Continuous Delivery”, Poppendieck’s “Lean Soft Dev”/”Implementing Lean Soft Dev” , Derby/Larsen “Agile Retrospectives”

Clusters, developers, and the complexity in Infrastructure Automation

Antoni Batchelli of PalletOps. Complexity, essential and accidental. Building a system is simple but the systems are complex at runtime, and “complexity of a system is the degree of difficulty in predicting the properties of the system given the properties of the system’s parts.”

In DevOps we see infrastructure-aware software and concepts moving up into dev processes.

Devs want to run “their own” cluster with all the setups they need – productionlike, but with specific versions/timings/data/code/etc. Don’t care about infra details but want consistent envs/code.

Software has to be infrastructure aware now to autoscale, self-heal, etc. The app is the best informed actor to make/orchestrate infra decisions.

[Ed: This late into a conference week, I get a little irritated about presentations that are not really clear *why* they are telling you what they’re telling you.]

He hates incidental complexity. Me too.

OK, maybe we’re getting to a thesis. Let people solve problems where they are less complex: at the right level of abstraction. Build layers of abstraction – infrastructure, OS, services, actions. Make them into modules, make them functional and polymorphic.

Ignites!

James Wickett (@wickett) on Rugged DevOps and gauntlt for security + DevOps. gauntlt is a gem for continuous security testing as part of your build cycle. BDD your app’s security! Knock Out!!! go to gauntlt.org to get started.

Karthik Gaekwad (@iteration1) on DevOps Culture in the CIA. Devops is culture/automation/measurement/sharing. Seen Zero Dark Thirty? Well, the true story behind that details the COA’s transformation from a split between analysts and operatives especially using Sisterhood, a group of female analysis tracking Bin Laden since 1980. Post 9/11 there was a mass reorg to become more tactical – analysts became Targeters and worked with Operatives hand in hand. Same kind of silo busting. The Phoenix Project is Zero Dark Thirty for DevOps!

Dave Mangot (@davemengot) for DevOps Do’s and Don’ts from Salesforce.  Do give everyone the tools they need to do their jobs. Don’t make ops the constraint, Do lots of communicating. Don’t forget to include everyone. Do get ops involved early. Don’t create a front door (loaded) process. Do have integration environments, Don’t forget config management. Do have blameless post-mortems. Don’t use the Phoenix Project as a bludgeon. Do use Agile as a cultural tool. Don’t rely on tools to change culture. Do get executive sponsorship. Don’t do shadow IT. Do use Damon Edward’s levers. Don’t just lecture, it’s a participation sport. Do structure the org around delivery. Don’t make separate DevOps teams or jackets. Do get the whole company involved, DevOps is for everyone.

Jonathan Thorpe – Preventing DevOps success. Not planning for scale. Not having unit tests. Not designing automated tests to scale. Not managing your capacity. Not using your resources effectively. Not using same deployment process for all environments. Not knowing what/where/when/who (activity tracking). Getting covered in ants.

DevOps is the future – John Esser from ancestry.com. What keeps CIOs up at night? Besides ants? IT. Need time to value. Transform mindset/processes/tools/etc. Strangler pattern.

DevOps productivity survey by Oliver White from ZeroTurnaround. DevOps oriented teams spend more time on infrastructure improvements and less on firefighting and support. Problem recoveries are shorter. Release software faster. use more custom tools. Make love for longer time. @rebel_labs

Nathan Harvey on leveling up your skills. Quit!  Go to a conference. Try new things. Do a project somewhere. Always be interviewing.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 3: DevOps and metrics

We’re talking about the devops survey with Gene Kim, James Turnbull and Jez Humble

4039 survey responses!

Lessons learned
– Don’t change questions midway in a survey
– Get a data analyst for survey analysis

Key findings
– devops teams are more agile; 30x more deployments, 8000x shorter lead times
– devops teams are more reliable;
– most teams use version control- 89%
– most teams use automated code deployments- 82%
– the longer you do devops, the better you get!

Hilarious John Vincent quote on devops:

20130620-133505.jpg

Measuring culture
Trust and verify

26% of folks who responded to the survey were from the enterprise, and 16% were from 10k and plus.
The biggest barriers to devops was culture because people didn’t get it- whether it was your manager, team or outside the group. Tell people more, and wear more devops shirts!!!

DevOps continues to be a culture issue versus an issue in terms of tools and processes. James Turnbull wants us to go out there and talk to people and figure out people skills!

And join the devops google+ community!

2 Comments

Filed under DevOps

Why Your Monitoring Is Lying To You

In my Design for Failure article, I mentioned how many of the common techniques we use to allegedly detect failure really don’t.  This time, we’ll discuss your monitoring and why it is lying to you.

Well, you have some monitoring, don’t you, couldn’t it tell you if an application is down? Obviously not if you are just doing old SNMP/box level monitoring, but you’re all DevOps and you know you have to monitor the applications because that’s what counts. But even then, there are common antipatterns to be aware of.

Synthetic Monitoring

Dirty secret time, most application monitoring is “synthetic,” which means it hits a specific URL or set of URLs once in a while, often 5-10 minutes apart. Also, since there are a lot of transient failures out there on the Internet, most ops groups have their monitors set where they have to see 2-5 consecutive failures – because ops teams don’t like being woken up at 3 AM because an application hiccuped once (or the Internet hiccuped on the way to the application). If the problem happens on only 1 of every 20 hits, and you have to see three errors in a row to alert, then I’ll leave it to your primary school math skills to determine how likely it is you’ll catch the problem.

You can improve on this a little bit, but in the end synthetic monitoring is mainly useful for coarse uptime checking and performance trending.

Metric Monitoring

OK, so synthetic monitoring is only good for rough up/down stuff, but what about my metric monitoring? Maybe I have a spiffier tool that is continuously pulling metrics from Web servers or apps that should give me more of a continuous look.  Hits per second over the last five minutes; current database space, etc.

Well, I have noticed that metrics monitors, with startling regularity, don’t really tell you if something is up or down, especially historically. If you pull current database space and the database is down, you’d think there would be a big nasty gap in your chart but many tools don’t do that – either they report the last value seen, or if it’s a timing report it happily reports you timing of errors. Unless you go to the trouble to say “if the thing is down, set a value of 0 or +infinity or something” then you can sometimes have a failure, then go back and look at your historical graphs and see no sign there’s anything wrong.

Log Monitoring

Well surely your app developers are logging if there’s a failure, right? Unfortunately logging is a bit of an art, and the simple statement “You should log the overall success or failure of each hit to your app, and you should log failures on any external dependency” can be… reinterpreted in many ways. Developers sometimes don’t log all the right things, or even decide to suppress certain logs.

You should always log everything.  Log it at a lower log level, like INFO, if it’s routine, but then at least it can be reviewed if needed and can be turned into a metric for trending via tools like Splunk. My rules are simple:

  • Log the start and end of each hit – are you telling the client success or failure? Don’t rely on the Web server log.
  • Log every single hit to an external dependency at INFO
  • Log every transient failure at WARN
  • Log every error at ERROR

Real User Monitoring

Ah, this is more like it.  The alleged Holy Grail of monitoring is real user monitoring, where you passively look at the transactions coming in and out and log them.  Well, on the one hand, you don’t have to rely on the developers to log, you can log despite them.  But you don’t get as much insight as you’d think. If the output from the app isn’t detectable as an error, then the monitoring doesn’t help.  A surprising amount of the time, failures are not thrown as a 500 or other expected error code. And checking for content within a payload is often fragile.

Also, RUM tools tend to be network sniffer based, which don’t work well in the cloud or in many network topologies.  And you get so much data, that it can be hard to find the real problems without spending a lot of time on it.

No, Really – One Real World Example

We had a problem just this week that managed to successfully slip through all our layers of monitoring – luckily, our keen eyes caught it in preproduction. We had been planning a bit app release and had been setting up monitoring for it. It seemed like everything was going fine. But then the back end databases (SQL Azure in this case) had a pretty long string of failures for about 10 minutes, which brought our attention to the issue. As I looked into it, I realized that it was very likely we would have seen smaller spates of SQL Azure connection issues and thus application outage before – why hadn’t we?  I investigated.

We don’t have any good cloud-compliant real user monitoring in place yet.  And the app was throwing a 200 http code on an error (the error page displayed said 401, but the actual http code was 200) so many of our synthetic monitors were fooled. Plus, the problem was usually occasional enough that hitting once every 10 minutes from Cloudkick didn’t detect it. We fixed that bad status code, and looked at our database monitors. “I know we have monitors directly on the databases, why aren’t those firing?”

Our database metric monitors through Cloudkick, I was surprised to see, had lovely normal looking graphs after the outage.I provoked another outage in test to see, and sure enough, though the monitors ‘went red,’ for some reason they were still providing what seemed to Cloudkick like legitimate data points, and once the monitors “went green,” nothing about any of the metric graphs indicated anything unusual! In other words, the historical graphs had legitimate looking data and did not reveal the outage. That’s a big problem. So we worked on those monitors.

I still wanted to know if this had been happening.  “We use Splunk to aggregate our logs, I’ll go look there!” Well, there were no error lines in the log that would indicate a back end database problem. Upon inquiring, I heard that since SQL Azure connection issues are a known and semi-frequent problem, logging of them is suppressed, since we have retry logic in place.  I recommended that we log all failures, with ones that are going to be retried simply logged at a lower severity level like WARN, but ERROR on failures after the whole spread of retries. I declared this a showstopper bug that had to be fixed before release – not everyone was happy with that, but sometimes DevOps requires tough love.

I was disturbed that we could have periods of outage that were going unnoticed despite our investment in synthetic monitoring, pulling metrics, and searching logs. When I looked back at all our metrics over periods of known outage and they all looked good, I admit I became somewhat irate. We fixed it and I’m happy with our detection now, but I hope this is instructive in showing you how bad assumptions and not fully understanding the pros and cons of each instrumentation approach can end up leaving “stacked holes” that end up profoundly compromising your view of your service!

10 Comments

Filed under DevOps