Tag Archives: ops

by Ernest Mueller | June 20, 2013 · 11:02 am

Velocity 2013 Day 3 Liveblog: The Keynotes

Day Three of a week of convention. Convention Humpday. The day it stops becoming a mini-vacation and you start earning your salary again.

As usual the keynotes are being livestreamed, so this liveblog is perhaps more for those who want the Cliff’s Notes summary later. Yesterday’s keynotes were certainly compressible, so here we go! Also follow along in twitter hashtag #veocityconf to see what people are saying about the show.

Clarification on Keynote’s RUM – they announced it yesterday but I was like “haven’t they been trying to sell this to me for two years?” Apparently they were but it was in beta. And congrats to fast.ly who just got a $10M round of funding!

Winners of the survey follies… Souders’ book, Release It!/Web Operations, and so on. Favorite co-host: Souders!

“Send us your comments! Except about the wi-fi!” Actually it’s working OK so far this morning. Show is good, though they added yet another ‘vendor track’ which is unfortunate. They have a front end dev track, an ops track, and a mobile track. Last year they added a fourth track – “vendor talks.” This year there is another fifth track – “more vendor talks.” Boo, let’s make space for real content.

Gamedays On The Obama Campaign

@dylanr on revamping the Obama site 18 mos. before election day. 40 engineers in 7 teams, ~300 repos, 200 deployed products, 3000 servers (AWS), millions of hits a day, million volunteers, 8000 staff. He had redone threadless’ site and was on to the next big thing!

Plan, build, execute, get out the vote.

Planning is the not-fast dreamtime. But for the tech folks, it means start building the blocks.

Build is when everyone starts building teams and soliciting $. Then tech builds the apps.

Execute is when everyone starts using it all, more and more of everything. Tech starts getting feedback (been building blind till now).

Get out the vote – final 4-day sprint. For tech, this means scale. A couple orders of magnitude in that span.

Funny picture of a “Don’t Fuck This Up” cake. [Ed: That was my second standing order for my old WebOps team. 1. Make It Happen, 2. Don’t Fuck Up, 3. There’s the right way, the wrong way, and the standard way.]

They got one shot at this. So how do you do it?

Talk to your stakeholders, but they want every feature ever. But working is better. No feature is better than having a working app. So frame the conversation as “if things fail what still needs to work?” Graceful degradation.

Failure – you can try to make it not fail, and learn to deal with failure. You should do some of the former but not delude yourself into not doing the latter. But not just the tech – make the team resilient to failure via practice.

“Game day” 6 weeks pre-election. Prod-size staging, simulate, break and react. Two week hardening sprint and then on game day had a long agenda of “things to break.” He lied about the order and timing though.

Devops (aggressors) vs engineers (defenders), organized in campfire, maintaining updated google doc

Learned that there were broken things we thought were fixed, learned what failure really looks like, how things fail, how to fix it

Made runbooks

They had a stressful day, went home, came back in – and databases started failing! AWS failure. Utilized the runbooks and were good.

Build resilient apps by planning for failure, defining what matters, making your plan clear to stakeholders, and fail to the things that matter. And resilient teams – practice failing and learn from it, use your instruction manual.

Ops School

Check it! Etsy and O’Reilly video classes on how to be an ops engineer! Oh. They’ll be for sale, I got excited for a minute thinking it would be a free way to get a whole generation of ops engineers trained from schools etc. that know nothing about ops. Guess not. Damn.

W3C Web Performance Working Group Status

Arvind from Google gives us an update on the W3C. The web perf working group is working on navigation timing, user timing, page visibility, resource priorities, client side error logging, and other fundamental standards that will help with web performance measurement and improvement. Good stuff if very boringly presented, but that’s standards groups for you.

Eliminating Web Site Performance Theft

Neustar tells us the world is online and brand reputation and revenue are at stake. Quite.

Performance can affect your reputation and revenue! Quite.

This talk is a great one for vaguely befuddled director level and above types, not the experts at this conference. Twitter agrees.

Mobitest, the Latest

Akamai has cell nodes and devices for Mobitest, you can become a webpagetest node on your phone! If you have an unlimited data plan 😛 The app is waiting on release in the App Store. See mobitest.akamai.com for more.

If You Don’t Understand People, You Don’t Understand Ops

Go to techleadershipnews.com and get Kate’s newsletter!

Influence – Gangster style!

How do you earn respect and influence without authority? Even if you’re not a “manager” you need to be able to do this to get things done.

You want people to hear what you have to say – need 3 things.

Accountability
Everyone is your ally
Reciprocity

Accountability – lead by example. Be the person who can get “it” done. Always followed through on commitments. Generates the graph of trust. Treat everyone with respect. Be a reliable person. Be a superstar – always be hustling.

Everyone is an ally – make them your friend. It’s a small world. Who was nice to you? Who made you feel bad? How about in return? Make every interaction positive.

Reciprocity – all about giving. You get what you give. What is your currency? What do you have of value with others and how can you share it? How can you improve the lives of other people?

Success is about people. Influence is success. Yay Kate!

Lightning Demos

httparchive and BigQuery

@igrigorik from Google about httparchive.org which crawls the Web twice a month and keeps stats. It’s all freely available for download – a subset is shown online at the site.

Google built Dremel, which is an interactive ad hoc query system for analysis of read only nested data. So they put BigQuery + http archive! Go to bigquery.cloud.google.com and it’s in there! Most comon JS framework (jquery btw)? Median page load times? You can query it all.

In Google Docs you can script and make them interoperate (like send an email when this spreadsheet gets filled in). Created dynamic query straight to bigquery. Oh look, dynamic graph! bit.ly/ha-query for more!

Patrick Meenan on Webpagetest

You can now do a tcpdump with your test! (Advanced tab). He shows an analysis with cloudshark – wireshark in the cloud! Nice.

Patrick Lightbody from New Relic

Real user monitoring is cool. newrelic.com/platform

Steve Reilly from Riverbed/Aptimize

An application aware infrastructure? We have abstractions for some layers – middleware, compute, storage – but not really for transport. Software defined networking will be the next “washing” trend. It’s just a transport abstraction. Then we can make the infrastructure a function of the application. “Middle boxes” are now app fetures – GSLB, WAFs, etc.

Slightly confusing at this point – a lot of abstract words and not enough concrete. Which is better than a thinly disguised product pitch, so still better than yesterday!

Decentralized decisionmaking… Location is no longer a constraint but a feature. This makes me think of Facebook’s talk yesterday with Sonar and rewriting DNS/GTM/LB.

Jonathan LeBlanc from Paypal on API Design

Started with SOAP/XML SOA. But then the enlightenment happened and REST made your life less sucky and devs more efficient.

“Sure we support REST! You can GET and POST!” Boo. And also religious REST principle following, instead of innovation.

Our lessons learned: Lower perceived latency, use HTTP properly, build in automation, offload complexity.

With no details this was very low value.

Why Your Monitoring Is Lying To You

In my Design for Failure article, I mentioned how many of the common techniques we use to allegedly detect failure really don’t. This time, we’ll discuss your monitoring and why it is lying to you.

Well, you have some monitoring, don’t you, couldn’t it tell you if an application is down? Obviously not if you are just doing old SNMP/box level monitoring, but you’re all DevOps and you know you have to monitor the applications because that’s what counts. But even then, there are common antipatterns to be aware of.

Synthetic Monitoring

Dirty secret time, most application monitoring is “synthetic,” which means it hits a specific URL or set of URLs once in a while, often 5-10 minutes apart. Also, since there are a lot of transient failures out there on the Internet, most ops groups have their monitors set where they have to see 2-5 consecutive failures – because ops teams don’t like being woken up at 3 AM because an application hiccuped once (or the Internet hiccuped on the way to the application). If the problem happens on only 1 of every 20 hits, and you have to see three errors in a row to alert, then I’ll leave it to your primary school math skills to determine how likely it is you’ll catch the problem.

You can improve on this a little bit, but in the end synthetic monitoring is mainly useful for coarse uptime checking and performance trending.

Metric Monitoring

OK, so synthetic monitoring is only good for rough up/down stuff, but what about my metric monitoring? Maybe I have a spiffier tool that is continuously pulling metrics from Web servers or apps that should give me more of a continuous look. Hits per second over the last five minutes; current database space, etc.

Well, I have noticed that metrics monitors, with startling regularity, don’t really tell you if something is up or down, especially historically. If you pull current database space and the database is down, you’d think there would be a big nasty gap in your chart but many tools don’t do that – either they report the last value seen, or if it’s a timing report it happily reports you timing of errors. Unless you go to the trouble to say “if the thing is down, set a value of 0 or +infinity or something” then you can sometimes have a failure, then go back and look at your historical graphs and see no sign there’s anything wrong.

Log Monitoring

Well surely your app developers are logging if there’s a failure, right? Unfortunately logging is a bit of an art, and the simple statement “You should log the overall success or failure of each hit to your app, and you should log failures on any external dependency” can be… reinterpreted in many ways. Developers sometimes don’t log all the right things, or even decide to suppress certain logs.

You should always log everything. Log it at a lower log level, like INFO, if it’s routine, but then at least it can be reviewed if needed and can be turned into a metric for trending via tools like Splunk. My rules are simple:

Log the start and end of each hit – are you telling the client success or failure? Don’t rely on the Web server log.
Log every single hit to an external dependency at INFO
Log every transient failure at WARN
Log every error at ERROR

Real User Monitoring

Ah, this is more like it. The alleged Holy Grail of monitoring is real user monitoring, where you passively look at the transactions coming in and out and log them. Well, on the one hand, you don’t have to rely on the developers to log, you can log despite them. But you don’t get as much insight as you’d think. If the output from the app isn’t detectable as an error, then the monitoring doesn’t help. A surprising amount of the time, failures are not thrown as a 500 or other expected error code. And checking for content within a payload is often fragile.

Also, RUM tools tend to be network sniffer based, which don’t work well in the cloud or in many network topologies. And you get so much data, that it can be hard to find the real problems without spending a lot of time on it.

No, Really – One Real World Example

We had a problem just this week that managed to successfully slip through all our layers of monitoring – luckily, our keen eyes caught it in preproduction. We had been planning a bit app release and had been setting up monitoring for it. It seemed like everything was going fine. But then the back end databases (SQL Azure in this case) had a pretty long string of failures for about 10 minutes, which brought our attention to the issue. As I looked into it, I realized that it was very likely we would have seen smaller spates of SQL Azure connection issues and thus application outage before – why hadn’t we? I investigated.

We don’t have any good cloud-compliant real user monitoring in place yet. And the app was throwing a 200 http code on an error (the error page displayed said 401, but the actual http code was 200) so many of our synthetic monitors were fooled. Plus, the problem was usually occasional enough that hitting once every 10 minutes from Cloudkick didn’t detect it. We fixed that bad status code, and looked at our database monitors. “I know we have monitors directly on the databases, why aren’t those firing?”

Our database metric monitors through Cloudkick, I was surprised to see, had lovely normal looking graphs after the outage.I provoked another outage in test to see, and sure enough, though the monitors ‘went red,’ for some reason they were still providing what seemed to Cloudkick like legitimate data points, and once the monitors “went green,” nothing about any of the metric graphs indicated anything unusual! In other words, the historical graphs had legitimate looking data and did not reveal the outage. That’s a big problem. So we worked on those monitors.

I still wanted to know if this had been happening. “We use Splunk to aggregate our logs, I’ll go look there!” Well, there were no error lines in the log that would indicate a back end database problem. Upon inquiring, I heard that since SQL Azure connection issues are a known and semi-frequent problem, logging of them is suppressed, since we have retry logic in place. I recommended that we log all failures, with ones that are going to be retried simply logged at a lower severity level like WARN, but ERROR on failures after the whole spread of retries. I declared this a showstopper bug that had to be fixed before release – not everyone was happy with that, but sometimes DevOps requires tough love.

I was disturbed that we could have periods of outage that were going unnoticed despite our investment in synthetic monitoring, pulling metrics, and searching logs. When I looked back at all our metrics over periods of known outage and they all looked good, I admit I became somewhat irate. We fixed it and I’m happy with our detection now, but I hope this is instructive in showing you how bad assumptions and not fully understanding the pros and cons of each instrumentation approach can end up leaving “stacked holes” that end up profoundly compromising your view of your service!

10 Comments

Filed under DevOps

Tagged as detection, DevOps, failure, logging, metrics, monitoring, Operations, ops, synthetic

by Ernest Mueller | March 11, 2010 · 8:34 am

Upcoming Free Velocity WebOps Web Conference

O’Reilly’s Velocity conference is the only generalized Web ops and performance conference out there. We really like it; you can go to various other conferences and have 10-20% of the content useful to you as a Web Admin, or you can go here and have most of it be relevant!

They’ve been doing some interim freebie Web conferences and there’s one coming up. Check it out. They’ll be talking about performance functionality in Google Webmaster Tools, mySQL, Show Slow, provisioning tools, and dynaTrace’s new AJAX performance analysis tool.

O’Reilly Velocity Online Conference: “Speed and Stability”
Thursday, March 17; 9:00am PST
Cost: Free

Tag Archives: ops

Velocity 2013 Day 3 Liveblog: The Keynotes

Gamedays On The Obama Campaign

Ops School

W3C Web Performance Working Group Status

Eliminating Web Site Performance Theft

Mobitest, the Latest

If You Don’t Understand People, You Don’t Understand Ops

Lightning Demos

httparchive and BigQuery

Patrick Meenan on Webpagetest

Patrick Lightbody from New Relic

Steve Reilly from Riverbed/Aptimize

Jonathan LeBlanc from Paypal on API Design

Why Your Monitoring Is Lying To You

Synthetic Monitoring

Metric Monitoring

Log Monitoring

Real User Monitoring

No, Really – One Real World Example

Upcoming Free Velocity WebOps Web Conference

Subscribe

Recent Comments

Recent Posts

Austinites

Cloud

DevOps

@wickett

@ernestmueller

@iteration1

Archives