Tag Archives: velocityconf13

Velocity 2013 Wrapup

Whew, we’re all finally back home from the conferencing. Fun was had by all.

@iteration1, @ernestmueller, @wickett

@iteration1, @ernestmueller, @wickett

Over the next week I’ll go back to the liveblog articles and put in links to slides/videos where I can find them (feel free and post ones you know in comments on the appropriate post!). We’ll also try to sum up the best takeaways into a Velocity 2013 and DevOpsDays Silicon Valley 2013 quick guide, for those without the patience to read the extended dance remix.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 3 Liveblog: Retooling Adobe: A DevOps Journey from Packaged Software to Service Provider

Retooling Adobe: A DevOps Journey from Packaged Software to Service Provider

Srinivas Peri, Adobe and Alex Honor, SimplifyOPS/DTO

Adobe needed to move from desktop, packaged software to a cloud services model and needed a DevOps transformation as well.

Srini’s CoreTech Tools/Infrastructure group tries to transform wasted time to value time (enabling tools).

So they started talking SaaS and Srini went around talking to them about tooling.

Dan Neff came to Adobe from Facebook as operations guru from Facebook.  He said “let’s stop talking about tools.” He showed him the 10+ deploys a day at Flickr preso. Time to go to Velocity!  And he met Alex and Damon of DTO and learned about loosely coupled toolchains.

They generated CDOT, a service delivery platform. Some teams started using it, then they bought Typekit and Paul Hammond thought it was just lovely.

And now all Adobe software is coming through the cloud.  They are not the CoreTech Solution Engineering team – who makes enabling services.

Do something next week! And don’t reinvent the wheel.

How To Do It

First problem to solve. There are islands of tools – CM, package, build, orchestration, package repos, source repos. Different teams, different philosophies.

And actually, probably in each business unit, you have another instantiation of all of the above.

CDOT – their service delivery platform, the 30k foot view

Many different app architectures and many data center providers (cloud and trad). CDOT bridges the gap.

CDOT has a UI and API service atop an integration layer  It uses jenkins, rundeck, chef, zabbix, splunk under the covers.

On the code side – what is that? App code, app config, and verification code. But also operations code! It is part of YOUR product. It’s an input to CDOT.

So build (CI).  Takes from perforce/github to pk/jenkins, into moddav/nexus, for cloud stuff bake to an AMI, promote packages to S3 and AMIs to an AMI repo.

For deploy (CD), jenkins calls rundeck and chef server. Rundeck instantiates the cloudformation or whatever and does high level orchestration, the AMis pull chef recipes and packages from S3, and chef does the local orchestration.  Is it pull or push?  Both/either. You can bake and you can fry.

So feature branches – some people don’t need to CD to prod, but they sure do to somewhere.  So devs can mess with feature branches on dev boxes, but then all master checkins CD to a CD environment.  You can choose how often to go to prod.

Have a cool “devops workbench” UI with the deployment pipeline and state. So everyone has one-click self service deployment with no manual steps, with high confidence.

Now, CDOT video! It’s not really for us, it’s their internal marketing video to get teams to uptake CDOT.  Getting people on board is most of the effort!

What’s the value prop?

  • Save people time
  • Alleviate their headaches
  • Understand their motivations (for when they play politics)
  • Listen to and address their fears

Bring testimonials, data, presentations, do events, videos!  Sell it!

“Get out of your cube and go talk to people”

Think like a salesperson. Get users (devs/PMs) on board, then the buyers (managers/budget folks), partners and suppliers (other ops guys).

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Liveblog Day 3: Managing Incidents In The Wild

Managing Incidents In The Wild

Got here late! By Jonathan Reichhold (@jreichhold) from Twitter.

“Facebook is for useless posts, Twitter is for making fun of celebrities, and Instagram is for young people.” -My 11 year old

Step 2: Set Expectations

set expectations for times of failure–set communication methods, test your escalation tree

Be realistic & ambitious. Prioritize what can be fixed and fix it in its due time

Postmortems – improvement has to be part of the process.

Teamwork – management has to support site reliability as a feature, burn out your ops guys

Distributed systems fail – have to be robust against things that don’t happen “a lot” at small scale.  A 1 in 1,000,000 issue is EVERY DAMN MINUTE at scale. Design more robust

Large systems take time to design, stabilize in prod.

Don’t assume.  Be rigorous and vigilant.

Degrade gracefully, shed load

Don’t “learn bad lessons” from retrospectives like “never touch the X!”

Capacity planning – do it just in time but be realistic.  Figure out real buffers. “Facebook with their huge custom datacenters is all nice but that’s not us.”

Hardware has lead time. [Ed: That’s why it’s for punks]

This is a marathon not a sprint.  You have to keep yourself healthy or you’ll crash.  Maintain your systems and yourself.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 3 Liveblog: Getting Started With Configuration Management

Getting Started With Configuration Management

By @sascha_d.

Configuration management! define and idempotently enforce system state across a bunch of machines.

But it’s not about the tool. But you need one.

You should care about package repositories!

Anyway, she was an infra queen who loved to rewrite things by hand, but finally realized this was a blocker. Needed repeatable process. Started to look at all the CM stuff but it was really overwhelming, there’s so much out there.

Lose the baggage

Need to remove your fear, inflexibility, and arrogance.  You have some, you’re an engineer. You often need to change how you think about things.  CM and automation is not a “threat to your job.”

It’s OK not to know and to admit it.  There’s a lot of crap out there.  What’s a ruby gem?  I don’t know either. It’s OK. You can go learn what you need. Everyone’s “faking it,” that’s how it works.

“I can’t code/I don’t program/I’m not a developer.” It’s OK, you can – you don’t have to be a pro dev. You get most of the concepts from doing good CM.

It’s not “just scripts” – this is all hard work and code.

Also, just because you understand systems doesn’t mean you understand CM – and moving that understanding from the gut to the mind.

Ask why things are the way they are; don’t accept constraints just because they’re there now.

Learn your tool

Resist the urge to “automate the world.” Pick something small but impactful, light on data.

Understand the primitives of your tool.  Don’t just port your bask scripts or break out to a bash block.

Read the source code.  You don’t have to write it to read it. You will see what really happens.

Test!  Learn to test! vagrant, test-kitchen, bats, jenkins (vagrant book free this week only!)

Own or get pwned

Infrastructure is an ecosystem, and you need to have curators for the tools.

There’s acceptable and unacceptable technical debt.

Curate what you’re installing – if you just rampantly download newest from the Internet you get jacked.

Own your package repos. Have one.  Have base/custom. Don’t just have stuff sitting around.

Own your build tools – artifactory/nexus/jenkins/travis.

Own your version control. Well, more “use” version control.

Own your integrity. Don’t disable CM, don’t do different in dev and prod, don’t deploy differently in different envs… You have control over this. [Ed: We are bad about this. Whether a change is puppet or manual or rundeck is a big ass mystery in our environment.  Angry cat.]

The end!

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 3 Liveblog: How to Run a Post-Mortem With Humans (Not Robots)

How to Run a Post-Mortem With Humans (Not Robots)

Got here a little late – not enough time in these breaks!!!

Dan Milstein (@danmil) of Hut 8 talking on how to build a learn-from-failure friendly culture.

1. and 2. – missed ’em!

3. Relish the absurdities of your system.  Don’t be embarrassed when you get a new hire and you show them your sucky deployment.  Own it, enjoy it.

Axioms to follow to have a good postmortem:

  • Everyone involved acted in good faith
  • Everyone involved is competent
  • We’re doing this to find improvements

Human error is the question, not the answer. Restate the problem to include time to recovery. “Why” is fine but look at time to detection, time to resolution. Why so long?

“Which of these is the root cause?” That’s a stupid and irrelevant question. Usually there’s not one, it’s a conjunction of factors blee blee. Look for the “broadest fix.” [Ed: Need to get a “Root cause is a myth” shirt to go with my “Private cloud is a myth” one.]

Corrective actions/remediations/fixes

Incrementalism or you’re fired! You can’t boil the ocean and “replace it wholesale.” Engineers love to say “it’s so terrible we just can’t touch it we have to replace it.” No. You have 4 hours to do the simplest thing to make it better, go.
“Well… OK I guess we could put a wrapper script around it…” OK, great! [Ed: We need to do that with all our database-affecting command line tools… Wrapper script that checks replication lag and also logs who ran it… Done and done!]

Don’t think about automation, think about tools. People think that computers are perfectly reliable and we should remove the humans.  Evidence shows this doesn’t work well. Skynet syndrome – lots of power, often written by those who don’t do the job.  Tools -> humans solve the problem, iterate on giving them better tools. Not everyone brings this baggage with automation but many do. “Do the routine task” – automate.  “Should I do this task and how should it be done?” – human.

Things are in partial failure mode all the time.  [Ed: Peco calls this “near miss” syndrome from the way they make flying safer – learn from near misses, not just crashes.]

To get started:

  • Elect a postmortem boss
  • Look for a goldilocks incodent
  • Expect awkwardness (use some humor to defuse)
  • THERE MUST BE FIXES
  • incrementally improve the incremental improvements (and this process)

Reading list!  Dang get the slides.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 3 Liveblog: The Keynotes

Day Three of a week of convention. Convention Humpday.  The day it stops becoming a mini-vacation and you start earning your salary again.

As usual the keynotes are being livestreamed, so this liveblog is perhaps more for those who want the Cliff’s Notes summary later.  Yesterday’s keynotes were certainly compressible, so here we go! Also follow along in twitter hashtag #veocityconf to see what people are saying about the show.

Clarification on Keynote’s RUM – they announced it yesterday but I was like “haven’t they been trying to sell this to me for two years?” Apparently they were but it was in beta. And congrats to fast.ly who just got a $10M round of funding!

Winners of the survey follies…  Souders’ book, Release It!/Web Operations, and so on. Favorite co-host: Souders!

“Send us your comments!  Except about the wi-fi!” Actually it’s working OK so far this morning. Show is good, though they added yet another ‘vendor track’ which is unfortunate.  They have a front end dev track, an ops track, and a mobile track.  Last year they added a fourth track – “vendor talks.”  This year there is another fifth track – “more vendor talks.” Boo, let’s make space for real content.

Gamedays On The Obama Campaign

@dylanr on revamping the Obama site 18 mos. before election day.  40 engineers in 7 teams, ~300 repos, 200 deployed products, 3000 servers (AWS), millions of hits a day, million volunteers, 8000 staff. He had redone threadless’ site and was on to the next big thing!

Plan, build, execute, get out the vote.

Planning is the not-fast dreamtime. But for the tech folks, it means start building the blocks.

Build is when everyone starts building teams and soliciting $. Then tech builds the apps.

Execute is when everyone starts using it all, more and more of everything. Tech starts getting feedback (been building blind till now).

Get out the vote – final 4-day sprint. For tech, this means scale. A couple orders of magnitude in that span.

Funny picture of a “Don’t Fuck This Up” cake.  [Ed: That was my second standing order for my old WebOps team.  1. Make It Happen, 2. Don’t Fuck Up, 3. There’s the right way, the wrong way, and the standard way.]

They got one shot at this. So how do you do it?

Talk to your stakeholders, but they want every feature ever.  But working is better.  No feature is better than having a working app.  So frame the conversation as “if things fail what still needs to work?” Graceful degradation.

Failure – you can try to make it not fail, and learn to deal with failure.  You should do some of the former but not delude yourself into not doing the latter.  But not just the tech – make the team resilient to failure via practice.

“Game day” 6 weeks pre-election.  Prod-size staging, simulate, break and react. Two week hardening sprint and then on game day had a long agenda of “things to break.”  He lied about the order and timing though.

Devops (aggressors) vs engineers (defenders), organized in campfire, maintaining updated google doc

Learned that there were broken things we thought were fixed, learned what failure really looks like, how things fail, how to fix it

Made runbooks

They had a stressful day, went home, came back in – and databases started failing! AWS failure.  Utilized the runbooks and were good.

Build resilient apps by planning for failure, defining what matters, making your plan clear to stakeholders, and fail to the things that matter. And resilient teams – practice failing and learn from it, use your instruction manual.

Ops School

Check it!  Etsy and O’Reilly video classes on  how to be an ops engineer! Oh. They’ll be for sale, I got excited for a minute thinking it would be a free way to get a whole generation of ops engineers trained from schools etc. that know nothing about ops.  Guess not.  Damn.

W3C Web Performance Working Group Status

Arvind from Google gives us an update on the W3C. The web perf working group is working on navigation timing, user timing, page visibility, resource priorities, client side error logging, and other fundamental standards that will help with web performance measurement and improvement.  Good stuff if very boringly presented, but that’s standards groups for you.

Eliminating Web Site Performance Theft

Neustar tells us the world is online and brand reputation and revenue are at stake.  Quite.

Performance can affect your reputation and revenue!  Quite.

This talk is a great one for vaguely befuddled director level and above types, not the experts at this conference.  Twitter agrees.

Mobitest, the Latest

Akamai has cell nodes and devices for Mobitest, you can become a webpagetest node on your phone!  If you have an unlimited data plan 😛 The app is waiting on release in the App Store. See mobitest.akamai.com for more.

If You Don’t Understand People, You Don’t Understand Ops

Go to techleadershipnews.com and get Kate’s newsletter!

Influence – Gangster style!

How do you earn respect and influence without authority? Even if you’re not a “manager” you need to be able to do this to get things done.

You want people to hear what you have to say – need 3 things.

  • Accountability
  • Everyone is your ally
  • Reciprocity

Accountability – lead by example. Be the person who can get “it” done.  Always followed through on commitments. Generates the graph of trust. Treat everyone with respect. Be a reliable person. Be a superstar – always be hustling.

Everyone is an ally – make them your friend. It’s a small world. Who was nice to you? Who made you feel bad?  How about in return? Make every interaction positive.

Reciprocity – all about giving. You get what you give. What is your currency?  What do you have of value with others and how can you share it? How can you improve the lives of other people?

Success is about people. Influence is success.  Yay Kate!

Lightning Demos

httparchive and BigQuery

@igrigorik from Google about httparchive.org which crawls the Web twice a month and keeps stats. It’s all freely available for download – a subset is shown online at the site.

Google built Dremel, which is an interactive ad hoc query system for analysis of read only nested data.  So they put BigQuery + http archive! Go to bigquery.cloud.google.com and it’s in there! Most comon JS framework (jquery btw)? Median page load times? You can query it all.

In Google Docs you can script and make them interoperate (like send an email when this spreadsheet gets filled in).  Created dynamic query straight to bigquery. Oh look, dynamic graph! bit.ly/ha-query for more!

Patrick Meenan on Webpagetest

You can now do a tcpdump with your test! (Advanced tab). He shows an analysis with cloudshark – wireshark in the cloud! Nice.

Patrick Lightbody from New Relic

Real user monitoring is cool. newrelic.com/platform

Steve Reilly from Riverbed/Aptimize

An application aware infrastructure? We have abstractions for some layers – middleware, compute, storage – but not really for transport.  Software defined networking will be the next “washing” trend. It’s just a transport abstraction. Then we can make the infrastructure a function of the application. “Middle boxes” are now app fetures – GSLB, WAFs, etc.

Slightly confusing at this point – a lot of abstract words and not enough concrete.  Which is better than a thinly disguised product pitch, so still better than yesterday!

Decentralized decisionmaking… Location is no longer a constraint but a feature.  This makes me think of Facebook’s talk yesterday with Sonar and rewriting DNS/GTM/LB.

Jonathan LeBlanc from Paypal on API Design

Started with SOAP/XML SOA. But then the enlightenment happened and REST made your life less sucky and devs more efficient.

“Sure we support REST!  You can GET and POST!” Boo. And also religious REST principle following, instead of innovation.

Our lessons learned: Lower perceived latency, use HTTP properly, build in automation, offload complexity.

With no details this was very low value.

 

Leave a comment

Filed under Conferences

Velocity 2013 Day 2 Liveblog: mobile performance and engagement

Guilin Chen from Facebook is the presenter…

UX is important and mobile users are less tolerant than desktop developers.

What does performance mean for the Facebook study
– page load times
– scroll performance (how smooth)
– prefetch delay (infinite scrolling)

– page load times showed a strong correlation between slowness and user drop off.
– consistent scrolling experience matters more; slower scrolling is better than jittery scrolling.
– prefetch delay studies weren’t as conclusive and thus, didn’t matter as much..

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 2 Liveblog – Application Resilience Engineering and Operations at Netflix

Application Resilience Engineering and Operations at Netflix  by Ben Christensen (@benjchristensen)

Netflix and resilience.  We have all this infrastructure failover stuff, but once you get to the application each one has dozens of dependencies that can take them down.

Needed speed of iteration, to provide client libraries (just saying “here’s my REST service” isn’t good enough), and a mixed technical environment.

They like the Bulkheading pattern (read Michael Nygard’s Release it! to find out what that is). Want the app to degrade gracefully if one of its dozen dependencies fails. So they wrote Hystrix.

1. Use a tryable semaphonre in front of every library they talk to. Use it to shed load. (circuit breaker)

2. Replace that with a thread pool, which adds the benefit of thread isolation and timeouts.

A request gets created and goes throug hthe circuit breaker, runs, then health gets fed back into the front. Errors all go back into the same channel.

The “HystrixCommand” class provides fail fast, fail silent (intercept, especially for optional functionality, and replace with an appropriate null), stubbed fallback (try with the limited data you have – e.g. can’t fetch the video bookmark, send you to the start instead of breaking playback), fallback via network (like to a stale cache or whatever).

Moved into operational mode.  How do you know that failures are generating fallbacks?  Line graph syncup in this case (weird). But they use instrumentation of this as part of a purty dashboard. Get lots of low latency granular metrics about bulkhead/circuit breaker activity pushed into a stream.

Here’s where I got confused.  I guess we moved on from Hystrix and are just on to “random good practices.”

Make low latency config changes across a cluster too. Push across a cluster in seconds.

Auditing via simulation (you know, the monkeys).

When deploying, deploy to canary fleet first, and “Zuul” routing layer manages it. You know canarying.  But then…

“Squeeze” testing – every deploy is burned as an AMI, we push it to perf degradation point to know rps/load it can take. Most lately, added

“Coalmine” testing – finding the “unknown unknowns” – an env on current code capturing all network traffic esp, crossing bulkheads. So like a new feature that’s feature flagged or whatever so not caught in canary and suddenly it starts getting traffic.

So when a problem happens – the failure is isolated by bulkheads and the cluster adapts by flipping its circuit breaker.

Distributed systems are complex.  Isolate relationships between them.

Auditing and operations are essential.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 2 Liveblog: a baseline for web performance with phantomjs

The talk I’m most excited about for today is next! I made sure to get here early…

@wesleyhales from apigee and @ryanbridges from CNN

Quick overview on load testing tools- firebug, charles, har viewers and whatnot; but its super manual.
Better- selenium, but it’s old yo and not hip anymore.
There are services out there: harstorage.com, harviewer etc that you can use too.
Webpagetest.org is pimped again, but apparently caused an internal argument in CNN.

Performance basics
– caching
– gzip: don’t gzip that’s already compressed (like jpegs)
– know when to pull from cdn’s

Ah! New term bingo! “Front end ops”- aka sucks to code something and then realize there need to be a ton of things to do to make things perform even more. Continued definition:
– keep an eye on perf
– manager of builds and dependencies (grunt etc)
– expert on delivering content from server to browser
– critical of new http requests/file sizes and load times

I’m realizing that building front ends is a lot more like building server side code….

Wes recommends having a front end performance ops position and better analytics.

A chart of CNN’s web page load times is shown.

So basically, every time CNN.com is built by bamboo, the page load time is analyzed, saved and analyzed. They use phantomjs for this which became Loadreport.js.

Loadreport.wesleyhales.com is the URL for it.

Filmstrip is a cool idea that stores filmstrips of all pages loaded.
Speed reports is another visualization that was written.

Hard parts
– performance issues needs more thought; figure out your baseline early
– advertisers use document.write
– server location
– browser types: DIY options are harder
– CPU activity: use a consistent environment

All in all, this has many of the same concerns when you’re doing server side performance

CI setup
– bamboo
– Jenkins
– barebones Linux without x11
– vagrant

Demo was shown that used Travis ci as the ci system.

All in all, everyone uses phantomjs for testing; check it out; look at fluxui on github for more!

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 2 Liveblog: Performance Troubleshooting Methodology

Stop the Guessing: Performance Methodologies for Production Systems

Slides are on Slideshare!

Brendan Gregg, Joyent

Note to the reader – this session ruled.

He’s from dtrace but he’s talking about performance for the rest of us. Coming soon, Systems Performance: Enterprises and the Cloud book.

Performance analysis – where do I start and what do I do?  It’s like troubleshooting, it’s easy to fumble around without a playbook. “Tools” are not the answer any more than they’re the answer to “how do I fix my car?”

Guessing Methodologies and Not Guessing Methodologies (Former are bad)

Anti-Methods

Traffic light anti-method

Monitors green?  You’re fine. But of course thresholds are a coarse grained tool, and performance is complex.  Is X bad?  Well sometimes, except when X, but then when Y, but…” Flase positives and false negatives abound.

You can improve it by more subjective metrics (like weather icons) – onjective is errors, alerts, SLAs – facts.

see dtrace.org status dashboard blog post

So traffic light is intuitive and fast to set up but it’s misleading and causes thrash.

Average anti-method

Measure the average/mean, assume a normal-like unimodal distribution and then focus your investigation on explaining the average.

This misses multiple peaks, outliers.

Fix this by adding histograms, density plots, frequency trails, scatter plots, heat maps

Concentration game anti-method

Pick a metric, find another that looks like it, investigate.

Simple and can discover correlations, but it’s time consuming and mostly you get more symptoms and not the cause.

Workload characterization method

Who is causing the load, why, what, how. Target is the workload not the performance.

lets you eliminate unnecessary work. Only solves load issues though, and most things you examine won’t be a problem.

[Ed: When we did our Black Friday performance visualizer I told them “If I can’t see incoming traffic on the same screen as the latency then it’s bullshit.”]

USE method

For every resource, check utilization, saturation, errors.

util: time resource busy

sat: degree of queued extra work

Finds your bottlenecks quickly

Metrics that are hard to get become feature requests.

You can apply this methodology without knowledge of the system (he did the Apollo 11 command module as an example).

See the use method blog post for detailed commands

For cloud computing you also need the “virtual” resource limits – instance network caps. App stuff like mutex locks and thread pools.  Decompose the app environment into queueing systems.

[Ed: Everything is pools and queues…]

So go home and for your system and app environment, create a USE checklist and fill out metrics you have. You know what you have, know what you don’t have, and a checklist for troubleshooting.

So this is bad ass and efficient, but limited to resource bottlenecks.

Thread State Analysis Method

Six states – executing, runnable, anon paging, sleeping, lock, idle

Getting this isn’t super easy, but dtrace, schedstats, delay accounting, I/O accounting, /proc

Based on where the time is leads to direct actionables.

Compare to e.g. database query time – it’s not self contained. “Time spent in X” – is it really? Or is it contention?

So this identifies, quantifies, and directs but it’s hard to measure all the states atm.

There’s many more if perf is your day job!

Stop the guessing and go with ones that pose questions and seek metrics to answer them.  P.S. use dtrace!

dtrace.org/blogs/brendan

Leave a comment

Filed under Conferences, DevOps