Tag Archives: velocity

Velocity 2014 After Action Report

An Average Velocity Session

An Average Velocity Session

Well, it was my first Velocity (I’ve been to every one, 2008 to present, you can read the previous reports here on the blog) as a vendor!  So that was different, and I split time between working the Copperegg booth and going to sessions.  As a result I’m not going to do the extensive session-by-session notes I’ve done in the past.  Two other Agile Admins, James and Karthik were there, I’m hoping they do some writeups of sessions they attended too!

Being a vendor was interesting; though standing at the booth made my dogs bark after the day was over, it was great to be able to talk to so many people. There were a lot of monitoring providers at the show (Copperegg (us), Compuware, New Relic, Datadog, many more).  Pingdom was right across from us, with a slate of guys shipped in from Sweden, but they were generally grumpy – jet lag or their recent acquisition, perhaps. A new log management SaaS provider was there, logentries.com, and that was interesting – Sumo is the only real one in the space since Loggly and SplunkStorm borked it up and they’ve been getting a little… “Enterprise-y?” By that I mean having sales reps call you 5x/day and wanting near-Splunk prices.  So yay to the newcomers, competition is always good. Other than that, it was mostly the same slate of Velocity-vendors as usual.

What’s New

Well, let’s get it out of the way – there wasn’t all that much new this year. Karthik complained to me that “last year, Velocity was my favorite conference ever, and this year I didn’t get much out of it.” Not every year hosts a bunch of new techniques, sadly, but I thought there were some gems in there.  Here’s the major four new trends taking up speech-space:

Docker docker docker containers containers containers. Learn it now because in a year everything will be in containers – no, seriously. Largest splash in computing since Amazon AWS. The hype is a little overexcited at times but there’s a lot of new development going on here.  On the one hand, not everyone needs new-box spinup in 5s instead of 5m and the efficiency gains are a tradeoff for security – but to be blunt, people stopped well short of exercising the elasticity and ephemerality of cloud/virtualization solutions, instead going for the more comfortable “let’s deploy a three tier app manually like we did back in the day, but in the cloud” and so containers will be a disruption to push forward the concept of dynamic service orchestration etc., which is good.

There is starting to be buzz around Internet of Things.  Mark Burgess (CFEngine, author of “In Search Of Certainty”) did a presentation on IoT and a more distributed model of monitoring and computation. Worth looking at, and it’s becoming more a part of mainstream computing (“engineering” tech and “IT” tech split off from each other 15 years ago for whatever reason and are just now joining forces again). Since we Agile Admins all had worked at National Instruments and had tried to get them onto the IoT bandwagon like 5 years ago, we grumped among each other about this.

There’s also strong interest in software defined networking (OpenDaylight, Cumulus). John Willis (@botchagalupe) waxed poetic on the topic and it fit into the general push towards making everything programmable.

There was strong and sustained interest (presentations, etc.) on STEM education and specifically on women in tech/getting more women into tech.

Keynotes

My Room At The Avatar

My Room At The Avatar

Video of these should be publicly available so you can watch them.

Jeff Dean of Google did a very interesting talk on making large scale services low latency that I recommend everyone view (video is at the link). Shared environments increase utilization but also congestion, exacerbated by large fanout systems – if a given system has services with only 1% 1 sec latency and have you to touch 100 services to finish your call, 63% of calls take more than a second. Traditional latency reduction uses techniques like differentiated service classes, breaking up large requests, managing background activity (rate limit, wait till low load). Tolerating faults is a lot like tolerating variability – extra resources make your system reliable – do the same with variability, but much lower timeframe. There’s two ways to do that…

  1. Cross Request Adaptation – examine recent behavior and make changes (load balance, scale) – low timescale, this tends to make the “next call” faster. Fine grained dynamic partitioning relies on equal sizes and constant load, but if you break up into 10-100 things a machine you can shed load more effectively. Selective replication, like in query system they make more copies of important docs. Use latency-induced probation via your load balancer, offload to other boxes, shadow stream to original, return to service when it’s better.
  2. Within Request Adaptation – make the call faster within the single call! Basically this is a series of refinements on “send the request two places.”  First he modeled sending the request again to another server if it didn’t return in an expected amount of time. You can get cuter, like by always sending to two destinations and having the one that starts working on it give a sideways “I’ve got it” to the other. His mathematical analysis says that you can cut latency dramatically for a very small increase in load, and not only that, but the response of a loaded cluster and an idle cluster become very similar (less dramatic spiking under load).

And I did one!  Just a 5 minute spot since Copperegg was a platinum sponsor; I talked about applying a Lean approach to implementing monitoring. It was called A 5 Minute Checklist For Application Monitoring and slides/video are at the link.  I also wrote a white paper to expand on it that’s available for download here.

Sessions

California Sushi

California Sushi

I went to a number of sessions that I enjoyed; here’s a quick breakdown of the ones I thought were winners.  I’ll try to find slides and link them where they exist. O’Reilly charges for the videos though.

Vladimir Vuskan’s workshop on ganglia. People like the gathering of mass metrics. They did rake him over the coals a bit on the 15s time resolution and the relatively primitive RRDTool graphs.  He had some interesting bits like a “check that a value is the same everywhere” alert for consistency. He also summed up “why we monitor” well – MTTD, MTTR, trending, learning.

Theo Schlossnagle’s presentation on Understanding Slowness. He recommended a system map as step 1 – high level box and line but low level with all versions, locations, and service connections. He also talked about going to histograms but less sophisticated users find those hard to understand, so displaying quantiles can be a happy medium. He sees three different tool spaces: observational, synthetic, and manipulation.

There was a good presentation on the math around false alarms, using the “sensitivity” and “specificity” terms from medicine. Here’s a quick reference on those and how you calculate a positive predictive value. Undetected outages are embarrassing so the response is to narrow the monitoring thresholds but this just generates more false alerts, aka “pagerrhea.” This segued into the discussion of using better means to detect deviation – hysteresis, moving thresholds like Holt-Winters, cross-correlation of metrics, Fourier transforms. You should alert on whether work is getting done, not on CPU or swap but on HTTP response time and requests per second. He wants “something like nagios but that separates detection from diagnosis.”

I also really appreciated the LinkedIn talk on technical debt. They admitted that several years ago, they were trying to keep up in the social world and just ground to a halt because of accumulated technical debt. They had to stop and take a bunch of time to fix it before they could move forward. Important takeaways included:

  • Technical debt comes small decision by small decision
  • Don’t wait for version n+1, fix it now
  • “One in a million” problems happen a lot at web scale
  • Cancerous workarounds are no good
  • Broken window syndrome – if things are broken, people will tend to leave things broken
  • Zombie tech will eat you
  • Use our cool rest.li REST framework!
  • Employee engagement drains KPIs
  • Strategies – recognize debt choices and decisions
  • Use new eyes – consultants, interns – to identify the “bad parts”
  • Measure the right things
  • Technical debt you can see is only the tip of the iceberg
  • Make active decisions otherwise in Soviet Russia, Decision Makes You!  (well, I added that last part)

The last really good one was about confirmation bias and monitoring. When dealing with metrics there are a lot of cognitive illusions – the anchoring effect (whatever it was recently before it deviated must have been right), the validity effect (a couple people told me that so it must be true), illusory correlation (looks like those happened around the same time), attitude polarization (round up the usual suspects). The way to combat this is with analysis. Rethink your data flow, validate your stats.  Use anomaly detection like the open sourced skyline and oculus to really detect correlations and deviations.

Though there weren’t as many breakthroughs this year, I appreciated the incremental uptick in wisdom about how to use what we have!

Social

Much of the benefit of conferences isn’t the sessions, it’s the great people you meet and share experiences with. Once you’ve been a couple years, you get to see old friends – though sadly none of our compatriots from Agile Admin alumni companies were there (National Instruments, Bazaarvoice, PowerReviews) we did get to see most of the “usual suspects” we get to see at these shows – we had the usual “hang out at the Hyatt bar fiesta” with Andrew Schafer, John Willis, Ben Rockwood, Cameron Haight and Jonah Kowall from Gartner, Gene Kim, and many more.  Notable in his absence was Patrick Debois who remained in Belgium, we all missed him.

If you went to Velocity this year, chime in below (especially if we met you there!).

1 Comment

Filed under Conferences, DevOps

Opening Velocity 2014

All right, I’m here in sunny San Jose for Velocity, the three-day Web operations and performance conference.  It’s my first time attending as a sponsor type which is interesting. We have a whole cadre going; I flew in with Jenny and Lauren from Copperegg as part of the advance squad.  Because I just got in on this gig recently, I am out at the Avatar while they’re at the Hilton nearby.  On the cab ride, they got a bit agitated over a tweet claiming we’re being exclusionary over our “The Dude” promos; I guess I can see the misunderstanding but it’s a Big Lebowski theme specifically cooked up by the women in our Marketing department.

Some IHOP breakfast, a long walk from the Avatar to the convention center, and then speaker checking, where I got to chat with Mandy Walls, Vladimir Vuskan, and Andrew “Clay” Shafer.  Apparently there’s a two person limit on booth setup so I don’t have to help with that. I’ll go report on Andrew’s talk, though will have to duck out early for speaker orientation for my talk.

Remember, if you can’t make it they’ll be streaming the morning keynotes on Wed/Thurs.  If you are here, grab me and say “Hi!”

Leave a comment

Filed under Conferences

Meet The Agile Admins At Velocity/DevOpsDays Silicon Valley!

Three of the four agile admins (James, Karthik, and myself) will be out at Velocity and DevOpsDays this week. Say hi if you see us!

James will be doing a workshop with Gareth Rushgrove on Tuesday 9-10:30 AM, “Battle-tested Code without the Battle – Security Testing and Continuous Integration.” Get hands on with gauntlt and other tools! [Conference site] [Lanyrd]

Ernest is doing a 5 minute sponsor keynote on Thursday, “A 5 Minute Checklist for Application Monitoring.” OK, so it’s during the USA vs Germany game – come see me anyway!  I hate keynote sales pitches so I’m not doing one, I’ll be talking about a Lean approach to monitoring and stuff to cover in your MVP. There’s a free white paper too since what can you really say in 5 minutes? And so you know what to expect, the hashtag you’ll want to use is #getprobed! [Conference site] [Lanyrd]

Leave a comment

Filed under Conferences

Velocity 2013 Wrapup

Whew, we’re all finally back home from the conferencing. Fun was had by all.

@iteration1, @ernestmueller, @wickett

@iteration1, @ernestmueller, @wickett

Over the next week I’ll go back to the liveblog articles and put in links to slides/videos where I can find them (feel free and post ones you know in comments on the appropriate post!). We’ll also try to sum up the best takeaways into a Velocity 2013 and DevOpsDays Silicon Valley 2013 quick guide, for those without the patience to read the extended dance remix.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 3 Liveblog: Retooling Adobe: A DevOps Journey from Packaged Software to Service Provider

Retooling Adobe: A DevOps Journey from Packaged Software to Service Provider

Srinivas Peri, Adobe and Alex Honor, SimplifyOPS/DTO

Adobe needed to move from desktop, packaged software to a cloud services model and needed a DevOps transformation as well.

Srini’s CoreTech Tools/Infrastructure group tries to transform wasted time to value time (enabling tools).

So they started talking SaaS and Srini went around talking to them about tooling.

Dan Neff came to Adobe from Facebook as operations guru from Facebook.  He said “let’s stop talking about tools.” He showed him the 10+ deploys a day at Flickr preso. Time to go to Velocity!  And he met Alex and Damon of DTO and learned about loosely coupled toolchains.

They generated CDOT, a service delivery platform. Some teams started using it, then they bought Typekit and Paul Hammond thought it was just lovely.

And now all Adobe software is coming through the cloud.  They are not the CoreTech Solution Engineering team – who makes enabling services.

Do something next week! And don’t reinvent the wheel.

How To Do It

First problem to solve. There are islands of tools – CM, package, build, orchestration, package repos, source repos. Different teams, different philosophies.

And actually, probably in each business unit, you have another instantiation of all of the above.

CDOT – their service delivery platform, the 30k foot view

Many different app architectures and many data center providers (cloud and trad). CDOT bridges the gap.

CDOT has a UI and API service atop an integration layer  It uses jenkins, rundeck, chef, zabbix, splunk under the covers.

On the code side – what is that? App code, app config, and verification code. But also operations code! It is part of YOUR product. It’s an input to CDOT.

So build (CI).  Takes from perforce/github to pk/jenkins, into moddav/nexus, for cloud stuff bake to an AMI, promote packages to S3 and AMIs to an AMI repo.

For deploy (CD), jenkins calls rundeck and chef server. Rundeck instantiates the cloudformation or whatever and does high level orchestration, the AMis pull chef recipes and packages from S3, and chef does the local orchestration.  Is it pull or push?  Both/either. You can bake and you can fry.

So feature branches – some people don’t need to CD to prod, but they sure do to somewhere.  So devs can mess with feature branches on dev boxes, but then all master checkins CD to a CD environment.  You can choose how often to go to prod.

Have a cool “devops workbench” UI with the deployment pipeline and state. So everyone has one-click self service deployment with no manual steps, with high confidence.

Now, CDOT video! It’s not really for us, it’s their internal marketing video to get teams to uptake CDOT.  Getting people on board is most of the effort!

What’s the value prop?

  • Save people time
  • Alleviate their headaches
  • Understand their motivations (for when they play politics)
  • Listen to and address their fears

Bring testimonials, data, presentations, do events, videos!  Sell it!

“Get out of your cube and go talk to people”

Think like a salesperson. Get users (devs/PMs) on board, then the buyers (managers/budget folks), partners and suppliers (other ops guys).

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Liveblog Day 3: Managing Incidents In The Wild

Managing Incidents In The Wild

Got here late! By Jonathan Reichhold (@jreichhold) from Twitter.

“Facebook is for useless posts, Twitter is for making fun of celebrities, and Instagram is for young people.” -My 11 year old

Step 2: Set Expectations

set expectations for times of failure–set communication methods, test your escalation tree

Be realistic & ambitious. Prioritize what can be fixed and fix it in its due time

Postmortems – improvement has to be part of the process.

Teamwork – management has to support site reliability as a feature, burn out your ops guys

Distributed systems fail – have to be robust against things that don’t happen “a lot” at small scale.  A 1 in 1,000,000 issue is EVERY DAMN MINUTE at scale. Design more robust

Large systems take time to design, stabilize in prod.

Don’t assume.  Be rigorous and vigilant.

Degrade gracefully, shed load

Don’t “learn bad lessons” from retrospectives like “never touch the X!”

Capacity planning – do it just in time but be realistic.  Figure out real buffers. “Facebook with their huge custom datacenters is all nice but that’s not us.”

Hardware has lead time. [Ed: That's why it's for punks]

This is a marathon not a sprint.  You have to keep yourself healthy or you’ll crash.  Maintain your systems and yourself.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 3: benchmarking the new front end

By Emily Nakashima and Rachel Myers

bitly.com/ostrichandyak

Talking about their experiences at mod cloth…..

Better performance is more user engagement, page views etc…

Basically, we’re trying to improve performance because it improves user experience.

A quick timeline on standards and js mvc frameworks from 2008 till present.

NewRelic was instrumented to get an overview of performance and performance metrics; the execs asked for a dashboard!! Execs love dashboard :)

Step 1: add a cdn; it’s an easy win!
Step 2: The initial idea was to render the easy part of the site first- 90% render.
Step 3: changed this to a single page app

BackboneJS was used to redesign the app to a single page app from the way the app was structured before.

There aren’t great tools for Ajax enabled sites to figure out perf issues. Some of the ones that they used were:
– LogNormal: rebranded as Soasta mpulse
– newrelic
– yslow
– webpagetest
– google analytics (use for front end monitoring, check out user timings in ga)- good 1st step!
– circonus (which is the favorite tool of the presenters)

Asynchronous world yo! Track:
– featurename
– pagename
– unresponsiveness

Velocity buzzwords bingo! “Front end ops”

Leave a comment

Filed under Cloud, Conferences

Velocity 2013 Day 3: DevOps and metrics

We’re talking about the devops survey with Gene Kim, James Turnbull and Jez Humble

4039 survey responses!

Lessons learned
– Don’t change questions midway in a survey
– Get a data analyst for survey analysis

Key findings
– devops teams are more agile; 30x more deployments, 8000x shorter lead times
– devops teams are more reliable;
– most teams use version control- 89%
– most teams use automated code deployments- 82%
– the longer you do devops, the better you get!

Hilarious John Vincent quote on devops:

20130620-133505.jpg

Measuring culture
Trust and verify

26% of folks who responded to the survey were from the enterprise, and 16% were from 10k and plus.
The biggest barriers to devops was culture because people didn’t get it- whether it was your manager, team or outside the group. Tell people more, and wear more devops shirts!!!

DevOps continues to be a culture issue versus an issue in terms of tools and processes. James Turnbull wants us to go out there and talk to people and figure out people skills!

And join the devops google+ community!

2 Comments

Filed under DevOps

Velocity 2013 Day 3 Liveblog: Getting Started With Configuration Management

Getting Started With Configuration Management

By @sascha_d.

Configuration management! define and idempotently enforce system state across a bunch of machines.

But it’s not about the tool. But you need one.

You should care about package repositories!

Anyway, she was an infra queen who loved to rewrite things by hand, but finally realized this was a blocker. Needed repeatable process. Started to look at all the CM stuff but it was really overwhelming, there’s so much out there.

Lose the baggage

Need to remove your fear, inflexibility, and arrogance.  You have some, you’re an engineer. You often need to change how you think about things.  CM and automation is not a “threat to your job.”

It’s OK not to know and to admit it.  There’s a lot of crap out there.  What’s a ruby gem?  I don’t know either. It’s OK. You can go learn what you need. Everyone’s “faking it,” that’s how it works.

“I can’t code/I don’t program/I’m not a developer.” It’s OK, you can – you don’t have to be a pro dev. You get most of the concepts from doing good CM.

It’s not “just scripts” – this is all hard work and code.

Also, just because you understand systems doesn’t mean you understand CM – and moving that understanding from the gut to the mind.

Ask why things are the way they are; don’t accept constraints just because they’re there now.

Learn your tool

Resist the urge to “automate the world.” Pick something small but impactful, light on data.

Understand the primitives of your tool.  Don’t just port your bask scripts or break out to a bash block.

Read the source code.  You don’t have to write it to read it. You will see what really happens.

Test!  Learn to test! vagrant, test-kitchen, bats, jenkins (vagrant book free this week only!)

Own or get pwned

Infrastructure is an ecosystem, and you need to have curators for the tools.

There’s acceptable and unacceptable technical debt.

Curate what you’re installing – if you just rampantly download newest from the Internet you get jacked.

Own your package repos. Have one.  Have base/custom. Don’t just have stuff sitting around.

Own your build tools – artifactory/nexus/jenkins/travis.

Own your version control. Well, more “use” version control.

Own your integrity. Don’t disable CM, don’t do different in dev and prod, don’t deploy differently in different envs… You have control over this. [Ed: We are bad about this. Whether a change is puppet or manual or rundeck is a big ass mystery in our environment.  Angry cat.]

The end!

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 3 Liveblog: How to Run a Post-Mortem With Humans (Not Robots)

How to Run a Post-Mortem With Humans (Not Robots)

Got here a little late – not enough time in these breaks!!!

Dan Milstein (@danmil) of Hut 8 talking on how to build a learn-from-failure friendly culture.

1. and 2. – missed ‘em!

3. Relish the absurdities of your system.  Don’t be embarrassed when you get a new hire and you show them your sucky deployment.  Own it, enjoy it.

Axioms to follow to have a good postmortem:

  • Everyone involved acted in good faith
  • Everyone involved is competent
  • We’re doing this to find improvements

Human error is the question, not the answer. Restate the problem to include time to recovery. “Why” is fine but look at time to detection, time to resolution. Why so long?

“Which of these is the root cause?” That’s a stupid and irrelevant question. Usually there’s not one, it’s a conjunction of factors blee blee. Look for the “broadest fix.” [Ed: Need to get a "Root cause is a myth" shirt to go with my "Private cloud is a myth" one.]

Corrective actions/remediations/fixes

Incrementalism or you’re fired! You can’t boil the ocean and “replace it wholesale.” Engineers love to say “it’s so terrible we just can’t touch it we have to replace it.” No. You have 4 hours to do the simplest thing to make it better, go.
“Well… OK I guess we could put a wrapper script around it…” OK, great! [Ed: We need to do that with all our database-affecting command line tools... Wrapper script that checks replication lag and also logs who ran it... Done and done!]

Don’t think about automation, think about tools. People think that computers are perfectly reliable and we should remove the humans.  Evidence shows this doesn’t work well. Skynet syndrome – lots of power, often written by those who don’t do the job.  Tools -> humans solve the problem, iterate on giving them better tools. Not everyone brings this baggage with automation but many do. “Do the routine task” – automate.  “Should I do this task and how should it be done?” – human.

Things are in partial failure mode all the time.  [Ed: Peco calls this "near miss" syndrome from the way they make flying safer - learn from near misses, not just crashes.]

To get started:

  • Elect a postmortem boss
  • Look for a goldilocks incodent
  • Expect awkwardness (use some humor to defuse)
  • THERE MUST BE FIXES
  • incrementally improve the incremental improvements (and this process)

Reading list!  Dang get the slides.

Leave a comment

Filed under Conferences, DevOps