Tag Archives: performance

Velocity 2013 Day 3: benchmarking the new front end

By Emily Nakashima and Rachel Myers

bitly.com/ostrichandyak

Talking about their experiences at mod cloth…..

Better performance is more user engagement, page views etc…

Basically, we’re trying to improve performance because it improves user experience.

A quick timeline on standards and js mvc frameworks from 2008 till present.

NewRelic was instrumented to get an overview of performance and performance metrics; the execs asked for a dashboard!! Execs love dashboard :)

Step 1: add a cdn; it’s an easy win!
Step 2: The initial idea was to render the easy part of the site first- 90% render.
Step 3: changed this to a single page app

BackboneJS was used to redesign the app to a single page app from the way the app was structured before.

There aren’t great tools for Ajax enabled sites to figure out perf issues. Some of the ones that they used were:
- LogNormal: rebranded as Soasta mpulse
- newrelic
- yslow
- webpagetest
- google analytics (use for front end monitoring, check out user timings in ga)- good 1st step!
- circonus (which is the favorite tool of the presenters)

Asynchronous world yo! Track:
- featurename
- pagename
- unresponsiveness

Velocity buzzwords bingo! “Front end ops”

Leave a comment

Filed under Cloud, Conferences

Velocity 2013 Day 2 Liveblog: Performance Troubleshooting Methodology

Stop the Guessing: Performance Methodologies for Production Systems

Slides are on Slideshare!

Brendan Gregg, Joyent

Note to the reader – this session ruled.

He’s from dtrace but he’s talking about performance for the rest of us. Coming soon, Systems Performance: Enterprises and the Cloud book.

Performance analysis – where do I start and what do I do?  It’s like troubleshooting, it’s easy to fumble around without a playbook. “Tools” are not the answer any more than they’re the answer to “how do I fix my car?”

Guessing Methodologies and Not Guessing Methodologies (Former are bad)

Anti-Methods

Traffic light anti-method

Monitors green?  You’re fine. But of course thresholds are a coarse grained tool, and performance is complex.  Is X bad?  Well sometimes, except when X, but then when Y, but…” Flase positives and false negatives abound.

You can improve it by more subjective metrics (like weather icons) – onjective is errors, alerts, SLAs – facts.

see dtrace.org status dashboard blog post

So traffic light is intuitive and fast to set up but it’s misleading and causes thrash.

Average anti-method

Measure the average/mean, assume a normal-like unimodal distribution and then focus your investigation on explaining the average.

This misses multiple peaks, outliers.

Fix this by adding histograms, density plots, frequency trails, scatter plots, heat maps

Concentration game anti-method

Pick a metric, find another that looks like it, investigate.

Simple and can discover correlations, but it’s time consuming and mostly you get more symptoms and not the cause.

Workload characterization method

Who is causing the load, why, what, how. Target is the workload not the performance.

lets you eliminate unnecessary work. Only solves load issues though, and most things you examine won’t be a problem.

[Ed: When we did our Black Friday performance visualizer I told them "If I can't see incoming traffic on the same screen as the latency then it's bullshit."]

USE method

For every resource, check utilization, saturation, errors.

util: time resource busy

sat: degree of queued extra work

Finds your bottlenecks quickly

Metrics that are hard to get become feature requests.

You can apply this methodology without knowledge of the system (he did the Apollo 11 command module as an example).

See the use method blog post for detailed commands

For cloud computing you also need the “virtual” resource limits – instance network caps. App stuff like mutex locks and thread pools.  Decompose the app environment into queueing systems.

[Ed: Everything is pools and queues...]

So go home and for your system and app environment, create a USE checklist and fill out metrics you have. You know what you have, know what you don’t have, and a checklist for troubleshooting.

So this is bad ass and efficient, but limited to resource bottlenecks.

Thread State Analysis Method

Six states – executing, runnable, anon paging, sleeping, lock, idle

Getting this isn’t super easy, but dtrace, schedstats, delay accounting, I/O accounting, /proc

Based on where the time is leads to direct actionables.

Compare to e.g. database query time – it’s not self contained. “Time spent in X” – is it really? Or is it contention?

So this identifies, quantifies, and directs but it’s hard to measure all the states atm.

There’s many more if perf is your day job!

Stop the guessing and go with ones that pose questions and seek metrics to answer them.  P.S. use dtrace!

dtrace.org/blogs/brendan

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 1 Liveblog – Hands-on Web Performance Optimization Workshop

OK we’re wrapping up the programming on Day 1 of Velocity 2013 with a Hands-on Web Performance Optimization Workshop.

Velocity started as equal parts Web front end performance stuff and operations; I was into both but my path lead me more to the operations side, but now I’m trying to catch up a bit – the whole CSS/JS/etc world has grown so big it’s hard to sideline in it.  But here I am!  And naturally performance guru Steve Souders is here.  He kindly asked about Peco, who isn’t here yet but will be tomorrow.

One of the speakers is wearing a Google Glass, how cute.  It’s the only other one I’ve seen besides @victortrac’s. Oh, the guy’s from Google, that explains it.

@sergeyche (TruTV), @andydavies (Asteno), and @rick_viscomi (Google/YouTube) are our speakers.

We get to submit URLs in realtime for evaluation at man.gl/wpoworkshop!

Tool Roundup

Up comes webpagetest.org, the great Web site to test some URLs. They have a special test farm set up for us, but the abhorrent conference wireless largely prevents us from using it. “It vill disappear like pumpkin vunce it is over” – sounds great in a Russian accent.

YSlow the ever-popular browser extension is at yslow.org.

Google Pagespeed Insights is a newer option.

showslow.com trends those webpagetest metrics over time for your site.

Real Page Tests

Hmm, since at Bazaarvoice we don’t really have pages per se, we’re just embedded in our clients’ sites, not sure what to submit!  Maybe I’ll put in ni.com for old times’ sake, or a BV client. Ah, Nordstrom’s already submitted, I’ll add Yankee Candle for devious reasons of my own.

redrobin.com – 3 A’s, 3 F’s. No excuse for not turning on gzip. Shows the performance golden rule – 10% of the time is back end and 90% is front end.

“Why is my time to first byte slow?”  That’s back end, not front end, you need another tool for that.

nsa.gov – comes back all zeroes.  General laughter.

Gus Mayer – image carousel, but the first image it displays is the very last it loads.  See the filmstrip view to see how it looks over time. Takes like 6 seconds.

Always have a favicon – don’t have it 404. And especially don’t send them 40k custom 404 error pages. [Ed. I'll be honest, we discovered we were doing that at NI many years ago.] It saves infrastructure cost to not have all those errors in there.

Use 85% lossy compression on images.  You can’t tell even on this nice Mac and it saves so much bandwidth.

sitespeed.io will crawl your whole site

speedcurve is a paid service using webpagetest.

Remember webpagetest is open source, you can load it up yourself (“How can we trust your dirty public servers!?!” says a spectator).

Mobile

webpagetest has some mobile agents

httpwatch for iOS

1 Comment

Filed under Conferences, DevOps

Service Delivery Best Practices?

So… DevOps.  DevOps vs NoOps has been making the rounds lately. At Bazaarvoice we are spawning a bunch of decentralized teams not using that nasty centralized ops team, but wanting to do it all themselves.  This led me to contemplate how to express the things that Operations does in a way that turns them into developer requirements?

Because to be honest a lot of our communication in the Ops world is incestuous.  If one were to point a developer (or worse, a product manager) at the venerable infrastructures.org site, they’d immediately poop themselves and flee.  It’s a laundry list of crap for neckbeards, but spends no time on “why do you want this?” “What value will this bring to your customer?”

So for the “NoOps” crowd who want to do ops themselves – what’s an effective way to state these needs?  I took previous work – ideas from the pretty comprehensive but waterfally “Systems Development Framework” Peco and I did for NI, Chris from Bazaarvoice’s existing DevOps work – and here’s a cut.  I’d love feedback. The goal is to have a small discrete set of areas (5-7), with a small discrete set (5-7) of “most important” items – most importantly, stated in a way so that people understand that these aren’t “annoying things some ops guy wants me to do instead of writing 3l33t code” but instead stated as they should be, customer facing requirements like any other.  They could then be each broken into subsidiary more specific user stories (probably a whole lot of them, to be honest) but these could stand in as “epics” or “pillars” for making your Web thing work right.

Service Delivery Best Practices

Availability

  • I will make my service highly available, so that customers can use it constantly without issues
  • I will know about any problems with my service’s delivery to customers
  • I can restore my service quickly from any disaster or loss
  • I can make my data resistant to corruption from runtime issues
  • I will test my service under routine disruption to make it rugged and understand its failure modes

Performance

  • I know how all parts of the product perform and can measure performance against a customer SLA
  • I can run my application and its data globally, to serve our global customer base
  • I will test my application’s performance and I know my app’s limitations before it reaches them

Support

  • I will make my application supportable by the entire organization now and in the future
  • I know what every user hitting my service did
  • I know who has access to my code and data and release mechanism and what they did
  • I can account for all my servers, apps, users, and data
  • I can determine the root cause of issues with my service quickly
  • I can predict the costs and requirements of my service ahead of time enough that it is never a limiter
  • I will understand the communication needs of customers and the organization and will make them aware of all upgrades and downtime
  • I can handle incidents large and small with my services effectively

Security

  • I will make every service secure
  • I understand Web application vulnerabilities and will design and code my services to not be subject to them
  • I will test my services to be able to prove to customers that they are secure
  • I will make my data appropriately private and secure against theft and tampering
  • I will understand the requirements of security auditors and expectations of customers regarding the security of our services

Deployment

  • I can get code to production quickly and safely
  • I can deploy code in the middle of the day with no downtime or customer impact
  • I can pilot functionality to limited sets of users
  • I will make it easy for other teams to develop apps integrated with mine

Also to capture that everyone starting out can’t and shouldn’t do all of this… We struggled over whether these were “Goals” or “Best Practices” or what… On the one hand you should only put in e.g. “as much security as you need” but on the other hand there’s value in understanding the optimal condition.

2 Comments

Filed under DevOps

Application Performance Management in the Cloud

Cloud computing has been the buzz and hype lately and everybody is trying to understand what it is and how to use it. In this post, I wanted to explore some of the properties of “the cloud” as they pertain to Application Performance Management. If you are new to cloud offerings, here are a few good materials to get started, but I will assume the reader of this post is somewhat familiar with cloud technology.

As Spiderman would put it, “With great power comes great responsibility” … Cloud abstracts some of the infrastructure parts of your system and gives you the ability to scale up resource on demand. This doesn’t mean that your applications will magically work better or understand when they need to scale, you still need to worry about measuring and managing the performance of your applications and providing quality service to your customers. In fact, I would argue APM is several times more important to nail down in a cloud environment for several reasons:

  1. The infrastructure your apps live in is more dynamic and volatile. For example, cloud servers are VMs and resources given to them by the hypervisor are not constant, which will impact your application. Also, VMs may crash/lock up and not leave much trace of what was the cause of issues with your application.
  2. Your applications can grow bigger and more complex as they have more resources available in the cloud. This is great but it will expose bottlenecks that you didn’t anticipate. If you can’t narrow down the root cause, you can’t fix it. For this, you need scalable APM instrumentation and precise analytics.
  3. On-demand scaling is a feature that is touted by all cloud providers but it all hinges on one really hard problem that they won’t solve for you: Understanding your application performance and workload behave so that you can properly instruct the cloud scaling APIs what to do. If my performance drops do I need more machines? How many? APM instrumentation and analytics will play a key part of how you can solve the on-demand scaling problem.

So where is APM for the cloud, you ask? It is in its very humble beginnings as APM solution providers have to solve the same gnarly problems application developers and operations teams are struggling with:

  1. Cloud is dynamic and there are no fixed resources. IP addresses, hostnames, and MAC addresses change, among other things. Properly addressing, naming and connecting the machines is a challenge – both for deep instrumentation agent technology and for the apps themselves.
  2. Most vendors’ licensing agreements are based on fixed count of resources billing rather than a utility model, and are therefore difficult to use in a cloud environment. Charging a markup on the utility bill seems to be a common approach among the cloud-literate but introduces a kink in the regular budgeting process and no one wants to write a variable but sizable blank check.
  3. Well the cloud scales… a lot. The instrumentation / monitoring technology has to be low enough overhead and be able to support hundreds of machines.

On the synthetic monitoring end there are plenty of options. We selected AlertSite, a distributed synthetic monitoring SaaS provider similar to Keynote and Gomez, for simplicity but in only provides high level performance and availability numbers for SLA management and “keep the lights are on” alerting. We also rely on CloudKick, another SaaS provider, for the system monitoring part. Here is a more detailed use case on the Cloudkick implementation.

For deeper instrumentation we have worked with OPNET to deploy their AppInternals Xpert (former Opnet Panorama) solution in the Amazon EC2 cloud environment. We have already successfully deployed AppInternals Xpert on our own server infrastructure to provide deep instrumentation and analysis of our applications. The cloud version looks very promising for tackling the technical challenges introduced by the cloud, and once fully deployed, we will have a lot of capability to harness the cloud scale and performance. More on this as it unfolds…

In summary, as vendors sell you the cloud but be prepared to tackle the traditional infrastructure concerns and stick to your guns. No, the cloud is not fixing your performance problems and is not going to magically scale for you. You will need some form of APM to help you there. Be on the lookout for providers and do challenge your existing partners to think of how they can help you. The APM space in the cloud is where traditional infrastructure was in 2005 but things are getting better.

To the cloud!!! (And do keep your performance engineers on staff, they will still be saving your bacon.)

Peco

Leave a comment

Filed under Cloud

Velocity 2010 – Performance Indicators In The Cloud

Common Sense Performance Indicators in the Cloud by Nick Gerner (SEOmoz)

SEOmoz has been  EC2/S3 based since 2008.  They scaled from 50 to 500 nodes.  Nick is a developer who wanted him some operational statistics!

Their architecture has many tiers – S3, memcache, appl, lighttpd, ELB.  They needed to visualize it.

This will not be about waterfalls and DNS and stuff.  He’s going to talk specifically about system (Linux system) and app metrics.

/proc is the place to get all the stats.  Go “man proc” and understand it.

What 5 things does he watch?

  • Load average – like from top.  It combines a lot of things and is a good place to start but explains nothing.
  • CPU – useful when broken out by process, user vs system time.  It tells you who’s doing work, if the CPU is maxed, and if it’s blocked on IO.
  • Memory – useful when broken out by process.  Free, cached, and used.  Cached + free = available, and if you have spare memory, let the app or memcache or db cache use it.
  • Disk – read and write bytes/sec, utilization.  Basically is the disk busy, and who is using it and when?  Oh, and look at it per process too!
  • Network – read and write bytes/sec, and also the number of established connections.  1024 is a magic limit often.  Bandwidth costs money – keep it flat!  And watch SOA connections.

Perf Monitoring For Free

  1. data collection – collectd
  2. data storage- rrdtool
  3. dashboard management – drraw

They put those together into a dashboard.  They didn’t want to pay anyone or spend time managing it.  The dynamic nature of the cloud means stuff like nagios have problems.

They’d install collectd agents all over the cluster.  New nodes get a generic config, and node names follow a convention according to role.

Then there’s a dedicated perf server with the collectd server, a Web server, and drraw.cgi.  In a security group everyone can connect in to.

Back up your performance data- it’s critical to have history.

Cloudwatch gives you stuff – but not the insight you have when breaking out by process.  And Keynote/Gomez stuff is fine but doesn’t give you the (server side) nitty gritty.

More about the dashboard. Key requirements:

  • Summarize nodes and systems
  • Visualize data over time
  • Stack measurements per process and per node
  • Handle new nodes dynamically w/o config chage

He showed their batch mode dashboard.  Just a row per node, a metric graph per column.  CPU broken out by process with load average superimposed on top.  You see things like “high load average but there’s CPU to spare.”  Then you realize that disk is your bottleneck in real workloads.  Switch instance types.

Memory broken out by process too.  Yay for kernel caching.

Disk chart in bytes and ops.  The steady state, spikes, and sustained spikes are all important.

Network – overlay the 95th percentile cause that’s how you get billed.

Web Server dashboard from an API server is a little different.

Add Web requests by app/request type.  app1, app2, 302, 500, 503…  You want to see requests per second by type.

mod_status gives connections and children idleness.

System wide dashboard.  Each graph is a request type, then broken out by node.  And aggregate totals.

And you want median latency per request.  And any app specific stuff you want to know about.

So get the basic stats, over time, per node, per process.

Understand your baseline so you know what’s ‘really’ a spike.

Ad hoc tools -try ‘em!

  • dstat -cdnml for system characteristics
  • iotop for per process disk IO
  • iostat -x 3 for detailed disk stats
  • netstat -tnp for per process TCP connection stats

His slides and other informative blog posts are at nickgerner.com.

A good bootstrap method… You may want to use more/better tools but it’s a good point that you can certainly do this amount for free with very basic tooling, so something you pay for best be better! I think the “per process” intuition is the best takeaway; a lot of otherwise fancy crap doesn’t do that.

But in the end I want more – baselines, alerting, etc.

Leave a comment

Filed under Cloud, Conferences, DevOps

Velocity 2010 – Day 3 Demos and More

Check out the Velocity 2010 flickr set!  And YouTube channel!

Time for lightning demos.

HTTPWatch

A HTTP browser proxy that does the usual waterfalls from your Web pages.  Version 7 is out!  You can change fonts.  They work in IE and Firefox both, unlike the other stuff.

Rather than focus on ranking like YSlow/PageSpeed, they focus on showing specific requests that need attention.  Same kind of data but from a different perspective.  And other warnings, not all strictly perf related.  Security, etc.  Exports and consumes HAR files (the new standard for http waterfall interchange).

webpagetest.org

Based on AOL Pagetest, an IE module, but hosted.  Can be installed as a private site too.  It provides obejct timing and waterfalls.  Allows testing from multiple locations and network speeds, saved them historically. Like a free single-snapshot version of Keynote/Gomez kinds of things.

Shows stuff like browser CPU and bandwidth utilization and does visual comparisons, showing percieved performance in filmstrip view and video.

And does HAR  import/export!  Ah, collaboration.

The CPU/net metrics show some “why” behind gaps in the waterfalls.

The filmstrip side by side view shows different user experiences very well.  And you can do video, as that is sexy.

They have a new UI built by a real designer (thanks Neustar!).

Speed Tracer

But what about Chrome, you ask?  We have an extension for that now.  Similar to PageSpeed.  The waterfall timeline is beautiful, using real “Google Finance” style visualization.  The other guys aren’t like RRDTool ugly but this is super purty.

It will deobfuscate JavaScript and integates it with Eclipse!

They’re less worried about network waterfall and more about client side render.  A lot of the functionality is around that.

You can send someone a Speed Trace dump file for debug.

Fiddler2

Are you tired of being browser dependent?  Fiddler has your back.

New features…  Hey, platform preview 3 for IE9 is out.  It has some tools for capture and export.; it captures traffic in a XML serialised HAR.  Fiddler imports JSON HAR from norms and the IE9 HAR!  And there’s HAR 1.1!  Eeek.  And wcat.  It imports lots of different stuff in other words.

I want one of these to take in Wireshark captures and rip out all the crap and give me the HTTP view!

FiddlerCap Web recorder (fiddlercap.com) lets people record transactions and send it to you.

Side by side viewingwith 2 fiddlers  if you launch with -viewer.

There’s a comparison extension called differ.  Nice!

You can replay captures, including binarieis now, with the AutoResponder tab.  And it’ll play latency soon.

We still await the perfect HTTP full capture and replay toolchain… We have our own HTTP log replayer we use for load tests and regression testing, if we could do this but in volume it would rock…

Caching analysis.  FiddlerCore library you can put in your app.

Now, Bobby Johnson of Facebook speaks on Moving Fast.

Building something isn’t hard, but you don’t know how people will use it, so you have to adapt quickly and make faster changes.

How do you get to a fast release cycle?  Their biggest requirement is “the site can’t go down.”  So they go to frequent small changes.

Most of the challenge when something goes wrong isn’t fixing it, it’s finding out what went wrong so you can fix it.  Smaller changes make that easier.

They take a new thing, push some fake traffic to it, then push a % of user traffic, and then dial back if there’s problems.

If you aren’t watching something, it’ll slip.  They had their performance at 5s; did a big improvement and got it to 2.5.  But then it slips back up.  How do you keep it fast (besides “don’t change anything?”)  They changed their organization to make quick changes but still maintain performance.

What makes a site get slow?  It’s not a deep black art.  A lot of it is jus tnot paying attention to your performance – you can’t foresee new code to not have bugs or not have performance problems.

  • New code is slow
  • More code is slow

“Age the code” by allocating time to shaking out performance problems – not just before, but after deploy.

  • Big pipe – break the page into small pieces and pipelines it.
  • Primer – a JavaScritp library that bootstraps by dl’ing the minimum first.

Both separate code into a fast path and a slow path and defaults to the slow path.

I have a new “poke” feature I want to try… I add it in as a lazy loaded thing and see if anyone cares, before I spend huge optimization time.

It gets popular!  OK, time to figure out performance on the fast path.

So they engineer performance per feature, which allows prioritization.

You can have a big metric collection and reporting tool.  But that’s different from alerting tools.  Granularity of alerting – no one wants to get paged on “this one server is slow.”  But no one cares about “the whole page is 1% slower” either.  But YOUR FEATURE is 50% slower than it was – that someone cares about.  Your alert granularity needs to be at the level of “what a single person works on.”

No one is going to fix things if they don’t care about it.  And also not unless they have control over it (like deploying code someone else wrote and being responsible for it breaking the site!). And they need to be responsible for it.

They have tried various structures.  A centralized team focused on performance but doesn’t have control over it (except “say no” kinds of control).

Saying “every dev is responsible for their perf” distributes the responsibility well, but doesn’t make people care.

So they have adopted a middle road.  There’s a central team that builds tools and works with product teams.  On each product team there is a performance point person.  This has been successful.

Lessons learned:

  • New code is slow
  • Give developers room to try things
  • Nobody’s job is to say no

Joshua from Strangeloop on The Mobile Web

Here we all know that performance links directly to money.

But others (random corporate execs) doubt it.  And what about the time you have to spend?  And what about mobile?

We need to collect more data

Case study – with 65% performance increase, 6% order size increase and 9% conversion increase.

For mobile, 40% perf led to 3% order size and 5% conversion.

They have a conversion rate fall-off by landing page speed graph, so you can say what a 2 second improvement is worth.  And the have preliminary data on mobile too.

I think  he’s choosing a very confusing way to say you need mttrics to establish the ROI of performance changes.  And MOBILE IS PAYING MONEY RIGHT NOW!

Cheryl Ainoa from Yahoo! on Innovation at Scale

The challenges of scale – technical complexity and outgrowing many tools and techniques, there are no off hours, and you’re a target for abuse.

Case Study: Fighting Spam with Hadoop

Google Groups was sending 20M emails/day to Taiwan and there’s only 18M Internet users in Taiwan.  What can help?  Nothing existing could do that volume (spamcop etc.)  And running their rules takes a couple days.  So they used hadoop to drive indications of “spammy” groups in parallel. Cut mail delivered by 5x.

Edge pods – small compute footprints to optimize cost and performance.  You can’t replicate your whole setup globally.  But adding on to a CDN is adding some compute capability to the edge in “pods.”  They have a proxy called YCPI to do this with.

And we’re out of time!

Leave a comment

Filed under Conferences, DevOps