Tag Archives: Operations

Velocity 2010 – Drizzle

Monty Taylor from Rackspace talked about Drizzle, a MySQL variant “built for operations“. My thoughts will be in italics so you can be enraged at the right party.

Drizzle is “a database for the cloud”.  What does that even mean?  It’s “the next Web 2.0”, which is another away of saying “it’s the new hotness, beeyotch” (my translation).

mySQL scaling to multiple machines brings you sadness.  And mySQL deploy is crufty as hell.  So step 1 to Drizzle recovery is that they realized “Hey, we’re not the end all be all of the infrastructure – we’re just one piece people will be putting into their own structure.”  If only other software folks would figure that out…

Oracle style vertical scaling is lovely using a different and lesser definition of scaling.  Cloud scaling is extreme!  <Play early 1990s music>  It requires multiple machines.

They shard.  People complain about sharding, but that’s how the Internet works – the Internet is a bunch of sites sharded by functionality.  QED.

“Those who don’t know UNIX are doomed to repeat it.”  The goal (read about the previous session on toolchains) is to compose stuff easily, string them together like pipes in UNIX.  But most of the databases still think of themselves as a big black box in the corner, whose jealous priests guard it from the unwashed heathen.

So what makes Drizzle different? In summary:

  • Less features
  • Ops driven
  • Sane config
  • Plugins

Less features means less ways for developers to kill you.  Oracle’s “run Java within the database” is an example of totally retarded functionality whose main job is to ruin your life. No stored procedures, no triggers, no prepared statements.  This avoids developer sloppiness.  “Insert a bunch of stuff, then do a select and the database will sort it!” is not appropriate thinking for scale.

Ops driven means not marketing driven, which means driven by lies.  For example, there’s no marketdroids that want them to add a mySQL event scheduler when cron exists.  Or “we could sell more if we had ANSI compliant stored procedures!”  They don’t have a company to let the nasty money affect their priorities.

They don’t do competitive benchmarks, as they are all lies.  That’s for impartial third parties to do.  They do publish their regression tests vs themselves for transparency.

You get Drizzle via distros.  There are no magic “gold” binaries and people that do that are evil.  But distros sometimes get behind.  pandora-build

They have sane defaults.  If most people are doing to set something (like FRICKING INNODB) they install by default that way.  To install Drizzle, the only mandatory thing to say is the data directory.

install from apt/yum works.  Or configure/make/make install and run drizzled.  No bootstrap, system tables, whatever.

They use plugins.  mySQL plugins are pain, more of a patch really.  You can just add them at startup time, no SQL from a sysadmin.  And no loading during runtime – see “less features” above.  This is still in progress especially config file sniblets.  But plugins are the new black.

They have pluggable protocols.  It ships with mySQL and Drizzle, but you can plug console, HTTP/REST or whatever.  Maybe dbus…  Their in progress Drizzle protocol removes the potential for SQL injection by only delivering one query, has a sharding key in the packet header, supports HTTP-like redirects…

libdrizzle has both client and server ends, and talks mySQL and Drizzle.

So what about my app that always auths to the database with its one embedded common username/password?  Well, you can do none, PAM, LDAP (and well), HTTP. You just say authenticate(user, pass) and it does it.   It has pluggable authorization too, none, LDAP, or hard-coded.

There is a pluggable query filter that can detect and stop dumb queries- without requiring a proxy.

It has pluggable logging – none, syslog, gearman, etc. – and errors too.

Pluggable replication!  A new scheme based on Google protocol buffers, readable in Java, Python, and C++.  It’s logical change (not quite row) based.  Combined with protocol redirects, it’s db migration made easy!

Boots!  A new command line client, on launchpad.net/boots.  It’s pluggable, scriptable, pipes SQL queries, etc.

P.S. mySQL can lick me! (I’m paraphrasing, but only a little.)

1 Comment

Filed under Conferences, DevOps

Velocity 2010 – Facebook Operations

How The Pros Do It

Facebook Operations – A Day In The Life by Tom Cook

Facebook has been very open about their operations and it’s great for everyone.  This session is packed way past capacity.  Should be interesting.  My comments are  in italics.

Every day, 16 billion minutes are spent on Facebook worldwide.  It started in Zuckerberg’s dorm room and now is super huge, with tens of thousands of servers and its own full scale Oregon data center in progress.

So what serves the site?  It’s rerasonably straightforward.  Load balancer, web servers, services servers, memory cache, database.  THey wrote and 100% use use HipHop for PHP, once they outgrew Apache+mod_php – it bakes php down to compiled C++.  They use loads of memcached, and use sharded mySQL for the database. OS-wise it’s all Linux – CentOS 5 actually.

All the site functionality is broken up into separate discrete services – news, search, chat, ads, media – and composed from there.

They do a lot with systems management.  They’re going to focus on deployment and monitoring today.

They see two sides to systems management – config management and on demand tools.  And CM is priority 1 for them (and should be for you).  No shell scripting/error checking to push stuff.  There are a lot of great options out there to use – cfengine, puppet, chef.  They use cfengine 2!  Old school alert!  They run updates every 15 minutes (each run only takes like 30s).

It means it’s easy to make a change, get it peer reviewed, and push it to production.  Their engineers have fantastic tools and they use those too (repo management, etc.)

On demand tools do deliberate fix or data gathering.  They used to use dsh but don’t think stuff like capistrano will help them.  They wrote their own!  He ran a uname -a across 10k distributed hosts in 18s with it.

Up a layer to deployments.  Code is deployed two ways – there’s front end code and back end deployments.  The Web site, they push at least once a day and sometimes more.  Once a week is new features, the rest are fixes etc.  It’s a pretty coordinated process.

Their push tool is built on top of the mystery on demand tool.  They distribute the actual files using an internal BitTorrent swarm, and scaling issues are nevermore!  Takes 1 minute to push 100M of new code to all those 10k distributed servers.  (This doesn’t include the restarts.)

On the back end, they do it differently.  Usually you have engineering, QA, and ops groups and that causes slowdown.  They got rid of the formal QA process and instead built that into the engineers.  Engineers write, debug, test, and deploy their own code.  This allows devs to see response quickly to subsets of real traffic and make performance decisions – this relies on the culture being very intense.  No “commit and quit.”  Engineers are deeply involved in the move to production.  And they embed ops folks into engineering teams so it’s not one huge dev group interfacing with one huge ops group.  Ops participates in architectural decisions, and better understand the apps and its needs.  They can also interface with other ops groups more easily.  Of course, those ops people have to do monitoring/logging/documentation in common.

Change logging is a big deal.  They want the engineers to have freedom to make changes, and just log what is going on.  All changes, plus start and end time.  So when something degrades, ops goes to that guy ASAP – or can revert it themselves.  They have a nice internal change log interface that’s all social.  It includes deploys and “switch flips”.

Monitoring!  They like ganglia even tough it’s real old.  But it’s fast and allows rapid drilldown.  They update every minute; it’s just RRD and some daemons.  You can nest grids and pools.  They’re so big they have to shard ganglia horizontally across servers and store RRD’s in RAM, but you won’t need to do that.

They also have something called ODS (operational data store) which is more application focused and has history, reporting, better graphs.  They have soooo much data in it.

They also use nagios, even though “that’s crazy”.  Ping testing, SSH testing, Web server on a port.  They distribute it and feed alerting into other internal tools to aggregate them as an execution back end.  Aggregating into alarms clumps is critical, and decisions are made based on a tiered data structure – feeding into self healing, etc.  They have a custom interface for it.

At their size, there are some kind of failures going on constantly.  They have to be able to push fixes fast.

They have a lot of rack/cluster/datacenter etc levels of scale, and they are careful to understand dependencies and failure states among them.

They have constant communication – IRC with bots, internal news updates, “top of page” headers on internal tools, change log/feeds.  And using small teams.

How many users per engineer?  At Facebook, 1.1 million – but 2.3 million per ops person!  This means a 2:1 dev to ops ratio, I was going to ask…

To recap:

  • Version control everything
  • Optimize early
  • Automate, automate, automate
  • Use configuration management.  Don’t be a fool with your life.
  • Plan for failure
  • Instrument everything.  Hardware, network, OS, software, application, etc.
  • Don’t spend time on dumb things – you can slow people down if you’re “that guy.”
  • Priorities – Stability, support your engineers

Check facebook.com/engineering for their blog!  And facebook.com/opensource for their tools.

Leave a comment

Filed under Conferences, DevOps

Velocity 2010: Day 2 Keynotes Continued

Back from break and it’s time for more!  The dealer room was fricking MOBBED.  And this is with a semi-competing cloud computing convention, Structure,  going on down the road.

Lightning Demos

Time for a load of demos!

dynaTrace

Up first is dynaTrace, a new hotshot in the APM (application performance management) space.  We did a lot of research in this space (well, the sub-niche in this space for deep dive agent-driven analytics), eventually going with Opnet Panorama over CA/Wily Introscope, Compuware, HP, etc.  dynaTrace broke onto the scene since and it seems pretty pimp.  It does traditional metric-based continuous APM and also has a “PurePath” technology where they inject stuff so you can trace a transaction through the tiers it travels along, which is nice.

He analyzed the FIFA page for performance.  dynaTrace is more of a big agenty thing but they have an AJAX edition free analyzer that’s more light/client based.  A lot of the agent-based perf vendors just care about that part, and it’s great that they are also looking at front end performance because it all ties together into the end user experience.  Anyway, he shows off the AJAX edition which does a lot of nice performance analysis on your site.  Their version 2.0 of Ajax Edition is out tomorrow!

And, they’re looking at how to run in the cloud, which is important to us.

Firebug

A Firefox plugin for page inspection, but if you didn’t know about Firebug already you’re fired.  Version 1.6 is out!  And there’s like 40 addons for it now, not just YSlow etc, but you don’t know about them – so they’re putting together “swarms” so you can get more of ’em.

In the new version you can see paint events.  And export your findings in HAR format for portability across tools of this sort.  Like httpwatch, pagespeed, showslow.  Nice!

They’ve added breakpoints on network and HTML events.  “FireCookie” lets you mess with cookies (and breakpoints on this too).

YSlow

A Firebug plugin, was first on the scene in terms of awesome Web page performance analysis.  showslow.com and gtmetrix.com complement it.   You can make custom rules now.  WTF (Web Testing Framework) is an new YSlow plugin that tests for more shady dev practices.

PageSpeed

PageSpeed is like YSlow, but from Google!  They’ve been working on turning the core engine into an SDK so it can be used in other contexts.  Helps to identify optimizations to get time to first pane, making JS/CSS recommendations.

The Big Man

Now it’s time for Tim O’Reilly to lay down the law on us.  He wrote a blog post on operations being the new secret sauce that kinda kicked off the whole Velocity conference train originally.

Tim’s very first book was the Massacomp System Administrator’s Guide, back in 1983.  System administration was the core that drove O’Reilly’s book growth for a long time.

Applications’ competitive advantage is being driven by large amounts of data.  Data is the “intel inside” of the new generation of computing.  The “internet operating system” being built is a data operating system.  And mobile is the window into it.

He mentioned OpenCV (computer vision) and Collective Intelligence, which ended up suing the same kinds of algorithms.  So that got his thinking about sensors and things like Arduino.  And the way technology evolves is hackers/hobbyists to innovators/entrepreneurs.  RedLaser, Google Goggles, etc. are all moves towards sensors and data becoming pervasive.  Stuff like the NHIN (nationwide health information network).  CabSense.  AMEE, “the world’s energy meter” (or at least the UK’s) determined you can tell the make and model of someone’s appliances based on the power spike!  Passur has been gathering radar data from sensors, feeding through algorithms, and now doing great prediction.

Apps aren’t just consumed by humans, but by other machines.  In the new world every device generates useful data, in which every action creates “information shadows” on the net.

He talks about this in Web Squared.  Robotics, augmented reality, personal electronics, sensor webs.

More and more data acquisition is being fed back into real time response – Walmart has a new item on order 20 seconds after you check out with it.  Immediately as someone shows up at the polls, their name is taken off the call-reminder list with Obama’s Project Houdini.

Ushahidi, a crowdsourced crisis mapper.  In Haiti relief, it leveraged tweets and skype and Mechanical Turk – all these new protocls were used to find victims.

And he namedrops Opscode, the Chef guys – the system is part of this application.  And the new Web Operations essay book.  Essentially, operations are who figures out how to actually do this Brave New World of data apps.

And a side note – the closing of the Web is evil.  In The State of the Internet Operating System, Tim urges you to collaborate and cooperate and stay open.

Back to ops – operations is making some pretty big – potentially world-affecting- decisions without a lot to guide us.  Zen and the Art of Motorcycle Maintenance will guide you.  And do the right thing.

Current hot thing he’s into – Gov2.0!  Teaching government to think like a platform provider.  He wants our help to engage with the government and make sure they’re being wise and not making technopopic decisions.

Code for America is trying to get devs for city governments, which of course are the ass end of the government sandwich – closest to us and affecting our lives the most, but with the least funds and skills.

Third Party Pain

And one more, a technical one, on the effects of third party trash on your Web site – “Don’t Let Third Parties Slow You Down“, by Google’s Arvind Jain and Michael Kleber.  This is a good one – if you run YSlow on our Web site, www.ni.com, you’ll see a nice fast page with some crappy page tags and inserted JS junk making it half as fast as it should be.  Eloqua, Unica, and one of our own demented design layers an extra like 6s on top of a nice sub-2s page load.

Adsense adds 12% to your page load.  Google Analytics adds 5% (and you should use the new asynchronous snippet!).  Doubleclick adds 11.5%.  Digg, FacebookConnect, etc. all add a lot.

So you want to minimize blocking on the external publisher’s content – you can’t get rid of them and can’t make them be fast (we’ve tried with Eloqua and Unica, Lord knows).

They have a script called ASWIFT that makes show_ads.js a tiny loader script.  They make a little iframe and write into it.  Normally if you document.write, you block the hell out of everything.  Their old show_ads.js had a median of 47 ms and a 90th percentile of 288 ms latency – the new ASWIFT one has a mdeian of 11 ms and a 90th %ile of 32 ms!!!

And as usual there’s a lot of browser specific details.  See the presentation for details.  They’re working out bugs, and hope to use this on AdSense soon!

Leave a comment

Filed under Conferences, DevOps

Before DevOps, Don’t You Need OpsOps?

From the “sad but true” files comes an extremely insightful point apparently discussed over beer by the UK devops crew recently – that we are talking about dev and ops collaboration but the current state of collaboration among ops teams is pretty crappy.

This resonates deeply with me.  I’ve seen that problem in spades.  I think in general that a lot of the discussion about the agile ops space is too simplistic in that it seems tuned to organizations of “five guys, three of whom are coders and two of whom are operations” and there’s no differentiation.  In real life, there’s often larger orgs and a lot of differentiation that causes various collaboration challenges.  Some people refer to this as Web vs Enterprise, but I don’t think that’s strictly true; once your Web shop grows from 5 guys to 200 it runs afoul of this too – it’s a simple scalability and organizational engineering problem.

As an aside, I don’t even like the “Ops” term – a sysadmin team can split into subgroups that do systems engineering, release management, and operational support…  Just saying “Ops” seems to me to create implications of not being a partner in the initial design and development of the overall system/app/service/site/whatever you want to call it.

Ops Verticals

Here, we have a large Infrastructure department.  Originally, it was completely siloed by technology verticals, and there’s a lot of subgroups.  Network, UNIX, Windows, DBA, Lotus Notes, Telecom, Storage, Data Center…  Some ten plus years ago when the company launched their Web site in earnest, they quickly realized that wasn’t going to work out.  You had the buck-passing behavior described in the blog posts above that made issues impossible to solve in a timely fashion, plus it made collaboration with devs/business nearly impossible.  Not only did you need like 8 admins to come involve themselves in your project, but they did not speak similar enough languages – you’d have some crusty UNIX admin yelling “WHAT ABOUT THE INODES” until the business analyst started to cry.

Dev Silos

But are our developers here better off?  They are siloed by business unit.  Just among the Web developers there’s the eCommerce developers, eCRM, Product Advisors, Community, Support, Content Management…  On the one hand, they are able to be very agile in creating solutions inside their specific niche.  On the other hand, they are all working within the same system environment, and they don’t always stay on the same page in terms of what technologies they are using. “Well, I’m sure THAT team bought a lovely million dollar CMS, but we’re going to buy our own different million dollar CMS.   No, you don’t get more admin resource.”  Over time, they tried to produce architecture groups and other cross-team initiatives to try to rein in the craziness, with mixed but overall positive results.

Plugging the Dike

What we did was create a Web Administration group (Web Ops, whatever you want to call it) that was holistically responsible for Web site uptime, performance, and security.  Running that team was my previous gig, did it for five years.  That group was more horizontally focused and would serve as an interface to the various technology verticals; it worked closely with developers in system design during development, coordinated the release process, and involved devs in troubleshooting during the production phase.

BizOps?

In fact, we didn’t just partner with the developers – we partnered with the business owners of our Web site too, instead of tolerating the old model of “Business collaborates with the developers, who then come and tell ops what to do.”  This was a remarkably easy sell really.  The company lost money every minute the Web site was down, and it was clear that the dev silos weren’t going to be able to fix that any more than the ops silos were.  So we quickly got a seat at the same table.

Results

This was a huge success.  To this day, our director of Web Marketing is one of the biggest advocates of the Web operations team.  Since then, other application administration (our word for this cross-disciplinary ops) teams have formed along the same model.  The DevOps collaboration has been good overall – with certain stresses coming from the Web Ops team’s role as gatekeeper and process enforcement.  Ironically, the biggest issues and worst relationships were within Infrastructure between the ops teams!

OpsOps – The Fly In The Ointment

The ops team silos haven’t gone down quietly.  To this day the head DBA still says “I don’t see a good reason for you guys [WebOps] to exist.”  I think there’s a common “a thing is just the sum of its parts” mindset among admins for whatever reason.  There are also turf wars arising from the technology silo division and the blurring of technology lines by modern tech.  I tried again and again to pitch “collaborative system administration.”  But the default sysadmin behavior is to say “these systems are mine and I have root on them.  Those are your systems and you have root on them.  Stay on your side of the line and I’ll stay on mine.”

Fun specific Catch-22 situations we found ourselves in:

  • Buying a monitoring tool that correlates events across all the different tiers to help root-cause production problems – but the DBAs refusing to allow it on “their” databases.
  • Buying a hardware load balancer – we were going to manage it, not the network team, and it wasn’t a UNIX or Windows server, so we couldn’t get anyone to rack and jack it (and of course we weren’t allowed to because “Why would a webops person need server room access, that’s what the other teams are for”).

Some of the problem is just attitude, pure and simple.  We had problems even with collaboration inside the various ops teams!  We’d work with one DBA to design a system and then later need to get support from another DBA, who would gripe that “no one told/consulted them!”  Part of the value of the agile principles that “DevOps” tries to distill is just a generic “get it into your damn head you need to be communicating and working together and that needs to be your default mode of operation.” I think it’s great to harp on that message because it’s little understood among ops.  For every dev group that deliberately ostracizes their ops team, there’s two ops teams who don’t think they need to talk to the devs – in the end, it’s mostly our fault.

Part of the problem is organizational.  I also believe (and ITIL, I think, agrees with me) that the technology-silo model has outlived its usefulness.  I’d like to see admin teams organized by service area with integral DBAs, OS admins, etc.  But people are scared of this for a couple reasons.  One is that those admins might do things differently from area to area (the same problem we have with our devs) – this could be mitigated by “same tech” cross-org standards/discussions.  The other is that this model is not the cheapest.  You can squeeze every last penny out if you only have 4 Windows admins and they’re shared by 8 functional areas.  Of course, you are cutting off your nose to spite your face because you lose lots more in abandoned agility, but frankly corporate finance rules (minimize G&A spending) are a powerful driver here.

If nothing else, there’s not “one right organization” – I’d be tempted to reorg everyone from verticals into horizontals, let that run for 5 years, and then reorg back the other way, just to keep the stratification from setting in.

Specialist vs Generalist

One other issue.  The Web Ops team we created required us to hire generalists – but generalists that knew their stuff in a lot of different areas.  It became very hard to hire for that position and training took months before someone was at all effective.  Being a generalist doesn’t scale well.  Specialization is inevitable and, indeed, desirable (as I think pretty much anything in the history of anything demonstrates).  You can mitigate that with some cross-training and having people be generalists in some areas, but in the end, once you get past that “three devs, two ops, that’s the company” model, specialization is needed.

That’s why I think one of the common definitions of DevOps – all ops folks learning to be developers or vice versa – is fundamentally flawed.  It’s not sustainable.  You either need to hire all expensive superstars that can be good at both, or you hire people that suck at both.

What you do is have people with varying mixes.  In my current team we have a continuum of pure ops people, ops folks doing light dev, devs doing light ops, and pure devs.  It’s good to have some folks who are generalizing and some who are specializing.  It’s not specializing that is bad, it’s specialists who don’t collaborate that are bad.

Conclusion

So I’ve shared a lot of experiences and opinions above but I’m not sure I have a brilliant solution to the problem.  I do think we need to recognize that Ops/Ops collaboration is an issue that arises with scale and one potentially even harder to overcome than Dev/Ops collaboration.  I do think stressing collaboration as a value and trying to break down organizational silos may help.  I’d be happy to hear other folks’ experiences and thoughts!

6 Comments

Filed under DevOps

Defining Agile Operations and DevOps

I recently read a great blog post by Scott Wilson that was talking about the definitions of Agile Operations, DevOps, and related terms.  (Read the comments too, there’s some good discussion.)  From what I’ve heard so far, there are a bunch of semi-related terms people are using around this whole “new thing of ours.”

The first is DevOps, which has two totally different frequently used definitions.

1.  Developers and Ops working closely together – the “hugs and collaboration” definition

2.  Operations folks uptaking development best practices and writing code for system automation

The second is Agile Operations, which also has different meanings.

1.  Same as DevOps, whichever definition of that I’m using

2.  Using agile principles to run operations – process techniques, like iterative development or even kanban/TPS kinds of process stuff.  Often with a goal of “faster!”

3.  Using automation – version control, automatic provisioning/control/monitoring.  Sometimes called “Infrastructure Automation” or similar.

This leads to some confusion, as most of these specific elements can be implemented in isolation.  For example, I think the discussion at OpsCamp about “Is DevOps an antipattern” was predicated on an assumption that DevOps meant only DevOps definition #2, “ops guys trying to be developers,” and made the discussion somewhat odd to people with other assumed definitions.

I have a proposed set of definitions.  To explain it, let’s look at Agile Development and see how it’s defined.

Agile development, according to wikipedia and the agile manifesto, consists of a couple different “levels” of thing.  To sum up the wikipedia breakdown,

  • Agile Principles – like “business/users and developers working together.”  These are the core values that inform agile, like collaboration, people over process, software over documentation, and responding to change over planning.
  • Agile Methods – specific process types.  Iterations, Lean, XP, Scrum.  “As opposed to waterfall.”
  • Agile Practices – techniques often found in conjunction with agile development, not linked to a given method flavor, like test driven development, continuous integration, etc.

I believe the different parts of Agile Operations that people are talking about map directly to these three levels.

  • Agile Operations Principles includes things like dev/ops collaboration (DevOps definition 1 above); things like James Turnbull’s 4-part model seem to be spot on examples of trying to define this arena.
  • Agile Operations Methods includes process you use to conduct operations – iterations, kanban, stuff you’d read in Visible Ops; Agile Operations definition #2 above.
  • Agile Operations Practices includes specific techniques like automated build/provisioning, monitoring, anything you’d have a “toolchain” for.  This contains DevOps definition #2 and Agile Operations definition #3 above.

I think it’s helpful to break them up along the same lines as agile development, however, because in the end some of those levels should merge once developers understand ops is part of system development too…  There shouldn’t be a separate “user/dev collaboration” and “dev/ops collaboration,” in a properly mature model it should become a “user/dev/ops collaboration,” for example.

I think the dev2ops guys’ “People over Process over Tools” diagram mirrors this about exactly – the people being one of the important agile principles, process being a large part of the methods, and tools being used to empower the practices.

What I like about that diagram, and why I want to bring this all back to the Agile Manifesto discussion, is that the risk of having various sub-definitions increases the risk that people will implement the processes or tools without the principles in mind, which is definitely an antipattern.  The Agile guys would tell you that iterations without collaboration is likely to not work out real well.

And it happens in agile development too – there are some teams here at my company that have adopted the methods and/or tools of agile but not its principles, and the results are suboptimal.

Therefore I propose that “Agile Operations” is an umbrella term for all these things, and we keep in mind the principles/methods/practices differentiation.

If we want to call the principles “devops” for short and some of the practices “infrastructure automation” for short I think that would be fine…   Although dev/ops collaboration is ONE of the important principles – but probably not the entirety; and infrastructure automation is one of the important practices, but there are probably others.

2 Comments

Filed under DevOps, Uncategorized

Upcoming Free Velocity WebOps Web Conference

O’Reilly’s Velocity conference is the only generalized Web ops and performance conference out there.  We really like it; you can go to various other conferences and have 10-20% of the content useful to you as a Web Admin, or you can go here and have most of it be relevant!

They’ve been doing some interim freebie Web conferences and there’s one coming up.  Check it out.  They’ll be talking about performance functionality in Google Webmaster Tools, mySQL, Show Slow, provisioning tools, and dynaTrace’s new AJAX performance analysis tool.

O’Reilly Velocity Online Conference: “Speed and Stability”
Thursday, March 17; 9:00am PST
Cost: Free

Leave a comment

Filed under Conferences, DevOps

Agile Operations

It’s funny.  When we recently started working on an upgrade of our Intranet social media platform, and we were trying to figure out how to meld the infrastructure-change-heavy operation with the need for devs, designers, and testers to be able to start working on the system before “three months from now,” we broached the idea of “maybe we should do that in iterations!”  First, get the new wiki up and working.  Then, worry about tuning, switching the back end database, etc.  Very basic, but it got me thinking about the problem in terms of “hey, Infrastructure still operates in terms of waterfall, don’t we.”

Then when Peco and I moved over to NI R&D and started working on cloud-based systems, we quickly realized the need for our infrastructure to be completely programmable – that is, not manually tweaked and controlled, but run in a completely automated fashion.  Also, since we were two systems guys embedded in a large development org that’s using agile, we were heavily pressured to work in iterations along with them.  This was initially a shock – my default project plan has, in traditional fashion, months worth of evaluating, installing, and configuring various technology components before anything’s up and running.   But as we began to execute in that way, I started to see that no, really, agile is possible for infrastructure work – at least “mostly.”  Technologies like cloud computing help, but there’s still a little more up front work required than with programming – but you can get mostly towards an agile methodology (and mindset!).

Then at OpsCamp last month, we discovered that there’s been this whole Agile Operations/Automated Infrastructure/devops movement thing already in progress we hadn’t heard about.  I don’t keep in touch with The Blogosphere ™ enough I guess.  Anyway, turns out a bunch of other folks have suddenly come to the exact same conclusion and there’s exciting work going on re: how to make operations agile, automate infrastructure, and meld development and ops work.

So if  you also hadn’t been up on this, here’s a roundup of some good related core thoughts on these topics for your reading pleasure!

Leave a comment

Filed under DevOps

Velocity 2009 – Best Tidbits

Besides all the sessions, which were pretty good, a lot of the good info you get from conferences is by networking with other folks there and talking to vendors.  Here are some of my top-value takeaways.

Aptimize is a New Zealand-based company that has developed software to automatically do the most high value front end optimizations (image spriting, CSS/JS combination and minification, etc.).  We predict it’ll be big.  On a site like ours, going back and doing all this across hundreds of apps will never happen – we can engineer new ones and important ones better, but something like this which can benefit apps by the handful is great.

I got some good info from the MySpace people.  We’ve been talking about whether to run our back end as Linux/Apache/Java or Windows/IIS/.NET for some of our newer stuff.  In the first workshop, I was impressed when the guy asked who all runs .NET and only one guy raised his hand.   MySpace is one of the big .NET sites, but when I talked with them about what they felt the advantage was, they looked at each other and said “Well…  It was the most expeditious choice at the time…”  That’s damning with faint praise, so I asked about what they saw the main disadvantage being, and they cited remote administration – even with the new PowerShell stuff it’s just still not as easy as remote admin/CM of Linux.  That’s top of my list too, but often Microsoft apologists will say “You just don’t understand because you don’t run it…”  But apparently running it doesn’t necessarily sell you either.

Our friends from Opnet were there.  It was probably a tough show for them, as many of these shops are of the “I never pay for software” camp.  However, you end up wasting far more in skilled personnel time if you don’t have the right tools for the job.  We use the heck out of their Panorama tool – it pulls metrics from all tiers of your system, including deep in the JVM, and does dynamic baselining, correlation and deviation.  If all your programmers are 3l33t maybe you don’t need it, but if you’re unsurprised when one of them says “Uhhh… What’s a thread leak?” then it’s money.

ControlTier is nice, they’re a commercial open source CM tool for app deploys – it works at a higher level than chef/puppet, more like capistrano.

EngineYard was a really nice cloud provisioning solution (sits on top of Amazon or whatever).  The reality of cloud computing as provided by the base IaaS vendors isn’t really the “machines dynamically spinning up and down and automatically scaling your app” they say it is without something like this (or lots of custom work).  Their solution is, sadly, Rails only right now.  But it is slick, very close to the blue-sky vision of what cloud computing can enable.

And also, I joined the EFF!  Cyber rights now!

You can see most of the official proceedings from the conference (for free!):

1 Comment

Filed under Conferences, DevOps

Velocity 2009 – Monday Night

After a hearty trip to Gordon Biersch, Peco went to the Ignite battery of five minute presentations, which he said was very good.  I went to two Birds of a Feather sessions, which were not.  The first was a general cloud computing discussion which covered well-trod ground.  The second was by a hapless Sun guy on Olio and Fabian.  No, you don’t need to know about them.  It was kinda painful, but I want to commend that Asian guy from Google for diplomatically continuing to try to guide the discussion into something coherent without just rolling over the Sun guy.  Props!

And then – we were lame and just turned in.  I’m getting old, can’t party every night like I used to.  (I don’t know what Peco’s excuse is!)

Leave a comment

Filed under Conferences, DevOps

Velocity 2009 – Scalable Internet Architectures

OK, I’ll be honest.  I started out attending “Metrics that Matter – Approaches to Managing High Performance Web Sites” (presentation available!) by Ben Rushlo, Keynote proserv.  I bailed after a half hour to the other one, not because the info in that one was bad but because I knew what he was covering and wanted to get the less familiar information from the other workshop.  Here’s my brief notes from his session:

  • Online apps are complex systems
  • A siloed approach of deciding to improve midtier vs CDN vs front end engineering results in suboptimal experience to the end user – have to take holistic view.  I totally agree with this, in our own caching project we took special care to do an analysis project first where we evaluated impact and benefit of each of these items not only in isolation but together so we’d know where we should expend effort.
  • Use top level/end user metrics, not system metrics, to measure performance.
  • There are other metrics that correlate to your performance – “key indicators.”
  • It’s hard to take low level metrics and take them “up” into a meaningful picture of user experience.

He’s covering good stuff but it’s nothing I don’t know.  We see the differences and benefits in point in time tools, Passive RUM, tagging RUM, synthetic monitoring, end user/last mile synthetic monitoring…  If you don’t, read the presentation, it’s good.  As for me, it’s off to the scaling session.

I hopped into this session a half hour late.  It’s Scalable Internet Architectures (again, go get the presentation) by Theo Schlossnagle, CEO of OmniTI and author of the similarly named book.

I like his talk, it starts by getting to the heart of what Web Operations – what we call “Web Admin” hereabouts – is.  It kinda confuses architecture and operations initially but maybe that’s because I came in late.

He talks about knowledge, tools, experience, and discipline, and mentions that discipline is the most lacking element in the field. Like him, I’m a “real engineer” who went into IT so I agree vigorously.

What specifically should you do?

  • Use version control
  • Monitor
  • Serve static content using a CDN, and behind that a reverse proxy and behind that peer based HA.  Distribute DNS for global distribution.
  • Dynamic content – now it’s time for optimization.

Optimizing Dynamic Content

Don’t pay to generate the same content twice – use caching.  Generate content only when things change and break the system into components so you can cache appropriately.

example: a php news site – articles are in oracle, personalization on each page, top new forum posts in a sidebar.

Why abuse oracle by hitting it every page view?  updates are controlled.  The page should pull user prefs from a cookie.  (p.s. rewrite your query strings)
But it’s still slow to pull from the db vs hardcoding it.
All blog sw does this, for example
Check for a hardcoded php page – if it’s not there, run something that puts it there.  Still dynamically puts in user personalization from the cookie.  In the preso he provides details on how to do this.
Do cache invalidation on content change, use a message queuing system like openAMQ for async writes.
Apache is now the bottleneck – use APC (alternative php cache)
or use memcached – he says no timeouts!  Or… be careful about them!  Or something.

Scaling Databases

1. shard them
2. shoot yourself

Sharding, or breaking your data up by range across many databases, means you throw away relational constraints and that’s sad.  Get over it.

You may not need relations – use files fool!  Or other options like couchdb, etc.  Or hadoop, from the previous workshop!

Vertically scale first by:

  • not hitting the damn db!
  • run a good db.  postgres!  not mySQL boo-yah!

When you have to go horizontal, partition right – more than one shard shouldn’t answer an oltp question.   If that’s not possible, consider duplication.

IM example.  Store messages sharded by recipient.  But then the sender wants to see them too and that’s an expensive operation – so just store them twice!!!

But if it’s not that simple, partitioning can hose you.

Do math and simulate it before you do it fool!   Be an engineer!

Multi-master replication doesn’t work right.  But it’s getting closer.

Networking

The network’s part of it, can’t forget it.

Of course if you’re using Ruby on Rails the network will never make your app suck more.  Heh, the random drive-by disses rile the crowd up.

A single machine can push a gig.  More isn’t hard with aggregated ports.  Apache too, serving static files.  Load balancers too.  How to get to 10 or 20 Gbps though?  All the drivers and firmware suck.  Buy an expensive LB?

Use routing.  It supports naive LB’ing.  Or routing protocol on front end cache/LBs talking to your edge router.  Use hashed routes upstream.  User caches use same IP.  Fault tolerant, distributed load, free.

Use isolation for floods.  Set up a surge net.  Route out based on MAC.  Used vs DDoSes.

Service Decoupling

One of the most overlooked techniques for scalable systems.  Why do now what you can postpone till later?

Break transaction into parts.  Queue info.  Process queues behind the scenes.  Messaging!  There’s different options – AMQP, Spread, JMS.  Specifically good message queuing options are:

Most common – STOMP, sucks but universal.

Combine a queue and a job dispatcher to make this happen.  Side note – Gearman, while cool, doesn’t do this – it dispatches work but it doesn’t decouple action from outcome – should be used to scale work that can’t be decoupled.  (Yes it does, says dude in crowd.)

Scalability Problems

It often boils down to “don’t be an idiot.”  His words not mine.  I like this guy. Performance is easier than scaling.  Extremely high perf systems tend to be easier to scale because they don’t have to scale as much.

e.g. An email marketing campaign with an URL not ending in a trailing slash.  Guess what, you just doubled your hits.  Use the damn trailing slash to avoid 302s.

How do you stop everyone from being an idiot though?  Every person who sends a mass email from your company?  That’s our problem  – with more than fifty programmers and business people generating apps and content for our Web site, there is always a weakest link.

Caching should be controlled not prevented in nearly any circumstance.

Understand the problem.  going from 100k to 10MM users – don’t just bucketize in small chunks and assume it will scale.  Allow for margin for error.  Designing for 100x or 1000x requires a profound understanding of the problem.

Example – I plan for a traffic spike of 3000 new visitors/sec.  My page is about 300k.  CPU bound.  8ms service time.  Calculate servers needed.  If I varnish the static assets, the calculation says I need 3-4 machines.  But do the math and it’s 8 GB/sec of throughput.  No way.  At 1.5MM packets/sec – the firewall dies.  You have to keep the whole system in mind.

So spread out static resources across multiple datacenters, agg’d pipes.
The rest is only 350 Mbps, 75k packets per second, doable – except the 302 adds 50% overage in packets per sec.

Last bonus thought – use zfs/dtrace for dbs, so run them on solaris!

1 Comment

Filed under Conferences, DevOps