Author Archives: Ernest Mueller

Ernest Mueller's avatar

About Ernest Mueller

Ernest is the VP of Engineering at the cloud and DevOps consulting firm Nextira in Austin, TX. More...

DevOps “From the Trenches” Report – HomeAway

We were out at HomeAway for a technical discussion, and DevOps reared its head as it does so frequently nowadays.  In the context of talking about their preparation to scale up for their big Chevy Chase Super Bowl commercial, they were doing all kinds of stuff.  One of the things they noted was that the traditional dev and ops headbutting changed due to the long hours of work they had to put in together.  They tried going off and doing “their parts” separately – ops doing network, servers, load balancers, and hosting and developers doing coding, caching, tuning, and testing – but the time pressure, importance, and complexity of the project forced them together into a room, and once they started to collaborate they just stayed there, working in close proximity, for the duration.  When asked about the big takeaways from the entire project, the developers noted that “Leaning how everything interacts has changed how we build things” – for example, doing “pull the plug” fault testing has made for more resilient architectures and higher confidence and quality of life for both the dev and ops teams!  They didn’t describe it as “DevOps,” but that’s what it boils down to.

The more I talk to other successful Austin tech companies – HomeAway, BazaarVoice, Pervasive – the more that I hear DevOps concepts mentioned as keys to their success – and they didn’t do them because they “wanted to do this cool DevOps thing,” but they did what was needed to succeed and it turns out that a part of that is bringing development and operational concerns together into a whole.  It reminds me of the story behind the Visible Ops book, where the authors researched what high performing IT shops had in common and then realized those successful behaviors all mapped to certain ITIL areas (mainly change management).  That is a compelling validation of its efficacy.

Anyway, I urged them to consider doing that presentation in public venues; it really was a great story and hit on many of the best practices that have been emerging from the ops and performance world over the last few years.  They must be doing something right because they’re growing like gangbusters – if you want to take a vacation and rent someone else’s house/condo instead of going to a hotel, go try out homeaway.com!

Leave a comment

Filed under DevOps

DevOps Cafe Podcast

Damon Edwards and John Willis run the DevOps Cafe Podcast.  It’s a great listen, and they have a lot of people on talking about exciting advances in the ops world (including Allspaw, John Kim, kaChing, Shopzilla).  And for this last one, they interviewed me! Apparently we’re on the cutting edge of doing DevOps in a traditional type organization as opposed to a lil’ Web startup.

So if you want to hear me natter on about DevOps and the lessons I’ve learned over my career that have brought me to it for 40 minutes or so, here you go.

Leave a comment

Filed under DevOps

DevOps and Security

I remember some complaints about DevOps from a couple folks (most notably Rational Survivability) saying “what about security!  And networking!  They’re excluded from DevOps!”  Well, I think that in the agile collaboration world, people are only excluded to the extent that they refuse to work with the agile paradigm.  Ops used to be “excluded” from agile, not because the devs hated them, but because the ops folks themselves didn’t willingly go collaborate with the devs and understand their process and work in that way.  As an ops person, it was hard to go through the process of letting go of my niche of expertise and my comfortable waterfall process, but once I got closer to the devs, understood what they did, and refactored my work to happen in an agile manner, I was as welcome as anyone to the collaborative party, and voila – DevOps.

Frankly, the security and network arenas are less incorporated into the agile team because they don’t understand how to be (or in many cases, don’t want to be).  I’ve done security work and work with a lot of InfoSec folks – we host the Austin OWASP chapter here at NI – and the average security person’s approach embodies most of what agile was created to remove from the development process.  As with any technical niche there’s a lot of elitism and authoritarianism that doesn’t mesh well with agile.

But this week, I saw a great presentation at the Austin OWASP chapter by Andre Gironda (aka “dre”) called Application Assessments Reloaded that covered a lot of ground, but part of it was the first coherent statement I’ve seen about what agile security would look like.  I especially like his term for the security person on the agile team – the “Security Buddy!”  Who can not like their security buddy?  They can hate the hell out of their “InfoSec Compliance Officer,” though.

Anyway, he has a bunch of controversial thoughts (he’s known for that) but the real breakthroughs are acknowledging the agile process, embedding a security “buddy” on the team, and leveraging existing unit test frameworks and QA behavior to perform security testing as well.  I think it’s a great presentation, go check it out!

1 Comment

Filed under DevOps, Security

Austin Cloud Computing Users Group Meeting Tomorrow

The second meeting of the Austin Cloud Computing Users Group is tomorrow, Tuesday August 24, from 6 to 8 PM, hosted by Pervasive Software in north Austin.

Michael Cote of Redmonk will be talking about recent cloud trends.  Opsource is sponsoring food and drinks.  We’ll have some lightning talks or unconference sessions too, depending.

It’s a great group of folks, so if you are into cloud computing come by and share.  If you plan to attend, please RSVP on Eventbrite here.

2 Comments

Filed under Cloud

Logging for Success

I’ve been working on a logging standards document for our team to use.  We are having a lot of desktop-software developers contributing software to the Web now, and it is making me take a step back and re-explain some things I consider basics.  I did some Googling for inspiration and I have to say, there’s not a lot of coherent bodies of information on what makes logging “good” especially from an operations point of view.  So I’m going to share some chunks of my thoughts here, and would love to hear feedback.

You get a lot of opinions around logging, including very negative ones that some developers believe.  “Never log!  Just attach a debugger!  It has a performance hit!  It will fill up disks!”  But to an operations person, logs are the lifeblood of figuring out what is going on with a complex system.  So without further ado, for your review…

Why Log?

Logging is often an afterthought in code.  But what you log and when and how you log it is critical to later support of the product.  You will find that good logging not only helps operations and support staff resolve issues quickly, but helps you root-cause problems when they are found in development (or when you are pulled in to figure out a production problem!).  “Attach a debugger” is often not possible if it’s a customer site or production server, and even in an internal development environment as systems grow larger and more complex, logs can help diagnose intermittent problems and issues with external dependencies very effectively.  Here are some logging best practices devised over years of supporting production applications.

Logging Frameworks

Consider using a logging framework to help you with implementing these.  Log4j is a full-featured and popular logging package that has been ported to .NET (Log4net) and about a billion other languages and it gives you a lot of this functionality for free.  If you use a framework, then logging correctly is quick and easy.  You don’t have to use a framework, but if you try to implement a nontrivial set of the below best practices, you’ll probably be sorry you didn’t.

The Log File

  • Give the log a meaningful name, ideally containing the name of the product and/or component that’s logging to it and its intent.  “nifarm_error.log” for example is obviously the error log for NIFarm.  “my.log” is… Who knows.
  • For the filename, to ensure compatibility cross-Windows and UNIX, use all lower case, no spaces, etc. in the log filenames.
  • Logs should use a .log suffix to distinguish themselves from everything else on the system (not .txt, .xml, etc.).  They can then be found easily and mapped to something appropriate for their often-large size.  (Note that the .log needs to come after other stuff, like the datetime stamp recommended below)
  • Logs targeted at a systems environment should never delete or overwrite themselves.  They should always append and never lose information.  Let operations worry about log file deletion and disk space – do tell them about the log files so they know to handle it though.  All systems-centric software, from Apache on up, logs append-only by default.
  • Logs targeted at a desktop environment should log by default, but use size-restricted logging so that the log size does not grow without bound.
  • Logs should roll so they don’t grow without bound.  A good best practice is to roll daily and add a .YYYYMMDD(.log) suffix to the log name so that a directory full of logs is easily navigable.  The DailyRollingFileAppender in the log4 packages does this automatically.
  • Logs should always have a configurable location.  Applications that write into their own program directory are a security risk.  Systems people prefer to make logs (and temp files and other stuff like that) write to a specific disk/disk location away from the installed product to the point where they could even set the program’s directory/disk to be read only.
  • Put all your logs together.  Don’t scatter them throughout an installation where they’re hard to find and manage (if you make their locations configurable per above, you get this for free, but the default locations shouldn’t be scattered).

Much more after the jump! Continue reading

11 Comments

Filed under DevOps

Oracle Declares War On Open Source

Some days, it doesn’t pay to be a Java shop.  Oracle has discontinued OpenSolaris and is suing Google over making a Java fork for their mobile phones.  Are OpenJDK, mySQL, and OpenOffice next on the destructive rampage?

We like using open source.  We also like coding our Web apps in Java; it’s heavier duty than PHP/Ruby and more open than .NET – or at least it was.  We’re actually a big Oracle shop – we use mySQL for our cloud offerings but use loads of Oracle databases (and Oracle ERP) internally.  But it’s hard to interpret this as anything other than a series of crappy moves that will result in diminishing the ecosystem we’re trying to use to create products and Web apps.

It seems like Larry Ellison is really fond of the old Microsoft “I am your monopolistic corporate overlord” kind of business relationship.  Warning – if we want one of those, we’d use Microsoft; they’re cheaper and better at it.

It’s partly the fault of the open source “companies” – they willingly sell out to the corporate-overlord set (Microsoft, Oracle, IBM, HP, CA) and then, no matter what they say, they’re not really open any more.  Java was allegedly open sourced by Sun but now Oracle is suing Google for exercising that open source license on the basis of patent infringement.  Heck, a lot of the open source companies have instead become “open core” which is often mostly a lie.

So is Open Source already dead, and this is just part of the feeding frenzy of the big boys scooping it up?

Leave a comment

Filed under General

Austin Cloud Computing Users Group!

The first meeting of the Austin Cloud Computing Users Group just happened this Tuesday, and it was a good time!  This new effort is kindly hosted by Pervasive Software [map].  We had folks from all over attend – Pervasive (of course), NI (4 of us went), Dell, ServiceMesh, BazaarVoice, Redmonk, and Zenoss just to name a few.  There are a lot of heavy hitters here in Austin because our town is so lovely!

We basically just introduced ourselves (there were like 50 people there so that took a while) and talked about organization and what we wanted to do.

The next meeting is planned already; it will be at Pervasive from 6:00 to 8:00 PM on Tuesday, August 24.  Michael Coté of Redmonk will be speaking on cloud computing trends.  Meeting format will be a presentation followed by lightning talks and self-forming unconference sessions.  Companies will be buying food and drink for the group in return for a 5 minute “pimp yourself” slot.  Mmm, free dinner.

There is a Google group/mailing list you can join – austin-cug@googlegroups.com.  There’s already some good discussion underway, so join in, and come to the next meeting!

Leave a comment

Filed under Cloud, DevOps

Give Me An API Or Give Me Death

Catchy phrase courtesy #meatcloud…   But it’s very true.  I am continuously surprised by the chasm between the “old generation” of software that jealously demands its priests stay inside the temple, and the “new generation” that lets you do things via API easily.  As we’ve been building up a new highly dynamic cloud-based system, we’ve been forced to strongly evaluate our toolset and toss out products with strong “functionality” that can’t be managed well in an automated infrastructure.

Let me say this.  If your product requires either a) manual GUI operations or b) a config file alteration and restart, it is not suitable for the new millenium.  That’s just a fact.

We needed an LDAP server to hold our auth information.  It’s been a while since I’ve done that, so of course OpenLDAP immediately came to mind.  So we tried it.  But what happens when you want to dynamically add a new replication slave?  Oh, you edit a bunch of config files and restart.  Well, sure, I’d like my auth system to be offline all the time, but…  So we tried OpenDS.  The most polished thing in the world?  No.  Does it have all the huge amount of weird functionality I probably won’t use anyway of OpenLDAP?  No.  But it does have an administration interface that you can issue directives to and have them take hold in realtime.  “Hey dude start replicating with that new box over there OK?”  “Sir, yes sir.”  “Outstanding.”  And since it’s Java, I can deploy it easily to targets in an automated fashion.  And even though the docs aren’t all up to date and sometimes you have to go through their interactive command line interface to do something – once you do it, the interface can be told to spit out the command-line version of that so you can automate it.  Sold!

The monitoring world is like this too.  Oh, we need an open source monitoring system?  Like everyone else, Nagios comes first to mind.  But then you try to manage a dynamic environment with it.  Again, their “solution” is to edit config files and restart parts of the system.  I don’t know about you, but my monitoring systems tend to be running a LOT of tests at any given time and hiccups in that make Baby Jesus (and frequently whoever is on call) cry.  So we start looking at other options.  “Well, you just come here in the UI and click to add!” the sales rep says proudly.  “Click,” goes the phone.  We end up looking at stuff like Zabbix, Zenoss, etc.  In fact, at least for the short term, we are using Cloudkick.  In terms of the depth of monitoring, it supports 1/100 of what most monitoring solutions do.  System stats mostly; there’s plugins for LDAP and mySQL but that’s about it, the rest is “here’s where you can plug in your own custom agent plugin…”  But, as my systems come up they get added to their interface automatically, tagged with my custom namespace.  And I’d rather have my systems IN a monitoring system that will give me 10 metrics than OUTSIDE a monitoring system that would give me 1000.

It’s also about agility.  We are trying to get these products to market way fast.  We don’t have time to become high priests of the “OpenLDAP way of doing things” or the “Nagios way of doing things.”  We want something that works upon install, that you can make a call to (ideally REST-based, though command line is acceptable in a pinch, and if there’s an iPhone app for it you get extra credit) in order to tell it what to do.  Each of these items is about 1/100 of everything that needs to go into a full working system, and so if I have to spend more than a week to get you working and integrate with you – it’s a dealbreaker.  You got away with that back when there weren’t other choices, but now in just about every sector there’s someone who’s figured out that ease of access and REST API for integration plus basic functionality is as valuable as loads of “function points” plus being hellishly crufty.

Heck, we ended up developing our own cloud management stuff because when we looked at the RightScales and whatnot of the world, they did a great job of managing the cloud providers’ direct APIs for you but didn’t then offer an API in return…  And that was a dealbreaker.  You can’t automate end to end if you come smacking up against a GUI.  (Since, RightScale has put out their own API in beta.  Good work guys!)

More and more, people are seeing that they need and want the “API way.”  If you don’t provide that, then you are effectively obsolete.  If I can’t roll up a new system – either with your software or something your software needs to be looking at/managing – and have it join in with the overall system with a couple simple API commands, you’re doing it wrong.

Leave a comment

Filed under Cloud, DevOps

DevOps Time!

All right!  After the last three days of Velocity 2010, we’ve talked a lot about ops and even hinted at devops, although often in a “recycled from previous Velocity” fashion.  But today it’s time to mainline it with DevOpsDays!

I’m going to be too busy actually participating to do full writeups like I did from Velocity, but I’ll distill down the best takeaways and bring them here as soon as I can.  If you just can’t wait and aren’t here, follow along on twitter at #devopsdays!

Leave a comment

Filed under DevOps

Velocity 2010 – Facebook Performance Shenanigans

Pipelining, Progressive Enhancement, and More: Making Facebook Twice as Fast by Jason Sobel (Facebook), Changhao Jiang (Facebook)

It’s the last session of Velocity already! The companies are tearing down their booths, people are escaping to the airport. Today went really, really fast. The room is still mostly full though!

As we’ve heard before, they have loads of users.  They have a central performance team but also distributed and embedded throughout the company.

The core site speed team started working on PHP speed.  Then they read Steve Souder’s book and realized “Oh, crap…”

They are working on a “perflab” to measure performance impacts of all changes.  And detect regressions.

What are the three things they measure at Facebook?

  1. Server time
  2. Network time
  3. Client/render time

What are we optimizing for?  Shouldn’t be any of those three.  Optimize for people.  That doesn’t even mean end user response time – that means impression of performance.

How fast is Facebook?  Well, determine what the core of the experience is.  What do people look at first, what defines the experience?  Lazy load the rest of that crap.

Metric: Time To Interact (TTI).  It’s a very custom metric.  When is the user getting value out of the site?  This is subjective and requires you to really know your users.

For this, the critical pieces have to be there and have to WORK – you can’t just display it and have the functionality not there yet.  You can’t pick visible but not functional yet.

Techniques used to speed things up:

  1. Early flush.  Get them a list of the crucial elements.
  2. Components.  Pages used the same components with different names, the color blue was defined a thousand times.  Make a reusable set of visual components that can appear on any page and share the same CSS rules.  Besides enforcing visual standards, you can optimize them and then reuse them.  Theirs are a grid, an image block, some buttons, page headers…
  3. JavaScript!  We love it, but it is hard.  They wrote a lot before they knew what they were doing.  They have something called “primer.”  There’s a simple JS library that lives in the head and can bootstrap the rest of the javascript and respond to simple stuff devs were writing over and over again.  An event handler that can do a popup, get and insert content, or do a form submit.  And go get other javascript.  Then you tag something with a rel=”dialog” and it pops a dialog.  And once the page is done you can go get the stuff instead of making it on demand.  “async” gets content. In the feedback interface; Like and View and Delete use it.
  4. BigPipe is an attempt to rethink how we present pages.  The problem is the page generation, network latency, and page rendering being serial.  They render personalized pages and have to query several back end services to make the page.  The page is waiting on the slowest back end query.  So pipeline it out!  Decompose pages into “pagelets” and pipeline them through different execution stages in the server and browser.  They give priorities to different pagelets.
    How does it work?  First you get a nearly empty doc.  In the head, script src bigpipe.js.  Then there are divs on the page with IDs; a template with the logical structure of the page.  For each pagelet, it’s flushed separately in a script tag, JSON encoded.  BigPipe on th  client downloads CSS for the pagelet, displays it, downloads JS, and executes onLoad()s.
    This gave them a 2x improvement in perceived latency (defined by TTI) across all browsers.
    What about search engines?  Well, first of all, for devs to use the pipe, they have to write pagelets, and they have a pagelet abstraction for them to use.  Only has three functions: initialize, prepare, and render.  To pipeline you create a BigPipe instance, specify your page layout and place holders, add pagelets to the pipe (source file and wrapper id) and then call render.  So you can do pipeline, singleflush, parallel, or prepare models.  One parameter in Bigpipe::GetInstance controls it. Use singleflush for search and non-JS stuff.  Preparelets you batch multiple pages.  Parallel lets you use multiple threads for different pagelets (at the cost of server resources!).

Whew!  All this was a success – on Dec 22 they got their goal of making Facebook twice as fast.

Combine with ESIs for even more fun!

Thoughts from the Site Speed team:

To build a culture of performance…  Make tshirts!  They gave shirts to those who made improvements.

  1. Getting the right metrics, that people buy into, then they’ll work on optimizing it.
  2. Build the right abstraction to make the site fast by default, and if devs use them then you are fast without them having to do loads of work.
  3. Partnership.  If other teams are committed you’ll have success.  Find the people that get it and work with them.  Ignore the ignorant.

Final thought – spriting!  He likes spriting.it’s crazy but the platform is a little broken so you have to do stuff like that.  But let’s fic the platform so you don’t hav eto do crazy stuff.  Fast by default!!!

And that’s a wrap for Velocity 2010!  Next, stuff from DevOpsDays, and my thoughts and reflections on what we’ve learned!

Leave a comment

Filed under Conferences, DevOps