DevOps “From the Trenches” Report – HomeAway

We were out at HomeAway for a technical discussion, and DevOps reared its head as it does so frequently nowadays.  In the context of talking about their preparation to scale up for their big Chevy Chase Super Bowl commercial, they were doing all kinds of stuff.  One of the things they noted was that the traditional dev and ops headbutting changed due to the long hours of work they had to put in together.  They tried going off and doing “their parts” separately – ops doing network, servers, load balancers, and hosting and developers doing coding, caching, tuning, and testing – but the time pressure, importance, and complexity of the project forced them together into a room, and once they started to collaborate they just stayed there, working in close proximity, for the duration.  When asked about the big takeaways from the entire project, the developers noted that “Leaning how everything interacts has changed how we build things” – for example, doing “pull the plug” fault testing has made for more resilient architectures and higher confidence and quality of life for both the dev and ops teams!  They didn’t describe it as “DevOps,” but that’s what it boils down to.

The more I talk to other successful Austin tech companies – HomeAway, BazaarVoice, Pervasive – the more that I hear DevOps concepts mentioned as keys to their success – and they didn’t do them because they “wanted to do this cool DevOps thing,” but they did what was needed to succeed and it turns out that a part of that is bringing development and operational concerns together into a whole.  It reminds me of the story behind the Visible Ops book, where the authors researched what high performing IT shops had in common and then realized those successful behaviors all mapped to certain ITIL areas (mainly change management).  That is a compelling validation of its efficacy.

Anyway, I urged them to consider doing that presentation in public venues; it really was a great story and hit on many of the best practices that have been emerging from the ops and performance world over the last few years.  They must be doing something right because they’re growing like gangbusters – if you want to take a vacation and rent someone else’s house/condo instead of going to a hotel, go try out homeaway.com!

Leave a comment

Filed under DevOps

DevOps Cafe Podcast

Damon Edwards and John Willis run the DevOps Cafe Podcast.  It’s a great listen, and they have a lot of people on talking about exciting advances in the ops world (including Allspaw, John Kim, kaChing, Shopzilla).  And for this last one, they interviewed me! Apparently we’re on the cutting edge of doing DevOps in a traditional type organization as opposed to a lil’ Web startup.

So if you want to hear me natter on about DevOps and the lessons I’ve learned over my career that have brought me to it for 40 minutes or so, here you go.

Leave a comment

Filed under DevOps

Application Security Conference in Austin, TX

I thought I would take this opportunity to invite the agile admin readers to LASCON.   LASCON (Lonestar Application Security Conference) is happening in Austin, TX on October 29th, 2010. The conference is sponsored by OWASP (the Open Web App Security Project) and is an entire day of quality content on web app security.  We’ll be there!

The speaker list is still in the works, but so far we have two presentations from this years BlackHat conference, several published authors, and the Director for Software Assurance in the National Cyber Security Division of the Department of Homeland Security just to name a few, and that’s only the preliminary round of acceptances.

Do you remember a few years ago when there was a worm going around MySpace that infected user profile pages at the rate of over one million in 20 hours?  Yeah, the author of that worm is speaking at the conference.  How can you beat that?

I have been planning this conference for a few months and am pretty excited about it.  If you are can make it to Austin on October 29th, we would love to meet you at LASCON.

1 Comment

Filed under Conferences, Security

DevOps and Security

I remember some complaints about DevOps from a couple folks (most notably Rational Survivability) saying “what about security!  And networking!  They’re excluded from DevOps!”  Well, I think that in the agile collaboration world, people are only excluded to the extent that they refuse to work with the agile paradigm.  Ops used to be “excluded” from agile, not because the devs hated them, but because the ops folks themselves didn’t willingly go collaborate with the devs and understand their process and work in that way.  As an ops person, it was hard to go through the process of letting go of my niche of expertise and my comfortable waterfall process, but once I got closer to the devs, understood what they did, and refactored my work to happen in an agile manner, I was as welcome as anyone to the collaborative party, and voila – DevOps.

Frankly, the security and network arenas are less incorporated into the agile team because they don’t understand how to be (or in many cases, don’t want to be).  I’ve done security work and work with a lot of InfoSec folks – we host the Austin OWASP chapter here at NI – and the average security person’s approach embodies most of what agile was created to remove from the development process.  As with any technical niche there’s a lot of elitism and authoritarianism that doesn’t mesh well with agile.

But this week, I saw a great presentation at the Austin OWASP chapter by Andre Gironda (aka “dre”) called Application Assessments Reloaded that covered a lot of ground, but part of it was the first coherent statement I’ve seen about what agile security would look like.  I especially like his term for the security person on the agile team – the “Security Buddy!”  Who can not like their security buddy?  They can hate the hell out of their “InfoSec Compliance Officer,” though.

Anyway, he has a bunch of controversial thoughts (he’s known for that) but the real breakthroughs are acknowledging the agile process, embedding a security “buddy” on the team, and leveraging existing unit test frameworks and QA behavior to perform security testing as well.  I think it’s a great presentation, go check it out!

1 Comment

Filed under DevOps, Security

Austin Cloud Computing Users Group Meeting Tomorrow

The second meeting of the Austin Cloud Computing Users Group is tomorrow, Tuesday August 24, from 6 to 8 PM, hosted by Pervasive Software in north Austin.

Michael Cote of Redmonk will be talking about recent cloud trends.  Opsource is sponsoring food and drinks.  We’ll have some lightning talks or unconference sessions too, depending.

It’s a great group of folks, so if you are into cloud computing come by and share.  If you plan to attend, please RSVP on Eventbrite here.

2 Comments

Filed under Cloud

Logging for Success

I’ve been working on a logging standards document for our team to use.  We are having a lot of desktop-software developers contributing software to the Web now, and it is making me take a step back and re-explain some things I consider basics.  I did some Googling for inspiration and I have to say, there’s not a lot of coherent bodies of information on what makes logging “good” especially from an operations point of view.  So I’m going to share some chunks of my thoughts here, and would love to hear feedback.

You get a lot of opinions around logging, including very negative ones that some developers believe.  “Never log!  Just attach a debugger!  It has a performance hit!  It will fill up disks!”  But to an operations person, logs are the lifeblood of figuring out what is going on with a complex system.  So without further ado, for your review…

Why Log?

Logging is often an afterthought in code.  But what you log and when and how you log it is critical to later support of the product.  You will find that good logging not only helps operations and support staff resolve issues quickly, but helps you root-cause problems when they are found in development (or when you are pulled in to figure out a production problem!).  “Attach a debugger” is often not possible if it’s a customer site or production server, and even in an internal development environment as systems grow larger and more complex, logs can help diagnose intermittent problems and issues with external dependencies very effectively.  Here are some logging best practices devised over years of supporting production applications.

Logging Frameworks

Consider using a logging framework to help you with implementing these.  Log4j is a full-featured and popular logging package that has been ported to .NET (Log4net) and about a billion other languages and it gives you a lot of this functionality for free.  If you use a framework, then logging correctly is quick and easy.  You don’t have to use a framework, but if you try to implement a nontrivial set of the below best practices, you’ll probably be sorry you didn’t.

The Log File

  • Give the log a meaningful name, ideally containing the name of the product and/or component that’s logging to it and its intent.  “nifarm_error.log” for example is obviously the error log for NIFarm.  “my.log” is… Who knows.
  • For the filename, to ensure compatibility cross-Windows and UNIX, use all lower case, no spaces, etc. in the log filenames.
  • Logs should use a .log suffix to distinguish themselves from everything else on the system (not .txt, .xml, etc.).  They can then be found easily and mapped to something appropriate for their often-large size.  (Note that the .log needs to come after other stuff, like the datetime stamp recommended below)
  • Logs targeted at a systems environment should never delete or overwrite themselves.  They should always append and never lose information.  Let operations worry about log file deletion and disk space – do tell them about the log files so they know to handle it though.  All systems-centric software, from Apache on up, logs append-only by default.
  • Logs targeted at a desktop environment should log by default, but use size-restricted logging so that the log size does not grow without bound.
  • Logs should roll so they don’t grow without bound.  A good best practice is to roll daily and add a .YYYYMMDD(.log) suffix to the log name so that a directory full of logs is easily navigable.  The DailyRollingFileAppender in the log4 packages does this automatically.
  • Logs should always have a configurable location.  Applications that write into their own program directory are a security risk.  Systems people prefer to make logs (and temp files and other stuff like that) write to a specific disk/disk location away from the installed product to the point where they could even set the program’s directory/disk to be read only.
  • Put all your logs together.  Don’t scatter them throughout an installation where they’re hard to find and manage (if you make their locations configurable per above, you get this for free, but the default locations shouldn’t be scattered).

Much more after the jump! Continue reading

11 Comments

Filed under DevOps

Oracle Declares War On Open Source

Some days, it doesn’t pay to be a Java shop.  Oracle has discontinued OpenSolaris and is suing Google over making a Java fork for their mobile phones.  Are OpenJDK, mySQL, and OpenOffice next on the destructive rampage?

We like using open source.  We also like coding our Web apps in Java; it’s heavier duty than PHP/Ruby and more open than .NET – or at least it was.  We’re actually a big Oracle shop – we use mySQL for our cloud offerings but use loads of Oracle databases (and Oracle ERP) internally.  But it’s hard to interpret this as anything other than a series of crappy moves that will result in diminishing the ecosystem we’re trying to use to create products and Web apps.

It seems like Larry Ellison is really fond of the old Microsoft “I am your monopolistic corporate overlord” kind of business relationship.  Warning – if we want one of those, we’d use Microsoft; they’re cheaper and better at it.

It’s partly the fault of the open source “companies” – they willingly sell out to the corporate-overlord set (Microsoft, Oracle, IBM, HP, CA) and then, no matter what they say, they’re not really open any more.  Java was allegedly open sourced by Sun but now Oracle is suing Google for exercising that open source license on the basis of patent infringement.  Heck, a lot of the open source companies have instead become “open core” which is often mostly a lie.

So is Open Source already dead, and this is just part of the feeding frenzy of the big boys scooping it up?

Leave a comment

Filed under General

Austin Cloud Computing Users Group!

The first meeting of the Austin Cloud Computing Users Group just happened this Tuesday, and it was a good time!  This new effort is kindly hosted by Pervasive Software [map].  We had folks from all over attend – Pervasive (of course), NI (4 of us went), Dell, ServiceMesh, BazaarVoice, Redmonk, and Zenoss just to name a few.  There are a lot of heavy hitters here in Austin because our town is so lovely!

We basically just introduced ourselves (there were like 50 people there so that took a while) and talked about organization and what we wanted to do.

The next meeting is planned already; it will be at Pervasive from 6:00 to 8:00 PM on Tuesday, August 24.  Michael Coté of Redmonk will be speaking on cloud computing trends.  Meeting format will be a presentation followed by lightning talks and self-forming unconference sessions.  Companies will be buying food and drink for the group in return for a 5 minute “pimp yourself” slot.  Mmm, free dinner.

There is a Google group/mailing list you can join – austin-cug@googlegroups.com.  There’s already some good discussion underway, so join in, and come to the next meeting!

Leave a comment

Filed under Cloud, DevOps

Give Me An API Or Give Me Death

Catchy phrase courtesy #meatcloud…   But it’s very true.  I am continuously surprised by the chasm between the “old generation” of software that jealously demands its priests stay inside the temple, and the “new generation” that lets you do things via API easily.  As we’ve been building up a new highly dynamic cloud-based system, we’ve been forced to strongly evaluate our toolset and toss out products with strong “functionality” that can’t be managed well in an automated infrastructure.

Let me say this.  If your product requires either a) manual GUI operations or b) a config file alteration and restart, it is not suitable for the new millenium.  That’s just a fact.

We needed an LDAP server to hold our auth information.  It’s been a while since I’ve done that, so of course OpenLDAP immediately came to mind.  So we tried it.  But what happens when you want to dynamically add a new replication slave?  Oh, you edit a bunch of config files and restart.  Well, sure, I’d like my auth system to be offline all the time, but…  So we tried OpenDS.  The most polished thing in the world?  No.  Does it have all the huge amount of weird functionality I probably won’t use anyway of OpenLDAP?  No.  But it does have an administration interface that you can issue directives to and have them take hold in realtime.  “Hey dude start replicating with that new box over there OK?”  “Sir, yes sir.”  “Outstanding.”  And since it’s Java, I can deploy it easily to targets in an automated fashion.  And even though the docs aren’t all up to date and sometimes you have to go through their interactive command line interface to do something – once you do it, the interface can be told to spit out the command-line version of that so you can automate it.  Sold!

The monitoring world is like this too.  Oh, we need an open source monitoring system?  Like everyone else, Nagios comes first to mind.  But then you try to manage a dynamic environment with it.  Again, their “solution” is to edit config files and restart parts of the system.  I don’t know about you, but my monitoring systems tend to be running a LOT of tests at any given time and hiccups in that make Baby Jesus (and frequently whoever is on call) cry.  So we start looking at other options.  “Well, you just come here in the UI and click to add!” the sales rep says proudly.  “Click,” goes the phone.  We end up looking at stuff like Zabbix, Zenoss, etc.  In fact, at least for the short term, we are using Cloudkick.  In terms of the depth of monitoring, it supports 1/100 of what most monitoring solutions do.  System stats mostly; there’s plugins for LDAP and mySQL but that’s about it, the rest is “here’s where you can plug in your own custom agent plugin…”  But, as my systems come up they get added to their interface automatically, tagged with my custom namespace.  And I’d rather have my systems IN a monitoring system that will give me 10 metrics than OUTSIDE a monitoring system that would give me 1000.

It’s also about agility.  We are trying to get these products to market way fast.  We don’t have time to become high priests of the “OpenLDAP way of doing things” or the “Nagios way of doing things.”  We want something that works upon install, that you can make a call to (ideally REST-based, though command line is acceptable in a pinch, and if there’s an iPhone app for it you get extra credit) in order to tell it what to do.  Each of these items is about 1/100 of everything that needs to go into a full working system, and so if I have to spend more than a week to get you working and integrate with you – it’s a dealbreaker.  You got away with that back when there weren’t other choices, but now in just about every sector there’s someone who’s figured out that ease of access and REST API for integration plus basic functionality is as valuable as loads of “function points” plus being hellishly crufty.

Heck, we ended up developing our own cloud management stuff because when we looked at the RightScales and whatnot of the world, they did a great job of managing the cloud providers’ direct APIs for you but didn’t then offer an API in return…  And that was a dealbreaker.  You can’t automate end to end if you come smacking up against a GUI.  (Since, RightScale has put out their own API in beta.  Good work guys!)

More and more, people are seeing that they need and want the “API way.”  If you don’t provide that, then you are effectively obsolete.  If I can’t roll up a new system – either with your software or something your software needs to be looking at/managing – and have it join in with the overall system with a couple simple API commands, you’re doing it wrong.

Leave a comment

Filed under Cloud, DevOps

Advanced Persistent Threat, what is it and what can you do about it – TRISC 2010

Talk by James Ryan.

An Advanced Persistent Threat is basically a massively coordinated long term hack attack, often accomplished by nation states or other “very large” organizations, like a business looking for intellectual property and information.  They try to avoid getting caught because they have invested capital in the break in and want to avoid the re-break in.  APTs are often categorized by slow access to data.  They avoid doing things rapidly to avoid detection.

Targets.  There is a question about targets and who is being targeted.  Anything that is crucial infrastructure is targeted.  James Ryan says that we are losing the battle.  We are now fighting (as a nation) nation states with an organized crime type of feel.  We haven’t really found religion to make security happen.  We still treat security as a way to stop rogue 17 year old hackers.

The most prevalent ways to engage in APT is through spear phishing with malware.  The attacker at this point is looking for credentials (key loggers, fake website, …).  Then damage by doing data exfiltration, data tampering, shutdown capabilities.  One other way to avoid getting caught is have the APT get hired in the company.

APT uses zero-day threats and sits on them.  They them it to stay on the network.

We should think that the APT is always going to be on our network and they are going to get there regularly.  We can avoid risk to APT by doing the following.

  • Implement PKI on smartcards, enterprise wide (PKI is mathematically proven to be secure for the next 20 years)
  • Hardware based PKI, not software
  • Implement network authentication and enterprise single sign on eSSO with PKI
  • Remote access tied to PKI keycard/smartcard
  • Implement Security Event Information Management and correlate accounts and run triggers on multiple simultaneous session trigger.  Also tie this with physical access control.
  • Implement PKI with privileged users as well (admins, power users)
  • Decrease access per person and evaluate and change
  • Create email tagging from external (avoid spear phishing)
  • Training and testing using spear phishing in the organization
  • Implement USB control to stop external USB
  • Background checks and procedures

James Ryan spent time talking about PKI and the necessity of using it.  I agree that we need to have better user management and if you operate on the assumption that Advanced Persistent Threat operators try to go undetected for a long amount of time and also try to get valid user credentials then it is even more so.  The thing that we need to do is control users and access.  This is our biggest vector.

Takeaways:

  • APT is real and dangerous
  • Assume network is owned already
  • Communicate in terms of business continuity
  • PKI should be part of the plan
  • Use proven methods for executing your strategy

Leave a comment

Filed under Conferences, Security