Author Archives: Ernest Mueller

Ernest Mueller's avatar

About Ernest Mueller

Ernest is the VP of Engineering at the cloud and DevOps consulting firm Nextira in Austin, TX. More...

Velocity 2012 Day One

Hello all! The Velocity cadre grows as the agile admins spread out.  I’m here with Chris, Larry, and Victor from Bazaarvoice and our new friends Kevin, Bob, and Morgan from Powerreviews which is now Bazaarvoice’s West Coast office; Peco is here with Charlie from Opnet, and James is here… with himself, from Mentor Graphics.  Our old friends from National Instruments Robert, Eric, and Matt are here too. We have quite a Groupme going!

Chris, Peco, James and I were on the same flight, all went well and we ended up at Kabul for a meaty dinner to fortify us for the many iffy breakfasts and lunches to come.  Sadly none of us got into the conference hotel so we were spread across the area.  I’m in the Quality Inn Santa Clara, which is just fine so far (alas, the breakfast is skippable, unlike that place Peco and I always used to stay).

I’m sharing my notes in mildly cleaned up fashion – sorry if it gets incoherent, but this is partially for me and partially for you.

Now it’s time for the first session!  Spoiler alert – it was really, really good and I strongly agree with large swaths of what he has to say.  In retrospect I think this was the best session of Velocity.  It combined high level guidance and tech tips with actionable guidelines. As a result I took an incredible number of notes.  Strap in!

Scaling Typekit: Infrastructure for Startups

by Paul Hammond (@ph) of Typekit, Slides are here: paulhammond.org/2012/startup-infrastructure

Typekit does Web fonts as a service; they were acquired by Adobe early this year. The characteristics of a modern startup are extreme uncertainty and limited money. So this is basically an exercise in effective debt management.

Rule #1 – Don’t run out of money.

Your burn rate is likely # of people on the team * $10k because the people cost is the hugely predominant factor.

Rule #2 – Your time is valuable, Don’t waste it.

He notes the three kinds of startups  – venture funded, bootstrapped, and big company internal.  Sadly he’s not going to talk about big company internal startups, but heck, we did that already at National Instruments so fair enough!  He does say in that case, leverage existing infrastructure unless it’s very bad, then spend effort on making it better instead of focusing on new product ideas.  “Instead of you building a tiny beautiful cloud castle in the corner that gets ignored.” Ouch! The ex-NI’ers look ruefully at each other. Then he discussed startup end states, including acquisition.  Most possible outcomes mean your startup infrastructure will go away at some point. So technical debt is OK, just like normal debt; it’s incurred for agility but like financial must be dealt with promptly.

Look for “excuses” to build the infrastructure you need (business and technical). He cites Small Batch Inc., which did a “How to start a company” conference first thing, forcing incorporation and bank accounts and liability insurance and all that, and then Wikirank, which was not “the product” but an excuse to get everyone working together and learn new tech and run a site as a throwaway before diving into a product. Typekit, in standard Lean Startup fashion, announced in a press release before there was anything to gauge interest, then a funding round, then 6 months later (of 4 people working full time) to get 1.0 out.  Launching a startup is very hard.  Do whatever you can to make it easier.

When they launched their stack was merb/datamapper/resque/mysql/redis/munin/pingdom/chef-solo/ubuntu/slicehost/dynect/edgecast/github/google apps/dropbox/campfire/skype/join.me/every project tracking tool ever.

Now about the tech stack and what worked/didn’t work.

  • Merb is a Web framework like Rails. It got effectively end of lifed and merged into Ruby 3, and to this day they’re still struggling with the transition. Lesson: You will be stuck with your technology choices for a long time.  Choose wisely.
  • Datamapper – a Ruby ORM. Not as popular as ActiveRecord but still going.  Launched on v0.9.11!  Over the long term. many bugs. A 1.0 version came out but it has unknown changes, so they haven’t ported.  The code that stores your data, you need 100% confidence in.  Upgrading to Activerecord was easier because you could do both in parallel.   Lesson: Keep up with upgrades.  Once you’re a couple behind it’s over.
  • Resque – queueing system for Ruby. They love it. Gearman is also a great choice. Lesson: You need a queue – start with one. Retrofitting makes things much harder.
  • Data: MySQL/Redis (and Elasticsearch)
    • MySQL: You have to trust your database like nothing else. You want battle tested, well understood infrastructure here. And scaling mySQL is a solved problem, just read Cal Henderson’s book.
    • Redis: Redis doesn’t do much, which is why it’s awesome.
    • Elasticsearch: Our search needs are small, and elastic search is easy to use.
    • Lessons from their data tier: Choose your technology on what it does today, not promises of the future. They take a couple half hour downtimes a year for schema upgrades. You don’t need 99.999% availability yet as a startup.  Sure, the Facebook/Yahoo/Google presentations about that are so tempting but you/re 4 guys, not them.
  • Monitoring
    • Munin – monitoring, graphing, alerting.  Now collected, nagios and custom code and they hate it.
    • Pingdom is awesome. It’s the service of last resort.
    • Pagerduty is also awesome. Makes sure you get woken up and you know who does.
    • Papertrail is hosted syslog. “It’s not splunk but it’s good enough for our needs.” “But a syslog server is easy to run.  Why use papertrail?” The tools around it are better than what they have time to build themselves.  Hosted services are usually better and cheaper than what you can do yourself.  If there’s one that does what you need, use it.  If it costs less than $70/month buy without thinking about it, because the AWS instance to run whatever open source thingy you were going to use instead costs that much.
    • #monitoringsucks shout-out!  “I don’t know anyone who’s happy with their monitoring that doesn’t have 3-4 full time engineers working on it.”  However, #monitoringsucks isn’t delivering. Every single little open source doohickey you use is something else to go wrong and something they all need to understand.  Nothing is meeting small startups’ needs.  A lot of the hosting ones too, they charge per metric or per host (or both) and that’s discouraging to a startup.  You want to be capturing and graphing as much as you can.
  • Chef – started with chef-solo and rsync; moved to Chef Hosted in 2011 and have been very happy with it.
  • Ubuntu TLS 10.04.  “I don’t thing any startup has ever failed because they picked the wrong Linux distribution.”
  • Slicehost – loved it but then Rackspace shut it down, and the migration sucked – new IPs, hours of downtime. Migrated to Rackspace and EC2. Lots of people are going to bash cloud hosting later at the conference as a waste of money. Counterpoint – “Employees are the biggest cost to a startup.”
  • Start with EC2, period, unless you’re an infra company or totally need super bare metal performance.
  • But – credentials… use IAM to manage them. We use it at BV but it ends up causing a lot of problems too (“So you want your stuff in different IAM accounts to talk to each other like with VPC?  Oh, well, not really supported…”)  Never use the root credentials.
  • Databases in the cloud.  Ephemeral or EBS? Backups? They get a high memory instance, run everything in memory, and then stop worrying about disk IO.  Sha za!  Figure it out later.
  • DynECT – Invisible and fine.
  • Edgecast – cool. CDNs are not created equal, and they have different strengths in regions etc. If you don’t want to hassle with talking to someone on the phone, screw Akamai/Limelight/etc. If you’re not haggling you’re paying too much.  But as a startup, you want click to start, credit card signup. Amazon Cloudfront, Fastly. For Typekit they needed high uptime and high performance as a critical part of the service.  Story time, they had a massive issue with Edgecast as about.me was going live. See Designing for Disaster by Jeff Veen from Velocity Europe. Systems perform in unexpected ways as they grow.  Things have unexpected scaling behavior. Know your escape plan for every infrastructure provider.  That doesn’t have to be “immediate hot backup available,” just a plan.
  • Github – using organizations.
  • Google Apps – yay.  Using Google App Engine for their status page to put it on different infrastructure. They use Stashboard, which we used at NI!

“Buy or build?”

Buy, unless nothing meets your needs.  Then build.  Or if it’s your core business and you’re eating your own dog food.
If it costs more than your annual salary, build it.

A third party provider having an outage is still YOUR problem. Still need a “sorry!” Write your update without naming your service provider.  [You should take responsibility but that seems close to not being transparent to me. -Ed.]  Anyway, buy or build option is “neither” if it’s not needed for the minimum viable product.

You’re not Facebook or Etsy with 100 engineers yet. You don’t need a highly scalable data store.  A half hour outage is OK. You don’t need multi-vendor redundancy, you need a product someone cares about.

Rule #3 – Set up the infrastructure you need.

Rule #4 – Don’t set up infrastructure you don’t need.

Almost every performance problem has been on something they didn’t yet measure.  All their scaling pain points were unexpected.  You can’t plan for everything and the stuff you do plan for may be wasted.

Brain twister: He spent a week to write code to automatically bring up a front end Tomcat server in AWS if one of theirs crashes.  That has never happened in years.  Was that work worth while, does it really meet ROI?

Rule #5 – Don’t make future work for yourself.

There’s a difference between not doing something yet and deliberately setting yourself up for redo.  People talk about “technical debt” but just as in finance, there’s judicious debt and then there’s payday loans. Optimize for change. Every time you grow 10x you’ll need to rewrite. Just make it easy to change.

“You ain’t gonna need it”

Everyone’s startup story:

  1. Find biggest problem
  2. Fix biggest problem
  3. Repeat

The story never reads like:

  1. Up front, plan and build infrastructure based on other companies
  2. Total success!

Minimum Viable Infrastructure for a Startup:

  1. Source control
  2. Configuration management
  3. Servers
  4. Backups
  5. External availability monitoring

So you really could get started with github orgs, rsync/bash, EC2, s3cmd, pingdom, then start improving from there. Well, he’s not really serious you should start that way, he wouldn’t start with rsync again.  But he’s somewhat serious, in that you should really consider the minimum (but good) solution and not get too fancy before you ship.

Watch out for

  • Black swans
  • Vendor lockin
  • Unsupported products
  • Time wasting

Woot! This was a great session, everything from straight dope on specific techs, mistakes made and lessons learned, high level guidance with tangible rules of thumb.

Question and Answer Takeaways:
If you’re going to build, build and open source it to make the ecosystem better
Monitoring – none of them have a decent dashboard. Ganglia, nagios, munin UI sucks.

Intermission

Discussion with Mike Rembetsy and other Etsyans about why JIRA and Confluence are ubiquitously used but people don’t like talking about it.  His theory is that everyone has to hack them so bad that they don’t want to answer 100 questions about “how you made JIRA do that.”

Turning Operational Data Into Gold At Expedia

By Eddie Satterly, previously of Expedia and now with Splunk. This is starting off bad.  I was hoping with Expedia having top billing it was going to be more of a real use case but we’re getting stock splunk vendor pitch.

Eddie Satterly was sr. director of arch at Expedia, now with splunk.  They put 6 TB/day in splunk. Highlights:

  • They built a sdk for cassandra data stores  and archive specific splunks for long term retention to hadoop for batch analysis
  • The big data integration really ramped up the TB/day
  • They do external lookups – geo, ldap, etc.
  • Puppet deploy of the agents/SCCM and gold images
  • A lot of the tealeaf RUM/Omniture Web analytics stuff is being done in splunk now
  • Zenoss integration but moving more to splunk there too
  • Using the file integrity monitoring stuff
  • Custom jobs for unusual volumes and “new errors”

Session was high on generalities; sadly I didn’t really come away with any new insights on splunk from it. Without the sales pitch it could have been a lightning talk.

11 Ways To Hack Puppet For Fun and Productivity

by Luke Kanies. I got here late but all I missed was a puppet overview. Slides on Slideshare.

Examples:
github.com/lak/velocity_2012-Hacking_Puppet
github.com/puppetlabs/puppetlabs-stdlib

  1. Puppet as you.  It doesn’t have to run as root.
  2. Curl speaks.  You can pull catalogs etc. easily, decouple see facts/pull catalog/run catalog/run report.
  3. Data, and lots of it. Catalogs, facts, reports.
  4. Static compiler. Refer to files with checksum instead of URL. And it reduces requests for additional files.
  5. config_version. Find out who made changes in this version.
  6. report processor.
  7. Function
  8. Fact
  9. Types
  10. Providers
  11. Face

Someone’s working on a puppet IDE called geppetto (eclipse based).

I don’t know much puppet yet, so most of this went right by me.

Develop and Test Configuration Management Scripts With Vagrant

By Mitchell Hashimoto from Kiip (@mitchellh). Slides on Speakerdeck.

Sure, you can bring up an ec2 instance and run chef and whatnot, but that gets repetitive. This tempts you to not do incremental systems development, because it takes time and work. So you just “set things up once” and start gathering cruft.

Maybe you have a magic setup script that gets your Macbook all up and running your new killer app. But it’s unlikely, and then it’s not like production.  Requires maintenance, what about small changes… Bah. Or perhaps an uber-readme (read: Confluence wiki page). Naturally prone to intense user error. So, use Vagrant!

We’ll walk through the CLI, VM creation, provisioning, scripted config of vm, network, fs, and setup

Install Virtualbox and Vagrant – All that’s needed are vagrantfile and vagrant CLI
vagrantfile: Per project configuration, ruby DSL
CLI: vagrant <something> e.g “vagrant up”

vagrant box – set up base boxes.  It’s just a single file. “vagrant box add name url”.
Go to vagrantbox.es for more base boxes. They’re big (It’s a vm…)

Project context. “vagrant init <boxtype>” will dump you a file.

“vagrant up” makes a private copy, doesn’t corrupt base box

vagrant up, status, reload, suspend (freeze), halt (shutdown), destroy (delete)

Provides shared folders, NFS to share files host to guest
Shared folder performance degrades with # of files, go to NFS

Provisioning – scripted instal packages, etc.  It supports shell/puppet/chef and soon cfengine.
Use the same scripts as production. vagrant up does utp, but vagrant reload or provision does it in isolation

Networking – port forwarding, host-onlu

port forwarding exposes hosts on the guest via ports on the host, even to the outside.
Simple, over 1024 and open
host only makes a private net of VMs and your host. set IPs or even DHCP it. Beware of IP collisions.
bridge – get IPs from a real router. makes them real boxes, though bad networks won’t do it.

multi vm.  Configure multiple VMs in one file and hook ’em up.  In multi mode you can specify a target on each command to not have it do on all

vagrant package “burns a new AMI” off the current system.
package up installed software, use provisioners for config and managing services

Great for developing and testing chef/puppet/etc scripts. Use prod-quality ops scripts to set up dev env’s, QA. It brings you a nice standard workflow.

Roadmap:

  • other virtualization, vmware, ec2, kvm
  • vagrant builder: ami creator
  • any guest OS

End, Day One!

And we’re done with “Tutorial” day!  The distinction between tutorials and other conference sessions is very weak and O’Reilly would do better to just do a three day conference and right-size people’s presentations – some, like the Typekit one, deserve to be this long.  Others should be a normal conference session and some should be a lightning talk.

Then we went to the Ignites and James and I did Ignite slide karaoke where you have to talk to random slides.  Check out the deck, I got slides 43-47 which were a bit of a tough row to hoe. I got to use my signature phrase “keep your pimp hand strong” however.

1 Comment

Filed under Conferences, DevOps

Puppet and Chef do only half the job

Our first guest post on theagileadmin is by Schlomo Schapiro, Systems Architect and Open Source Evangelist at ImmobilienScout24. I met Schlomo and his colleagues at DevOpsDays and they piqued my interest with their YADT deployment tool they’ve open sourced.  Welcome, Schlomo!

“How do you update your system with OS patches” was one of my favourite questions last week at the Velocity Web Performance and Operations Conference in Santa Clara and at the devopsdays in Mountain View. I tried to ask as many people as would be ready to talk to me about their deployment solutions, most of whom where using one of puppet or chef.

The answers I got ranged from “we build a new image” through “somebody else is doing that” to “I don’t know”. So apparently many people see the OS stack different from their software stack. Personally, I find this very worrying because I strongly believe that one must see the entire stack (OS, software & configuration) as one and also validate it as one.

We see again and again that there is a strong influence from the OS level into the application level. Probably everybody already had to suffer from NFS and autofs and had seen their share of blocked servers. Or maybe how a changed behaviour in threading suddenly makes a server behave completely different under load. While some things like the recent leap second issue are really hard to test, most OS/application interactions can be tested quite well.

For such tests to be really trustworthy the version must be the same between test and production. Unfortunately even a very small difference in versions can be devastating. A specific example we recently suffered from is autofs in RHEL5 and RHEL6 which would die on restart up to the most recent patch. It took us awhile to find out that the autofs in testing was indeed just this little bit newer than in production to actually matter.

If you are using images and not adding any OS patches between image updates, then you are probably on the safe side. If you add patches on top of the image, then you also run a chance that your versions will deviate.

So back to the initial question: If you are using chef, puppet or any other similar tool: How do you manage OS patches? How do you make sure that OS patches and upgrades are tested exactly the same as you test changes in your application and configuration? How do you make sure that you stage them the same? Or use the same rollout process?

For us at ImmobilienScout24 the answer is simple: We treat all changes to our servers exactly the same way without discrimination. The basis for that is that we package all software and configuration into RPM packages and roll them out via YUM channels. In that context it is of course easy to do the same with OS patches and upgrades, they just come from different YUM channels. But the actual change process is exactly the same: Put systems with outstanding RPM updates into maintenance mode, do yum upgrade, start services, run some tests, put systems back into production.

I am not saying that everybody should work that way. For us it works well and we think that it is a very simple way how to deal with the configuration management and deployment issue. What I would ask everybody is to ask yourself how you plan to treat all changes in a server the same way so that the same tests and validation can be applied. I believe that using the same tool to manage all changes makes it much simpler to treat all changes equal than using different tools for different layers of the stack. And if only because a single tool makes it much easier to correlate such changes. Maybe it should be explored more how to use puppet and chef to do the entire job and manage the “lower” layers of the system stack as well as the upper layers.

Are you “doing DevOps”? Then maybe you can look at it like this: If you manage all the stuff on a server the same way it will help you to get everybody onto the same page with regard to OS patches. No more “surprise updates” that catch the developers cold because they are part of all the updates.

Hopefully at the next Velocity somebody will give a talk about how to simplify operations by treating all changes equal.

4 Comments

Filed under Conferences, DevOps

Want To Write For The Agile Admin?

At Velocity and DevOpsDays this week I had a couple people ask about contributing articles to the blog – yes, we’d love to have anyone write on cloud, DevOps, etc. Many folks don’t want to maintain a whole blog themselves (or do and want to crosspromote it, I guess!) so we’re happy to have you here, just email me!

2 Comments

Filed under General

DevOps Illustrated

Naturally, Ops is the Predator and Dev is Ahnuld. We Ops folks are individually fearsome and have cool technology. But the Devs have the numbers on us, and despite speaking somewhat incoherently, win in the end.

1 Comment

Filed under DevOps

DevOps: It’s Not Chef And Puppet

There’s a discussion on the devops Google group about how people are increasingly defining DevOps as “chef and/or puppet users” and that all DevOps is, is using one of these tools.  This is both incorrect and ignorant.

Chef and puppet are individual tools that you can use to implement specific parts of an overall DevOps strategy if you want to use them – but that’s it.  They are fine tools, but do not “solve DevOps” for you, nor are they even the only correct thing to do within their provisioning niche. (One recent poster got raked over the coals for insisting that he wanted to do things a different way than use these tools…)

This confusion isn’t unexpected, people often don’t want to think too deeply about  a new idea and just want the silver bullet. Let’s take a relevant analogy.

Agile.  I see a lot of people implement Scrum blindly with no real understanding of anything in the Agile Manifesto, and then religiously defend every one of Scrum’s default implementation details regardless of its suitability to their environment.  Although that’s better than those that even half ass it more and just say “we’ve gone to our devs coding in sprints now, we must be agile, woot!” Of course they didn’t set up any of the guard rails, and use it as an exclude to eliminate architecture, design, and project planning, and are confused when colossal failure results.

DevOps.  “You must use chef or puppet, that is DevOps, now let’s spend the rest of our time fighting over which is better!” That’s really equivalent to the lowest level of sophistication there in Agile.  It’s human nature, there are people that can’t/don’t want to engage in higher order thought about problems, they want to grab and go.  I kinda wish we could come up with a little bit more of a playbook, like Scrum is for Agile, where at least someone who doesn’t like to think will have a little more guidance about what a bundle of best practices *might* look like, at least it gives hints about what to do outside the world of yum repos.  Maybe Gene Kim/John Willis/etc’s new DevOps Cookbook coming out soon(?) will help with that.

My own personal stab at “What is DevOps” tries to divide up principles, methods, and practices and uses agile as the analogy to show how you have to treat it.  Understand the principles, establish a method, choose practices.  If you start by grabbing a single practice, it might make something better – but it also might make something worse.

Back at NI, the UNIX group established cfengine management of our Web boxes. But they didn’t want us, the Web Ops group, to use it, on the grounds that it would be work and a hassle. But then if we installed software that needed, say, an init script (or really anything outside of /opt) they would freak out because their lovely configurations were out of sync and yell at us. Our response was of course “these servers are here to, you know, run software, not just happily hum along in silence.” Automation tools can make things much worse, not just better.

At this week’s Agile Austin DevOps SIG, we had ~30 folks doing openspaces, and I saw some of this.  “I have this problem.”  “You can do that in chef or puppet!” “Really?”  “Well… I mean, you could implement it yourself… Kinda using data bags or something…” “So when you say I can implement it in chef, you’re saying that in the same sense as ‘I could implement it in Java?'”  “Uh… yeah.” “Thanks kid, you’ve been a big help.”

If someone has a systems problem, and you say that the answer to that problem is “chef” or “puppet,” you understand neither the problem nor the tools. It’s “when you have a hammer, everything looks like a nail – and beyond that, every part of a construction job should be solved by nailing shit together.”

We also do need to circle back up and do a better job of defining DevOps.  We backed off that early on, and as a result we have people as notable as Adrian Cockroft saying “What’s DevOps?  I see a bunch of conflicting blog posts, whatever, I’ll coin my own *Ops term.” That’s on us for not getting our act together. I have yet to see a good concise DevOps definition that is unique if you remove the word DevOps and insert something else (“DevOps helps you bring value to software!  DevOps is about delivering a service to a customer!” s/DevOps/100 other things/).

At DevOpsDays, some folks contended that some folks “get” DevOps and others don’t and we should leave them to their shame and just do our work. But I feel like we have some responsibility to the industry in general, so it’s not just “the elite people doing the elite things.” But everyone agreed the tool focus is getting too much – John Willis even proposed a moratorium on chef/puppet/tooling talks at DevOpsDays Mountain View because people are deviating from the real point of DevOps in favor of the knickknacks too much.

7 Comments

Filed under DevOps

Monitorin’ Ain’t Easy

The DevOps space has been aglow with discussion about monitoring.  Monitoring, much like pimping, is not easy. Everyone does it, but few do it well.

Luckily, here on the agile admin we are known for keeping our pimp hands strong, especially when it comes to monitoring. So let’s talk about monitoring, and how to use it to keep your systems in line and giving you your money!

In November, I posted about why your monitoring is lying to you. It turns out that this is part of a DevOps-wide frenzy of attention to monitoring.

John Vincent (@lusis) started it with his Why Monitoring Sucks blog post, which has morphed into a monitoringsucks github project to catalog monitoring tools and needs. Mainstream monitoring hasn’t changed in oh like 10 years but the world of computing certainly has, so there’s a gap appearing. This refrain was quickly picked up by others (Monitoring  Sucks. Do Something About It.) There was a Monitoring Sucks panel at SCALE last week and there’s even a #monitoringsucks hashtag.

Patrick Debois (@patrickdebois) has helped step into the gap with his series of “Monitoring Wonderland” articles where he’s rounding up all kinds of tools. Check them out…

However it just shows how fragmented and confusing the space is. It also focuses almost completely on the open source side – I love open source and all but sometimes you have to pay for something. Though the “big ol’ suite” approach from the HP/IBM/CA lot makes me scream and flee, there’s definitely options worth paying for.

Today we had a local DevOps meetup here in Austin where we discussed monitoring. It showed how fragmented the current state is.  And we met some other folks from a “real” engineering company, like NI, and it brought to mind how crap IT type monitoring is when compared to engineering monitoring in terms of sophistication.  IT monitoring usually has better interfaces and alerting, but IT monitoring products are very proud when they have “line graphs!” or the holy grail, “histograms!” Engineering monitoring systems have algorithms where they can figure out the difference between a real problem and a monitoring problem.  They apply advanced algorithms when looking at incoming metrics (hint: signal processing).  When is anyone in IT world who’s all delirious about how cool “metrics” going to figure out some math above the community college level?

To me, the biggest gap especially in cloud land – partially being addressed by New Relic and Boundary – in the space is agent based real user monitoring.  I want to know each user and incoming/outgoing transaction, not at the “tcpdump” level but at the meaningful level.  And I don’t want to have to count on the app to log it – besides the fact that devs are notoriously shitful loggers, there are so many cases where something goes wrong – if tomcat’s down, it’s not logging, but requests are still coming in…  Synthetic monitoring and app metrics are good but they tend to not answer most of the really hard questions we get with cloud apps.

We did a big APM (application performance management) tool eval at NI, and got a good idea of the strengths and weaknesses of the many approaches. You end up wanting many/all of them really. Pulling box metrics via SNMP or agents, hitting URLs via synthetic monitors locally or across the Internet, passive network based real user monitoring, deep dive metric gathering (Opnet/AppDynamics/New Relic/etc.)…  We’ll post more about our thoughts on all these (especially Peco, who led that eval and is now working for an APM company!).

Your thoughts on monitoring?  Hit me!

Leave a comment

Filed under DevOps

Service Delivery Best Practices?

So… DevOps.  DevOps vs NoOps has been making the rounds lately. At Bazaarvoice we are spawning a bunch of decentralized teams not using that nasty centralized ops team, but wanting to do it all themselves.  This led me to contemplate how to express the things that Operations does in a way that turns them into developer requirements?

Because to be honest a lot of our communication in the Ops world is incestuous.  If one were to point a developer (or worse, a product manager) at the venerable infrastructures.org site, they’d immediately poop themselves and flee.  It’s a laundry list of crap for neckbeards, but spends no time on “why do you want this?” “What value will this bring to your customer?”

So for the “NoOps” crowd who want to do ops themselves – what’s an effective way to state these needs?  I took previous work – ideas from the pretty comprehensive but waterfally “Systems Development Framework” Peco and I did for NI, Chris from Bazaarvoice’s existing DevOps work – and here’s a cut.  I’d love feedback. The goal is to have a small discrete set of areas (5-7), with a small discrete set (5-7) of “most important” items – most importantly, stated in a way so that people understand that these aren’t “annoying things some ops guy wants me to do instead of writing 3l33t code” but instead stated as they should be, customer facing requirements like any other.  They could then be each broken into subsidiary more specific user stories (probably a whole lot of them, to be honest) but these could stand in as “epics” or “pillars” for making your Web thing work right.

Service Delivery Best Practices

Availability

  • I will make my service highly available, so that customers can use it constantly without issues
  • I will know about any problems with my service’s delivery to customers
  • I can restore my service quickly from any disaster or loss
  • I can make my data resistant to corruption from runtime issues
  • I will test my service under routine disruption to make it rugged and understand its failure modes

Performance

  • I know how all parts of the product perform and can measure performance against a customer SLA
  • I can run my application and its data globally, to serve our global customer base
  • I will test my application’s performance and I know my app’s limitations before it reaches them

Support

  • I will make my application supportable by the entire organization now and in the future
  • I know what every user hitting my service did
  • I know who has access to my code and data and release mechanism and what they did
  • I can account for all my servers, apps, users, and data
  • I can determine the root cause of issues with my service quickly
  • I can predict the costs and requirements of my service ahead of time enough that it is never a limiter
  • I will understand the communication needs of customers and the organization and will make them aware of all upgrades and downtime
  • I can handle incidents large and small with my services effectively

Security

  • I will make every service secure
  • I understand Web application vulnerabilities and will design and code my services to not be subject to them
  • I will test my services to be able to prove to customers that they are secure
  • I will make my data appropriately private and secure against theft and tampering
  • I will understand the requirements of security auditors and expectations of customers regarding the security of our services

Deployment

  • I can get code to production quickly and safely
  • I can deploy code in the middle of the day with no downtime or customer impact
  • I can pilot functionality to limited sets of users
  • I will make it easy for other teams to develop apps integrated with mine

Also to capture that everyone starting out can’t and shouldn’t do all of this… We struggled over whether these were “Goals” or “Best Practices” or what… On the one hand you should only put in e.g. “as much security as you need” but on the other hand there’s value in understanding the optimal condition.

2 Comments

Filed under DevOps

Breaking the Silence – Agile Admin Updates!

Well I know it’s been quiet around here lately.  Two of the agile admins, myself and Peco, have switched jobs and have been crazy busy.

Peco has moved to Opnet as a SE – he always loved their APM tool Panorama and we got a lot of mileage out of it at NI. Notably, this included getting some of our suppliers, like Vignette, to buy it and use it on their products to find the performance problems before we did… We were having Vignette V7 performance issues and used Panorama to identify exactly what they were, and when we fed it back to Vignette engineering they were like “how the hell did you do that…” Opnet’s never had great marketing, they’ve been doing what New Relic and AppDynamics have been doing for a long time but aren’t as much of a recognized name. It’s an interesting move for Peco, he went from long time Ops guy to a Java dev, working on system automation like PIE, and now an SE because he wants trigger time on customer interaction.  He’s a renaissance man!  And working on his own stealth startup on the side of course.

I have moved to Bazaarvoice, a SaaS startup in Austin (well, are you still a startup when  you’re up to 400 people?) to be their Manager of Release Engineering.  They have been increasing their dev and ops forces by a very large margin (P.S. We’re hiring!) and wanted someone to run the “middle third” of their DevOps teams.  There’s one team of DevOps embedded into all the product teams, kind of a matrixed organization; then there’s my team to do the build and deploy automation and cloud and data center stuff; then there’s a support team to wrangle the pages and tickets.

I was there for a week doing new hire training when they said “So we have moved to agile doing two week sprints, and that’s going OK except we realized we can’t release every two weeks safely. Get that going.  We start biweekly releases in three weeks. Zero customer impact each time!” Therefore I’ve been in a little bit of a frenzy doing that. It was definitely on the fine line between “I like a challenge” and “Oh shit.” I lucked out and got a week of slack because we decided to IPO on the date that the first release was supposed to go, and we decided that doing both on the same day was just a wee bit too ambitious. The first biweekly release slipped from a Thursday till the next Tuesday and only had two minor issues that needed fixing after, and the next one is coming up!  We should have it tamped down to regular in a sprint or two and I hope to return to more regular blogging.

On the side of course I continue to help run the Austin Cloud User Group and the Agile Austin DevOps SIG and put together DevOpsDays Austin and do other random stuff, so sadly when the overcommittedness hits the blog has to give.

James is doing great, and just had a new baby!  So he’s been out of pocket some too. He’s still at NI, and now large and in charge of all operations for their SaaS products. They are fighting the brave fight to get PIE open sourced and out the door.

We are all going to SXSW Interactive this weekend and will regroup and decide how we’re going to bring you all the latest in hot cloud/DevOps/tech stuff going forward!

1 Comment

Filed under General

Awesome Austin Tech Meetups

Austin is such a great place to be a techie.

  • The Austin Cloud User Group (I help run it) meets every third Tuesday evening, and we’ve ben having 50+ people come in to check out some awesome stuff.  Next meeting Feb 21 on Puppet, hosted by Pervasive.
  • The Agile Austin DevOps SIG meets fourth Wednesdays, we had our meeting today and had about 20 attendees, hosted by CA/Hyperformix. I also help run that one.
  • The Austin Big Data User Group is back meeting – next one is tomorrow night! Hosted by Bazaarvoice.
  • The Austin OWASP chapter is one of the biggest and most active in the country, and also meets monthly, hosted by National Instruments. Fellow Agile Admin James Wickett helps run that group.
  • The Cloud Security Alliance, Austin chapter is just getting started but has a lot of momentum and we’re coordinating with them from the ACUG and OWASP sides. Their first meeting is tonight, come out!

There are others but those are my favorites and therefore the coolest by definition.

There’s also cool events coming up you should keep an eye out for.

  • DevOpsDays Austin, Apr 2-3, hosted by National Instruments, and this’ll be big! Patrick Debois and the whole crew of DevOps illuminati will be here. Now taking sponsors and speakers! Register now!
  • AppSec USA 2012, Oct 23-26 – Austin OWASP kicks so much ass with LASCON that the annual OWASP convention is coming here to Austin this year!
  • South by Southwest Interactive, March 9-13 – quickly becoming theWeb conference in the flyover states :-). Lots of stuff happens during it, like:
    • Austin Cloud/DevOps party courtesy GeekAustin (ACUG is a community sponsor). March 10.
    • CloudCamp – Dave Nielsen will be bringing a CloudCamp to Austin again this year during SXSWi. Details TBD, sounding like Mar 11 maybe.
  • The Cloud Security Alliance and ACUG are hoping to put together an Austin cloud conference, too. Maybe early 2013.

Leave a comment

Filed under Cloud, Conferences, DevOps

Why Does Cloud Load Balancing Suck?

Back in the old world of real infrastructure, we used Netscalers or F5’s and we were happy.  Now in the cloud, you have several options all of which seem to have problems.

1. Open source.  But once you want SSL, and redundancy, and HTTP compression, you get people saying with a straight face “nginx (for HTTP compression) –> Varnish cache (for caching) –> HTTP level load balancer (HAProxy, or nginx, or the Varnish built-in) –> webservers.” (Quoted from Server Fault).  Like four levels, often with the same software twice in it. And don’t forget some kind of heartbeat between the two front-ends. Oh look I’ve spent $150/mo on just machines to run my load balancing. And I really want to load balance/failover between all my tiers not just the front end.  It’s a lot of software parts to go wrong.

2. Zeus.  For some reason none of the other LB vendors have gotten off their happy asses and delivered a good software load balancer you can use in Amazon.  I got tired of talking to our Netscaler reps about it after the first couple years.  They’re more interested in selling their hardware to the cloud data centers than helping real people load balance their apps. Zeus is the only one – and it’s really quite expensive

3. Amazon ELBs.  These just have a lot of problems under the hood.  We’ve been engaged with Amazon ELB product management on them – large files serve out super slow; users get hits refused due to throttling/changes during ELB scaling – basically if you want 100% of your hits to come through you can’t use them.

4. Geo-IP load balancing, through Dyn or whoever. They claim to have the failover problem fixed, but it still only works for the front end tier of what is a multitier architecture. I certainly don’t want to have to advertise every internal IP in external DNS to make load balancing work.

And really the frustrating part is there seems to have been no headway on any of this stuff in a decade. Same old open source options, same old techniques.  Can someone come up with a way to load balance on the cloud that a) doesn’t lose any hits, b) is one thing not 4 things, and c) is useful for front and back end balancing?  Seems like a necessary part of oh say every system ever, why is it still so hard?

31 Comments

Filed under Cloud, DevOps