Category Archives: DevOps

Pertaining to agile system administration concepts and techniques.

New DevOps Courses In The Can!

Well it was  busy week last week – James, Karthik, and I were all in lovely Carpenteria, CA  at the Learning HQ to film some new DevOps Foundations courses!

We have two out already – James and I did DevOps Foundations, which lays out the basics of DevOps from culture to containers. It’s a three hour course, and should suffice to orient someone in all the ways of DevOps and defines Continuous Delivery, Infrastructure Automation, and Reliability Engineering as its three practice areas. (There’s a course handout under the Exercise Files that has links and bibliography, as well.)  It’s DevOps 101 if you could use that!

And then we started to flesh out the major DevOps practice areas we defined in that course as 200-level courses.  These focus on concepts but illustrate with tool demos. So we filmed DevOps Foundations: Infrastructure Automation in March, which released the end of April.  It’s two hours, and covers infrastructure as code concepts and the basics of creating infrastructure from specs with e.g. CloudFormation, provisioning systems with e.g. Chef, and going immutable with Docker.

But now we have an irresistible urge to do more, so in a double shot that took about a year off my life, last week James and I recorded DevOps Foundations: Continuous Delivery, which goes over continuous integration and delivery and shows you how to build a delivery pipeline – we used  Jenkins/Nexus/Chef/go/abao/Robot Framework but again we focused on concepts and did just enough implementation to illustrate it.

2017-07-17 14.01.27

James went home mid-week and Karthik came out, and we also recorded DevOps Foundations: Lean and Agile!  Lean and Agile are integrally related to DevOps and especially to being successful at DevOps.  Our content manager actually asked us to do this one; we were kinda bulling ahead on our three main practice areas, but we said sure!  We cover some Agile and Lean basics, and then we take a tip from The Goal and The Phoenix Project, and the bulk of the course is a fictional implementation stitched together from real experiences we’ve both had doing these at various companies.  It was fun!  Here’s a look at behind the scenes.

2017-07-21 14.47.33

Both of these should drop in about 5 weeks, so keep an eye out.



Leave a comment

Filed under DevOps

AWS CLI Queries and jq

Anyone who’s worked with the AWS CLI/API knows what a joy it is. Who hasn’t gotten API-throttled?  Woot!  Well, anyway, at work we’re using Cloudhealth to enforce AWS tagging to keep costs under control; all servers must be tagged with an owner: and an expires: date or else they get stopped or, after some time, terminated.   Unfortunately Cloudhealth doesn’t understand Cloudformation stacks, so it leaves stacks around after having ripped the instances out of them.  We also have dozens of developers starting instances and CloudFormation stacks every day.  Sometimes they clean up after themselves, but often they don’t (and they like to set that expires: tag way in the future, which is a management problem not a technology problem).  Sometimes we hit AWS quotas in our dev environment and an irritated engineer goes and “cleans up” a bunch of stuff – often not in the way that they should (like by terminating instances and not stacks).

So, I was throwing together a quick bash script to find some of the resulting exceptions and orphans in our environment.  Unattached EBS volumes.  CloudFormation stacks where someone terminated the EC2 instance and figured they were done, instead of actually deleting the stack. Stuff like that. This got into some advanced AWS CLI-fu and use of jq, both of which are snazzy enough I thought I’d share.

Here’s my script, which I’ll explain. You will need the AWS CLI installed (on OSX El Capitan, “brew install awscli” or “pip install –ignoreinstalled six –upgrade –user awscli” – the –ignoreinstalled six works around an El Capitan problem).  You need “aws config” configured with your creds and region and such, and set output to json.  And you need to install jq, “brew install jq.”

#!/usr/bin/env bash
# This script finds problematic CloudFormation stacks and EC2 instances in the AWS account/region your credentials point at.
# It finds CF stacks with missing/terminated and stopped EC2 hosts.  It finds EC2 hosts with missing owner and expires tags.
# It finds unattached volumes. Should you delete them all?  Probably. Kill the EC2 instances first because it'll probably
# make more orphan CF stacks.


echo "Finding misconfigured AWS assets, stand by..."
for STACK in $(aws cloudformation list-stacks --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE --max-items 1000 | jq -r '.StackSummaries[].StackName')
        INSTANCE=$(aws cloudformation describe-stack-resources --stack-name $STACK | jq -r '.StackResources[] | select (.ResourceType=="AWS::EC2::Instance")|.PhysicalResourceId')
        if [[ ! -z $INSTANCE  ]]; then
                STATUS=$(aws ec2 describe-instance-status --include-all-instances --instance-ids $INSTANCE 2> /dev/null | jq -r '.InstanceStatuses[].InstanceState.Name') 
                if [[ -z $STATUS  ]]; then
                        BADSTACKS="${BADSTACKS:+$BADSTACKS }$STACK"
                elif [[ ${STATUS} == "stopped" ]]; then

echo "CloudFormation stacks with missing EC2 instances: (aws cloudformation delete-stack --stack-name)"

echo "CloudFormation stacks with stopped EC2 instances: (aws cloudformation delete-stack --stack-name)"

echo "EC2 instances without owner tag: (aws ec2 terminate-instances --instance-ids)"
aws ec2 describe-instances --query "Reservations[].Instances[].{ID: InstanceId, Tag: Tags[].Key}" --output json | jq -c '.[]' | grep -vi owner | jq -r '.ID' | awk -v ORS=' ' '{ print $1  }' | sed 's/ $//'

echo "EC2 instances without expires tag: (aws ec2 terminate-instances --instance-ids)"
aws ec2 describe-instances --query "Reservations[].Instances[].{ID: InstanceId, Tag: Tags[].Key}" --output json | jq -c '.[]' | grep -vi expires | jq -r '.ID' | awk -v ORS=' ' '{ print $1   }' | sed 's/ $//'

echo "Unattached EBS volumes: (aws ec2 delete-volume --volume-id)"
aws ec2 describe-volumes --query 'Volumes[?State==`available`].{ID: VolumeId, State: State}' --output json | jq -c '.[]' | jq -r '.ID' | awk -v ORS=' ' '{ print $1  }' | sed 's/ $//'


The AWS cli, of course, lets you manipulate your AWS account from the command line.  jq is a command line JSON parser.

Let’s look at where I rummage through my CloudFormation stacks looking for missing servers.

aws cloudformation list-stacks --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE --max-items 1000 | jq -r '.StackSummaries[].StackName'

Every separate aws subsection works a little different.  aws cloudformation lets you filter on status, and CREATE_COMPLETE and UPDATE_COMPLETE are the “good” statuses – valid stacks not in flight right now.  The CLI likes to jack with you by limiting how many responses it gives back, which is super not useful, so we set “–max-items 1000” as an arbitrarily large number to get them all.  This gives us a big ol’ JSON output of all the cloudformation stacks.

    "StackSummaries": [
            "StackId": "arn:aws:cloudformation:us-east-1:12345689:stack/mystack/1e8f2ba0-4247-11e7-aad1-500c28601499", 
            "StackName": "mystack", 
            "CreationTime": "2017-05-26T19:11:28.557Z", 
            "StackStatus": "CREATE_COMPLETE", 
            "TemplateDescription": "USM Elastic Search Node"

Now we pipe it through jq.

jq -r '.StackSummaries[].StackName'

This says to just output in plain text (-r) the StackName of each stack. You use that dot notation to traverse down the JSON structure.  So now we have a big ol’ list of stacks.

For each stack, we have to go find any EC2 instances in it and check their status. So this time we use a select inside our jq call, to find only items whose resource type is “AWS::EC2::Instance”.

aws cloudformation describe-stack-resources --stack-name $STACK | jq -r '.StackResources[] | select (.ResourceType=="AWS::EC2::Instance")|.PhysicalResourceId')

And then for each of those instances, we get their status, which is in the InstanceState.Name field.

aws ec2 describe-instance-status --include-all-instances --instance-ids $INSTANCE 2> /dev/null | jq -r '.InstanceStatuses[].InstanceState.Name'

That works.  But there’s more than one way to do it!  The AWS CLI commands support a “–query” parameter – which lets you specify a JSON search string that happens on the AWS end, so you have to do less parsing on your end!

To find instances without the owner tag,

aws ec2 describe-instances --query "Reservations[].Instances[].{ID: InstanceId, Tag: Tags[].Key}" --output json | jq -c '.[]' | grep -vi owner | jq -r '.ID' | awk -v ORS=' ' '{ print $1  }' | sed 's/ $//'

What this does is look under Reservations.Instances and basically outputs me a new JSON with just the ID and tags in it.  “jq -c ‘.[]'” just crunches each one into a one-liner.   I grep out the ones without an owner, turn them into one line with awk, and strip the training space at the end from the awk with a sed (ah, UNIX string manipulation).

With this, you can choose what to put into the –query and what to do after in jq.  The –query is fast and cuts down your result set, so you run less risk of magically missing resources because AWS decided there were too many to tell you about.

You can do filters in the query – so for example, when I do the volumes, instead of doing what I did for the tags using grep, I can instead just do:

aws ec2 describe-volumes --query 'Volumes[?State==`available`].{ID: VolumeId, State: State}' --output json | jq -c '.[]'

Yes, those are backticks, don’t blame the messenger.  This is more precise when you can get it to work.  In the instances’ case, people aren’t good about using the same case (owner, Owner, OWNER) and also I just plain couldn’t figure out how to properly create the query, “Reservations[].Instances[?Tags[].Key==`owner`” and other variations didn’t work for me.  I’m no JSON query expert, so good enough!

Between the CLI queries and jq, you should be able to automate any common task you want to do with AWS!

Leave a comment

Filed under DevOps

Tcpdump and Wireshark on OSX

I’m going to start sharing little techie tidbits that require me to go scour the Internet for exactly how to do them, in hopes of making you able to do it in a lot less time than it took me!

So I’m having trouble with connection times spiking to an Amazon Web Services ELB, so it’s time to break out the tcpdump to take packet traces and the wireshark (was ethereal long ago) to analyze it.  I’m on OSX El Capitan (10.11.6).

tcpdump comes on OSX (or if it doesn’t, something installed it without me knowing!).  Step one is figure out what network interface you want to dump.  This will list all your network interfaces.

networksetup -listallhardwareports

Then, run a packet trace on that interface.  I’m using en0 the primary wireless interface, so I run:

sudo tcpdump -i en0 -s 0 -B 524288 -w ~/Desktop/DumpFile01.pcap

I go to another window and hit the URL I’m having trouble with – you can use whatever, but I used ab (Apachebench) which comes with OSX.  Other popular URL-hitters you might install are curl, wget, and siege.

ab -n 10 http:///

Then come back and control-C out of the tcpdump capture.  Now I have a network dump of me hitting that URL (plus whatever other shenanigans my computer was up to at the time, so there’s probably a lot of noise in there from chat clients etc.).

Now to analyze it – wireshark.  I had to go a couple rounds with the installation.

If you want the UI you need to install it as:

brew install wireshark --with-qt

(If you just install wireshark without –with-qt you don’t get wireshark, you get a command line called tshark, and then  you need to reinstall…) For this, as with most things, you need Xcode or at least the Xcode command line tools (I always just install the tools).  You install them with:

xcode-select --install

But if you have an older version (<8.2.1) the wireshark build will fail.  To update the command line tools, you… Apparently you don’t any more.  The App Store doesn’t offer Command Line Tools updates and Apple has gotten more unclear and squirrelly about whether they’re even a thing.  So I just installed full XCode from the App Store, whatever, it’s just network and disk space and contributing to the heat death of the universe, but I’m not bitter, and then it builds.

Optionally if you want to capture from within wireshark on your local box instead of having to tcpdump separately also do

brew cask install wireshark-chmodbpf

But to analyze your tcpdump file just run


And load in the capture file. The quickest way to then sort into what you want is to find one part of a transaction of interest – like in my case by filtering on “http” or just looking around – and then right-clicking on one packed and saying “Follow… HTTP stream” and you get a whole transaction end to end.

Screen Shot 2017-05-26 at 2.43.56 PM

Screen Shot 2017-05-26 at 2.44.16 PM

And now you test out your TCP/IP network admin knowledge by rooting through and seeing if you can find what’s going wrong!


Filed under DevOps

Crazy Convention Season’s Coming To Austin

Sorry we’ve been quiet around here… Too much real work to do!  Besides real work is our vigorous schedule of conference and user group stuff.  James just spoke on Serverless at RSA, and James and I are working on our second DevOps course (Infrastructure Automation) to start drilling down from our DevOps Fundamentals course we released late last year.

And conference season is about to break over Austin like a storm.  It’s insane this year.  So many great events in a two month period.

Our Events

We Agile Admins love the tech community and we spend a lot of our free time organizing stuff to help it out!

Of course we’re working hard on preparing DevOpsDays Austin 2017, the sixth year of one of the biggest DevOpsDays events anywhere!   May 4-5.  This year we’re bringing in a real all star set of speakers as our Monsters of DevOps Reunion Tour – including DevOps OGs like Patrick Debois, John Willis, Damon Edwards, Gene Kim, Andrew Shafer, Adrian Cockroft, Jez Humble, Nicole Fosgren… We’re pulling out all the stops this year! We are also again inviting local user groups to come out and participate – you can get a couple free tickets per group and we’re even going to offer to sponsor a meeting for groups that come out.

User group wise, we’ve retooled CloudAustin a little to be more practitioner focused, come check it out.  Also, Peco’s Austin Monitoring Meetup is putting some really good content in.

The Rest Of The Events

Well, at least the ones on my radar personally.

The giant conference that locks up Austin for two weeks – SXSW is almost upon us.  SXSW Interactive is March 10-16.

The biggest new splash in conferences, Dockercon, is coming to Austin April 17-20!  I’ll totally be there.

Serverlessconf is coming to Austin for the first time on April 26-28.

OSCON is coming back to Austin again this year, May 8-11.

Keep Austin Agile is May 25.

Security conference BSides Austin is on the same dates DevOpsDays is, May 4-5. 😦

In terms of user groups, I’ve started to use Austin Tech Events’ Twitter and email newsletter to keep up with them all.

Leave a comment

Filed under Conferences, DevOps

But How Can IT Do DevOps?


That’s how!

While James and I were filming our course, DevOps Fundamentals, I was impressed by this LinkedIn tech vending machine in their offices.  Swipe badge, get doodad.  I had just come off two different jobs where it was more like “pester IT for the thing you need, wait a couple weeks, maybe someone will drop by with one.”

It’s not “Chef or Puppet,” but it is self service automation!  Think outside the box.  Or in this case, I guess, inside a box. Kudos to LinkedIn IT for a user friendly solution.

1 Comment

Filed under DevOps

Loose Lips Sink Ships: Precision in Language

loose_lips_might_sink_shipsWe talk a lot about communication in the DevOps space.  Often the discussion is about how to “talk nice” and be collaborative, but I wanted to take a minute to talk about something even more profound – the precision of your communication in the first place.

I was reminded of this because over the holidays I’ve been doing some nonfiction reading – first, Thomas J. Cutler’s The Battle of Leyte Gulf, and after that, Cornelius Ryan’s classic A Bridge Too Far.

Both books are rife with examples of how imprecise language – often at the highest levels – caused terrible errors and loss of life.

The example that I’ll unpack is the infamous (to WWII wonks, at least) example of Admiral “Bull”  Halsey’s naval charge northward leaving the landing forces at Leyte Gulf completely unprotected. He had previously sent a message indicating his intent to split his forces from three into four task forces. When he later sent a message that he was proceeding north with “three task forces” to pursue Japanese aircraft carriers, Admiral Nimitz and other commanders assumed that the fourth was being left behind to defend the landing forces.  But this was not the case – he had meant “I plan to reform into four groups, you know, sometime in the future, but not now.” By the time the others began to suss out that something was wrong, it was too late – Halsey was too far north to return in time when a large Japanese fleet came upon the much lighter landing forces and forced them into a lengthy fighting retreat. Many American ships sacrificed themselves against an overwhelming opponent to buy time for the doomed task force, until for reasons still not made clear to posterity, Japanese Admiral Kurita broke off the attack for no good reason. (Extensive historical analysis hasn’t clearly determined whether he had bad intel, suffered from battle fatigue, or simply lost his nerve. The after action report of Admiral Sprague of the landing forces simply stated that “The failure of the enemy… to wipe out all vessels of this task unit can be attributed to our successful smoke-screen, our torpedo counterattack, continuous harassment of the enemy by bomb, torpedo, and strafing air attacks, timely maneuvers, and the definite partiality of Almighty God.”)  A happier ending than it could have been, but still, ships sank and men died due to a lack of clear and precise communication.

The same kind of miscommunication happens at work all the time.  A recent real example – I was incident commander on a production incident where a customer-facing server couldn’t be connected to by a support server used to perform interactive customer debugging sessions. It was being coordinated through HipChat. One engineer was looking at the problem from the the customer-facing server.  One of the engineers who administers the support server came on to chat and a discussion ensued. The support engineer indicated that there were hundreds of interactive customer support sessions in progress on that support server. Another senior engineer, an expert on the customer servers, declared, “Well, let’s try rebooting the server.  It’s the only thing to do.” “Really?” says the support engineer?  “Yes, let’s do it,” says the senior engineer.

Of course he meant the customer-facing server, which we knew didn’t have any traffic at the moment, and not the support server.  But the support engineer assumed he meant the support server, and proceeded to begin to reboot it – at this point I intervened and had to say “STOP, HE IS NOT TALKING ABOUT YOUR SERVER.” But it was a reasonable mistake – if you talk about “the server,” then everyone is going to interpret that as “the server most important to me at this time.” Anyway, we stopped the support engineer from disrupting 100 customer support sessions and we restarted the other server instead. (Which didn’t help of course, “reboot the server” is something that went out with Windows NT, it’s a desperate attempt to not perform quality troubleshooting – but that’s material for another post.)

If at any time you find yourself talking to someone else about something there’s clearly more than one of:

  • “the” server – or even “the” support server or “the” bastion server
  • “the” incident/problem
  • “that” ticket
  • “that” email I sent you
  • “the” bug
  • “the” build

You are being a dumbass.  I can’t be blunt enough about this.  You are deliberately courting miscommunication that in the best case adds time and friction and in the worst case causes major errors to be made.  I have seen costly errors happen because of real world exchanges just like this:

“Are you working on the problem?”  “Yes.”  (And by this they mean the other problem the asker doesn’t know about.)

“It’s critical we fix that big bug from this morning!” “Yes, someone’s on it!” (And by this they mean the bug that got ranked as high criticality in the backlog, but the asker means their own special favorite bug.)

“Is the server back up?”  “Yes.” “OK I will now begin a lengthy operation… Hey what the hell?!?”  (Oh, you meant another one of the 5 servers being restarted this hour.)

Everyone in an environment should know, in a sober moment, that there are often multiple incidents going on in a day (and even at the same time), that there are multiple servers and multiple tickets (and apps and environments and builds and developers and customers…).

If you hit me in chat and say “the build broke what could be wrong” – and our build server has somewhere around 100 builds in it, all of which are breaking and being fixed continuously by a large group of developers, at best you are being super inconsiderate and forcing me to do diagnostic work that could be forestalled by you cut-and-pasting an URL or whatnot into your browser, and potentially I’m going to go look at some other build and you get help much later than you wanted it.

And that, if you’ve sent me 10 emails that morning, asking me “if you got that last email I sent you” is pointless verbal masturbation of an excessively time-wasting variety.  Either I’m going to assume and potentially give you the wrong answer, or I have to spend time trying to elicit identifying information from you.

Be Specific

I expect crisp and exact communication out of my engineers, and it’s as simple as hearkening back to grade school and using proper nouns.

You are working on bug BUG-123, “Subject line of the bug.”   You are logged into server i-123456 as ec2-user. You are working on incident I-23, “Customer X’s system is down.” You sent me an email around 10 AM with the subject line “Emergency audit response needed.”

Being in chat vs email vs in person/phone/video call is not an excuse.  This is not a “tone formality” issue, it is a precision and correctness issue.

Having people with different primary languages is not an excuse. Unless you’re from El-Adrel, your language has proper nouns and you know how to use them, and you know which server you’re logged into.  (Well, OK, you might not, but that’s a different problem.) Slang is a language problem, other things are a language problem, but being able to clearly state the proper noun at hand is NOT a language problem.  Proper nouns are the one thing that never need translating!

Detect Fuzzy Language And Make It Specific

But what if you’re not the one initiating the discussion?  What if someone else is already giving you unclear direction or questions?  Well, I’m going to let you in on a little secret – managers are often terrible at this.  Like with Admiral Halsey, sometimes being “hard charging” gets confused with “lack of clarity.” Sometimes it seems like the higher level of management someone’s in, the more they think that the general rules of discourse and process don’t apply to them. You must not be “in tune with the business needs” if you aren’t sure exactly which thing they are vaguely referring to. Frustrating, but you just need to manage up by starting to be precise yourself. Don’t compound the possible error by making an assumption; if there’s not a proper noun in play you need to state one yourself to ensure precision.

“Customer X is angry are you working on that bug?!?”  Obviously “yes” or “no” is probably an answer that, if you’re not on the same page already, will continue to cause confusion. Reply “I am working on bug BUG-123, ‘Subject line of the bug.’  Is that the one  you’re talking about?”

“Did you get that last email I expect a response!” says the call or text.  Well, just saying “yes” because I saw an email from that guy an hour ago I just responded to isn’t right –  it might not be the right one. Get clarification.

Whatever happens, try to verify information you’re being given.  “Reboot the server” – “To confirm, you want me to reboot server i-123436, the customer support server, right?”

Ask fast.  In the example above, Nimitz and the other commanders just let the imprecision slide, thinking “Well…  I’m sure he meant the fourth task group is staying…  I don’t want to look dumb by asking or make him look bad by having to clarify.  It’ll all be fine.”  and by the time they got nervous enough to ask him directly it was too late for him to return in time and people died as a result. Now sometimes issues are timely and the person sending the request/demand doesn’t respond to your request for clarification in a timely manner, even if you respond immediately.  What do you do?  Well, you have to do what you think is best/safest, but try to get things back on the right path ASAP by using all the communication paths at hand to clarify. If the request was unclear, do not let your response be unclear.  “Restart the server!”  “Done.”  You’re both culpable at this point.”I restarted server i-123456 at 10:50 AM.”

Then afterwards, send the person this article so they can understand how important precision is.  Luckily, in IT it is usually less likely to cause deaths than in the military, but poor communication can and has lost many dollars, hours, and jobs.


Filed under DevOps

DevOps Fundamentals Going Strong!

Our LinkedIn Learning content manager, Jeff Kellum, tells James and I that our DevOps Fundamentals course is “the third most popular IT course in our library right now”!  You can start a free trial period at Lynda by going to  We’ll post more about the experience we had making the course, it was a lot of fun and we learned a lot about going in front of the camera!


Leave a comment

Filed under DevOps