Author Archives: Ernest Mueller

About Ernest Mueller

Ernest is Engineering Manager at Verica, in Austin, TX. More...

All Day DevOps Is Coming Up

And James, Karthik, and I are hosting the Modern Infrastructure track again this year! ADDO is a free, 24 hour, multi-track online conference with a lot of great speakers. More info follows…

What: All Day DevOps, Live Online

When: November 12, 2020 (24 hours)

Where: From your desktop, laptop, or mobile device

Free: Click here to register

On November 12th, we will be supporting the 5th Anniversary Celebration for All Day DevOps. This is a 24 hour event with 6 simultaneous tracks, delivering 180+ sessions, live online. Session tracks include CI/CD Continuous Everything, Cloud Native Infrastructure and Monitoring, DevSecOps, Cultural Transformation, Site Reliability Engineering and Government.

Check out the amazing speaker lineup. Registration is free. Full details are located at AllDayDevOps.com.

Virtual Viewing Parties: Hosting a virtual viewing party is free for anyone in the community you supply the group and the connection while All Day DevOps provides the content. Here are some guidelines that will provide detailed information and tips to assist you with your party planning.

Leave a comment

Filed under Conferences, DevOps

Change Management – Share Your Thoughts!

Hey loyal admineers! (I just made that up.) I wanted to toss a question out to our readers. I’m working on a Change Management course for LinkedIn Learning right now to join my other courses, and I was hoping to hear some good new techniques people are using to do it that a) ensure compliance but b) are not super heavy and lame.

My current approach is to mention the ITIL, COBIT, ISO 27000-1, etc. approaches but then come in with a practical approach inspired by Visible Ops and leavened with DevOps innovation.

Chime in in the comments – how do you do change management? What compliance regimes do you have to fulfill? Are you using one of the ITSM frameworks? Are you using a tool (ServiceNow aka “ServiceNo” followed by JIRA were the two most popular when I asked at the latest CloudAustin meeting)? Are you using any techniques that you think are excellent and would like others to hear about? I’d love to hear from you!

Leave a comment

Filed under DevOps

Upcoming Agile Admin Stuff

We’re not being idle, there’s a lot going on here in lockdown.

  1. James is speaking at the DevOps Institute’s DevSecOps SKILup Day today, go check that out!
  2. Karthik’s going to be on the DroidDevCast podcast this Friday talking about testing!

  3. James, Karthik, and Ernest all are running the Modern Infrastructure track at All Day DevOps in November. Sign up now! And we have a couple talk spots free – if you think you have a super amazing talk, and especially if you are from an underrepresented group, hit one of us up (Twitter DMs are fine) and we look at getting you in.

Leave a comment

Filed under Conferences, General

Chaos In The Time Of Covid

“See, here I’m now sitting by myself, uh, er, talking to myself. That’s, that’s chaos theory.”
– Jeff Goldblum as Ian Malcom, Jurassic Park

Hello all!  I know it’s been a while between posts here at the Agile Admin. Everyone’s safe and healthy, though. We’ve been quiet but busy, and in a twist of happy fate many of us are working together again!

James, Karthik, Bill, and Ernest are all now at Verica, a chaos engineering startup founded by Casey Rosenthal and Aaron Rinehart. It’s 100% remote, with people from Portland to Portland, as we like to say (Maine to Oregon). We’re still working on the initial release of the product; my team (which includes Karthik and Bill) is working on the Kubernetes module.  Really interesting stuff!  Casey literally co-wrote the book on chaos engineering and it’s a fascinating bleeding edge part of systems engineering.  If we can lure Peco over to join us then we’ll have the entire band back together!

Community stuff has been quiet.  We had to fight to get our money back from the venue from cancelling DevOpsDays Austin 2020, but we finally prevailed there.  The CloudAustin user group is meeting virtually.  And James, Karthik, and I are running the Modern Infrastructure track for All Day DevOps in November. Karthik’s speaking at KubeCon in August. We’re doing light work on some LinkedIn Learning courses, though I have to admit with the additional burden of the pandemic I haven’t been getting a lot of side stuff done personally.

As we get on our feet with chaos experimentation and this dang lockdown eases up we hope to be back with our usual mix of technical insight and Texan cussedness!  Because let’s be honest, Twitter is where you go to get dumber.

Leave a comment

Filed under General

Pragmatic Pipeline Security

Check out agile admin James Wickett’s talk from DeliveryConf last month on adding security into your continuous software delivery pipeline!

Leave a comment

Filed under Conferences, DevOps, Security

How do I start with Kubernetes in the enterprise?

Fellow Agile Admin Karthik Gaekwad did this video interview late last year on k8s in the enterprise from his perspective as an Oracle Cloud devrel, and I thought I’d share it here!

1 Comment

Filed under k8s

Record and Replay Browser Testing, Take 2

Recently I reported, and I quote, “Bah!” from trying a bunch of record and playback cross browser testing options. To recap, we’re a startup, our devs are writing UI tests but we don’t have cross-browser testing, so I tried to find something where I could record and replay our pretty simple Angular UI flow and get some cross-browser testing on it without needing code. And I didn’t find anything that worked. But I got a bit more traction after that first run at it, so here’s part 2.

EndTest

The EndTest support folks looked into it and fixed my test so it worked. Some were alternate locator schemes.  Some were advanced options I am not sure how I would have figured out (to close those pesky multiselect dropdowns, you can’t just click on the overlay backdrop that comes up, you have to offset by some pixels…).  I am not a front end guy, I just fake it, so this is a little daunting, but their support seems able to help with tests so it’s doable.

I now have a working test, somewhat generated from the recording and somewhat generated by programming. The trial doesn’t allow other browsers by default but they turned them on for me.  With some light fiddling I got them all running, and the only issue was IE not running because it didn’t like that pixel-offset workaround from the above and EndTest fixed it by changing my test to hit “enter” instead – fair enough.

I then also made a test for the second part of our flow that has a PDF that needs validation, you can do it with a screenshot comparison.

So after help from their support, I have a working test suite – it’s a stretch to say it’s pure record and replay; it’s record, edit the locators a bunch, and replay, but since my last iteration on this was “none of these solutions work and do crossbrowser”, that’s a big win.

Sauce Labs

Last time I had been trying to use the Selenium IDE for record/replay and integrate with Sauce for the crossbrowser testing. They got back to me and extended my trial minutes and gave me some tips on how to get some of the tests working.  They don’t support the Selenium IDE however so they don’t really help with the tests per se.

I had a weird experience, there were intermittent (but frequent) timeouts from running tests against Sauce with selenium-side-runner, just at some point 0-4 minutes into the test it’d hang and time out and give me a “ETIMEDOUT connect ETIMEDOUT 66.85.49.22:443″.  

And then I got into the lovely rich set of test things that don’t work the same across the browsers.  Oh, in Edge “the element isn’t clickable because it’s obscured.”  In Firefox on Mac “that radio button isn’t clickable because the inner part of the radio button obscures it.” All different elements.  The problem is, fiddling with the locators to get something that browser likes breaks it in the other browsers.  But the app works in those browsers, just not the tests.

Anyway, Sauce support said “we can’t support Selenium IDE questions” so I guess that’s it.  (I don’t know that those timeouts when running tests on their service count as selenium IDE questions but whatever). I had hope when I found Selenium IDE that a record and replay with Sauce was feasible but it seems like it’s a starting point to seed your own Selenium code at best.

mabl

A sales rep from mabl wouldn’t let go of me till I tried their solution. So I did and it has a lot of promise.  It tries a bunch of different methods to find a locator automatically and self-heals the tests, which is great.  On the one hand the tests are slow; a 4 minute selenium test is a 6-8 minute mabl test. On the other hand it works!

It worked the first time in fact (Well… second, but it was because I went into a click frenzy trying to get the mabl trainer window and our Hubspot popup out of my way). I now have a working mabl test, though it is having one-minute timeouts and confusion on that same “click on the backdrop to get out of the multiselect dropdown” issue that’s a pain in all the other solutions as well, it does it but after a long timeout and it doesn’t self heal it for the next test run so it works but is slow. I hate multiselect dropdowns. Anyway, after a discussion with mabl support it turns out that it tries a bunch of ways to find a locator and then self heals if it found a better one, and then tries a bunch of ways to click/exercise that locator but doesn’t self heal from that hence the long timeout in my test.  I gave them the feedback “self heal that too, yo!”

OK so it’s working, how about cross-browser?  I add Firefox – it works.  IE and Edge are only on the “enterprise plan” but I ask them to add them to the trial, they do, I run them, and…  They work!  Safari… Well, a problem there, but mabl looks at it and thinks it may be them. I’m super impressed, of all the stuff I’ve tried only GhostInspector actually recorded and replayed without significant recoding and that was only Chrome/Firefox.  They’re still working on a fix as of “press time.” They also do PDF testing, which we need..

So it works great and looks good!  And they have a lot of cool dev integration and stuff to get into, it uses a branching model for the tests, you can run the tests locally…  Great looking ecosystem. You can’t choose like OS platform though, the Chrome and Firefox are just on Linux. At our current level of detail that’s fine.

Next step is discussing pricing (Web site just says “contact us”)… And unfortunately it’s way, way out of reach of a 10-person startup. The cross-browser and PDF testing are part of the “Enterprise plan” but even if I sweet talk them into putting those features in their lowest cost plan it’s still cost prohibitive for us while we’re in seed round.  I mean, it’s totally worth it because it works out of the box without fiddling around, if I were still at AT&T Cybersecurity I’d make someone spring for it for sure, but it’s an extremely significant pricing step above these other solutions and not at the value point for where we are right now.  Dang.

New Bottom Line

OK so my new bottom line is this.

  1. If you just want quickie Chrome/Firefox on Linux record and replay, GhostInspector works out of the box. Nice but I want more browsers, doesn’t quite fit my needs.
  2. If you want record and replay on various OS/browser combinations and are willing to do some re-coding and testing of it to make it work, EndTest does it. Fits my need with a little pain, and is affordable.
  3. If you want cross-browser record and replay without hand coding, without full platform choices but with a bunch of cool dev friendly extras, mabl does a great job – for a lot more money than the other options.  Best for my needs, but most expensive.
  4. The other options basically don’t work on Angular (Crossbrowsertesting), or possibly work but with intensive time investment that makes it not really record and replay IMO (Sauce + Selenium IDE).
  5. Though if you don’t need a bunch of CI-driven executions per day and don’t care about all the platforms you can probably just use the Selenium IDE to do 3 major browsers locally installed on whatever laptop you have yourself for free (Safari/Chrome/Firefox on Mac or Chrome/Firefox/Whatever Microsoft Is Pushing Today on Windows). Free but DIY.

So for us being on a seed round budget, EndTest is probably the best compromise of functionality and price at this point, YMMV of course.

Leave a comment

Filed under DevOps

Trying To Record And Replay Browser Tests… FML

I’m working for a startup right now and we don’t have a huge excess of development staff.  Our devs have been implementing UI testing in Cypress, but we also need some wide cross-browser testing of our front end Angular apps – we’d already found a couple blocker bugs on Edge and IE largely by accident.  The devs are all busy devving, so I figured I’d take that on. I said, “Well, there’s products where you can just click to record a UI session and replay it in other browsers without writing a bunch of code, let’s try that out.”  Most everyone has a free trial nowadays so I could see which ones were best. Then the pain began.

Sauce Labs – Part 1

I had used Sauce in a previous life, when we had a bunch of Robot Framework/Selenium tests and I liked it.  So I went there first.  Unfortunately, they have no record/replay capability, verified by their support, so I moved on.

But I came back later because I had found that there was a Selenium IDE that’s a record and playback tester that you can integrate with Sauce by using selenium-side-runner.

Selenium IDE is very cool, its killer feature is that as it records it copies various ways to address an item on the screen – css, xpath, full xpath – and when it replays if the first one doesn’t work it tries the next one and if that works tells you “hey you should update this test.” That’s great because UI testing is shitty and unreliable at best, and once you have Angular generating ever-changing ids for elements it is even worse.  The only bad thing is you have to go add assertions in manually afterwards.

Screen Shot 2020-01-09 at 10.58.47 AM

So in fairly short order, I managed to get a reproducible Selenium IDE script that exercised our Angular app and works.  The app’s just like 7 screens of form fill, it’s not crazy.

Well, then I tried to save it as a “.side” project and feed it through Sauce by using selenium-side-runner, which is just:

npm install -g selenium-side-runner

selenium-side-runner –server <sauce-url> -c “browserName=’chrome’ version=’latest’ platform=’macOS 10.14′” ‘Paul Precision.side’

You get that sauce URL that has credentials embedded under User Settings/Driver Creation in their UI.

Unfortunately once I push it to sauce (starting on the same OS/browser, which you go get the tokens for from their Platform Configurator) – problems. The player is great, it shows the video (even live while testing) synced to the step taking place (unfortunately since I’m piping it in, it’s not showing the steps in the test syntax, but in raw Selenium execution syntax).

Screen Shot 2020-01-09 at 12.37.42 PM

I fixed most of them by going and changing selectors away from CSS to xpath, then sitting there iterating with chrome dev tools and the IDE trying new ways to use an item that works in chrome then works in selenium IDE and then… Doesn’t work on Sauce. I have gotten it 90% working but the last 10% is blocking me.

CrossBrowserTesting

Next I tried SmartBear’s CrossBrowserTesting.com. An all-in-browser recorder that worked great!  And then the replays didn’t work.  I messed with it a while and contacted their support, who said “Oh yeah it doesn’t do angular, it’s for static pages.”  So on to the next one. Who uses static pages, this is 2020?

The interface is nice enough, editable steps next to a running video (though not synced up).

Actually looking at it closer I bet I could do the same “edit all the locators” deal and try to get it to work but… My 7 day trial is over (a week shorter than the other options) so I guess I can’t try.  It didn’t do the nice multi-locator guessing Selenium IDE did but it does seem to have several options in a dropdown while I edit the tests, and the recorder is integrated into the offering so that’s nice – the UI was good overall. Unfortunately the super short trial and the presales support saying “Angular? Go away!” prevented me from really seeing if it can work for us.

Screen Shot 2020-01-09 at 12.38.18 PM.png

GhostInspector

Demoralized, I head to Twitter, and someone recommends GhostInspector.  You record it with a Chrome plugin and then replay in the browser – video, but then it shows the editable steps next to a screenshot showing % change from the last screenshot (the steps aren’t synced to the video, which would be better) .  You can do assertions while you’re recording.  And the replay works the first time and every time – Hallelujah!

Screen Shot 2020-01-09 at 12.54.11 PM

And then I look to set up cross-browser and discover they only support Chrome and Firefox, and to even do that in an automated manner you have to duplicate the entire test suite.  I was so disappointed, it worked perfectly otherwise.

Seriously y’all if you add more browsers I’ll pay you immediately for this.

EndTest

Determined to make this happen I find EndTest and, after verifying they support a full OS/browser matrix, try them. They also use a browser plugin recorder like GhostInspector.

I’ll be honest, the UX is terrible.  Besides the 1990s colored icons, everything is always a click away – you have to watch the replay video separately from looking at the logs from looking at the steps from editing the steps. Everyone, the magic combination is editable steps to the left, running video and logs to the right, highlight the step you’re on as it plays. Anything else harms your usability.  And also while editing steps you can’t add a step in just anywhere, you have to add it at the end of your 100 steps and then drag it up page by page…  And often when you do that you just get “error saving test” messages for no reason. Argh.

Screen Shot 2020-01-09 at 1.08.46 PM

But… The recording is quick and then it is semi working.  Tempting.  Now I start the iterative edit-replay-debug cycle.  It is slow. You get to give your steps a name but those names don’t show up in the test output, because why would they.  After an afternoon of fiddling, I’m halfway through a 7 screen flow. Their support was nicely proactive and reached out to me about an error (I was looking for text with a $ in it and you can’t do that, but you can define a variable and then use that…)

It’s at this point I also find the Selenium IDE and bring Sauce back into the mix.

Keep Trying – Sauce and EndTest

Next, what I was doing was fiddling with the steps in the Selenium IDE, then pumping those changes both into Sauce via CLI and manually editing them into EndTest’s UI, desperately hoping to get one to pass (they don’t act the same under the same inputs for whatever reason).

Locator by locator I grind through making the test work.  I have a lot of trouble where we use multiple option mat-selects, because they “stay open” while you select items and I can’t get them to close.  I try sending ESCAPE keys but can’t get that to work, I try double clicking on other things…  One of our devs figured out the magical thing to click on was the overlay backdrop (css=.cdk-overlay-backdrop) to close the damn multiselect box.

This takes several grueling days.  I ask support folks for help but don’t really get any useful traction.  Finally, I get a magic combination in the Selenium IDE that also works in Sauce!  I try the same ones in EndTest and they don’t work.

Screen Shot 2020-01-10 at 11.15.10 AMScreen Shot 2020-01-10 at 11.15.18 AM

It’s super frustrating.  The same locator doesn’t work in all 3 tools, often forcing me to choose a less portable option – instead of something resilient to change like “xpath=//span[contains(.,’Visual Line of Sight’)]” – which works in some cases – I end up having to use something like like  xpath=//mat-option[@id=’mat-option-87′]/mat-pseudo-checkbox (and sadly in angular material those IDs randomize unpredictably). Like, there will literally be two identical-except-for-the-text-and-ids-in-them widgets one after the another and one kind of locator works on the first one and not on the second. No idea why.

Sauce Labs – Part 2

OK, so of all the options the only one that actually works for me and will allegedly do crossbrowser testing is an unsupported combo of Selenium IDE and Sauce run off the command line.  A couple sources I found over the course of this:

Not optimal, but at this point I’m a week in and taking what I can get.  Let’s try an actual crossbrowser matrix now.  Bonus hacky Bash script:

#!/usr/bin/env bash

tests=("Paul Precision.side")

platforms=("browserName='chrome' version='latest' platform='macOS 10.14'"
        "browserName='chrome' version='latest' platform='Windows 10'"
        "browserName='chrome' version='latest' platform='Linux'"
        "browserName='MicrosoftEdge' version='latest' platform='Windows 10'"
        "browserName='safari' version='latest' platform='macOS 10.14'"
        "browserName='firefox' version='latest' platform='macOS 10.14'"
        "browserName='internet explorer' version='latest' platform='Windows 10'")

for test in "${tests[@]}"
do
        for platform in "${platforms[@]}"
        do
        echo Running "${test}" "${platform}" 
        echo
        selenium-side-runner --server https://<secrets>@ondemand.saucelabs.com:443/wd/hub -c "${platform}" "${test}"
done
done

Chrome on MacOS – works.  Chrome on Windows – works.  Chrome on Linux – for some reason can’t find a selector early on.  Edge on Windows – weird proxy 400 error, won’t even load the page.  Pretty sure that’s not my fault.  Safari on MacOS – can’t click on the first things it needs to click on.  Firefox on MacOS – same error?  Really?  Now IE… Out of minutes (despite the UI telling me .6 automated hours remain).

I have tried all these os/browser combos manually and they work.

So my conclusion is all these suck and I guess I just need to pay manual QA people to click on our app.  Great.  Or for Cypress to get off their butts and add cross-browser support, which they say “is coming” for three years now.

We’re a startup and time is money, so in the end cross-browser testing is not worth the hassle in all these solutions.  But it is important and I’d love someone to make a solution that actually works for it.

P.S. Please do not suggest another solution unless it has a) UI record and replay capability and b) is cross browser (Chrome/Firefox/Safari/IE/Edge on Windows/MacOS/Linux). I know there’s a million browser automation testing tools out there, that’s not what I need.

Update

I put some more time into this and got some working options – see Record and Replay Browser Testing, Take 2!

Leave a comment

Filed under DevOps

Ask A Tech Manager

We had a really interesting joint CloudAustin/Austin DevOps meeting this week – a tech manager panel!  Ask them anything!  We had 92 people attend and we had to herd them all out of the building with sticks at the end because everyone had so many questions that even at two hours it was still going strong.

My takeaways:

  • Understanding managers’ viewpoint and goals is critical to an individual engineer when looking to get hired or promoted or whatever.
  • Managers want you to be successful – you being successful vs you not being successful (and/or leaving or being let go) is a huge win for them in all ways.
  • Looking for a job?
    • Your resume should be tuned to a job and either through its top section or a cover letter tell a story. Hiring managers get 100-200 applications/resumes per job. They don’t expect you to have everything on the job application – “50% is fine” was the consensus. But they’re doing a first cut before they talk to you, and a lot of people spam resumes, so if your resume doesn’t clearly say why you, add something that does.
    • For a Cloud Engineer position, for example, there’s a big difference between a random UNIX admin resume and a random UNIX admin resume that says “I’m excited about cloud and working towards an AWS certification on the side and really want to find a new job I can do cloud in.” The first one is discarded without comment, the second one can move ahead if they’re willing for people to learn – and given the “50%” thing, they are generally willing for people to lean, if those people are willing to learn!
  • Interviewing for a job?
    • Everyone hates every different interviewing tactic (see Hiring is Broken, and we have the Ultimate Fix) – individual interviews, panel interviews. whiteboard design, online coding, takehome projects, poring over your resume, checking your github, looking at your social meda/blog/whatever. But in the end hiring managers are just trying a handful of things to see if they can figure out if you know what you need to know to do the job.
    • They can’t just take your word for it – their reputation is on the line when they bring people in, and you reflect on them. They want to understand if you’ll be successful. Recruiting, hiring, onboarding cost a lot of money so they can only take so much of a risk – they’re not real particular what the form “proof you can do the job” takes, they’re just fishing for something.
  • Performance issues?
    • If you’re having some outside problem, talk to your manager. People we work with are a cross-section of just plain people, and we expect every medical, psychological, marital, criminal, etc. issue to show up at some point.
    • Communicate. Again, it’s in their best interest you succeed. No one “wants to get rid of you.”
    • Unreasonable demands?  Communicate.  Lots of technical staff work long hours or do the wrong things because they don’t “manage up” well and communicate.  “Hey, with this change I have way too much to get done in 40-50 hours, this is what I think my priority list is, this is what will fall off, what can we do about it?” Bosses don’t know what you’re doing every minute of the day and can’t read your mind. I’ve personally worked with engineers burning themselves out while their same-team colleagues aren’t because they aren’t managing themselves – though they think it’s outside forces bullying them.
  • How to get ahead?
    • Understand the options.  What are career paths there?  How do raises and promotions work? What are the cycles, pools, etc – you can only work the system if you understand the system. Managers are happy to explain.
    • Communicate.  No one knows if you want to move into management or not, or feel like you’re due a promotion or not, if you don’t talk about it with them over time. You have to take charge of your own development – companies want to help you develop but a manager has N reports and a lot of things to worry about, they aren’t going to drive it for you.
    • Listen.  What is needed to get to that Lead Engineer position?  “Slingin’ code” and “being here for 3 years” are, I guarantee, not much of that list of requirements. Usually things like leadership, image, communication, and exposure have a role. Read “How To Win Friends And Influence People” and stuff if you need to. “I just grind code by myself all day” is fine but there’s a max level to which you will rise doing so.

Anyway, thanks to all the managers that participated and all the attendees that grilled them!  I hope it helped people understand better how to guide their own careers.

Leave a comment

Filed under Management

Why Blaming “Human Error” Is Wrong

I’ve been writing a LinkedIn Learning course on postmortems lately and digging into all the fun research on the topic (Dekker, Hollnagel, and so on), and deepening my knowledge on the things I hope most of you know (root cause is a myth, blaming “human error” is wrong…).

I came across an example that really brought it home to me why the continuous blaming of human error is wrong – not “mean,” not “unenlightened,” but just plain logically ineffective.

One of the classic examples of design choices contributing to aviation accidents is the similarity and close placement of landing gear and flap controls in an airplane cockpits. Pilots lower the landing gear, then when they land they pull the flaps – but a small miss has them retract the landing gear instead and they pancake in.

In fact, the US Air Force did a study at the close of World War II where they looked back at all kinds of “pilot error” crashes and identified a bunch of design problems that contributed, and the flap/landing gear confusion was #2 on the list, forming 16%

Analysis of Factors Contributing to 460 ‘Pilot-Error’ Experiences in Operating Aircraft Controls,” by P.E. Fitts and R.E. Jones, USAF Aero Medical Laboratory, Memorandum Report, July 1947.

Then, 20 years later, a major study on the same topic

Aircraft Design-Induced Pilot Error, National Transportation Safety Board, Department of Transportation, Washington, D. C., PB # 175 629, July 1967.
And then, another 13 years later, suddenly they realized the same thing about small craft.

Well, there were very simple fixes mandated to this problem early on – the FAA now requires, for example, the landing gear control be shaped like a wheel and the flap control be shaped like a flap, so basic visual and tactile feedback is available to distinguish the two (especially at night, under stress, etc.).

Screen Shot 2019-09-04 at 5.00.37 PM

There’s other simple tricks that greatly reduce these accidents, like putting a catch on the landing gear retraction. Suddenly the “human error” goes away (leading to the reasonable question of how broadly we define “human error…”

So why did this “known” problem persist – damaging planes and killing people I might add – for 33 years?  In fact, it still happens, here’s a lovely writeup from 2015.

Basically, because all of the accidents where this happened continued to be declared “pilot error.” When there’s not any significant further inquiry (which to be fair, bothers people like airlines and the government and aircraft manufacturers and people with money), just saying it’s pilot error gets the problem over with by a minor sacrifice (a pilot) instead of doing any harder work.

At my new job, we’re doing risk analysis and commercial insurance for unmanned autonomous vehicles (drones). It’s interesting to now be working in a space that’s actually closer than tech to all this safety research. And you can see the same things happening.

Sure, actual research shows that it’s technical problems, not really human error, that is the problem most of the time – see

News article: More drone crashes caused by technical glitches, not human error, study shows.

Study: Exploring Civil Drone Accidents and Incidents to Help Prevent Potential Air Disasters

It cites technical problems as 64% of the time and frankly doesn’t really distinguish human error from design-induced human error.

But of course you can still just blame human error.  I smelled something questionable in the recent news reports of how the crew was to blame for a UK Watchkeeper drone crash. Oh sure, the drone failed to land correctly so the crew intervened, so it’s their fault.  I suspect if they had not intervened and it had continued to malfunction it would also be “their fault.” Here’s the full Ministry of Defense writeup, which goes to the “loss of situational awareness” synonym for human error. And here’s a later Register article with more details, like “The most appropriate [flight reference card] drill …stated: ‘If UA [unmanned aircraft] not maintaining centreline axis: Engine cut……..Command’.” and that they were under supervision of contractors from the drone manufacturer. Apparently following the actual designated drill recommendation of cutting the engine still makes it your fault.

Screen Shot 2019-09-05 at 9.51.07 AM

Of course, “Five drones – almost 10% of a 54-strong fleet bought from French firm Thales – have been wrecked in mid Wales crashes.”  Apparently they have a lot of navigation problems.  So when one goes to land in a populated area and is clearly not navigating properly and goes off the runway and the crew cuts the engine…  The crash is ‘their fault.’  Riiiiiight.

It’s actually a fascinating question – when drone operations are more and more autonomous, how long can we just hold the “crew” responsible for anything that goes wrong? It’s cheaper and less embarrassing, so I’m betting a good while.

For us in tech this opens up an interesting discussion, beyond the obvious statement of “if you do an incident postmortem and simply write it off to developer or operator error you aren’t doing your job”.

We can’t always fix all design issues immediately.  At what point, though, does not prioritizing a better-than-bandaid fix become negligence? “Thirty years,” like with the flap/landing gear thing?

There’s a lot of legislation that tries to protect the powerful from lawsuits etc. – but as autonomy becomes more common, how long will that last for us? Technology firms have managed to get out of being held responsible for endemic security flaws (largely thanks to Microsoft) for decades.

You can see this beginning to crumble in aviation with things like the recent Boeing 737 Max crashes. “It’s human error!” declares the Boeing CEO.  But people aren’t that dumb and the Internet helps information get out that was previously inaccessible.  So the next tack is blame the software, but that’s also buck-passing… The software was the band-aid fix on top of the design issues.

When will our lovely band of insulation finally be whittled away in tech? Soon, I’d bet… “Oh sure let’s crank out some self driving cars, I’m sure it’ll be fine and we can just give the standard ‘what me worry’ face when our crappy design ends up killing some soccer mom that we use when we mess up a software patch nowadays.”

In the end, if you are motivated by actual safety, or uptime, or security instead of the CYA game of who to blame, you have to push beyond the nearest human, or the outermost band-aid on your Rube Goldberg system, to improve the system. You’re going to have to consider how your design and interfaces of your software and the tooling you use to operate it contribute. You’re going to have to use facts and numbers and not soothing opinions to say “you know what? That goes wrong more than our other systems – there’s something wrong with it, we have to dig in and figure out what.”

Leave a comment

Filed under DevOps