Tag Archives: logging

Cloud Austin Logging Tool Roundup Presentations

James, Karthik, and I run Cloud Austin, a technical user group for cloud computing types in Austin.  Last night we broke new ground by videoing the presentations using Hangouts On Air, and the result is a cool bunch of 15 minute presentations on Splunk, Sumo Logic, Logstash, Greylog2 (including one from Lennat Koopmann, the maintainer) and the first public presentation of Project Meniscus, Rackspace’s new logging system.

You can go get slides and watch the 2+  hour long video on the Cloud Austin blog.

Leave a comment

Filed under Cloud, DevOps

Why Your Monitoring Is Lying To You

In my Design for Failure article, I mentioned how many of the common techniques we use to allegedly detect failure really don’t.  This time, we’ll discuss your monitoring and why it is lying to you.

Well, you have some monitoring, don’t you, couldn’t it tell you if an application is down? Obviously not if you are just doing old SNMP/box level monitoring, but you’re all DevOps and you know you have to monitor the applications because that’s what counts. But even then, there are common antipatterns to be aware of.

Synthetic Monitoring

Dirty secret time, most application monitoring is “synthetic,” which means it hits a specific URL or set of URLs once in a while, often 5-10 minutes apart. Also, since there are a lot of transient failures out there on the Internet, most ops groups have their monitors set where they have to see 2-5 consecutive failures – because ops teams don’t like being woken up at 3 AM because an application hiccuped once (or the Internet hiccuped on the way to the application). If the problem happens on only 1 of every 20 hits, and you have to see three errors in a row to alert, then I’ll leave it to your primary school math skills to determine how likely it is you’ll catch the problem.

You can improve on this a little bit, but in the end synthetic monitoring is mainly useful for coarse uptime checking and performance trending.

Metric Monitoring

OK, so synthetic monitoring is only good for rough up/down stuff, but what about my metric monitoring? Maybe I have a spiffier tool that is continuously pulling metrics from Web servers or apps that should give me more of a continuous look.  Hits per second over the last five minutes; current database space, etc.

Well, I have noticed that metrics monitors, with startling regularity, don’t really tell you if something is up or down, especially historically. If you pull current database space and the database is down, you’d think there would be a big nasty gap in your chart but many tools don’t do that – either they report the last value seen, or if it’s a timing report it happily reports you timing of errors. Unless you go to the trouble to say “if the thing is down, set a value of 0 or +infinity or something” then you can sometimes have a failure, then go back and look at your historical graphs and see no sign there’s anything wrong.

Log Monitoring

Well surely your app developers are logging if there’s a failure, right? Unfortunately logging is a bit of an art, and the simple statement “You should log the overall success or failure of each hit to your app, and you should log failures on any external dependency” can be… reinterpreted in many ways. Developers sometimes don’t log all the right things, or even decide to suppress certain logs.

You should always log everything.  Log it at a lower log level, like INFO, if it’s routine, but then at least it can be reviewed if needed and can be turned into a metric for trending via tools like Splunk. My rules are simple:

  • Log the start and end of each hit – are you telling the client success or failure? Don’t rely on the Web server log.
  • Log every single hit to an external dependency at INFO
  • Log every transient failure at WARN
  • Log every error at ERROR

Real User Monitoring

Ah, this is more like it.  The alleged Holy Grail of monitoring is real user monitoring, where you passively look at the transactions coming in and out and log them.  Well, on the one hand, you don’t have to rely on the developers to log, you can log despite them.  But you don’t get as much insight as you’d think. If the output from the app isn’t detectable as an error, then the monitoring doesn’t help.  A surprising amount of the time, failures are not thrown as a 500 or other expected error code. And checking for content within a payload is often fragile.

Also, RUM tools tend to be network sniffer based, which don’t work well in the cloud or in many network topologies.  And you get so much data, that it can be hard to find the real problems without spending a lot of time on it.

No, Really – One Real World Example

We had a problem just this week that managed to successfully slip through all our layers of monitoring – luckily, our keen eyes caught it in preproduction. We had been planning a bit app release and had been setting up monitoring for it. It seemed like everything was going fine. But then the back end databases (SQL Azure in this case) had a pretty long string of failures for about 10 minutes, which brought our attention to the issue. As I looked into it, I realized that it was very likely we would have seen smaller spates of SQL Azure connection issues and thus application outage before – why hadn’t we?  I investigated.

We don’t have any good cloud-compliant real user monitoring in place yet.  And the app was throwing a 200 http code on an error (the error page displayed said 401, but the actual http code was 200) so many of our synthetic monitors were fooled. Plus, the problem was usually occasional enough that hitting once every 10 minutes from Cloudkick didn’t detect it. We fixed that bad status code, and looked at our database monitors. “I know we have monitors directly on the databases, why aren’t those firing?”

Our database metric monitors through Cloudkick, I was surprised to see, had lovely normal looking graphs after the outage.I provoked another outage in test to see, and sure enough, though the monitors ‘went red,’ for some reason they were still providing what seemed to Cloudkick like legitimate data points, and once the monitors “went green,” nothing about any of the metric graphs indicated anything unusual! In other words, the historical graphs had legitimate looking data and did not reveal the outage. That’s a big problem. So we worked on those monitors.

I still wanted to know if this had been happening.  “We use Splunk to aggregate our logs, I’ll go look there!” Well, there were no error lines in the log that would indicate a back end database problem. Upon inquiring, I heard that since SQL Azure connection issues are a known and semi-frequent problem, logging of them is suppressed, since we have retry logic in place.  I recommended that we log all failures, with ones that are going to be retried simply logged at a lower severity level like WARN, but ERROR on failures after the whole spread of retries. I declared this a showstopper bug that had to be fixed before release – not everyone was happy with that, but sometimes DevOps requires tough love.

I was disturbed that we could have periods of outage that were going unnoticed despite our investment in synthetic monitoring, pulling metrics, and searching logs. When I looked back at all our metrics over periods of known outage and they all looked good, I admit I became somewhat irate. We fixed it and I’m happy with our detection now, but I hope this is instructive in showing you how bad assumptions and not fully understanding the pros and cons of each instrumentation approach can end up leaving “stacked holes” that end up profoundly compromising your view of your service!

10 Comments

Filed under DevOps

Logging for Success

I’ve been working on a logging standards document for our team to use.  We are having a lot of desktop-software developers contributing software to the Web now, and it is making me take a step back and re-explain some things I consider basics.  I did some Googling for inspiration and I have to say, there’s not a lot of coherent bodies of information on what makes logging “good” especially from an operations point of view.  So I’m going to share some chunks of my thoughts here, and would love to hear feedback.

You get a lot of opinions around logging, including very negative ones that some developers believe.  “Never log!  Just attach a debugger!  It has a performance hit!  It will fill up disks!”  But to an operations person, logs are the lifeblood of figuring out what is going on with a complex system.  So without further ado, for your review…

Why Log?

Logging is often an afterthought in code.  But what you log and when and how you log it is critical to later support of the product.  You will find that good logging not only helps operations and support staff resolve issues quickly, but helps you root-cause problems when they are found in development (or when you are pulled in to figure out a production problem!).  “Attach a debugger” is often not possible if it’s a customer site or production server, and even in an internal development environment as systems grow larger and more complex, logs can help diagnose intermittent problems and issues with external dependencies very effectively.  Here are some logging best practices devised over years of supporting production applications.

Logging Frameworks

Consider using a logging framework to help you with implementing these.  Log4j is a full-featured and popular logging package that has been ported to .NET (Log4net) and about a billion other languages and it gives you a lot of this functionality for free.  If you use a framework, then logging correctly is quick and easy.  You don’t have to use a framework, but if you try to implement a nontrivial set of the below best practices, you’ll probably be sorry you didn’t.

The Log File

  • Give the log a meaningful name, ideally containing the name of the product and/or component that’s logging to it and its intent.  “nifarm_error.log” for example is obviously the error log for NIFarm.  “my.log” is… Who knows.
  • For the filename, to ensure compatibility cross-Windows and UNIX, use all lower case, no spaces, etc. in the log filenames.
  • Logs should use a .log suffix to distinguish themselves from everything else on the system (not .txt, .xml, etc.).  They can then be found easily and mapped to something appropriate for their often-large size.  (Note that the .log needs to come after other stuff, like the datetime stamp recommended below)
  • Logs targeted at a systems environment should never delete or overwrite themselves.  They should always append and never lose information.  Let operations worry about log file deletion and disk space – do tell them about the log files so they know to handle it though.  All systems-centric software, from Apache on up, logs append-only by default.
  • Logs targeted at a desktop environment should log by default, but use size-restricted logging so that the log size does not grow without bound.
  • Logs should roll so they don’t grow without bound.  A good best practice is to roll daily and add a .YYYYMMDD(.log) suffix to the log name so that a directory full of logs is easily navigable.  The DailyRollingFileAppender in the log4 packages does this automatically.
  • Logs should always have a configurable location.  Applications that write into their own program directory are a security risk.  Systems people prefer to make logs (and temp files and other stuff like that) write to a specific disk/disk location away from the installed product to the point where they could even set the program’s directory/disk to be read only.
  • Put all your logs together.  Don’t scatter them throughout an installation where they’re hard to find and manage (if you make their locations configurable per above, you get this for free, but the default locations shouldn’t be scattered).

Much more after the jump! Continue reading

11 Comments

Filed under DevOps