Logging for Success

I’ve been working on a logging standards document for our team to use.  We are having a lot of desktop-software developers contributing software to the Web now, and it is making me take a step back and re-explain some things I consider basics.  I did some Googling for inspiration and I have to say, there’s not a lot of coherent bodies of information on what makes logging “good” especially from an operations point of view.  So I’m going to share some chunks of my thoughts here, and would love to hear feedback.

You get a lot of opinions around logging, including very negative ones that some developers believe.  “Never log!  Just attach a debugger!  It has a performance hit!  It will fill up disks!”  But to an operations person, logs are the lifeblood of figuring out what is going on with a complex system.  So without further ado, for your review…

Why Log?

Logging is often an afterthought in code.  But what you log and when and how you log it is critical to later support of the product.  You will find that good logging not only helps operations and support staff resolve issues quickly, but helps you root-cause problems when they are found in development (or when you are pulled in to figure out a production problem!).  “Attach a debugger” is often not possible if it’s a customer site or production server, and even in an internal development environment as systems grow larger and more complex, logs can help diagnose intermittent problems and issues with external dependencies very effectively.  Here are some logging best practices devised over years of supporting production applications.

Logging Frameworks

Consider using a logging framework to help you with implementing these.  Log4j is a full-featured and popular logging package that has been ported to .NET (Log4net) and about a billion other languages and it gives you a lot of this functionality for free.  If you use a framework, then logging correctly is quick and easy.  You don’t have to use a framework, but if you try to implement a nontrivial set of the below best practices, you’ll probably be sorry you didn’t.

The Log File

  • Give the log a meaningful name, ideally containing the name of the product and/or component that’s logging to it and its intent.  “nifarm_error.log” for example is obviously the error log for NIFarm.  “my.log” is… Who knows.
  • For the filename, to ensure compatibility cross-Windows and UNIX, use all lower case, no spaces, etc. in the log filenames.
  • Logs should use a .log suffix to distinguish themselves from everything else on the system (not .txt, .xml, etc.).  They can then be found easily and mapped to something appropriate for their often-large size.  (Note that the .log needs to come after other stuff, like the datetime stamp recommended below)
  • Logs targeted at a systems environment should never delete or overwrite themselves.  They should always append and never lose information.  Let operations worry about log file deletion and disk space – do tell them about the log files so they know to handle it though.  All systems-centric software, from Apache on up, logs append-only by default.
  • Logs targeted at a desktop environment should log by default, but use size-restricted logging so that the log size does not grow without bound.
  • Logs should roll so they don’t grow without bound.  A good best practice is to roll daily and add a .YYYYMMDD(.log) suffix to the log name so that a directory full of logs is easily navigable.  The DailyRollingFileAppender in the log4 packages does this automatically.
  • Logs should always have a configurable location.  Applications that write into their own program directory are a security risk.  Systems people prefer to make logs (and temp files and other stuff like that) write to a specific disk/disk location away from the installed product to the point where they could even set the program’s directory/disk to be read only.
  • Put all your logs together.  Don’t scatter them throughout an installation where they’re hard to find and manage (if you make their locations configurable per above, you get this for free, but the default locations shouldn’t be scattered).

Much more after the jump!

How Many Logs?

One or multiple logs for a given product are both OK.  What you want to do is have a clear purpose for each separate log.  Common log file divisions are separate usage logs vs error logs, or separate logs per major component.  If you combine usage with errors, each line needs to have something indicating which it is.  Keep in mind that if you separate component logs too much it’ll be hard to trace problem timelines across all those files.

It’s best to optimize log files for their use rather than worrying about them being theoretically separate.  Apache logs are a good example – there’s an access log and an error log.  Errors still appear in the access logs as accesses with an error code on them, but the detail information is in the error log.  This is because if you have a log of accesses, you expect it to contain ALL accesses.  But you want it to be concise and completely machine parsable, which an error log isn’t necessarily.

You will always have at least one “main” log – call it the execution log, or the error log, or whatnot, it’s the main log all your messages go into and is used for troubleshooting.  You want to flush this log to disk every time you write to it so crashes don’t destroy information.  I’ll refer to this as the “error” log below – even though it can have informational messages and whatnot its main point is to highlight errors.

Other logs for specific other purposes are fine – in fact if you have another major use case, like “A report the PMMs use to determine what kinds of compiles are coming into the app,” that should probably be a separate log file made appropriate for that use.

Log Format and Content

  • Log in plaintext so logs can be viewed in Notepad etc.
  • Log a single event to a single line.  Stack trace kinds of things can be multiline, but it’s best to keep it single.  Remember a multithreaded application could intersperse a bunch of stuff between those lines you think are “adjacent”.
  • Log with a delimiter between fields so logs can be easily parsed.  Pipes, commas, or spaces are OK but make sure you escape inserted strings that include that delimiter!  Do not use fixed widths.  If you have an empty field, put something in there (like Apache uses “-“)
  • The more formatted and machine readable a log file is, the more it can be automatically consumed and acted upon.
  • Every log line should start with a date/time stamp, ideally in YYYYMMDDHHMMSS order (e.g. 2010-05-07 19:51:57).  Use 24 hour time format (no AM/PM).  This makes logs easily searchable and sortable.  UTC preferred.
  • Every error log line should have a standard severity.
  • You should always log any diagnostic ID that indicates what process, thread, session, or other instance of any multi-instance resource generated the event.
  • You should log information about the user’s identity – both network info (e.g. IP address) and any authentication info (e.g. username).
  • You should be very specific about where the event came from, at the class/method level.
  • Do NOT log sensitive data.  Passwords should never go into a log (whether the authorization event was a pass or fail).  Notify operations of any personally identifiable information you log (email addresses, etc.) due to legal requirements surrounding such data (we have to scrub it before sending it outside NI for example).
  • Log a unique ID for each error that can also be presented to the user by your code, so that an end user reporting an error can be traced back to the actual error information in the system.  “Hey, it says ‘Error 420804’ when I try to upload a file.”  Remember error reports, if and when they come, can be days later and have gone through multiple people – “some guy had an upload problem at some point in the last couple days” isn’t real helpful when there’s 250 MB of log files from that period.
  • Store data in your exceptions to make them easier to read – the more context you have, the better you’ll be able to log it.  Say what you expected and what you got.  “Tried to run a job requiring 10 service credits for user foo@bar.com but they have only 5 remaining” is much better than “Job failed.”
  • You should use your application log and not standard out/system logs/tomcat logs/Windows logs.  Those should be for unforeseen untrapped exceptions and you should expect that anything ending up in those would generate a request to you to handle that exception in the code in the future.
  • Don’t log and rethrow an exception.  Log it only once, as high in the stack as possible.
  • Keep context in mind.  If you’re logging all successful logins, but not unsuccessful ones, how obvious is that to someone six months later looking at the log because there’s problems?

Now, none of this is to say that you should go out and execute other code and do lookups to populate your error.  But if the code throwing the error already has access to context information, user information, etc. – you should include it and not just keep it from the poor person reading the log.

Logging Levels

Don’t fret over “whether” to log something or not.  The answer, if you have to ask the question, is yes.  The question should simply be what level to log it at.

You should use the semi-standard set of log levels – [FATAL|ERROR|WARN|INFO|DEBUG|TRACE].  Here’s what they should mean.

In production, logging is usually set to WARN level. More detailed levels are only turned on if there’s a known problem or the developer wants to capture more detail for a limited time.

  • FATAL: Level FATAL is for things that cause the software to not start or crash.  “Can’t load that DLL” or “Out of memory, going down” qualify.  If your app tries to start or crashes, someone should be able to go to the log and find a FATAL line that says when and why.
  • ERROR: At level ERROR, the only things that should go into the log are problems that need to be actionable by someone.  There is no such thing as a “routine error.”  Those should be put at level WARN.  ERRORs are things that aren’t end user error but that indicate something technical’s not right with the system/application.  “Can’t connect to (database, ldap, file server)”, “trying to save this file but am getting an error saying I can’t,” etc.  ERROR log lines should be rare enough that they page operations staff or trigger automated routines.
  • WARN: At level WARN, put anything that is a temporary problem.  An example is if an app couldn’t connect to the authorization service the first time, but will try three more times before giving up.  That should generate WARNs until it’s done trying and fails, which would be an ERROR.  Also at WARN is any activity that is not totally routine.  A failed login due to a bad password is a WARN level.  Bad user inputs are WARN level.  Finding a virus in an uploaded file and disposing of it successfully is a WARN level.  WARN messages would appear on operations consoles and might be monitored for volume.
  • INFO: At level INFO and above, you should log every single call to an external dependency.  If you are talking over a network port, that means you. Talking to a Web service, a database, an authentication server, a file server, or anything like that should be logged and ideally a timing taken of how long it took.  One of the most common errors in a complex system is that one of the many elements it depends on gets slow or goes down.  If logging is at INFO level, you should be able to see major activity – every user coming in, every job submitted, etc. – a high level map of “what is going on” with the application.
  • DEBUG: At level DEBUG, you should log every time you go into or out of parts of your code.  If you’re using a logging framework, it’s as easy to use this as it is to put in print statements.  Don’t put in print statements, ever, use debug level logging.
  • TRACE: TRACE level is what you’d use if you wanted to log loads of input/output or similar – like a Web app that logs every line of HTML it outputs to the user for some reason.

Changing Log Levels In Production

You should be able to configure your application’s log severity level to log only events of severity X and above in an easily changed configuration file.  Let the operations staff know where this setting is.  As a bonus, let this level be changed without having to restart your application/app server.

If your application/log framework is set to read in its configuration repeatedly as it runs, then log levels can be changed on the fly.  This can incur a performance penalty, though some log frameworks do this multithreaded or once every time increment rather than every time it logs to minimize the impact.  If your app requires a restart to set its log level, then changing levels will probably have to wait until maintenance windows.

Examples

Here’s some examples of “happy” logs.

1.  This is an application called “portal” that calls a Web service called “ills” that in turn creates or authenticates users against a LDAP repository.  Notice the time and date stamps, the pipe delimitation, and location (module and line number) this was thrown from the code.  There are WARNs for minor issues (auth failed, user already existed , INFOs for major steps, DEBUG for details, and TRACE for input/output.  There are no ERRORs because nothing happened that is bad enough to provoke intervention.  This is from Log4j; there is no “standard” format as it’s configurable but this is what they tend to look like.
portal.20100819.log:
2010-08-19 10:20:17,214|INFO|MyfacesConfig.java|185|Starting up Tomahawk on the MyFaces-JSF-Implementation
2010-08-19 10:20:31,229|INFO|TomcatAnnotationLifecycleProvider.java|47|Creating instance of com.ni.apps.lvdotcom.portal.controller.MyMainController
2010-08-19 10:20:31,230|DEBUG|MyMainController.java|23|MyMainController commences
2010-08-19 10:20:31,230|DEBUG|MyMainController.java|29|Hidden flavor: uib
2010-08-19 10:20:31,231|DEBUG|MyMainController.java|33|Setting the flavor to hidden param: uib
2010-08-19 10:20:31,231|DEBUG|MyMainController.java|69|MyMainController completes
2010-08-19 10:20:31,232|INFO|TomcatAnnotationLifecycleProvider.java|47|Creating instance of com.ni.apps.lvdotcom.portal.controller.MyUserController
2010-08-19 10:20:31,233|INFO|TomcatAnnotationLifecycleProvider.java|47|Creating instance of com.ni.apps.lvdotcom.portal.controller.MyMessagingController
2010-08-19 10:20:31,233|DEBUG|MyMessagingController.java|15|MyMessagingController- Initializing.
2010-08-19 10:20:31,250|INFO|TomcatAnnotationLifecycleProvider.java|47|Creating instance of com.ni.apps.lvdotcom.portal.controller.MyLoginController
2010-08-19 10:20:31,252|DEBUG|MyLoginController.java|62|login- The app flavor is: uib
2010-08-19 10:20:31,693|WARN|ILLSClient.java|238|ILLS could not authenticate. Returned Status: 401
2010-08-19 10:20:31,694|WARN|MyLoginController.java|96|Invalid Login: karthik.gaekwad@ni.com
2010-08-19 10:20:31,695|TRACE|ILLSClient.java|241|<html><head><title>Apache Tomcat/6.0.26 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:blac
k;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 401 - java.lang.Exception: Failed to authenticate: karthik.gaekwad@ni.com
at com.ni.apps.lvdotcom.user.web.core.AccountFacade.login(AccountFacade.java:141)
<stack trace>

2010-08-19 10:21:38,660|TRACE|MyAccountController.java|160|Typed FirstName: Karthik
2010-08-19 10:21:38,660|TRACE|MyAccountController.java|162|Typed LastName: Gaek
2010-08-19 10:21:38,706|WARN|MyAccountController.java|178|Account creation Error: could not create account for karthik.gaek@ni.com
com.ni.apps.lvdotcom.portal.exception.ApplicationException: error_user_in_ldap
at com.ni.apps.lvdotcom.portal.web.core.AccountManagementFacade.createAccountStep1(AccountManagementFacade.java:75)<stack trace>
...

2.  Apache logs, for those not familiar with them.  They conform to an actual W3C standard format.  The access log is space delimited, and wraps fields with spaces in quotes.  It puts a “-“ in for fields that would otherwise be empty.  The access log is a completely regular log, more complex errors are put in the error log so that the access log is concise and parsable.  They’re very configurable, this is our default configuration.  If you are using/writing a Web server you should be logging similar information.

access.20100819.log:
2010-08-19 10:48:42 10.209.121.205 - 10.240.95.149 130.164.78.45 users.niwsc.com GET /portal/app/page/account.xhtml  HTTP/1.1 200 16222 12020 12097 2944351088 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)" "-" "JSESSIONID=0173772CBB28982130AA198BD86539B9" "http://users.niwsc.com/portal/app/page/account.xhtml"
2010-08-19 10:49:38 10.209.121.205 - 10.240.95.149 130.164.78.45 users.niwsc.com GET /portal/app/page/account.xhtml ?p=confirm&u=kgaekwad%40ni.com&token=210E5823DB09016C HTTP/1.1 200 6891 11235 12097 2912881520 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)" "-" "JSESSIONID=0173772CBB28982130AA198BD86539B9" "http://users.niwsc.com/portal/app/page/account.xhtml?p=confirm&u=kgaekwad%40ni.com&token=210E5823DB09016C"2010-08-19 10:49:46 10.209.121.205 - 10.240.95.149 130.164.78.45 users.niwsc.com GET /portal/app/page/account.xhtml  HTTP/1.1 200 16222 14889 12097 2933861232 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)" "-" "JSESSIONID=0173772CBB28982130AA198BD86539B9" "http://users.niwsc.com/portal/app/page/account.xhtml"2010-08-19 10:51:36 10.209.121.205 - 10.240.95.149 130.164.78.45 users.niwsc.com GET /portal/app/page/account.xhtml ?p=confirm&u=kgaekwad%40ni.com&token=210E5823DB09016C HTTP/1.1 200 6891 90415 12095 3049249648 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)" "-" "JSESSIONID=0173772CBB28982130AA198BD86539B9" "http://users.niwsc.com/portal/app/page/account.xhtml?p=confirm&u=kgaekwad%40ni.com&token=210E5823DB09016C"

errors.20100819.log:
[Thu Aug 19 05:07:33 2010] [error] [client 83.242.145.34] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /w00tw00t.at.ISC.SANS.DFind:)

Conclusion

Remember that log files get very large on busy system, and that there are lots of different logs – app logs, system logs, etc. – that have to be rooted through, often by someone who doesn’t know how your code was written and what it does when.  Operations will be more successful if they can figure out what your code was trying to do and what exactly about that failed.  Support will be more successful if when a customer sends them log files, they can figure out the same thing.  And if these folks can help customers quickly, then everyone’s happy.  Next best is that they have to come to you (assuming you still work at that company/on that product) and ask you “what this means.”  Of course, it’s possible to log so badly that you don’t know what it means either.  If you can’t look at your own application’s log after a crash and understand what was going on at the time, no one else will be able to either.  Debuggers are fine for development but are not a production problem finding tool.

References

11 Comments

Filed under DevOps

11 responses to “Logging for Success

  1. Nice article, thanks for pulling it together.
    One point that you didn’t mention amd might consider for your standards is that some logging frameworks (like log4j) provide boolean checks to determine whether you are at a given log level.
    So, I can call isDebugEnabled() or isInfoEnabled() methods before executing a log message.
    This is important for Debug and Info level messages (also Trace if you use them) because of the volume of those messages in the normal execution path.
    Because programming languages generally process nested code from the inside-out (eg, arguments are processed before being passed into a function) a debug log message can generate some processing even if debugging is not enabled.
    For example, the following log4j statement will always incur a String concatenation, regardless of the log level because the arguments are processed before the call to the logging framework:

    Logger.debug(“Error #” + errorNum + ” occurred.”);

    To avoid this, use the frameworks checks so that you incur a lightweight boolean check instead of a more expensive string concatenation.

    if (Logger.isDebugEnabled){Logger.debug(“Error #” + errorNum + ” occurred.”);}

    This will significantly reduce the impact of the many trace, debug, and info messages in the code.

  2. Great article!

    I prefer LogBack to log4j (http://logback.qos.ch/) as LogBack extends log4j in useful ways, such as allowing rotation based on time and size (so you can say “rotate this log daily or when the log is bigger than 100MB”). I like this because dealing with large log files is very painful. Also, log4j compresses logs synchronously, so that it pauses application execution during compression which is a high-impact bug when your logs are big. LogBack fixes this bug as well.

    -Adam

  3. Hey guys, sorry you got caught in the overzealous WordPress spam filter, but I let ya out. Both good points. Log4j isn’t the only game in town. On the Windows side for C#/C++ types pantheois and a variety of log4j-like stuff that isn’t the “official from the log4j site” port (log4cxx) like log4cplus and log4net.

  4. Karthik Gaekwad

    Hey! A google search of my name found this blog. Fancy that!

    That being said, I had something to add.

    I noticed you had added 6 levels of logging from TRACE to FATAL. At the time of implementation, I’ve realized that most frameworks hadn’t started with 6 levels, but 5 instead (DEBUG through FATAL). TRACE is a little newer, and while most frameworks (aka log4j) support it now, they haven’t always supported it.

    That being said, I don’t use TRACE as much in my logs but generally start with DEBUG.

    • Yeah, I think it’s OK to combine the two if you don’t have a trace level – but if you log full streams of input and/or output, then you risk not being able to turn debug logs on in production for troubleshooting because of sheer volume.

  5. Hi Ernest, hot from the comments thread on masterzen’s blog, here’s the link to a recent article of mine (in SDPJ) covering a technique for tuning logging levels in Production which avoids the need to decide on logging levels at compile time:

    Tune logging levels in Production without recompiling code


    Cheers

    Matthew

  6. Pingback: Loggen, Visualisieren und Überwachen: Applikationen richtig monitoren - JAXenter

  7. Sagar Rout

    Too late to comment but I was just working on my elastic search operational logs design and someone suggested this article. So, I gave a try and BOOM BOOM I learned something which I should know before I start implement elasticsearch. Cool Isn’t ? Nice article. 🙂

  8. Thanks a lot, great article on logging!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.