10 Steps to Get Your Crazy Logs Under Control

Two days ago I wrote a post about how “developers tailing the logs” is a common pattern.  A couple people responded to me directly asking me if I had some sort of telepathic ability because they were stuck in a war room tailing logs at that very moment.  It’s a common pattern.  As developers we understand that tailing log files is much like tasseomancy (reading tea leaves) – sometimes the logs roll by so quickly we have to use a sixth sense to recognize the errors.  We are “log whisperers.”

The problem here is that tailing logs is ridiculous for most of the systems we work with.  Ok, if you have 1-2 servers, go knock yourself out – tail away.  If you have 2,000 servers (or more) tailing a log to get any sort of idea about errors or behavior isn’t just inappropriate, it’s dangerously misleading.  It’s the kind of practice that just gives you and everyone around you the false reassurance that because one of your servers is working well, they are all fine.

So, how do we get a handle on this mess?

#1. Stop Tailing Logs @ Scale – If you have more than, say, 10 servers you need to get out of the business of tailing logs.   If you have a reasonable log volume up to a handful of GBs a day, throw it into some system based on Apache Solr and find a way to make the system as immediate as possible.  That’s the key, figure out a way to get logs indexed quickly (in a couple of seconds) because if you don’t?  You’ll have to go back to tailing logs.

You can also use Splunk.  Splunk works, but it’s also expensive, and they charge by daily bandwidth. If you don’t have the patience to figure out Solr, use Splunk, but you’re going to end up paying crazy money for something that you could get for free.

If you have more than a few GBs a day on the order of tens of GBs, hundreds of GBs, or even a few TBs of data a day.  You are in another league, and your definition of “logging” likely encompasses system activity.  There are companies that do this and they have campuses, and this isn’t the kind of “logging” I’m talking about here.

#2. If possible, keep logs “readable” – If you operate a very large web site this may not be possible (see the first recommendations), but you should be aiming for production log messages that are reasonable.   If you are running something on a smaller scale, or something that needs to be interactive don’t write out a MB every 10 seconds.  Don’t print out meaningless gibberish.  When you are trying to come up with a log message, think about the audience which is partially yourself, but mostly the sysadmin who will need to maintain your software far into the future.

#3. Control log volume – This is related to the previous idea. Especially in production, keep volume under control.  Remember that log activity is I/O activity, and if you don’t log correctly you are making your system wait around for disk (disk is crazy slow).   Also, if you are operating @ scale, all that extra logging is just going to slow down whatever consolidated indexing that is going on making it more difficult to get immediate answers from something like Solr.

#4. Log messages should be terse – Log message should be terse. Aim for a single line when possible and try to stay away from messages that need to wrap.  You shouldn’t print a paragraph of text to convey an idea that can be validated with a single integer identifier.  It should fit neatly on a single line if at all possible.   For example, your log messages don’t need to say:

"In an attempt to validate the presence or absence of a record for BLAH BLAH BLAH INC, it was noted that there was an empty value for corporation type.  Throwing an exception now."

Instead:

"Empty corp type: BLAH BLAH... (id: 5). Fix database."

#5. Don’t Log, Silence is Golden – I can’t remember who it was, but someone once commented on the difference between logging in Java and logging in Ruby (I think it was Charles Nutter talking about the difference between Rake and Maven).  When you run a command-line tool in Ruby it often doesn’t print out anything unless something goes horribly wrong.  When you run a tool like Rake it doesn’t print much if things go as planned.   When you run Maven?  It prints a lot of output, and this is output that no one ever reads. This is a key point.  Normal operation of a system shouldn’t really warrant that much logging at all.  If it “just works”, then don’t bother me with all that logging.

If you are operating a web site @ scale, this is an important concept to think about.  Your Apaches (or nginx) are already going to be logging something in an access log so do you really need to have a log that looks like this?

INFO: Customer 13 Logged In
INFO: Attempting to Access Resource 34
INFO: Resource Access for resource 34 from customer 13 approved
INFO: Sending email confirming purchase to customer 13

I don’t think you need this. First, you should have some record of these interactions elsewhere (in the httpd logs), and second, it’s just unnecessary.  In fact, I think those are all DEBUG messages.  Unless something fails – Unless something needs attention, you should strive for having zero effect on your log files. If you depend on your log files to convey activity, you should look elsewhere for a few reasons: 1. It doesn’t scale, and 2. It is inefficient.  Instead of relying on a log file to convey a sense of activity, tell operations to look at database activity over time.

#6. Give Someone (Else) a Command – This is something no one does, but everyone should.  Your logs should tell an administrator what to do next (and it rarely involves you.) The new criteria for printing a log file in production is either something goes wrong or something needs serious attention. If you are printing a message about something that has gone wrong don’t assume that the person reading this message has any understanding about the internals of the system. Give them a direct command.

Instead of this message:

ERROR: Failed to retrieve bean property for the customer object null.

Write this:

ERROR: Customer object was null after login. Call DBA, ask about customer #342R.

You see the difference? The second log gives the admin a fighting chance (it also shifts blame to the database).  In this case, someone sent you a corrupt customer record, so point someone at the DBA.  You’d likely redirect them there anyway.  This way the sysadmin can skip the call to engineering and go directly to the source of the problem.

If you do this right, you’ll minimize the production support burden. Trust me, you want to minimize your production support burden – if you don’t minimize this you won’t have much time for development because you will be fielding calls from production all the time.

#7. Provide Context – Unless you are logging a truly global condition like “Starting system…” or “Shutting down system…”, every log message should have some contextual data.  This is very often an identifier of a record, but what you should try to avoid is the log message that provides zero data or context.   The worst kind of message is something like this:

ERROR: End of File encountered in a stream.  Here's a stack trace and a line number...

This begs two questions: what is that stream from? What exactly were you trying to do? A better log message might be:

ERROR: (API: Customer, Op: Read, Server: api-www23, ID: 32) EOF Problem. Call API team.

In this second example, we’re using something like Log4J Nested Diagnostic Context to put some details into the log that will help diagnose the problem in production.

#8. Don’t Write Multiple Log Statements at a Time – Some developers see logs as an opportunity to have a running commentary on the system and they log everything that happens in a class. I dislike seeing this in both code and in a log.  Here’s the example, you have a single class somewhere and you see code like this:

log.info( "Retrieving customer record" + id );
Customer cust = retrieveCustomer( id );
if( cust != null ) {
     log.info( "Customer isn't null. Yay!" );
     log.info( "Doing fancy thin with customer. );
     doFancyThing( cust );
     log.info( "Fancy thing has been done." );
} else {
     log.error( "Customer " + id + " is null, what am I going to do?" );
}

And, in the log you have:

INFO: Retrieving customer record 3
INFO: Customer isn't null. Yay!
INFO: Doing fancy thing with customer.
INFO: Fancy thing has been done.

Consolidate all of these log messages into a single message and log the result (or don’t log at all unless something goes wrong).  Remember, logging is often a write to disk, and disks are insanely slow compared to everything else in production.   Every time you write a log to disk think of the millions of CPU cycles you are throwing into the wind.

If a developer is writing a class that prints 10 log statements one after another,  these log statements should be combined into a single statement.  Admins don’t really care to see every step of your algorithm described, that’s not why you pay them to maintain the system.

#9. Don’t Use Logs for Range Checking – There’s a certain kind of logging that creeps into a system that has more to do with bad input than anything else.  If you find yourself constantly hitting NullPointerExceptions in something like Java you may end up trying to print out variables to help you evaluate how things failed in production.  After a few years of this, you’ll end up with a production system that logs the value of every variable in the system on every request.

You’ll end up with this:

Customer logged in value: { customer: { id: 3, name: "Charles", ......}
Purchasing products: { product: { id: 32, manufacturer: { id: 53, name: "Toyota"....}
Running through a list of orders: [ { order: { id: 325... }, { order: { id:2003...} ]

…and so on.  In fact, you may end up serializing the entire contents of your database to your log files using this method.

Programmers are usually doing this because they are trying to diagnose problems caused by bad input.  For example, if you read a customer record from the database, maybe you’ll just log the value of the customer record somewhere in the log so you can have it available when you are debugging some production failure. Have a process that takes a customer and a product, well why not print out both in the log just in case we need them.   There are issues with customer records have null values, so… don’t do this, just create better database constraints.

This is the fastest road to unintelligible log files, and it also hints at another problem.  You have awful data.  If you are dealing with user data, check it on the way in.   If you are dealing with a database, take some time to add constraints to the table so that you don’t have to test to see if everything is null.  It’s an unachievable ideal, I know, but you should strive for it.

#10. Evaluate Logs Often – The things described in this post are really logging policy, and no one has it.  This is why we have these production logging disasters, and this is why we create systems that are tough to understand in production. To prevent this, you should put the evaluation of logging on some sort of periodic schedule.  Once every month, or once every release you should have some metric that tells you if log volume has outpaced customer growth.

You should conduct some investigations into how useful or how wasteful your current approach to logging is.   You should have some policy document that defines what each level means.  What is a DEBUG message in your system?  What should be an ERROR?  What does it mean to throw a FATAL? Prior to every release you should do a quick “sanity check” to make sure that you haven’t added some ridiculous log statement that is going to make maintaining the system awful.

But… most people don’t do these things which is why production logs end up being a disaster.