In the War Room, “Let me take a look at the logs…”

Not long ago, I had the opportunity to help a large company upgrade a fairly critical piece of software that was running everything.  I can tell you that the job involved Tomcat, but that’s about it. As a consultant, you learn to keep the specifics of any engagement close.  What I can tell you is that this was the kind of client that has a “War Room” that looks exactly like what you would expect – three rows of Mission Control-style desks all facing a huge wall of monitors. The lights were dimmed, and (minus the uniforms) it reminded me of War Games.  I’ve seen a few of these rooms, and if I ran a company I’d probably demand that they set one up that matches this movie set almost exactly. (Except in my War Room, I’d make uniforms mandatory.)

3movies_wg_snap_room

At this particular launch event I made a joke about the movie “War Games”, but I don’t think the audience was old enough to remember. Thy must have thought to themselves, “Why is the old guy saying, ‘Would you like to play a game?’ in a strange robot voice?”

In the War Room: Everything is Measured

At a company large enough to have a War Room, almost everything is graphed in real-time: databases, apaches, caches, temperature…  There are multiple teams that can handle emergencies around-the-clock without skipping a beat.  Toward the end of the day, you’ll jump on the conference call with teams on other continents preparing to pass the baton as the work never ends.  During a production rollout someone will get up in front of the room and bellow out commands.  There is a multi-hour process, there are multiple data centers with a large number of servers. It’s serious stuff.

There are checkpoints to be reached and decisions to be made…If metrics rise or fall in response to a deployment, people start to react.  If a graph starts to dip or spike, people start to react. Everyone’s on edge because it’s the War Room.  There are conference rooms off to the side where solemn meetings can be convened, and there’s no shortage of serious faces. It’s the War Room. Everything is measured down to the millisecond, or at least that’s the idea, but there’s always one thing that isn’t measured yet in all war rooms and in all production deployments I’ve been a part of, and that’s the application, because most of the time during a deployment you have guests in the War Room – us developers.

Developers Meet Operations

Developers don’t think like admins or operators. When a developer is working on a new system they often don’t stop and think about what the best way to measure it would be.  If you are creating a new database or writing a new programming language, the last thing on your mind is the graph that some admin has to look at to tell if your system is working.  You are working on features, maybe you are also writing some tests (hopefully).  You are not thinking about when or how to page someone if something goes awry.

Developers tend to just grab what is closest to the code.  In most cases, this is a series of log statements.  So, we write these systems that spit out a lot of ad-hoc log statements.  We’re so prolific with log messages that many operators understand that developer-driven log files are usually something of a mess.  Some systems I’ve worked on in the past print out 10 MB of text in five minutes – that’s less than helpful and yhey don’t mean anything to admins because they are often full of errors and stacktraces.  Application logs contain a lot of miscategorized messages. (Debug messages as Error messages, messages that make no sense to anyone other than the developer who wrote them, etc…)

Ops: “Is it working?”  Dev: “Dude, I don’t know, let me look at the logs…”

Back to the War Room. You’ve been sitting there for a few hours.  People have brought a bajillion of dollars worth of hardware offline to push a new deployment, and finally some guy on the other side has pressed a button that runs the release.  Everyone sort of waits around for a few minutes for hard drives and networks to do whatever it is that they are doing, and then people go back to watching some charts.  They always dip or blip a bit during a release, and people always tend to jump up and blame whatever application for whatever problem starts happening at this point.

Or, maybe things are working, maybe they aren’t, but generally there’s this 5-30 minute window period where people are just waiting for some shoe to drop. If you are a developer responsible for a high risk change this is the moment when, highly stressed, you find yourself tailing a log file.  Even in the most monitored of environments, it all tends to come down to a couple of developers tailing log files on a production machine looking for signs of an error. “Yep, it works.  I don’t see any errors.” Or, “Nope, something blew up.  Not sure what happened.”

I’m usually one of the people doing post-release production analysis, and I can tell you what we’re doing when we just tail the logs.  We’re just watching screenfuls of text roll by and we’re looking for hints of an error.  It’s totally unscientific, and it doesn’t make any sense given the fact that production networks often have an insane number of machines in them now.  Maybe it made sense when there were 10 machines and we could tail a few of them, but today those numbers are much larger (by a few orders of magnitude).  Still, we cling to this idea that we’ll always need a developer to just check a few servers. Tail a few logs and make sure nothing went horribly wrong.

I wonder if this is just how the world works.  When they launch a new rocket, I’m sure they measure everything down to the nanosecond, but I’m also sure there’s some person sitting in that NASA Mission Control center running “tail -f” on the booster rocket.   Or when you watch that landing operation for the last Mars rover, half of those people are just SSH’ing into some Solaris or Linux box and tailing “rover.log”.

There’s Got to be a Better Way

Yes, there’s probably a better way.  You could write post-deployment integration tests, you could build in some instrumentation that tests the internals of a system, but it rarely works that way.  You could write Splunk jobs ahead of time to track particular error conditions.  You could run more frequent deployments (i.e multiple times a day) so that deployments don’t have to have all this ceremony. All of these are good ideas, but it’s more likely that developers will continue to just tail logs from now until we stop writing code.

More often than not, your production network is just a whole different beast.  While staging might have tens of machines, maybe your production network has tens of thousands of machines and hundreds of millions of users.   The error conditions are different, the failure mechanisms can’t be anticipated. There’s just no way to recreate it, and, as a developer, you are the guy on the stardeck sitting there like “Data” from ST:TNG watching an overwhelming amount of log data roll by. “Captain, the logs show no anomalies.”