In the War Room, “Let me take a look at the logs…”

Not long ago, I had the opportunity to help a large company upgrade a fairly critical piece of software that was running everything.  I can tell you that the job involved Tomcat, but that’s about it. As a consultant, you learn to keep the specifics of any engagement close.  What I can tell you is that this was the kind of client that has a “War Room” that looks exactly like what you would expect – three rows of Mission Control-style desks all facing a huge wall of monitors. The lights were dimmed, and (minus the uniforms) it reminded me of War Games.  I’ve seen a few of these rooms, and if I ran a company I’d probably demand that they set one up that matches this movie set almost exactly. (Except in my War Room, I’d make uniforms mandatory.)

3movies_wg_snap_room

At this particular launch event I made a joke about the movie “War Games”, but I don’t think the audience was old enough to remember. Thy must have thought to themselves, “Why is the old guy saying, ‘Would you like to play a game?’ in a strange robot voice?”

In the War Room: Everything is Measured

At a company large enough to have a War Room, almost everything is graphed in real-time: databases, apaches, caches, temperature…  There are multiple teams that can handle emergencies around-the-clock without skipping a beat.  Toward the end of the day, you’ll jump on the conference call with teams on other continents preparing to pass the baton as the work never ends.  During a production rollout someone will get up in front of the room and bellow out commands.  There is a multi-hour process, there are multiple data centers with a large number of servers. It’s serious stuff.

There are checkpoints to be reached and decisions to be made…If metrics rise or fall in response to a deployment, people start to react.  If a graph starts to dip or spike, people start to react. Everyone’s on edge because it’s the War Room.  There are conference rooms off to the side where solemn meetings can be convened, and there’s no shortage of serious faces. It’s the War Room. Everything is measured down to the millisecond, or at least that’s the idea, but there’s always one thing that isn’t measured yet in all war rooms and in all production deployments I’ve been a part of, and that’s the application, because most of the time during a deployment you have guests in the War Room – us developers.

Developers Meet Operations

Developers don’t think like admins or operators. When a developer is working on a new system they often don’t stop and think about what the best way to measure it would be.  If you are creating a new database or writing a new programming language, the last thing on your mind is the graph that some admin has to look at to tell if your system is working.  You are working on features, maybe you are also writing some tests (hopefully).  You are not thinking about when or how to page someone if something goes awry.

Developers tend to just grab what is closest to the code.  In most cases, this is a series of log statements.  So, we write these systems that spit out a lot of ad-hoc log statements.  We’re so prolific with log messages that many operators understand that developer-driven log files are usually something of a mess.  Some systems I’ve worked on in the past print out 10 MB of text in five minutes – that’s less than helpful and yhey don’t mean anything to admins because they are often full of errors and stacktraces.  Application logs contain a lot of miscategorized messages. (Debug messages as Error messages, messages that make no sense to anyone other than the developer who wrote them, etc…)

Ops: “Is it working?”  Dev: “Dude, I don’t know, let me look at the logs…”

Back to the War Room. You’ve been sitting there for a few hours.  People have brought a bajillion of dollars worth of hardware offline to push a new deployment, and finally some guy on the other side has pressed a button that runs the release.  Everyone sort of waits around for a few minutes for hard drives and networks to do whatever it is that they are doing, and then people go back to watching some charts.  They always dip or blip a bit during a release, and people always tend to jump up and blame whatever application for whatever problem starts happening at this point.

Or, maybe things are working, maybe they aren’t, but generally there’s this 5-30 minute window period where people are just waiting for some shoe to drop. If you are a developer responsible for a high risk change this is the moment when, highly stressed, you find yourself tailing a log file.  Even in the most monitored of environments, it all tends to come down to a couple of developers tailing log files on a production machine looking for signs of an error. “Yep, it works.  I don’t see any errors.” Or, “Nope, something blew up.  Not sure what happened.”

I’m usually one of the people doing post-release production analysis, and I can tell you what we’re doing when we just tail the logs.  We’re just watching screenfuls of text roll by and we’re looking for hints of an error.  It’s totally unscientific, and it doesn’t make any sense given the fact that production networks often have an insane number of machines in them now.  Maybe it made sense when there were 10 machines and we could tail a few of them, but today those numbers are much larger (by a few orders of magnitude).  Still, we cling to this idea that we’ll always need a developer to just check a few servers. Tail a few logs and make sure nothing went horribly wrong.

I wonder if this is just how the world works.  When they launch a new rocket, I’m sure they measure everything down to the nanosecond, but I’m also sure there’s some person sitting in that NASA Mission Control center running “tail -f” on the booster rocket.   Or when you watch that landing operation for the last Mars rover, half of those people are just SSH’ing into some Solaris or Linux box and tailing “rover.log”.

There’s Got to be a Better Way

Yes, there’s probably a better way.  You could write post-deployment integration tests, you could build in some instrumentation that tests the internals of a system, but it rarely works that way.  You could write Splunk jobs ahead of time to track particular error conditions.  You could run more frequent deployments (i.e multiple times a day) so that deployments don’t have to have all this ceremony. All of these are good ideas, but it’s more likely that developers will continue to just tail logs from now until we stop writing code.

More often than not, your production network is just a whole different beast.  While staging might have tens of machines, maybe your production network has tens of thousands of machines and hundreds of millions of users.   The error conditions are different, the failure mechanisms can’t be anticipated. There’s just no way to recreate it, and, as a developer, you are the guy on the stardeck sitting there like “Data” from ST:TNG watching an overwhelming amount of log data roll by. “Captain, the logs show no anomalies.”

The Next Generation of Health is Genomics-driven Prevention

Steven Pinker has cardiac hypertrophy (or Ventricular Hypertrophy also known as althlete’s heart). On it’s own this is a wholly uninteresting fact to most people, it isn’t something you’d bring up in normal conversation, “Oh, hey did you know that the esteemed cognitive scientist Steven Pinker has athlete’s heart?”

What makes this revelation remarkable is the fact that he learned this via the Personal Genome Project and that you can watch him finding out about the diagnosis (at 0:40). The initial PGP-10 took a risk when they decided to publish “anonymized” health data and gene sequences as part of this research, they must have all known that everyone in that group of 10 was guaranteed to lose any notion of health privacy, but they’ve taken this bold step to help usher in an era of scientific and medical research that will revolutionize the way we all approach health.

In the same way that individuals like Pinker and Brin learn of genetic predisposition to serious illness, many millions of Americans will learn of similar markers over the next decade. The cost of sequencing a human genome is dropping at a factor of 10x per year, at a current cost of anywhere between $25k and $100k depending on what technology you use. In about two years, the cost of sequencing your entire genome will be on par with the cost of a simple medical test, and that’s when the discussion about health care is going to get interesting.

When the cost of sequencing a human’s DNA is on par with the cost of a simple blood test, it is going to be difficult to justify not performing this routine “test”. We’re going to have to have real discussions about the meaning of privacy and ownership (Who owns your sequence?) and we’re going to have to answer some fairly unexpected ethical questions. The most important questions we’re going to have to answer in the next two years:

  • What would you do with millions of individual sequences?
  • Once you’ve realize what the benefits of aggregate sequencing are, does the individual’s “ownership” of individual DNA sequence data outweigh the benefit to society as a whole?
  • How will knowledge of genetic predisposition affect the individual in a society that is currently focused on the cost of health care?

It’ll be interesting, but the one thing to keep in mind is something that Wally Gilbert told me at last year’s Scifoo. After having watched some of the discussions between Dr. Fire and Dr. Church, I wanted to get Gilbert’s take on the emerging ethical questions of sequencing. His first reaction was to mention the controversy that accompanied his initial success at sequencing, but he cautioned that the general perception of DNA as “predestiny” is not at all accurate. Just because someone has a marker for a serious disease does not at all mean that they will develop the disease.

This is a concept that society will have to grok quickly, because Science is moving faster than most realize.

Two O’Reilly Posts this Morning: PGP-10 and Justifying Twitter Trends

First, I wrote a quick piece pointing people to the Personal Genome Project. Church’s experiment is expanding beyond the initial group of 10 luminaries and is starting to invite more participants. It will be interesting to see how this plays out, from some discussions at last year’s Scifoo, I’m convinced that aggregating everyone’s Genetic sequence into a massive data set would yield a series of benefits that could potentially transform our understanding of disease. The simple act of creating an index of the human genome and comparing that index to the incident rate of various pathologies could yield clues as to the cause of various conditions.

Second, a response to Morozov’s Twitter missive in Foreign Policy. Morozov sounds the alarm that Twitter has the potential to misinform and bemoans the excessive number of tin-hat conspiracy theorists on the platform. Listen, Twitter is a communications platform, and, as such, a bunch of crazies are always going to show up to the party. I don’t think that you discount the entire conversation because of eight example Tweets… nor do I think it is helpful or original to view Twitter as a potential terrorist threat. Maybe Morozov was auditioning for the role of Fox News columnist?