Fixing Payment Systems with Competition

This Target hack is a BFD. I’m at the mall this weekend because I’m a very last-minute shopper and it was the only time I could find to shop. My wife calls me because she gets this email from Chase which I’ll paraphrase here:

You got hacked.  Lolz!  It ain’t our fault, really.  So sorry. So so sorry. Oh, BTW we’re putting new limits on how you can use your card in the middle of Christmas week because of Target. Hey hope this doesn’t screw you up, but I hope you weren’t planning on spending more than $100 a day with us.   Happy holidays.

Think about this for longer than a few minutes, think about how this affects millions of customers, and then you’ll realize that this Target hack could potentially ding a percent or two off of this holiday season for a few retailers.

When we look back at this time, we’re going to laugh at how silly our approach to payment systems was from about 1980 – 2013.  I think that the Target hack is likely just the beginning, but it is clear that (even with strict PCI-compliance) we need a radical change in payment.

Problems with Payment

  1. Our credit cards (at least in the US) are the technology equivalent of a cassette tape. While I’m running around town with a smartphone that can read my fingerprint whenever I shop, I’m still using the equivalent of an 8-track cassette tape to pay for everything. Instead of moving toward a system that uses my location and my fingerprint. We’re just walking around with wallets that are no more secure than an envelope labeled “My Credit Card Numbers” that is totally unprotected. Steal my wallet, and you’ve got my credit card numbers… there’s a better way.
  2. We still have this irrational belief in the signature (and checkout clerks still eyeball them). This is our idea of identity verification – here’s a quill pen, why don’t you just sign this.  Now wait… there’s enough reliable location data flowing from my phone to enable every checkout clerk to say, “Welcome to the store Mr. O’Brien” without me saying anything.  The store should know I’m there already, the technology also exists to have the store take care of payment authorization every time I pick something up. My phone could generate a piece of data that could encrypt not just who I am, but where I’ve been today and what the time is down to the microsecond authenticated by several GPS satellites.
  3. Online payment systems that offer more security are tiny in comparison to the 50,000 lbs gorillas that dominate the system.  No one uses these systems. Add up the value of all the innovative payment companies in the Bay Area (Square, PayPal, + a thousand others), and you still don’t touch the $6.9 trillion total volume of Visa.  That’s $6.9 trillion dollars flowing through billions of point-of-sale terminals (or “all the money”). Someone needs to figure out how to upgrade that instead of creating yet another payment system to trial in San Francisco and New York.

When I wrote about payment systems in 2010, the universal warning everyone was throwing at me was, “Don’t expect anything to change in the short-term.  The retail industry moves slowly, and no one wants to make the capital investment necessary to upgrade point-of-sale.”  At the time I was talking to a senior manager at a well-known payment company based in the Bay Area about NFC payment systems.  According to him, the future was now a revolution was upon us.  It wasn’t.

The solution

1. Ensure real competition in the payment processing space. Huge payment providers like the ones that have logos in your wallet have had a history of using confidentiality agreements with vendors and transaction fees as a tool to lock out the competition. For example, you are not allowed to offer discounts for different kinds of payment methods.  Whether or not this continues to happen after the interchange fee settlement is up for debate, but we need to make sure that new technologies are not locked out of the physical point-of-sale space.

2. Put all the risk on payment providers.  If you provide a card or technology that people can use for payment, put all of the responsibility for a compromise on the payment provider. This will motivate payment providers to move away from the current, insecure methods of payment that we use today. Your credit card won’t just be a series of easy to copy numbers, it will make use of the technology we have available. Also, this would force dramatic changes to PCI.  “Storing a credit card #” at a merchant would go away, and instead your transactions would look more like PayPal’s authorization process for recurring payments.

With real competition, the payment processors that can control risk will be able to offer a significantly lower cost to the retailer, and retailers will provide the necessary motivation to consumers to adopt the more secure technology.  If Square has the best risk management and fraud prevention technology available, a retailer should be able to offer customers that use that technology a 1-2% discount if they pay with Square. Competition (not regulation) is the way out of this mess.

Whirr

Whirr + Spot Prices + Thanksgiving Weekend means that I can run large m1.xlarge instances on the cheap.

<griping>Also, Whirr is essential, but the project has a sort of “forgotten Maven site” feel about it. It’s annoying when an open source project has several releases, but no one bothers to republish the site.  It’s even more annoying when the “Whirr in 5 minutes” tutorial takes 60 minutes because it doesn’t work.</griping>

The Fall Guy (or Representing Open Source in the Business)

The problem with being the developer who can write at an open source company is that you end up being enlisted into the whole “Please explain how open source works” discussion when the company hires non-technical managers.  You end up as the representative of this strange thing called “open source.” A VP (not yours) calls you up and says, “Hey, could you explain what open source is to our sales team?”

You seize upon this as an opportunity to spread the Gospel of FOSS. You prepare elaborate slides that speak of Cathedrals and Bazaars. You turn some Lessig into an inspirational dramatic monologue that will inspire these non-developers to start thinking of OSS as the heroic effort we are mounting to take back control from proprietary vendors and create an even larger sharing economy. You think that maybe it is appropriate to introduce some of the developers that work on the project that company is currently making money…

…and then you show up at the “Sales Kick-off” meeting and you realize that this is more of a Glengarry Glen Ross joke festival than it is an audience receptive to the idea of profiting from a sharing economy.  You quickly try to revise slides about “Free as in Beer”, because you realize that any mention of beer is going to get this crowd derailed pretty quickly. They scheduled you at the end of the day, after the VP of Sales gave a speech that involved football metaphors and after the regional sales director had a loud fight about territory with the sales team.  You realize that no one really wants to hear about OSS because they are all about to go out on some sales team-building exercise that involves a lot of drinking and more discussion of sports.

You are summoned to present with “…Ok, some hippy developer is going to tell us what this freeware @#$# is all about anyway. Go ahead show ’em how to ‘make it rain.'”

If this is your job, you’ll find yourself in a room full of people asking you questions like “Alright, so do you geeks have anything better to do with your weekend?” and “Why are my customers getting all worked up over open source? I don’t get no commission on this crap.”

Some things that you’ll notice in the reaction:

  • People with a background in business and sales have no idea why you’ve been participating in open source for years.  Not only do they not understand it, some of them discount the entire idea (even if the company was built atop an OSS foundation).
  • Even if you think you’ve explained open source, there’s a large portion of the audience that either wasn’t listening or refuses to admit that it could ever work. (Someone will make a joke about how you are a communist.  It will be unclear whether that person was really joking or not.)
  • Jokes will be made about open source being about “free love,”, “hippies,” and “unicorns.”
  • Invariably, someone from the 1980s will show up and talk about how they once made a lot of money selling business software.  This will be used as an attempt to show others that your generation just has it all wrong.

If just the right kind of manager is there, everything you say about the “possibilities of open source” will be dismissed as over-idealistic nonsense.  Even though you might have just delivered a presentation on how Hadoop has created billions of dollars in value and how organizations like the Apache Software Foundation act as foundries for innovations that drive computing, someone will invariably stand up right after you and say, “Ok, enough about this open source crap, how are we going to make money?”

You realize that your “open-source” stuff is just going to be used as a scapegoat for a sales team that has no idea what OSS is.  This is the reason why you see headlines about large companies canceling support for OSS projects and products.  It isn’t because they couldn’t find a way to “monetize” – no it was often because they refused to understand the gold mine they were sitting on.

The Shift to Local Data Centers

In my post on Friday I wrote a fictional piece from 2020 predicting that the world’s IT infrastructure shifted to in-country data centers after the recent surveillance revelations.   It looks like this is going to happen faster than I expected.

What shall we name this trend?  How about “Jurisdictional Data Compliance” or “Jurisdictional Data Security”.  Walk up to your CIO today and ask what your JDC implementation plan is given your client’s new concerns about privacy.

View from 2020: American Complacency on Surveillance Ruined the Internet

Assume we’re in year 2020, we can all remember a time when Google was the largest internet company in the world – the #1 ranked search engine everywhere (well, everywhere except China). In 2020, this is no longer the case, because of a continued stream of revelations about government surveillance, just about every country in the world decided to enact regulations that encouraged (if not required) services like Email, Advertising, Social networks, and IM to be served from an in-country data center.

In 2020, if you are in Russia you use the Russian social network (already happening), if you are in Germany you use the German email provider, and if you are in China you use the Chinese version of Twitter (already established).  In seven years we went from these ubiquitous internet companies all having a global reach to a reality that encourages providers to confine themselves to a “state”.  The transition was difficult, a number of large internet companies stocks tanked in Q3 and Q4 of 2014 for a number of reasons, but one of the driving factors was that earnings suffered greatly when large portions of the EU and Asia lost trust in anything related to US-based internet providers.  Many of these companies were banking on international expansion as a source of growth. The free lunches and massive campuses in the Bay Area were built on a vision of linking the world’s populations together. Those went away when the promise of a global user base evaporated.

The period of time between 2013 and 2020 was about more than just businesses being affected by the surveillance fallout, after the surveillance scandals of 2013, people started putting up more walls to international cooperation.  This wasn’t an overnight decision, but over years and years as new projects were being implemented both in the private and public sector people who had to make decisions about where to host servers, what cloud providers to use, they all tended to opt for hosting something “in-country”. It wasn’t about which cloud provider had the easiest API any more, it was about a German company hosting a Germany web site in Berlin because of pressure from German customers.  All across the world, companies started to say things like, “Your data doesn’t cross any national boundaries” in marketing materials.  Jurisdictional Data Security became a selling point.

Companies made a mint over compliance with a series of laws passed in the EU, but this new “local-only” approach to services resulted in the creation of isolated islands of activity. In 2020, there’s no more “Internet” really.  The “Baidu-ification” of the internet influenced culture broadly as there is far less cross-cultural exchange.  In 2020, K-pop is confined to Korea, Russian dash-cams of insurance fraud are confined to Russia, and Australian Reddit users no longer salute the North America users during the wee hours of the night.  Nations and regions keep activity to themselves.  Advertising networks (these great vacuums of data) had a much more difficult time operating across networks, and companies started aiming at a target an order-of-magnitude less than multiple-billions of users.

Without thinking about the ramifications to US-based businesses, the government just decided to start using its leverage over US-based internet companies to compel compliance with a collection of secret laws. In an effort to protect us, they ended up sapping energy from one of the only sectors of growth in the economy.  They ended up ruining the global surveillance network they had so successfully established.

Back in 2013, even after the stories broke, most of the American public was still complacent.  Only a tiny percentage of people were paying attention in this country, and of those that were, a sizable portion just thought, “Oh, well, we have to keep track of the terrorists.”   It isn’t like the public was in the habit demanding swift action for anything really, as a nation we had decided to stop electing effective representatives years ago and both of the branches of government they had any control over were locked in an endless battle over Bread and Circus.  Poll random people on the street about FISA in 2013, and they’d probably tell you it was a European soccer league.

The public was complacent, even accepting of this new reality.  The Patriot Act was passed in a time of great fear and an anticipation of constant threat.  Our collective complacency with surveillance and our inability to stand for core values like privacy were the competitive disadvantage that ruined Silicon Valley.   In an effort to protect ourselves, we ended up doing more harm than good.  Some blamed the NSA and CIA, but these people were just implementing policy enacted into law by the public’s representatives.  The real source of the problem was the American public. We were complacent with surveillance.

In 2020 people say things like, “Remember when Google was a global company?  They had an entire campus in Mountain View.  Those were the days.” Google started having problems after 2013, they had to spend so much time at the executive level dealing with high-level negotiations with governments that they took their eye off of the local competition. Yes, America has the best capital markets in the world, the largest economy, and a strong national defense, but the loss of trust trumped all of that.

Lift Now Has Plans

Two weeks ago I blogged about Lift as a good site to help people meet personal goals.   Now, Lift has announced a new feature “Plans”.

lift-plans

What I like about Lift is it’s simplicity.  It isn’t asking me to tweet every other second, and the mobile application hasn’t decided to ask me to write a review. (Have I mentioned I hate that.)

10 Steps to Get Your Crazy Logs Under Control

Two days ago I wrote a post about how “developers tailing the logs” is a common pattern.  A couple people responded to me directly asking me if I had some sort of telepathic ability because they were stuck in a war room tailing logs at that very moment.  It’s a common pattern.  As developers we understand that tailing log files is much like tasseomancy (reading tea leaves) – sometimes the logs roll by so quickly we have to use a sixth sense to recognize the errors.  We are “log whisperers.”

The problem here is that tailing logs is ridiculous for most of the systems we work with.  Ok, if you have 1-2 servers, go knock yourself out – tail away.  If you have 2,000 servers (or more) tailing a log to get any sort of idea about errors or behavior isn’t just inappropriate, it’s dangerously misleading.  It’s the kind of practice that just gives you and everyone around you the false reassurance that because one of your servers is working well, they are all fine.

So, how do we get a handle on this mess?

#1. Stop Tailing Logs @ Scale – If you have more than, say, 10 servers you need to get out of the business of tailing logs.   If you have a reasonable log volume up to a handful of GBs a day, throw it into some system based on Apache Solr and find a way to make the system as immediate as possible.  That’s the key, figure out a way to get logs indexed quickly (in a couple of seconds) because if you don’t?  You’ll have to go back to tailing logs.

You can also use Splunk.  Splunk works, but it’s also expensive, and they charge by daily bandwidth. If you don’t have the patience to figure out Solr, use Splunk, but you’re going to end up paying crazy money for something that you could get for free.

If you have more than a few GBs a day on the order of tens of GBs, hundreds of GBs, or even a few TBs of data a day.  You are in another league, and your definition of “logging” likely encompasses system activity.  There are companies that do this and they have campuses, and this isn’t the kind of “logging” I’m talking about here.

#2. If possible, keep logs “readable” – If you operate a very large web site this may not be possible (see the first recommendations), but you should be aiming for production log messages that are reasonable.   If you are running something on a smaller scale, or something that needs to be interactive don’t write out a MB every 10 seconds.  Don’t print out meaningless gibberish.  When you are trying to come up with a log message, think about the audience which is partially yourself, but mostly the sysadmin who will need to maintain your software far into the future.

#3. Control log volume – This is related to the previous idea. Especially in production, keep volume under control.  Remember that log activity is I/O activity, and if you don’t log correctly you are making your system wait around for disk (disk is crazy slow).   Also, if you are operating @ scale, all that extra logging is just going to slow down whatever consolidated indexing that is going on making it more difficult to get immediate answers from something like Solr.

#4. Log messages should be terse – Log message should be terse. Aim for a single line when possible and try to stay away from messages that need to wrap.  You shouldn’t print a paragraph of text to convey an idea that can be validated with a single integer identifier.  It should fit neatly on a single line if at all possible.   For example, your log messages don’t need to say:

"In an attempt to validate the presence or absence of a record for BLAH BLAH BLAH INC, it was noted that there was an empty value for corporation type.  Throwing an exception now."

Instead:

"Empty corp type: BLAH BLAH... (id: 5). Fix database."

#5. Don’t Log, Silence is Golden – I can’t remember who it was, but someone once commented on the difference between logging in Java and logging in Ruby (I think it was Charles Nutter talking about the difference between Rake and Maven).  When you run a command-line tool in Ruby it often doesn’t print out anything unless something goes horribly wrong.  When you run a tool like Rake it doesn’t print much if things go as planned.   When you run Maven?  It prints a lot of output, and this is output that no one ever reads. This is a key point.  Normal operation of a system shouldn’t really warrant that much logging at all.  If it “just works”, then don’t bother me with all that logging.

If you are operating a web site @ scale, this is an important concept to think about.  Your Apaches (or nginx) are already going to be logging something in an access log so do you really need to have a log that looks like this?

INFO: Customer 13 Logged In
INFO: Attempting to Access Resource 34
INFO: Resource Access for resource 34 from customer 13 approved
INFO: Sending email confirming purchase to customer 13

I don’t think you need this. First, you should have some record of these interactions elsewhere (in the httpd logs), and second, it’s just unnecessary.  In fact, I think those are all DEBUG messages.  Unless something fails – Unless something needs attention, you should strive for having zero effect on your log files. If you depend on your log files to convey activity, you should look elsewhere for a few reasons: 1. It doesn’t scale, and 2. It is inefficient.  Instead of relying on a log file to convey a sense of activity, tell operations to look at database activity over time.

#6. Give Someone (Else) a Command – This is something no one does, but everyone should.  Your logs should tell an administrator what to do next (and it rarely involves you.) The new criteria for printing a log file in production is either something goes wrong or something needs serious attention. If you are printing a message about something that has gone wrong don’t assume that the person reading this message has any understanding about the internals of the system. Give them a direct command.

Instead of this message:

ERROR: Failed to retrieve bean property for the customer object null.

Write this:

ERROR: Customer object was null after login. Call DBA, ask about customer #342R.

You see the difference? The second log gives the admin a fighting chance (it also shifts blame to the database).  In this case, someone sent you a corrupt customer record, so point someone at the DBA.  You’d likely redirect them there anyway.  This way the sysadmin can skip the call to engineering and go directly to the source of the problem.

If you do this right, you’ll minimize the production support burden. Trust me, you want to minimize your production support burden – if you don’t minimize this you won’t have much time for development because you will be fielding calls from production all the time.

#7. Provide Context – Unless you are logging a truly global condition like “Starting system…” or “Shutting down system…”, every log message should have some contextual data.  This is very often an identifier of a record, but what you should try to avoid is the log message that provides zero data or context.   The worst kind of message is something like this:

ERROR: End of File encountered in a stream.  Here's a stack trace and a line number...

This begs two questions: what is that stream from? What exactly were you trying to do? A better log message might be:

ERROR: (API: Customer, Op: Read, Server: api-www23, ID: 32) EOF Problem. Call API team.

In this second example, we’re using something like Log4J Nested Diagnostic Context to put some details into the log that will help diagnose the problem in production.

#8. Don’t Write Multiple Log Statements at a Time – Some developers see logs as an opportunity to have a running commentary on the system and they log everything that happens in a class. I dislike seeing this in both code and in a log.  Here’s the example, you have a single class somewhere and you see code like this:

log.info( "Retrieving customer record" + id );
Customer cust = retrieveCustomer( id );
if( cust != null ) {
     log.info( "Customer isn't null. Yay!" );
     log.info( "Doing fancy thin with customer. );
     doFancyThing( cust );
     log.info( "Fancy thing has been done." );
} else {
     log.error( "Customer " + id + " is null, what am I going to do?" );
}

And, in the log you have:

INFO: Retrieving customer record 3
INFO: Customer isn't null. Yay!
INFO: Doing fancy thing with customer.
INFO: Fancy thing has been done.

Consolidate all of these log messages into a single message and log the result (or don’t log at all unless something goes wrong).  Remember, logging is often a write to disk, and disks are insanely slow compared to everything else in production.   Every time you write a log to disk think of the millions of CPU cycles you are throwing into the wind.

If a developer is writing a class that prints 10 log statements one after another,  these log statements should be combined into a single statement.  Admins don’t really care to see every step of your algorithm described, that’s not why you pay them to maintain the system.

#9. Don’t Use Logs for Range Checking – There’s a certain kind of logging that creeps into a system that has more to do with bad input than anything else.  If you find yourself constantly hitting NullPointerExceptions in something like Java you may end up trying to print out variables to help you evaluate how things failed in production.  After a few years of this, you’ll end up with a production system that logs the value of every variable in the system on every request.

You’ll end up with this:

Customer logged in value: { customer: { id: 3, name: "Charles", ......}
Purchasing products: { product: { id: 32, manufacturer: { id: 53, name: "Toyota"....}
Running through a list of orders: [ { order: { id: 325... }, { order: { id:2003...} ]

…and so on.  In fact, you may end up serializing the entire contents of your database to your log files using this method.

Programmers are usually doing this because they are trying to diagnose problems caused by bad input.  For example, if you read a customer record from the database, maybe you’ll just log the value of the customer record somewhere in the log so you can have it available when you are debugging some production failure. Have a process that takes a customer and a product, well why not print out both in the log just in case we need them.   There are issues with customer records have null values, so… don’t do this, just create better database constraints.

This is the fastest road to unintelligible log files, and it also hints at another problem.  You have awful data.  If you are dealing with user data, check it on the way in.   If you are dealing with a database, take some time to add constraints to the table so that you don’t have to test to see if everything is null.  It’s an unachievable ideal, I know, but you should strive for it.

#10. Evaluate Logs Often – The things described in this post are really logging policy, and no one has it.  This is why we have these production logging disasters, and this is why we create systems that are tough to understand in production. To prevent this, you should put the evaluation of logging on some sort of periodic schedule.  Once every month, or once every release you should have some metric that tells you if log volume has outpaced customer growth.

You should conduct some investigations into how useful or how wasteful your current approach to logging is.   You should have some policy document that defines what each level means.  What is a DEBUG message in your system?  What should be an ERROR?  What does it mean to throw a FATAL? Prior to every release you should do a quick “sanity check” to make sure that you haven’t added some ridiculous log statement that is going to make maintaining the system awful.

But… most people don’t do these things which is why production logs end up being a disaster.

In the War Room, “Let me take a look at the logs…”

Not long ago, I had the opportunity to help a large company upgrade a fairly critical piece of software that was running everything.  I can tell you that the job involved Tomcat, but that’s about it. As a consultant, you learn to keep the specifics of any engagement close.  What I can tell you is that this was the kind of client that has a “War Room” that looks exactly like what you would expect – three rows of Mission Control-style desks all facing a huge wall of monitors. The lights were dimmed, and (minus the uniforms) it reminded me of War Games.  I’ve seen a few of these rooms, and if I ran a company I’d probably demand that they set one up that matches this movie set almost exactly. (Except in my War Room, I’d make uniforms mandatory.)

3movies_wg_snap_room

At this particular launch event I made a joke about the movie “War Games”, but I don’t think the audience was old enough to remember. Thy must have thought to themselves, “Why is the old guy saying, ‘Would you like to play a game?’ in a strange robot voice?”

In the War Room: Everything is Measured

At a company large enough to have a War Room, almost everything is graphed in real-time: databases, apaches, caches, temperature…  There are multiple teams that can handle emergencies around-the-clock without skipping a beat.  Toward the end of the day, you’ll jump on the conference call with teams on other continents preparing to pass the baton as the work never ends.  During a production rollout someone will get up in front of the room and bellow out commands.  There is a multi-hour process, there are multiple data centers with a large number of servers. It’s serious stuff.

There are checkpoints to be reached and decisions to be made…If metrics rise or fall in response to a deployment, people start to react.  If a graph starts to dip or spike, people start to react. Everyone’s on edge because it’s the War Room.  There are conference rooms off to the side where solemn meetings can be convened, and there’s no shortage of serious faces. It’s the War Room. Everything is measured down to the millisecond, or at least that’s the idea, but there’s always one thing that isn’t measured yet in all war rooms and in all production deployments I’ve been a part of, and that’s the application, because most of the time during a deployment you have guests in the War Room – us developers.

Developers Meet Operations

Developers don’t think like admins or operators. When a developer is working on a new system they often don’t stop and think about what the best way to measure it would be.  If you are creating a new database or writing a new programming language, the last thing on your mind is the graph that some admin has to look at to tell if your system is working.  You are working on features, maybe you are also writing some tests (hopefully).  You are not thinking about when or how to page someone if something goes awry.

Developers tend to just grab what is closest to the code.  In most cases, this is a series of log statements.  So, we write these systems that spit out a lot of ad-hoc log statements.  We’re so prolific with log messages that many operators understand that developer-driven log files are usually something of a mess.  Some systems I’ve worked on in the past print out 10 MB of text in five minutes – that’s less than helpful and yhey don’t mean anything to admins because they are often full of errors and stacktraces.  Application logs contain a lot of miscategorized messages. (Debug messages as Error messages, messages that make no sense to anyone other than the developer who wrote them, etc…)

Ops: “Is it working?”  Dev: “Dude, I don’t know, let me look at the logs…”

Back to the War Room. You’ve been sitting there for a few hours.  People have brought a bajillion of dollars worth of hardware offline to push a new deployment, and finally some guy on the other side has pressed a button that runs the release.  Everyone sort of waits around for a few minutes for hard drives and networks to do whatever it is that they are doing, and then people go back to watching some charts.  They always dip or blip a bit during a release, and people always tend to jump up and blame whatever application for whatever problem starts happening at this point.

Or, maybe things are working, maybe they aren’t, but generally there’s this 5-30 minute window period where people are just waiting for some shoe to drop. If you are a developer responsible for a high risk change this is the moment when, highly stressed, you find yourself tailing a log file.  Even in the most monitored of environments, it all tends to come down to a couple of developers tailing log files on a production machine looking for signs of an error. “Yep, it works.  I don’t see any errors.” Or, “Nope, something blew up.  Not sure what happened.”

I’m usually one of the people doing post-release production analysis, and I can tell you what we’re doing when we just tail the logs.  We’re just watching screenfuls of text roll by and we’re looking for hints of an error.  It’s totally unscientific, and it doesn’t make any sense given the fact that production networks often have an insane number of machines in them now.  Maybe it made sense when there were 10 machines and we could tail a few of them, but today those numbers are much larger (by a few orders of magnitude).  Still, we cling to this idea that we’ll always need a developer to just check a few servers. Tail a few logs and make sure nothing went horribly wrong.

I wonder if this is just how the world works.  When they launch a new rocket, I’m sure they measure everything down to the nanosecond, but I’m also sure there’s some person sitting in that NASA Mission Control center running “tail -f” on the booster rocket.   Or when you watch that landing operation for the last Mars rover, half of those people are just SSH’ing into some Solaris or Linux box and tailing “rover.log”.

There’s Got to be a Better Way

Yes, there’s probably a better way.  You could write post-deployment integration tests, you could build in some instrumentation that tests the internals of a system, but it rarely works that way.  You could write Splunk jobs ahead of time to track particular error conditions.  You could run more frequent deployments (i.e multiple times a day) so that deployments don’t have to have all this ceremony. All of these are good ideas, but it’s more likely that developers will continue to just tail logs from now until we stop writing code.

More often than not, your production network is just a whole different beast.  While staging might have tens of machines, maybe your production network has tens of thousands of machines and hundreds of millions of users.   The error conditions are different, the failure mechanisms can’t be anticipated. There’s just no way to recreate it, and, as a developer, you are the guy on the stardeck sitting there like “Data” from ST:TNG watching an overwhelming amount of log data roll by. “Captain, the logs show no anomalies.”

Break Through Server-side Bias and Surrender to Javascript

You have a server-side bias and you don’t even realize it.  I know this, and you need to know this.  It’s keeping you back a bit. Step one is to admit that you have a problem and that your addiction to easy server-side frameworks is ruining your performance.  You’ve used frameworks like Rails, Django, WordPress, or one of several hundred Java web application frameworks for years and you are resisting this move toward Javascript.  Yes, you’ve “embraced” Javascript throughout your applications, but you might be missing the larger point – Javascript isn’t here to make your server-side applications more “reactive”, it isn’t just a nice feature to add to a larger application.

That Javascript you keep on insisting to serve from your server-side framework…that is the application.  It’s taking over.    Your server-side framework won’t be doing anything resembling templating in a few years because that’s the job of the browser.  Yes, yes, you might have a few “high value” pages or pages that have to be hosted on a server because of security constraints, but if your web site requires a round-trip to a server to render a web page… well that’s old people thinking.

The “geographical center” of your application is no longer on the server-side.  You shouldn’t start your application thinking about what server-side framework it is going to be based on, because it won’t be based on a server-side framework.   In fact, you may use many, but they will only serve to support what will essentially be a client-side Javascript application.

I understand how you feel, you might read this and think – “No, we’re not moving everything to client-side Javascript frameworks, no.”  The willful ignorance you are embracing here is a defense mechanism.  As a Java developer in the middle of the last decade.  I saw the next generation using PHP and Ruby and I tried to explain it away for a few years as just a passing fad.  It wasn’t, and I still see a lot of developers in my age group reflexively resisting dynamic languages.  Resisting change is a mistake – you’ll be obsolete before you can say, “Wait I didn’t realize that the browser could do…”

With the rise of ReactJS, AngularJS, Backbone, and a number of other good client-side Javascript frameworks I’m seeing a new kind of bias – server-side bias.  There are people out there who, for whatever reason, be it ignorance or otherwise – these people fail to realize that the days of having some dynamic templating engine on the server-side merge your data with some HTML… those days are coming to a swift conclusion.   It was fun while it lasted, but this “mail-merge” approach to retrieving a row from a database, packaging it up in some object, and then “merging” it into a template.  The view layer of this is moving client-side, quickly.

And, it’s moving to the client because it’s an order of magnitude faster to serve what is essentially a static AngularJS application from a CDN than it is to muck around with serving the same from some server-farm (even with the help of memcache).  That’s the thing, once you start doing this you realize that you only really need a server-side framework for API calls.  That’s really it.  All your server-side frameworks are doing in five years time? JSON, and maybe futzing around with a few databases.

Don’t get me wrong, there will be people writing Rails RHTML and Java JSPs for many years to come much in the same way that COBOL developers still run systems packed away in government data centers.    But, Ruby and Java developers that fail to embrace this client-side Javascript trend will find themselves confined to internal applications – the Oracle Forms of 2020 is Ruby on Rails.

(An imperfect) Space-inspired OSS Project Analogy

At the risk of sounding like a raving lunatic, I decided to come up with a space-inspired taxonomy for characterizing OSS projects.  I came up with this after kicking around GitHub over the weekend trying to make sense of some new projects. Recently there’s been a huge influx of corporate-sponsored OSS projects that are released with a lot of fanfare.  While there’s a lot of good stuff happening in OSS land, it is also difficult to figure out which projects are truly “vibrant” open source projects and which are simply one engineer’s solo project.  While GitHub makes it easy to track things like a network, number of forks, etc.  These metrics are still something of a popularity contest. When a company puts 40 projects in a GitHub account, what I’d appreciate is some upfront statement: “There are our four major OSS projects, and the rest of the repositories with silly names are just small projects that plugin into our own infrastructure.”

As an exercise in lunacy, I decided to throw together a very flawed OSS project size/health/community analogy using space.   You can classify OSS projects using the following classifications:

Comets: Periodic Celebrities / OSS Projects that Won’t Last

Maybe there is an open source project that is suddenly very hot, but you can tell it’s not going to last very long.  I compare these projects to Comets.  The latest little Javascript utility may streak through industry news for a few weeks and then fizzle out.  Many startups in the OSS space think of themselves as a new planet, when in reality they are just a comet with excessive mass.   The thing about the OSS industry news cycle is that Comets often tend to dominate the news cycle as if they are full-fledged planets (because you can pay for coverage).   We’re all so used to the planets we already know… so when a comet comes into town everyone flips out.

Some comets burn out, some comets show up every few months or years and make a lot of noise attracting attention and contributors but ultimately returning back to the desolate reaches of the Oort cloud for a few months.  I’d name some OSS comets but then this post would attract a whole army of comment haters. If you find yourself attracting a community, losing a community, then attracting a new community, then losing it – you are in a dangerous orbit and you are a comet.

Asteroids: Where is everybody? Who’s running this project?

One person OSS projects.  Projects that are not completely connected to a community.   Projects that look substantial on radar, but then appear to be abandoned upon closer inspection.  Projects not large enough to attract a community (or in this case an atmosphere).   The majority of GitHub is comprised of a series of asteroid belts.

Think RubyGems, some RubyGems are so important they are moons of a planet (activesupport), or even planets in a system depending on your perspective (rails), but a lot of RubyGems are just one-person forks of someone else’s codebase floating around without a lot of discussion.   If you’ve ever found yourself trying to contribute to an OSS project only to find no response, there’s a good chance that you’ve stumbled upon an asteroid.

If you work for a company that just dumps OSS projects out there but doesn’t provide much in the way of support, you are effectively generating more asteroids.  Asteroids can be very useful to a consumer of OSS, but when you take on an asteroid, when you start mining that asteroid for minerals, you own the whole thing. If it breaks you have to fix it.  Also, if your healthy project (your planet) depends on an asteroid, you better keep track of it, or it’s going to impact you at some later date.

Planets: OSS Projects with an Ecosystem

Tomcat is a planet, and on the Tomcat planet live thousands of developers.  If something starts going wrong with Tomcat, the planet, a whole army of people show up to fix problems.   If someone wants to do something drastic to the planet, there’s a whole community which consists of that planet and any associated moons that show up and register an opinion. Taking care of a planet is tough work because there are so many interested parties.

This is the ideal size and scope for an OSS project.  Something large enough to attract a population, something large enough to sustain an atmosphere.  Yes, your planet is going to go through seasons of activity and inactivity, but there will always be signs of life on your project (as long as you do things like monitor the climate and make the necessary adjustments).

Moons: Your planet’s plugins.

Plugins for larger projects are moons.  Maybe.  Moons can gain so much velocity that they need to be rocketed into separate planetary orbits.    Maven plugins == moons.  Gradle plugins == moons.  Can’t think of anything more interesting to say about moons, so I’m moving on…

Systems:  Substantial OSS Projects Revolving Around a Central Idea or Project

Apache httpd is a system (maybe), Rails is a system (but it dominates the Ruby Galaxy).    Node.js is a system in the Javascript galaxy.

While Hadoop itself may have been a planet at one time, you can consider the entire Hadoop ecosystem to be it’s own system.    Or, maybe Hadoop is a planet in the Map/Reduce system.  Maybe Hadoop started out as a planet, it quickly aggregates many moons.  It underwent a sort of ignition point and became a star itself?

This may be where the whole analogy breaks down because if Hadoop is a star, what then is Hive?  A planet? You know what, I don’t know. It’s an analogy and it’s imperfect.  Maybe HDFS is like a singularity that tunnels between dimensions.

Now I’m just being facetious.  You get the gist.

Galaxies –  Galaxies are often more than just a project, they are an entire collection of systems.  For example, Hadoop is in the Java galaxy.    Maybe there is a PHP galaxy or a Javascript galaxy.

Listen and you’ll hear Cosmic Background Radiation?  That’s the constant bickering between proponents of BSD-style licenses and proponents of the GPL.

What is Dark Matter? Some people are convinced that OSS is dominated by corporate influence.  This influence is often very visible, but it is also something that is difficult to keep of track of because it has a weak interaction with mailing lists.

What then is the Apache Software Foundation?   The Apache Software Foundation is like the Federation.  It spans many systems and dominates certain galaxies.  Except they often have a hard time deciding where to go next because none of the ships have a captain. Sulu can stand up at any time and say, “Kirk I’m going to have to -1 that order.”  (That was a joke ASF people… that was a joke.)

Here, watch a YouTube video of Carl Sagan…