Lost in Transition: From Server to Cloud

Who remembers the gravvy days of 1999 and 2000? When your company’s Foosball table was in the shipping and receiving room right next to that stack of expensive E4500 and E450s from Sun Microsystems. Every once in a while a particularly vigorous shot on goal would send a ball flying on to a stack of (very) expensive boxes from Sun waiting (often for weeks at a time) to be shipped to Equinix. Back in 2000, companies like Forbes.com and TheStreet.com were paying for bare metal, and from what I remember, this wasn’t uncharacteristic. Today we live in a very different reality, only the largest Internet companies (GOOG, FCBK, APPL) are paying for real infrastructure. Everyone else just rents virtual machines from large telcos.

While it is easy to spin up a new VM in 2010, I get the feeling that we’ve lost something. Whenever i think about a production network or try to help someone diagnose a problem, there’s a lurking uncertainty about the bargain we’ve made with tools like VMWare and other approaches to virtualization. Really, what sort of CPU does this machine have? If our managed hosting provider gives us a SAN? What exactly does that mean? What do you mean they won’t give us a breakdown of which VMs are on which physical machine? From what I can see, the industry is now accustomed to sacrificing control. Companies that have adopted a virtual approach to infrastructure have lost something in the transition. While I don’t argue for a return to the days of spending $3 million on physical infrastructure, I do think that it was easier to diagnose and fix performance issues in 2001. There were fewer variables, fewer secrets, and there were more people on staff that knew exactly how a hard drive works.

Back when we had machines…

Back in 2001, I was playing a role that we would now call “devops liason”. This is the term for a senior member of the software team that acts as the interface between operations and development. My job, most of the time, was to be a bi-directional proxy. Operations might have a question or a suggestion about a particular technology we used in production. Development might need to work with a DBA to tune a particular database table. While my primary job responsibility was application development, I had some insight into operations issues. What sort of shared storage did we have access to? How were the “head-ends” configured? What sort of super-secret security countermeasures were deployed?

The web was still “new”, and companies were rapidly spending millions to gain an advantage: millions on development, millions on advertising, and millions on computing hardware. Back then you had confidence that your application had enough horsepower to serve requests because it was running on $3 million worth of Sun hardware, each machine sporting 32 GB of memory running on a 16-way machine. If your application needed storage, you had access to Fiber Channel SAN storage, your every wish was just another $500k Purchase Order away. Before the bubble burst in 2001, Internet companies dined upon massive servers that many would consider unwieldy today.

The rise of the VM, the Cloud, and the Widespread Adoption of Managed Hosting

After the first crash, the money dried up. The jobs disappeared, and companies had to make due with the capacity they purchased in the previous two years. Systems built on Xen and VMWare became more mainstream in 2004 and 2005. Huge companies like Google and Amazon perfected the model of having a stable of general purpose computers constantly available to be repurposed and deployed . While we grew more and more comfortable with abstracting computing from the underlying physical infrastructure there was also a change in the way people started to pay for resources. Back in 2000, it was old-fashioned – million dollar servers were amortized over five years.

If 2010, if you are using virtual machines, you are likely paying a monthly fee for VM specifications. Example, you pay Amazon EC2 about $70/month for a Small instance (1.7 GB, 1 CPU Unit) and maybe you add on an extra $10/month for access to storage. Another example, you rent a VM from a national telco with 8 GB of RAM, 50% of a quad core CPU, and access to a SAN for about $800/month. Even large companies have moved to a model that involves renting virtual infrastructure. Today we’re no longer discussing disk RPMs or even focusing on CPU clock speed. (Quick what is the CPU clock frequency on your server? How about the Front-side Bus frequency? do you have 15K drives? Do you even care about that anymore?)

We’ve lost something important in the transition

While you might not care about the FSB speed or the size of your hard drive cache, all of these things have a direct impact on performance. (This makes me sound like an old fogey…) Back in the day, when there was a particularly difficult performance problem, there were people on staff who could do a deep dive into issues like average seek time and whether or not a particular disk was fragmented. Today, more and more operations people are just people that call up the hosting provider when something fails. The cloud has brought us indirection, there’s no direct responsibility, and when something awful happens you can just blame the nebulous, unknowable infrastructure. “Well, I don’t know why it isn’t fast, call the hosting service.”

On the other hand, the organization that owns the infrastructure was constantly confronted by issues like power failures and freaky motherboards. I’m not saying that there are not advantages to getting out of the hardware maintenance business. There are. What I’m trying to capture is the idea that we’re farther away from the nuts and bolts of performance. A good production network can only reach full performance potential if it is highly tuned. Very often this means getting someone in house who is familiar with some of the most advanced tools to measure OS performance. Someone who has experience tuning a production server like a custom engine on a valuable sports car. By running toward general-purpose, grid computing solutions, we’re making these highly skilled mechanics obsolete, and we’re creating a new generation of developers who will never be confronted with the realities of the physical hardware that was designed to execute code.

Call me old-fashioned, but I think we’ve lost a lot. Our highly tuned E450 from 2000 could run circles around your virtualized Linux machine in 2010. We’ve been tricked by “the man” into paying rent for less resources, and the operations guru as a craftsman is a disappearing trade.