All Posts By

Mike Murphy

CPU Cost Comparison

CPU Cost Comparison: Bigger Spend, Better Output?

By | Review, Storage | No Comments

In a previous blog post, “Spending Money To Make Money: An IT Strategy That Really Works?” I compared the cost of running 6-core, 8-core, and 12-core CPUs across the x86 enterprise, comparing costs among the versions. My point back then was that more expensive servers could actually save you money when looking at the TCO. Now that Intel is producing 14, 16, and 18-core CPUs, I wanted to go back and see where these machines fit in terms of price and performance.

An Updated CPU Cost Comparison

While these 18-core CPUs are hot-rods featuring 5.69 billion transistors, 45MB of L3 cache, DDR4 RAM support and 9.6GB/sec of QPI, they are very expensive.

Who would argue that buying the most expensive servers is a smart business choice? Actually, I will, with some caveats.

While these CPUs make the top 5 list for VMmark’s performance specs, they actually hit #1 when we factor in power costs and cooling efficiency. So let’s do a high-level ROI when factoring in hypervisor and OS costs. One caveat up front: When I say “most expensive server,” I’m actually talking about a specific CPU line. I like the most expensive Intel E5 CPUs, which are more affordable than the top E7 CPUs. The E7 is the true top-of-the-line CPU and may only be necessary for the absolute most demanding workloads. That said, the E5 tends to follow the consumer market, which arguably moves faster than true enterprise. So the E5 benefits from having newer technology faster, which is a benefit as well.

Let’s take a look at a VDI requirement that is based on 400 concurrent users of Citrix. If the requirement is 72 physical cores and 1.5TB of RAM, there are a few different ways to satisfy the requirements with differences of cost and number of servers (using hp.com “customize and buy” as of November 11th 2014 for pricing estimates, 32GB DIMMs, 2.3Ghz CPUs with redundant power, fans, rail kit, and no hard disks).

CPU Cost Comparison Analysis

Screen Shot 2014-11-20 at 11.41.23 AM

While the 6-server option is still the cheapest, if we factor in the space in the chassis, power, cooling, not to mention management overhead, it probably makes sense to purchase and install the larger servers.

The biggest benefit is longevity and density. A larger server can be repurposed later on, can scale for a different purpose such as a database server, a test/dev environment, software defined storage, whatever. These new servers will generally last longer than a 6-core server. You might even get 4-5 years out of these 18-core CPUs, but it’s unlikely that it will make sense to run 6-core CPUs.

The Bottom Line

When choosing a server, consider that spending more money on the fastest severs that are available today, should have many benefits: Reduced management overhead, reduced software costs (per-core database software aside), reduced power and cooling costs, and a smaller footprint if you are paying for rack space. And you’ll likely get more longevity out of them as well.

Nearline and Enterprise SAS vs. NVMe Storage Connections

Nearline and Enterprise SAS vs. NVMe (PCI express) Storage Connections

By | Design & Architecture, Storage | No Comments

Most enterprise storage arrays today have backend enterprise SATA (aka Nearline SAS) and enterprise SAS connections running 6gbps or 12gbps where the actual disks and spindles connect to the controllers. The benefits of these enterprise-class communication protocols over standard SATA include:

  • Native command queuing.
  • Dual-redundant multipath I/O connections.
  • Plenty of throughput for each individual spindle.

You would think this is plenty of bandwidth, but now that SSDs are replacing HDDs, there is a case to be made that a newer, better technology can be used. Many individual SSDs can push 500MB/sec on their own. It’s not so much that 12gbps is a bottleneck, but the future of storage isn’t just NAND flash memory. Technologies like PCM and MRAM will easily push the boundaries of being able to move large amounts of data in and out of individual drives, even on the order of 1000x.

How Can We Improve Existing Flash Performance Outputs?

We now might agree that newer technologies are on order for the long term, but even with NAND flash in use today, there could be big improvements in performance by looking at flash differently.

For example, most SSD drives today have multiple NAND chips on the circuit board. If we read and write to these chips in a more parallel fashion, we can get even faster performance. Take existing PCI express-based NAND flash systems out there today, like Fusion-IO or OCZ’s RevoDrive. How can these devices achieve higher throughput and lower latency than a 12gbps SAS connection? For starters, they use the PCI express bus, which removes some controller latency. Taken a step further, NVMe (Non-Volatile Memory Express) is a new specification that can out perform AHCI and even PCIe storage connections.  See the graphic below from communities.intel.com for the latencies of the different stacks comparing the two.

Intel SSD P3700 Series NVMe Efficiency

Image from communities.intel.com.

What Other Benefits Does NVMe Provide?

Some of the other major benefits of NVMe include:

  • Multiple thread usage.
  • Parallel access.
  • Increase in queue depth.
  • Efficiency improvements.

Let’s look at queue depth specifically. AHCI can do 1 queue and 32 commands per queue. NVMe on the other hand can do 64,000 queues with 64,000 commands per queue. Since many SSD drives don’t perform well until there’s a big demand and high queue depth, getting the most performance out of an SSD means hitting it with multiple requests. A 20,000 IOPS drive can often do 80,000-90,000 IOPS with the right queue depth, and newer NAND controller CPUs have more than double the number of channels compared to SATA-based SSD (18 instead of 8), as well as more DDR3 RAM used for cache (1TB instead of 128 or 256GB). So we are starting to see miniature storage array performance in a single SSD “spindle,” which results in capabilities with exceedingly higher performance levels.

One more thing, Intel has a special way to convert a PCIe-based SSD into a standard 2.5” form-factor with the use of the SFF-8639 connector. This connector is what we will start to see in enterprise systems. Wouldn’t it be nice if this connector could use both SATA/SAS or PCIe in the same cable?

How Does NVMe Perform in Independent Tests?

In independent tests, these NVMe-based storage drives are able to hit 200,000-750,000 IOPS using 4KB random reads with queue depths of 128-256. The 4KB random write numbers are lower, from 80,000 – 220,000 at similar queue depths. Sequential read and write performance of many of these drives can easily exceed 2GB/sec, peaking near 3GB/sec for the largest transfer sizes. Average response time peaks at 935 µs, whereas peak latency has a much larger range from 3ms up to 98ms depending on the model, brand and queue depth.

Those super-high latency numbers are proof that IOPS only matter in relation to latency, and it makes sense to choose an SSD drive that offers performance consistency if the application requires it (such as the Micron P320h – 700GB).

What Does NVMe Mean for the Future?

These are strong numbers from a single SSD drive, but the point of all this analysis is two-fold. On the one hand, NVMe is a technology that will lift a potential barrier as NL-SAS and SAS connections eventually become a bottleneck with the release of newer flash-based technologies. On the other hand, much like storage systems of the past decade they are being replaced by newer flash-based systems built from the ground up. We have the opportunity to see a new way of reading and writing to flash that yields even greater performance levels with more parallelism and concurrency, and since we are seeing existing PCI-based SSDs already pushing the limits of SAS, NVMe has a promising future as storage becomes faster and faster.

WAN vs. WAN Optimization

By | How To, Networking, Strategy | No Comments

Last week, I compared Sneakernet vs. WAN. And I didn’t really compare the two with any WAN optimization products—just a conservative compression ratio of around 2x, which can be had with any run-of-the-mill storage replication technology or something as simple as WinZip.

But today, I want to show the benefits of putting a nice piece of technology in between the two locations over the WAN to see how much better our data transfer becomes.

When WAN Opt Is Useful

When choosing between a person’s time or using technology, I like the tech route. But even if it’s faster, how much faster does it need to be to offset the expense, hassle, and opportunity cost of installing a WAN Opt product? The only true way to know is to buy the product, install it, and run your real-world tests; however, I’m one for asking around.

But even if it’s faster, how much faster does it need to be to offset the expense, hassle, and opportunity cost of installing a WAN Opt product?

I reached out to my friends over at Silver Peak, and they pointed me to this handy online calculator.

It turns out, WAN Optimization products aren’t always useful in some situations. If you have ample bandwidth that’s very low latency, it might not be worth it. But even marginal latency across any distance at all, or data that can be repetitive (or compresses/deduplicates well), can benefit from a WAN Opt. And if you have business RPO and RTOs, you may very well require WAN Optimization in between.

An Example

I took the example from last week: the 100mbit connection, figuring in 7ms of latency to simulate the equivalent of 50% utilization on the line with 2x compression. If you recall, the file transfer of 10TB of data moved in 10 days can translate to 370TB of data in the same time frame with a Silver-Peak appliance at both ends. Much of that efficiency is due to the way WAN Optimization works, which is to say that data doesn’t always just get compressed and streamed using multiple steams. The best WAN Opt products also don’t send duplicate and redundant data. So a transfer that would normally take a week or a day could be completed in as little as 4.5 hours or 40 minutes, respectively.

The effort to install, in reality, is not that significant. Silver-Peak appliances come in physical and virtual form, with the virtual machines being a lot quicker to spin up and a little cheaper to acquire. Just make sure you are on a relatively recent IOS code that supports WCCP on your routers, and you can quickly deploy the virtual appliance in both locations.

Additional Benefits

Aside from moving data quickly, there are other benefits, such as improved voice calls (UDP packets that arrive out of order can be reassembled in the correct order), faster response times on applications over the wire, and pretty much any type of traffic that’s TCP-IP. If it were me, I would simply compare the cost of expanding the performance of the circuit versus adding a WAN Opt product in between. For most locations in the United States, circuits are expensive and bandwidth is limited, so you’re likely better off with a Silver-Peak at both ends to save both time and cost.

If it were me, I would simply compare the cost of expanding the performance of the circuit versus adding a WAN Opt product in between.

Of course, don’t just take my word for it. Run a POC on any network that you’re having problems with, and you’ll find out soon enough if WAN Optimization is the way to go.

Photo credit via Flickr: Tom Raftery

Sneakernet vs. WAN: When Moving Data With Your Feet Beats Using The Network

By | Disaster Recovery, Networking, Strategy | No Comments

Andrew S. Tanenbaum was quoted in 1981 as saying “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.”

The underlying story on this is written in the non-fiction section of Wikipedia. It was derived from NASA and their Deep Space network tracking station between Goldstone, CA and their other location at Jet Propulsion Labs about 180 miles away. Common today, as much as it were 30 years ago, a backhoe took out the 2400bps circuit between the two locations. The estimate to fix it was about one full day. So, they loaded a car with 9-track magnetic tapes and drove it 3-4 hours from one location to the other to get the data there six times faster than over the wire.

So, they loaded a car with 9-track magnetic tapes and drove it 3-4 hours from one location to the other to get the data there six times faster than over the wire.

That got me to thinking about IT and business projects that require pre-staging data. Normally, we IT folks get wind of a project weeks or months in advance. With such ample notice, how much data can we pre-stage in that amount of time?

With a simple 100Mbit connection between locations, and using a conservative compression ratio, we can move nearly 1TB of data in a day. That seems plenty of time to move source installation files, ISOs, and even large databases. Remembering that our most precious resource is time, anything a script or computer can do instead of us manually doing is worth a careful consideration.

Below is a chart listing out common bandwidth options and the time to complete a data transfer.

chart - 1

The above example is not as much about data center RPOs and RTOs, as it is about just moving data from one location to another. For DR objectives, we need to size our circuit so that we never fall below the minimums during critical times.

For example, if we have two data center locations with a circuit in between, and our daily change rate of 100TB of data is 3%, we will still need to find the peak data change rate timeframe before we can size the circuit properly.

chart - 2

If 50% of the data change rate occurs from 9am to 3pm, then we need a circuit that can sustain 250GB per hour. A dedicated gigabit circuit can handle this traffic, but only if it’s a low latency connection (the location are relatively close to one another). If there’s latency, we will most certainly need a WAN optimization product in between. But in the event of a full re-sync of data, it would take 9-10 days to move all that data over the wire plus the daily change rate. So unless we have RPOs and RTOs measuring weeks, or unless we have weeks to ramp-up to a DR project, we will have a tough time during a full re-sync, and wouldn’t be able to rely on DR during this time.

So, that might be a case where it makes sense to sneakernet the data from one location to the other.

Photo credits via Flickr: Nora Kuby

Why Thinking Like A Child Matters In Business And IT

By | Strategy | No Comments

They say that youth is wasted on the young. Children often have the luxury of acting without thinking, doing and then failing, and then just getting back up again with little more cost than a scraped knee and a bruised ego.

When I become old enough to drive, I would explore every road I could. I would embark far out into unknown country back roads, not knowing or caring much if my old pickup truck would break down or I would run out of gas, both of which happened so often that I became practiced at parking on hills so that I could roll-start the engine or coast to a gas station easily. Tools and tow ropes were my friends in a world where teenagers didn’t really have cell phones.

There’s something to be said for that same spirit and childish attitude in the adult world of business and IT: exploring new technology could be the road to success in a time of fierce competition and the 100 mile-per-hour pace of technology today. We can barely keep up.

The problem, though, is not just having time. It’s risk. If sticking your neck out is the road to advancement, and the key to unlock your Porsche is the execution of an important project, then preparation is the airbag that will save you when a car pops out of nowhere.

Mark Horstman from Manager-Tools once said:

Managers should not try to reduce risk in business because risk is constant and cannot be reduced. Instead, we can educate ourselves to better understand, quantify, and prepare for risk so that we can make higher quality decisions to achieve the best possible outcome, while at the same time choosing a path or solution that has the best risk/reward outcome.

In other words, take the “no pain, no gain” concept and factor in what happens if you push too far, get really hurt, and have a major setback. What’s the sweet spot of pushing hard but not too far? The answer may lie in the insights of the Marshmallow Challenge, which talks of executives and kindergartners.

In business and IT, we need to take chances on newer technology, because it’s the only way to advance forward.

But as adults in positions of power, our failures can affect hundreds or thousands of people’s lives. So preparation is the key to a successful road trip. Having a plan B isn’t enough when Murphy’s Law is at hand. Have three or fourth backup plans. Before departing, check the oil and the air in the tires, especially the spare. Have an emergency kit with food, water, a knife, a map, and duct tape. It costs like twenty bucks. Save the phone number to AAA towing.

Research and planning are necessary in any endeavor, but so is talking with others who have traveled the same roads or destinations. And for God’s sake, have fun. My most rewarding vacations included experts who knew how to rock-climb at Joshua Tree, landed a helicopter on the top of Hawaii waterfall, or negotiated through a class 5 rapid in West Virginia. So use an IT partner who knows the ropes and have done this before.

One last question: where do we want to go?

Photo credits via Flickr: Seema Krishnakumar

Flash Is Dead: The Next Storage Revolution Is About CPUs and RAM

By | Storage | No Comments

Alright, flash is not dead, it’s thriving. People love SSDs in their phones and laptops, because it’s so much faster than traditional hard drives. They are faster because they have lower latency, which is to say that they allow the computer to “wait less.”

SSDs operate in the millisecond and tenth of a millisecond time span, whereas typical mechanical hard drives range in the 6-10 millisecond range. That’s about 10x lower latency, which equates to twice as fast in the real world. You often do not find technologies that are ten times faster than the ones before them. But imagine something one hundred or a thousand times faster than even SSDs.

No problem for super computers costing millions of dollars. Just use CPUs and RAM, because they operate in the nanosecond realm, which is 1,000 times faster than a microsecond. We’re talking about going from 10x improvement in performance, to 100x or 1000x. Imagine an entire datacenter running in RAM with no disks. Stanford University has made the case, and they are calling it RamCloud.

“But imagine something one hundred or a thousand times faster than even SSDs.”

Cost aside, let’s try to put these types of potential speed increases into perspective and solve for cost later.

A Scenario To Consider

Let’s assume a typical 3.0GHz processor today in 2014 can perform some basic calculations and transfer data inside the chip itself in 10 nanoseconds, or 30 clock cycles. Perhaps the human equivalent in speed would be someone asking you to solve a simple math problem—what’s 2+4+3+4 in your head. You quickly add up that it’s 13 and it takes you 2 seconds from start to finish.

Now suppose that the CPU has to go back to DRAM for this data, because it doesn’t have the information handy to respond immediately. Going back to DRAM can take an additional 9-13 nanoseconds, even with today’s faster DDR3-1866 RAM. DRAM still runs at 200-300Mhz as a base clock speed, even if the bus speed itself is a lot higher. So going back to DRAM can double the time it takes for a CPU to execute on a task. It would take a human 4 seconds instead of 2. But twice as slow is nothing compared to how much slower storage is compared to RAM.

Continuing along the same lines as the math problem analogy, suppose that the math problem required all data to come from fast SSD-backed storage running at 500 microseconds (half a millisecond, or 20x quicker than a mechanic hard drive). Even with this fast storage, the simple math problem would require the computer’s CPU to spend 50,000 nanoseconds to complete the answer, when in fact it could do it roughly five thousand times faster than before. In human terms, the four second calculation of adding 2+4+3+4 would take you nearly three hours to complete.

“But twice as slow is nothing compared to how much slower storage is compared to RAM.”

man doing math on chalk board

In reality, you wouldn’t need storage to make such a simple calculation, because you could afford to keep your code small enough to cache in DRAM or in the CPU hardware registers itself. But the problem becomes much more pronounced when you have to go get real data from storage, which happens all the time with systems.

Perhaps a more complex math problem would best illustrate. Advanced math problems can require reading a paragraph describing the problem, looking at the textbook for a hint, going over notes from class, and finally scribbling it down on paper. This process can take minutes per problem.

If we used the same storage analogy in computing terms, a five-minute problem that could be solved completely in your head would instead take seventeen days to complete if we had to do it via the human equivalent of storage systems, which is to say, going back to the disk storage system and sending data back and forth. And that’s with SSDs. If we had mechanical drives, it would take nearly three months to do the problem. Imagine working on something all winter and completing it just as spring starts in. It had better be worthwhile, and I would say that looking for the cure to cancer, predicting tornadoes, or developing automated cars certainly are.

Today

So how does this play out in the real world today in 2014? Well, companies like Microsoft, EMC, Nimble, PureStorage, SAP, etc. are all taking advantage of using CPUs and RAM to accelerate their storage solutions. Today’s applications and users can wait milliseconds for data, because they were built to be used with mechanical hard drives, WAN connections, and mobile phones. So the storage companies are using CPUs and RAM to take in IO, organize data, compress it, dedupe it, secure it, place it in specific locations, replicate it, and snapshot it, because CPUs have so much time on their hands and can afford the nanoseconds to do so. They are using off the shelf Intel CPUs and DRAM to do this.

But the idea of waiting milliseconds today will seem absurd in the future. This lazy approach will someday soon change as CPUs and RAM continue to get faster than SSDs. In time, SSDs are going to be much too slow in computing terms, so we are going to see further advancements on the storage front for faster storage and memory technologies.

Things like PCM (Phase Change Memory), Spin-torque tunneling MRAM, Racetrack memory or DWM (Domain-Wall Memory) are technologies in development today. CPU frequencies are not increasing, but parallelism is, so the goal will be to place more RAM and storage closer to the CPU than before, and use more threads and cores to execute on data.

“In time, SSDs are going to be much too slow in computing terms, so we are going to see further advancements on the storage front for faster storage and memory technologies.”

Tommorow

If you have to wonder why CPUs and RAM are the keys to future storage performance, the reason is simply because CPU and RAM are hundreds of times faster than even the fastest storage systems out there today. Cost can be reduced with compression and deduplication.

And I’m betting that this speed discrepancy gap will continue for a while longer, at least over the next 3-5 years. Take a look at Intel and Micron’s Xeon Phi idea using TSV’s, which should make its way to commoditization in a few years. This will augment other advances in memory and storage technologies, driving the discussion from dealing with milliseconds of storage latency to microseconds and nanoseconds in the years to come.

Photo credits via Flickr, in order of appearance: Pier-Luc Bergeron; stuartpilbrow.

Reconsidering 2-Node Architecture: Design Your Data Center With “3s and 5s”

By | Design & Architecture | No Comments

In the enterprise IT world, most people would agree that all your eggs in one basket is not ideal. However, if you scale too wide, you have to manage all that overhead. That’s why a brilliant and experienced enterprise architect once told me that he prefers “3s and 5s.”

If you have anything in your data center that has redundancy, consider a rethink of anything active/standby or active/active and only having two nodes. For one thing, if you have only two systems, then you should monitor to ensure you don’t go above 50% utilization. If you have active/standby or active/passive, then each node cannot really exceed 100% utilization. In either case, you’re not getting efficiency out of your systems.

Design Comparison Matrix

Consider the following matrix comparing number of nodes versus efficiency, resiliency, and performance:

Screenshot 2014-08-12 06.38.41

But why 3s and 5s and not 4s and 6s? “Because it forces a recode of the software and moves to a structure that is a truly cloud-like design instead of old-school, fault-tolerant or highly available designs,” says David Bolthouse, enterprise architect.

Look how inefficient and rigid 2-node systems look with this comparison. You cannot burst all that much when the business needs to burst, perhaps due to acquisitions, perhaps because there’s a sudden demand workload, or headroom for peak usage times.

I’m suggesting that there is a sweet spot somewhere between 3 and 8-node systems.

If you have a blank slate on new designs for servers, networking, and storage, then consider looking at multi-node architectures. This concept can also be applied to DR and Cloud, as just having two sites—one production and one failover site—is not the future of cloud-based data centers.

If you have a blank slate on new designs for servers, networking, and storage, then consider looking at multi-node architectures.

The whole point of cloud is to distribute workloads and data across systems to achieve a higher level of overall efficiency, resiliency, and performance. Notice I didn’t say cost. We’re not quite there yet, but that’s coming. Distributing your risk across nodes and locations will eventually drive down costs, not just from the additional efficiencies gained, but also because you will be able to afford to invest in more value-driven products.

Take Cleversafe, which replicates data across multiple nodes and multiple locations. The low cost of this object-based storage allows for all those copies, while still keeping costs under control. Instead of thinking about your ability to recover from a failure, you will be thinking about how many failures you can sustain without much business impact. If the applications are written correctly, there may be very little business interruption after all.

Photo credit: Thomas Lieser via Flickr

Run Your Data Center: On iPhones?

By | Uncategorized | No Comments

In 2010, at a D8 conference, Steve Jobs made the famous analogy that “back when we were an agrarian nation, all cars were trucks, because that’s what you needed on the farm …

But as vehicles started to be used in the urban centers, cars got more popular. Innovations like automatic transmission and power steering and things that you didn’t care about in a truck as much started to become paramount in cars … PCs are going to be like trucks. They’re still going to be around, they’re still going to have a lot of value, but they’re going to be used by one out of X people.”

Consider the three following data points:

Someday, perhaps in this decade, phones may be displaced by personal wearable computers in the same way that desktops are being replaced by mobile devices. There’s a lot of computing power each individual is carrying around with them, all day, every day. So this got me to thinking about technologies in the data center and how we could leverage the power of the phone. Literally.

“What would a 42U rack full of iPhones look like?”

What would a 42U rack full of iPhones look like? The specs of the original iPhone 5 include a dual-core 1.3Ghz CPU, 1GB of RAM, multiple network adapters, and plenty of fast solid-state storage with low latency. It weighs about ¼ lb. and costs about $849 (retail).

We could easily fit 1152 phones in a 42U rack: that’s 2300 cores, 1.15TB of RAM, 72TB of storage, and 270 Gigabit of network and storage performance. It would weigh 300-400 lbs, and would cost $978,000. I’m not suggesting that people would actually drop off their phones at the data center, but since users are already bringing in the phone to the workplace, it’s already been paid for, what’s missing to make this actually work?

My opinion is that desktops are declining and VDI isn’t taking off because the personal, mobile aspect of computing hasn’t made its way back to the phone where it clearly needs to be. The reason why the phone has taken off so well is because it’s personal, mobile, and we always have it on us. We can’t live without it. It’s powerful, and is the center of the personal IT universe. So we as IT need to find a better way to run our enterprise applications on it.

“The phone is the center of the personal IT universe. So we as IT need to find a better way to run our enterprise applications on it.”

Anyone who has already created an app to run their software on these phones is ahead of the game. But we still need someone to write the software that makes it possible to leverage the existing enterprise applications on the phone, and more importantly—and here’s where I think there’s a hidden gem of an opportunity—figure out a way to leverage the CPU, RAM, and storage in the phone to offload traditional data center costs.

For example, VDI processing actually runs on the data center servers, but is this ideal? Why not leverage the CPU on the phone somehow? I hate to see those 2300 cores just sitting there, mostly idle. Seems like such a waste.

Photo credit: moridin3335r via Flickr

Spending Money To Make Money: An IT Strategy That Really Works?

By | Strategy | No Comments

Tell me if you’ve heard this one before: “Look, if I spend more than $50 at this store, I get 20% off!” Or how about adding an extra item to Amazon’s shopping cart to qualify for free shipping? You have to wonder if these tactics are win/win, or if they are just getting us to think we are saving money while in reality, we are being tricked into spending more. I think the answer to that question just leads to more questions, such as, what are you buying, how often, etc.

But I admit that sometimes it really does make sense to spend more money to save in the long run—you just have to put some work into finding out how. I think there are certain strategies that work better than others. Consider the difference in cost between hardware and software.

Game Console Comparison

An Xbox or PlayStation might sell for $400 (the hardware), but what is the cost of all the games you might purchase over the lifetime of owning it (the software)? Assuming that the average number of games is probably between 5 and 10 games at roughly $60 per game, it wouldn’t take long before the software costs would exceed the hardware costs.

If your plan is to play and use as many games as possible, a good strategy would be to trade, buy used, or use a rental service. However, if you only planned to have a few favorite games, your goal would be to wait until the console and game packages became available with a one-time coupon or sale.

Having a strategy for enterprise hardware and software can also make sense in a similar fashion. Let’s use the example for software licensing for Microsoft and VMware, as opposed to the hardware (the server and Intel CPU) that runs that software. Which one is more expensive? What’s the TCO for each? And where do you get the most value for your enterprise?

Scenario

Here’s a scenario for you: suppose you have blade servers in your environment that are approximately 2.5 years old, and your budget includes money to potentially refresh those servers at year 3. Like any good custodian of your company’s money, you want to get the best price with the best solution.

In some cases, the budget gets taken by other projects. So while you might benefit from newer servers, it might be tough to justify to the business how to spend the money on new hardware, when the current hardware is “just fine.” You also might have a Microsoft or VMware renewal coming, which typically are also every 3 years.

So what’s the best way to maximize your current budget and get the most value for you and the organization? Let’s run the numbers and see what happens.

The below table might illustrate your current scenario. Let’s assume you bought the original blades for $10,000 each using the Intel Xeon x5690 6-core CPU for a list price of $1,663 per socket and loaded them up with RAM. If you have even older 4-core CPUs in your blades, then you stand to gain even more benefit; but, let’s look at the 6-core CPUs for now, since they are still in service in a lot of places and will likely run for a long time without issues.

Scenario A

Scenario A Software Hardware (6-core) Total $ $-per-VM
Windows 2012 Std $882 $8k+ $1,663 CPU $10,545 $5,272 (2 VMs)
Windows 2012 DTC $4,809 $8k+ $1,663 CPU $14,472 $1,608 (9 VMs)
vSphere Enterprise Plus $3,495 $8k+ $1,663 CPU $17,967 $1,633 (11 VMs)

Microsoft will allow you to run Windows 2012 standard and virtualize up to 2 instances per OS. That’s not really a bad deal for around $440 per OS instance. The Intel CPU chip itself is $1,663 list, nearly 5x more expensive, and if the server was $8,000, the true cost to own each instance is around $5,300. Again, not bad, especially if you are running your critical business on those two servers from a value perspective (let’s leave high availability, backup, and DR out of this for now to keep it simple).

But most companies have a lot more servers than that. If you had, let’s say, 30 servers at $10k apiece, you are looking at $300,000 to run this environment.

So, in the case where you have hundreds of VMs, it may make sense to consolidate into denser workloads to save time, cooling, space, and last but not least … cost.

Using the same CPU model and running Windows Datacenter and VMware Enterprise Plus editions, the software cost is going to be significantly more expensive than a single processor. And while you are getting a lot more features using more advanced software, the added cost can be offset by increasing your density. Suppose that you can run up to 9 VMs using Hyper-V (11 on VMware taking advantage of SIOC, NIOC, SDRS, and other enterprise features) on each of those $10,000 blades. The cost per VM goes way down, from $5,300 to around $1,600.

This is really nothing new and is one big reason why virtualization has gotten traction over the years. Yes, $3,500 per CPU for virtualization is a big chunk of change, but I will gladly pay that over the old way of building servers for all the other benefits, including cost.

Now, let’s run the numbers by looking at what this would cost to buy all brand new servers and try to get even higher density. The new Intel E5-2697 v2 12-core CPUs are out in the marketplace, and they list for more money: $2,618 to be exact, as of May 2014. I could have used the 15-core CPUs, but those start at $6,400+, so I doubt the 3x cost increase for minimal performance improvement would be worth it.

Scenario B

Scenario B Software Hardware (6-core) Total $ $-per-VM
Windows 2012 Std $882 $8k+ $2,618 CPU $10,545 $6,191 (2 VMs)
Windows 2012 DTC $4,809 $8k+ $2,618 CPU $14,472 $857 (18 VMs)
vSphere Enterprise Plus $3,495 $8k+ $2,618 CPU $17,967 $860 (22 VMs)

Using this scenario, we can drop the cost per VM from around $1,600 to around $860, or around 47% cheaper. Wow, that’s a big difference, and if your server budget is $300,000, you can save $141,000 to free up on other projects. Granted, budgets don’t always work that way – sometime if you don’t use it, you lose it. If that’s the case, you can still use up the $300k, but buy more servers, add more RAM, install more IO cards, graphics cards, etc. for $140k.

But hold on a second. Is this really a fair comparison? What ever happened to the servers I already bought 2.5 years ago? Isn’t Scenario A one where the $10,000 servers and CPUs are already a sunk cost? Why wouldn’t I make the comparison that the hardware cost is zero in that case, compared to a new server? Well, if we did that, we’d have to include some other costs, such as hardware and software maintenance, and the results are surprising:

Scenario C

Scenario
C
Legacy 6-core Server
Hardware & Software TCO
New 12-core Server
Hardware & Software TCO
Hardware $0 $10,618
Software (Windows & VMware) $8,304/CPU $8,304/CPU
50 VMs $49,824 (3 – double
CPU blades)
$24,912 (3 – single CPU blades)
200 VMs $149,472 (9 – double CPU blades) $66,432 (4 – double CPU blades)
600 VMs $448,416 (27 – double CPU blades) $199,296 (12 – double CPU blades)
1200 VMs $913,440 (55 – double CPU blades) $381,984 (23 – double CPU blades)
HW Maintenance 3/yr $6,000/BC + $900/server $6,000/BC + $900/server
SW Maintenance 3/yr 20%/yr of cost ($8304*0.2*CPU*3) 20%/yr of cost ($8304*0.2*CPU*3)
50 VMs w/ maintenance only $8,700 HW + $24,912 SW $8,700 HW + $14,946 SW
200 VMs w/ maintenance only $11,400 HW + $94,665 SW $9,600 HW + $39,859 SW
600 VMs w/ maintenance only $28,200 HW + $274,032 SW $16,800 HW + $119,577 SW
1200 VMs w/ maintenance only $73,500 HW + $543,081 SW $32,700 HW + $229,190 SW
50 VMs TCO $83,846 $48,558
200 VMs TCO $255,537 $115,891
600 VMs TCO $751,528 $335,673
1200 VMs TCO $1,530,021 $643,874

Wow! Those are considerable savings if you include maintenance and software costs.

I calculated these costs by assuming that blades require hardware and software support on the chassis and the blades themselves. I also added in the Microsoft and VMware maintenance costs, which I’ve estimated to be roughly 20% per year.

Finally, the performance difference between the 6-core CPUs and the 12-core CPU is more than 2x (it’s not just double the core count). There’s probably anywhere from 10%-60% increase in performance given all the other improvements. So, instead of 11VMs per socket and just doubling it due to core count, I assumed a conservative additional 20% increase in density for these new CPUs for a total of 26 VMs per CPU socket. Note, the 50-VM example will have 3 nodes for redundancy, but only 2 nodes would have worked for capacity purposes.

Based on Scenario C, while it may seem surprising that tossing your old servers, buying new ones, and saving money is possible, the numbers work out.

Of course, there is a big time effort to do this as well; however, even if you paid someone to do this, you’d still likely save money.

To think of this of this another way: 80% of the cost of running your “free” (or already paid-for servers) is the hardware/software maintenance and licensing to go along with them. So in the case where you are looking at software licensing, where you have the option to go out and add capacity to your VMware farm or pay up software licenses, it almost always makes sense to buy new hardware along with the software.

Here’s another scenario: Let’s say that instead of the 6-core CPUs, you have the 2-yr-old 8-core CPUs and were curious to know when it would be a good time to refresh. If you do the math, you find that you would only have to pay about 20% more cost to recycle these perfectly good servers for new ones. This is one reason why a lot of companies look for a 3-year refresh cycle for IT, even though servers can and do run a lot longer than that.

One final scenario: consider the case where you have a standard server, but a new more powerful CPU is released, and you need to add capacity to your existing farm due to business growth. As much as it makes great sense to standardize on one platform and keep buying the older model, consider the cost implications of doing so.

I’d argue that it might be more important to standardize on the hypervisor or management suite rather than the server itself.

Consider just going to a different CPU model and leave everything else the same in the server. The cost savings of adding 16-core CPUs when they become available when you already have 12-core CPUs can actually be difficult to deny. VMware’s Enhanced VMotion Compatibility feature (EVC), for example, allows you to run multiple processor architectures and not lose any features or performance when mixing and matching CPUs from the same manufacturer.

Granted, there are scenarios when this strategy doesn’t make sense. If you are running low-cost software, or you choose not to renew maintenance on software, or you have a hardware model that allow you not to have support on it, then going scale-out in a wide fashion might make sense. Then again, we didn’t even talk about the power and cooling costs associated with larger numbers of servers.

Conclusion

The shift in thinking about recycling more often is not about disposing of perfectly good hardware. It’s about looking at the total cost of ownership, the software and maintenance costs especially, and building a strategy around where you want to spend your money.

In today’s world of rapidly evolving hardware and software, the hardware is still plenty cheaper than most software. As long as that’s the case, I prefer to have the best hardware available and look to keep my software costs under control.

This strategy is just the tip of the iceberg. When it comes to really expensive software, such as SQL or Oracle, the software can be 10x the cost of the hardware. Look for SQL consolidation projects to get your overall costs down while at the same time, acquire new hardware. And remember, the proof of the pudding is in the eating, as they say, so always “do the math” to make sure the dollars add up.

Photo credit: Craig Piersma via Flickr

Top 10 Concepts That Matter In Technology

By | How To | No Comments

We IT professionals are always looking for ways to make our lives easier. It’s not because we are lazy—although, some of the best solutions out there require the least amount of human intervention. The reason is because our goal is to provide the best reliability, stability, and service to the business possible. So if the business isn’t happy, chances are, we are not happy either. After working in IT for the last 15 years, I have seen some things go really well, and others go horribly wrong. Below is a short list of ten concepts that really matter in IT—things that will help make you, the IT professional, successful.

1) The Cloud is what you want it to be.

Don’t get me wrong, there is the NIST definition of Cloud, which is a good definition and a great start. But remember, what you actually do with the Cloud—how you solve business problems—is way more important than how it’s defined. Expand existing virtualization technologies, including storage and network virtualization, to put you and your company in the best position possible for the future. Today, that usually means to start building out a hybrid cloud strategy so that you are ready and able to move between public and private spaces with ease.

2) Learning how to break stuff will make you (and your technology solutions) better.

Before you implement a solution, and certainly before you put production users and data on it, try to break it. Fail it over. If it’s a server, pull the network cord or power cord and see what happens. You’d be surprised to learn that solutions that are designed with full redundancy and failover may not behave as expected in the real world. Taking something apart often times teaches you more about how something works than what you might find in a whitepaper. At the least, you will know what the weak point is that maybe someone else may not know, and you can share what you find with others.

3) Software-Defined Data Center is powerless without the hardware to match.

If you haven’t heard about SDDC, it’s what nearly every technology manufactures seems to talk about these days as the key to being agile, flexible, cloud-ready, and cutting edge. And I agree completely. I’m a hardware guy through and through, so I just have to say, now more than ever, the hardware is a key component of SDDC. When you hear the term “commodity,” you might think that it doesn’t matter what the underlying hardware is. This could be a big mistake. If you look under the covers of most storage system controllers being sold today, you’ll most likely find Intel processors. Knowing the differences among the Intel CPUs in those architectures can make the difference between half or double the performance between one vendor and another. Every release of an Intel processor could have a very large effect on the performance of the SAN. I’m not suggesting that we have to know the intricate details of Sandy-bridge vs. Ivy-bridge (Intel code-names), but we need to keep the marriage of software and hardware in mind when designing today’s data center and cloud solutions.

4) It is (still) about the latency, stupid.

A famous 1996 Stanford article discusses bandwidth and latency, and illustrates that solving bandwidth problems are easy, but latency is a physics problem (the speed of light limitation) that cannot be overcome easily or at all. More than ten years later, fiber has replaced modems in many locations, but we are still looking at WAN latency as a major factor in network performance that affects availability, DR, backups, and client connections. Today, latency on the storage is more important than ever. Most application performance today is not due to bandwidth, but instead, latency, and much of it is on the storage. IOPS are still worth discussing, but are not very meaningful without the associated IO size and latency figures to match.

5) Somebody is doing this better, faster, smarter than you.

It’s nearly impossible to be the smartest person in the room. But even if you are, there are at least two big downsides to being this smart. First, your competition is gaining on you faster than you are maintaining your skills. There’s only one place to go when you’re on the top—and that is down, and it will happen sooner or later. Second, intelligence is overrated. Getting things done means cooperating with others, being creative, being persistent, and above all else, putting in time.

6) Seek out the smart ones and join them. If you can’t join them, mimic them.

Michael Dell once said, “If you are the smartest pro in the room, find another room.” If they won’t let you in, be humble and be persistent. If they don’t like you, check your ego: nobody likes a know-it-all. If you can’t join them, find out what they do and start doing it. Mimicry is a form of flattery, but it can also lead to success. You might be able to learn how to do it better than they do it themselves. Microsoft learned from IBM, AOL learned from Netscape, Palm Pilot became popular after Apple made their Newton, Facebook is MySpace 2.0, and so on.

7) Plan for worst-case scenarios and peak utilization

One reason why Google is looking at automated cars is because they have done the math to reveal that the Interstate system is 90% free space on average. But does this statistic matter when most daily commutes are hitting bumper-to-bumper traffic at 8am or 5pm in most American cities? No, it doesn’t. A website that sells concert tickets and is up for four nines and just happens to be down the 52 minutes that tickets go on sale is little consolation for the lost revenue that business needs to operate. If you assume the worst, and build for the peaks, then your customers will less likely be looking at the hour glass when they need it the most.

8) Be passionate about solving problems; don’t be a Brand X person.

Some technology companies are great, and others are frustrating and complicated. The second we say “Brand x is poor technology, I like Brand Y,” we discredit ourselves. We may even lose respect, credibility, or lose customers. Every technology has its place, and maybe it’s a training issue, or maybe that technology really isn’t the best. But someone else may love it. There are often times many ways to implement technology, and your way is not always the best. The other way works perfectly fine too. Stay passionate on your favorite technologies, and use your passions to make people’s lives easier by solving business problems, even if that means using technologies that you’re not in love with.

9) Lose the words “always, never, can’t, but and no” from your vocabulary.

This is easy—just get rid of these words. Look, there are certain things that successful people say and do and sometimes it is not what you say, it’s what you don’t say. Lose these words and replace them with something else. Even if someone asks for the impossible, you can easily allude to the difficulty without saying no. There might be a large cost and/or risk associated with a big challenge, and you might assume they won’t pay the bill. I’ve seen a blank check handed over in response to someone saying “we would do this, but it would be too expensive.”

10) Have five back out plans.

You have to assume that your primary goal or solution will fail. Sometimes it’s political. This is a good thing, if in the end, an alternate solution works and delivers on time and on budget. There’s much more value in having multiple purposes for a product or solution. Suppose a Tier-2 storage system is targeted for an end-user computing platform, but the company wants to change direction. You might use this storage for another application. You might be able to allocate it to a test/dev lab, augment DR, or use it as a backup target. General purpose solutions that excel in one area or another is a lot different than nich products that are really only good at one thing. In today’s fast-changing IT world, flexibility and agility matter.

Photo credit: holeymoon via Flickr

float(1)