Evaluating All-Flash Storage Part 2: Taking a closer look

All-Flash Storage Analysis

A few weeks ago I wrote “Evaluating All-Flash Storage Part 1: Mapping out the process,” discussing the excitement around getting to test a new product, specifically an All-Flash Array.

Getting to open boxes, break out new equipment, install it, and put it through its paces is fun. Unfortunately, the business side of things rears its ugly and less entertaining head, and the temporarily excited IT staff has to put their noses back to the grindstone. Testing is being requested for a reason, and the results of the tests will play an important role in determining not only the product that will be purchased, but quite possibly in future directions for the company and the testers involved.

I said I would revisit the steps of the evaluation process and go into more depth on each one. You can view the original full list in my previous post.

For today’s article, let’s take a look at the first couple of steps on this list. We will examine later steps in subsequent articles.

A Closer Look at All-Flash Storage Evaluation Steps

Step 1: Identify problematic applications and look for storage-related bottlenecks.

In this step, the tester must work with users, application owners, DBAs, etc. to identify which applications are perceived to have performance issues. Often this part of step 1 will be incredibly easy. The interested parties are often very vocal and will probably already have made their complaints known (they are probably the reason that new hardware is being considered anyway). It is the job of the tester or testers to determine that these applications are actually bottlenecked at the storage. This should be fairly easy to determine with array-based tools that see performance from the array point of view, as well as host and application based tools that will view it from the server or application. As a general rule of thumb, most modern disk arrays should be able to respond to almost all IO requests in less than 10ms for most workloads. Do what you can to ensure that you are looking at the actual server to disk IO performance, and ensure that there are no bottlenecks in the connectivity between the server and the storage (usually FC or Ethernet/iSCSI network).

Sort your list of problematic applications by how much they are affected by storage, especially storage performance. This will be a good place to start when deciding which applications to test on new, higher performing storage.

Step 2: Examine storage requirements for each application.

It may seem obvious, especially when hearing complaints from users about how slow things are, but performance is only one aspect of data storage. Proper storage for every application should include analysis of requirements for availability, reliability/recoverability, replication, and frequency of changes/provisioning, as well as performance.

Here are my definitions for each of these factors specifically relating to data storage:

  • Availability: This is the percentage of the time must the data be available. Usually designated as a number of nines, ie 5 nines or 99.999% availability equals a little over 5 minutes of downtime per year. Most businesses expect applications to be accessible and available at least that much. In order to accomplish this, everything underpinning that application must be designed for at least that level of availability. Often there is only a single datastore, so this datastore must be available to the servers that access it at least 99.999% of the time. Most “enterprise” class arrays offer this level of availability, but you have to be careful in your understanding of how the vendor defines this. Some will only count UNPLANNED outages against their availability number. Unplanned outages are certainly worse than planned ones, but scheduling downtime does not eliminate it.
  • Reliability/Recoverability: The overall reliability of the data storage system and ability to recover from the failure of components and the overall system. Analysis should also include what it takes to recover data in the event of an overall system failure.
  • Replication: Ability of the system to replicate data within the same storage system and to remote storage system(s). Be sure to investigate what your application and regulatory replication requirements entail. Local and remote copies of data may or may not be write accessible by the same or different servers. Most flash arrays are capable of pointer-based LUN/disk snapshots within a single array, while many also include the ability to copy these snapshots to remote arrays. Some also include synchronous or asynchronous block/track replication technologies.
  • Change/Provisioning Frequency: Changes to the array can include hardware and software upgrades, downgrades, etc. How often these are necessary, how difficult and time-consuming they are to implement, and how much disruption they cause for data availability must all be considered. In addition to array changes, general management of the array is very important. This will mostly be provisioning and deprovisioning storage from servers, as well as setting up and breaking down of any replication within an array and between arrays.
  • Performance: This is usually one of the biggest reasons that people look to all-flash arrays. Most currently available enterprise disk arrays are able to satisfy most IO writes within less than 2ms because they accept the data into cache (usually mirrored), acknowledge acceptance of the write back to the host, perform any necessary parity, ordering, and other calculations, and subsequently write the data out to disks on the back end. Unless the backend process can’t handle all of the incoming data for some reason, the only part of this IO write process that really affects application performance is the initial write into cache. Unfortunately read-IOs cannot rely on the data always being in cache. Reads of data that are in cache are just as lightning fast as writes, but unless the dataset is fairly small, it is likely that at least some of the reads will have to be supplied from disk. Intelligent caching and read-ahead algorithms attempt to guess what should remain in cache, and can significantly improve the percentage of data read from cache, but reading from disk always adds latency to an IO compared to reading from cache. Improving this latency is the primary reason for using all flash or solid-state disk over spinning magnetic disk. The ease of monitoring, managing, and reporting on the performance of the array is also very important, as these tasks can also consume a significant portion of a storage team’s valuable time.

Stay tuned, more details to come in subsequent articles including benchmarking, criteria prioritization, and more.