Analytics and Big Data are at a crossroads: effectively leveraging Hadoop map-reduce on larger and larger datasets is getting expensive to store and protect.
The traditional deployment model consists of name nodes and data node services running on the same hardware as compute layer services (job scheduling and execution). Hadoop File System (HDFS) data is protected at the server level and drive level by replication to another node through the protocol. The number of copies is a tunable parameter; however, best practices recommends 3 copies.
As systems scale to petabytes and beyond, the storage requirements to sustain 3 copies becomes astronomical.
Another key feature of HDFS is that the objects that are stored are generally larger blocks. These blocks are written sequentially and are never updated or overwritten. Essentially, they are WORM (Write Once, Read Many) objects until their relevance expires and they are deleted.
One of the tenants of traditional HDFS is that “Moving computation is cheaper than moving data.” As a storage guy, I would like to rewrite this tenant as “Let’s leverage computation for computation and optimize Data infrastructure to best serve the application’s data requirements.” I know, it’s a bit wordy, but it makes the point. There are a number of technologies that added the HDFS protocols.
Isilon added HDFS support to its OneFS code base with the 7.0 release. An Isilon cluster can scale from 3 nodes up to 144 nodes. These nodes can be one of 3 tiers:
- SAS (Serial Attach SCSI) and SSD (Solid State Drive) based S Nodes for extreme performance
- SATA (Serial ATA drive interface) and SSD based X Nodes for typical medium performance workloads
- SATA based NL Nodes for Archive level performance
An Islion cluster can be a combination of these nodes, allowing for tiering of data based on access. The advantage of using Isilon for HDFS is Isilon provides the data protection, so the HDFS copy parameter can be set to a single object.
This reduces nearly in half the amount of storage required to support a Hadoop cluster, while improving the reliability and simplifying the environment.
In very large environments or in environments that require geo-dispersal, Cleversafe can be leveraged to provide storage via the HDFS protocol. Like Isilon, Cleversafe leverages erasure coding techniques to distribute the data across the nodes in its cluster architecture. Cleversafe, however, scales much larger and can be geo-dispersed as it’s cluster interconnect leverages TCP/IP over Ethernet as opposed to Infiniband.
IDS has integrated both the Isilon and Cleversafe technologies in our cloud and has the capacity to support customer analytics environments on this infrastructure.
Our customers can efficiently stand up a Hadoop Ecosystem and produce valuable insights without having to purchase and manage a significant investment in infrastructure.
SMR from Seagate
On a completely separate, but surpisingly related thread: one of the major developments in rotational hard drive technology in the last year have been focused on archival storage. Seagate announced Shingled Magnetic Recording (SMR) with densities up to 1.25TB per platter. SMR drives overlap groups of tracks, leaving valid read tracks inside the boundaries of wider write tracks. SMR drives can store much more data this way, but data re-writes are much slower with SMR than existing perpendicular magnetic recording (PMR) drive technology drives. This is because when a block is updated the entire group of tracks has to be overwritten—much like solid-state page writes. While the only known customers of SMR drives to date are Seagate’s subsidiary E-Vault, this technology would seem to line up well with HDFS workloads.
Photo credit: _cheryl via Flickr