Dealing with enormous data

I wasn’t aware of the company called ‘Greenplum’ until EMC bought it!! I became interested in it when analysts were mentioning that ‘Netezza’ would be bought by IBM to counter this movie. I was interested because I had a friend who worked in ‘Netezza’. So I wanted to find out what this whole thing was about. I checked with a friend, who knows stuff in this area. And this is what he replied. ” The key thing is Netezza, Teradata, Greenplum, Vertica are all designed from the ground up for data warehousing kind of workloads. Oracle and DB2 started as OLTP (Online Transaction Processing) systems and then they tried to do Datawarehousing also using the same server code. That does not work. Datawarehousing has a very different kind of characteristic. Loads are bulk loads. Insert / Update / Deletes are few and it is very Select heavy. All you do is analytics. The selects usually involve very complex queries often running into GBs in size, generated automatically by front end analytics tools. It touches massive amounts of data in the range of terabytes to petabytes. OLTP on the other hand has all of Select / Insert / Update / Delete. Typical example is air line reservation. The volume of data is not that big at all. ” That made sense. Later IBM bought Netezza and HP bought Vertica, another similar company.

So the whole thing was about how you searched for patterns and such in massive amounts of data. Unlike the OLTP data, where there is some data which is current and important, in the analytics scenario, all data is important. There is no irrelevant data as Jim McDonald says in his very nice blog post at XIOTech. This is a very nice post giving a good perspective the challenges faced when you have to access huge amount of data.  He talks about Big Data. I am not sure if there is a common agreement on what ‘Big Data’ means but this Wikibon article can be your starting point in understanding what Enterprise Big Data is all about.

As data grows at amazing speed, neither the processor nor the disk technology can keep up to that pace. So scaling up a product to meet the needs to data growth can only go so far. It is inevitable that data access happen in parallel if you want to deal with larger and larger data sets. The current product trends as well as acquisition trends show that all companies understand this problem and are responding to it. NetApp have come up with their clustered NAS in Data Ontap 8.0  This allows for aggregation of multiple nodes and uses a global namespace. (Looks like there is some confusion regarding the term global namespace since Isilon and SONAS have interpretations that are different from NetApp. You may want to read Martin Glasborrow’s (Storagebod) post which talks about this.)  The data sheet for Clustered Mode Data Ontap is available here. (pdf file)

While NetApp must have developed their our clustered mode Scale Out NAS based on their Bycast buy last year Spinnaker acquisition earlier (thanks to Dustin for pointing my error) , EMC went and bought Isilon, which again was a company dealing with ScaleOut NAS. Infact EMC paid $2.25b to get this company. So you can understand what EMC feels about the potential of Scale Out NAS.  HP in 2009 had acquired IBRIX, another company dealing with Scale Out NAS. IBM has its own Scale Out NAS, which is appropriately labeled, SONAS!!

All of these use a global namespace. What exactly is a global namespace and more importantly, what exactly is Scale Out NAS and how does it work? According to the SONAS datasheet:

-Access your data in a single global namespace allowing all users a single, logical view of files through a single drive letter such as a Z drive.
– Offers internal (SAS, Nearline SAS) and external (Tape) storage pools. Automated file placement and file migration based on policies. It can store and retrieve any file data in/out of any pool transparently and quickly without any administrator involvement.

Scale Out NAS technical details require an extensive writeup, which I will do in a future post. What is important is that all the main storage vendors have a Scale Out NAS solution in their portfolio.

An unexpected, for many, acquisition was that of LSI’s Engenio by NetApp. The reason for being surprised was that NetApp’s message all along has been that of Unified Storage and everyone thought that NetApp would only go with the Unified Storage way always. (Infact there have been blogs critical of NetApp, calling it an one product company. Now everyone was surprised and started asking, “Why are you getting more products. Your messaging will be lost).  LSI’s Engenio is a pure block play and people were interested in knowing why NetApp acquired Engenio and how it would affect their message. Dave Hitz, in his characteristic clear style replied to these concerns / accusations in his blog. In his blog post he says, “The observation is that, while many customers and workloads do require advanced data management, some need “big bandwidth” without the fancy features. For them, the best solution is a very fast RAID array with great price/performance. Perfect for Engenio! Two immediate opportunities are Full Motion Video (FMV) and Digital Video Surveillance (DVS), and over time we believe there will be more.” Here we see NetApp targeting a different type of workload and understanding that no fancy features like Snapshot etc are required here. All that is required here is bandwidth. In other words, all companies are now trying to get solutions which deal with different types of workloads. Hence you see pure block play, Datawarehousing solutions and Scale Out NAS.

So what is the moral all this rambling? Well, the moral is clear. You better start understanding how big data is being dealt with. That is the future if you are into Storage Infrastructure. Your concepts of RAID will not suffice as data will not be distributed across disks in one single array but may be striped across multiple arrays. The clustered storage solutions may become the de facto way of installing storage. And it may happen faster than you think. So go read up more about these technologies. It will help you in the long run.