Monday, July 25, 2011

Data Warehousing and Analytics on Unstructured Data / Big Data using Hadoop on Microsoft platform

I'm reading: Data Warehousing and Analytics on Unstructured Data / Big Data using Hadoop on Microsoft platformTweet this !
A major community of data warehousing professionals grow up from the old school of Kimball and Inmon methods of data warehousing. Lots of professionals do boast on virtualization, complex MDX querying, performance tuning OLAP engines and managing data warehouse environments of the size of a few hundred GBs or several TBs, as the most niche and challenging jobs they have on their resume. But there is another world of data warehousing and analytics which most would not have explored, and slowly this revolutionary and emerging wave is reaching SMBs which would effectively challenge the world of data warehousing as we practice today. You might come across a question while reading this post, that what has Microsoft to do with it and answer to this question is towards the end of this post.

Data warehouses and data marts developed using Kimball, Inmon or any hybrid methodology can deal with structured data and have scalability challenges too. Appliance solutions such as Parallel Data Warehouse are Microsoft's candidate to deal with such challenges. Some might think that this is the answer to warehouse largest volume of data and build analytical capabilities on the top of it. But this data volume is just a very small piece of the ecosystem. According to Gartner, enterprise data would grow by 650% in 2014 and 85% of the same would be unstructured data, which is also termed as BIG Data.

Have you ever thought of how organizations like Yahoo, Google, Facebook etc organize their data? Which databases do they use? Whether they have data warehousing and analytics? These organizations have some of the largest data volumes in the world. For example, Facebook is heard to have 12 TB of compressed data added per day and 800 TB of compressed data scanned per day. Can you imagine structuring such volume of data using ETL, storing it in data warehouses, aggregating it using OLAP engines in data marts and extracting analytics out of the same ? To handle such volumes of data for data warehousing and analytics, innovative technologies and infrastructure design are required that can support massively parallel processing, and the one I am talking about is named "Hadoop" which is an open-source distributed computing technology and "Hadoop Distributed File System" which is the storage mechanism for handling unstructured data.

Cloud environments like Amazon are already supporting Hadoop, organizations like Cloudera and IBM are supporting commercial distributions of Hadoop, and a lot of big and famous international business majors are already using Hadoop implementation. The biggest implementation is used by Yahoo with 100,000+ CPUs running on 40,000+ computers running Hadoop. An exhaustive list of organizations using Hadoop can be read from here. Organizations are using Hadoop to implement data warehousing and analytics for purposes like Event Analytics, Click Stream Analytics, Text Analytics and more.

For those who are completely afresh to this part of the world, can go through some very interesting reference material mentioned below:

1) The Google File System

2) Data warehousing and Analytics Infrastructure at Facebook

3) Apache Hadoop Wiki

4) Apache Hadoop MapReduce Implementation at Yahoo

5) Setting up Hadoop on VM

Microsoft is aware of the challenges using unstructured data and Hadoop, and is gearing up slowly for the same.

1) Microsoft Research is developing Project Daytona on Azure platform and Project Dryad, which is perceived by the industry as Microsoft's candidate as an alternative for Apache Hadoop.

2) Those who believe that MDX is the top query language that can deal with huge amount of data from OLAP engines, should check out LINQ to HPC to update their GK.

3) With the increasing popularity and success of Hadoop, Microsoft is also supporting Hadoop on Azure platform. You can get an idea of how to deploy Hadoop cluster on Azure platform from here.

The way Microsoft professionals felt that cloud is something new when Azure was introduced, same would be the case when Microsoft would start supporting Hadoop commercially or introduce a commercial alternative for the same. But neither cloud is a recent invention nor technologies like Hadoop to handle, ware house and analyze unstructred data. In my viewpoint, architects and organizations should develop their readiness to deal with the emerging winds of change and upcoming potential business opportunites that unstructured data can offer.

3 comments:

yngyani said...

Microsoft's answer to Hadoop has been Dyrad

But the real challenge is mining this exhaustive data, Yahoo & Google have a lot of manual mining process and this has yet to be perfected for machine-based mining. With some commercial providers like Cloudera are betting their next big break-through would be in the mining & analytics area which offcourse would be automated. all we can say is that there are interesting times ahead.

Unknown said...

you have posted very useful information about the Data warehousing i LIKE it. keep on posting, thanks for sharing .

Unknown said...

It was nice to see the datawarehoue here,
datastage

Related Posts with Thumbnails