Thursday, February 07, 2013
I'm reading: Building Social Analytics with MS BITweet this !
Every form of analysis needs data, but it's not possible that one might have that data generated and stored in organizational repository. Many forms of analysis depends upon data from third-party, and platforms like Windows Azure Marketplace are based on the same principle.
Social Analytics is widely used to forecast the impact on the business and extract insights to counter the same. The interesting question here is, what is the data source that can be used to calculate / derive sentiments of customers related to the respective business ? A majority of this data would come from social / professional / collaboration forums. Examples of such sources are Facebook, YouTube, Twitter, LinkedIn, PInterest, IMDb, Blogs etc. Anyone would agree that the analytics derived from unstructured data created by the public interaction on social media can be expected to be much more close to precision than even any data mining algorithm. But the big question here is, the amount of data - very very very big data. On a daily basis, there are 400 million Tweets, 2.7 billion Facebook Likes, and 2 billion YouTube views. Even these figures might have been outdated today.
Say an organization is influenced by Sharepoint 2013 enhancements related to social media collaboration, and intends to add an ability to derive sentiment analysis in their client offering. Let's say that as a starting source, Twitter is selected as the source of data, and all the public tweets for a particular product would be analyzed and the results would be stored for future use.
The first challenge is that according to a study, Twitter generates approximately 1 billion tweets in less than 3 days. So how to deal with processing such a huge amount of unstructured data and just consider the kind of infrastructure required to handle this processing. To proceed with the case study, let's say that we live in the age of cloud and we just signed up on AWS and have beefed up a fat Amazon EMR that uses Hadoop and HBase NoSQL database.
The second challenge in this case is how to get access to Twitter Firehose - an API that provides streaming access to Twitter public tweets. One needs to partner with Twitter and pay millions of dollars to get licensed access to it's sea of unfiltered dataset. Also you would need rights to publicly sell this dataset to your end-clients. Considering this complexity any organization would give up the idea of implementing it for own use.
Sometimes the answer to the problem is not technology but it's partner technology. Only three publicly known companies have licensed rights to Twitter's Firehose - Topsy, Datasift, and Gnip. These companies have established partnership with hundreds and thousands of social media platforms, established a web scale and google inspired flavor of infrastructure based on Hadoop clustering methodology, and also have been maintaining a huge archive of historical social data. On the top of it, these providers provide real time access to live stream of social media and also provides social analytics using intelligent methods. An interesting case study of how Datasift manages infrastructure for huge processing, storage and analytics can be read from here.
How MS BI is related to it ?
Even if one selects to sign-up with any of these providers and source analyzed data from them, one would have to keep storing the results. These providers have pay-per-use pricing model depending upon the selected source. After intelligently extracting analyzed data from different sources through these providers, one would have to warehouse the same to avoid paying repeatedly for the same data. Considering the volume of data, even if analyzed data from these social media providers is warehoused, it would easily create a huge warehouse of data.
Microsoft have two different flavors of analysis models (Tabular mode SSAS and OLAP mode SSAS) under the BISM umbrella and a very strong set of end user collaboration platforms including Sharepoint and Excel. Analyzing the warehoused data from social analytics providers with MS BI and including the same in solution offerings can be a deal breaker than implementing complex data mining algorithms or such methods.
I would really like to hear what Microsoft thinks about my idea around social analytics with ms bi. Anyone reading this post is interested in sharing their thoughts about this idea, I would be more than happy to receive the same.
Saturday, February 02, 2013
Amazon Web Services (AWS) pocket reference for Business Intelligence Architects / Architecture DesignI'm reading: Amazon Web Services (AWS) pocket reference for Business Intelligence Architects / Architecture DesignTweet this !
Every large scale IT organization is organized in some form of verticals / Strategic Business Units (SBU), or in some other form. These may be grouped by geography / technology / industry groups etc. Almost inevitably every such organization has a cloud computing capability, and most of cloud based projects / architectures are designed and developed by this capability. This may work till you are working in the capacity of an architect for your own set of projects that just deal with your technology.
I believe that when one intends to grow as an enterprise architect, one needs to collaborate with SMEs from cross environments / technologies / platforms, and for the same one needs to have a good understanding of a variety of each of it.
Why Amazon Web Services (AWS) - AWS is probably the largest cloud player in providing IaaS. Azure and other such platforms have started providing IaaS recently, but their major strength is PaaS where they provide technology to build solutions and the infra is managed by them. If one intends to develop solutions that have a very broad mix variety of technologies, then one would have to opt a very strong IaaS cloud environment, than a PaaS environment.
Below are some of my quick notes on the world of Amazon Web Services, that one might want to keep in consideration while architecting BI solutions on AWS.
1) AWS has two types of clouds : Public / Virtual private cloud (VPC)
In public cloud servers are under AWS control, which can be configured by user. In VPC, servers are hosted within AWS but part of corporate network. IPs are under the control of the corporate network and security between the corporate network and servers hosted on AWS is the obligation of the corporate.
2) Amazon Simple Storage Service (S3) :
- Its an object store, where one can store any type of data in huge amounts, and the same can be accessed using the API provided by amazon for S3.
- It's a highly available service, as it stores copies of data in multiple locations. It can be used as a staging location for migrating data across availability zones when using Elastic Block Store Disk.
- When data is stored into S3, the datatype is stored in a metadata tag. When a client accesses the data, it can check this tag to ensure that the data is read accordingly.
- S3 can store an object with max 5 GB in size. S3 objects can be accessed via REST/SOAP/HTTP. Third party tools are available to handle storage management inside S3.
3) Amazon Elastic Compute Cloud - EC2
- Provides scalable and flexible compute capacity EC2 instance provides interface to manage Amazon Machine Image (AMI, also known as bundle). Amazon, and other third party providers like RightScale, IBM and others provide ready images for use.
- Any software installation would be lost from EC2 instance, once the instance is "terminated". Persistent images are also available which can persist software changes, once the instance is stopped (but not terminated). These images are based on EBS or S3 instance store.
- If you use a SQL Server 2008 R2 AMI, then the license cost of SQL Server is included in the cost of running the instance. One cannot use their own purchased licenses to offset the cost of SQL Server license in a AWS provided SQL Server AMI.
- One can allocate static IP address to an instance using AWS "Elastic IP", and after that once can RDP to the same using the same IP / DNS every time. Without an Elastic IP, the IP address for the instance would change every time the instance is started and stopped. Elastic IPs are chargeable.
Billing types for EC2 instance
- Reserved Instance - This instance type requires reserving the instance for a fixed term. It includes an up-front cost, along with usage charges. This instance is cheaper than Unreserved instance.
- Unreserved Instance - This instance is billed on pay-per-use basis, but is comparatively expensive than Reserved Instance.
- Spot Instance - These are unique type of EC2 instances, which are basically amazon's way to handle spare capacity. You need to set a price and number of instances you need. When the average spot price falls below the price set by you, the instances would be allocated to your account. But downside is that once the average spot price rise above the price set by you, those instance would stop.
- In AWS, you are not billed for any data transfer between AWS components (for example data transfer between S3 and EC2). But for any data traffic that goes in and out of the instance using Internet, is billable.
- Various categories of EC2 instances available like Micro, Standard, Cluster Compute, High-Memory Cluster, Cluster GPU, High Memory, High CPU, High Storage, High I/O etc. Also each of them have small, medium, large scaling for each category. A comparison can be seen from here: http://www.ec2instances.info , http://aws.amazon.com/ec2/instance-types/
4) Amazon Elastic Block Storage (EBS)
- Its the storage system / disk where EC2 instance would store and persist data. EBS is created, configured and managed out of EC2 instance and not within it. Even if an EC2 instance has been terminated, data stored on EBS would persist.
- EBS volumes can be 1 GB to 1 TB in size.
- EBS volume availability is restricted to the region and availability zone in which they are created. It's possible to make it available within a different zone by creating a snapshot of EBS and storing it into S3, and again creating a new EBS from the snapshot stored in S3. But EBS cannot be made available across regions by any means.
- One EC2 instance can have many EBS volumes, but one EBS volume cannot be shared by multiple EC2 instances.
5) Amazon Security Groups
- It provides a way to restrict access on EC2 instances, by configuring ports, ip and servers that can connect to an EC2 instance. It acts as a firewall for an EC2 instance.
- All the EC2 instance on which a security group is applied, does not become part of a common group / subnet.
6) Amazon CloudWatch
- Cloudwatch are of two types in AWS - Basic CloudWatch and Detailed CloudWatch.
- Basic CloudWatch is available with EC2 instance. It collects different performance metrics related to the EC2 instance.
- Detailed CloudWatch enables a detailed monitoring of EC2 instances, with alerts and notifications.
7) Amazon Elastic Load Balancing (ELB)
- Elastic Load Balancing can be used for two major purposes - Load balancing and Fault tolerance.
- As a load balancer it can distribute incoming traffic to different servers in a load balanced fashion.
- As a fail over balancer, it can detect a failed / unresponsive / unhealthy EC2 instance and route traffic to other instances as required.
8) Amazon Relational Database Service (RDS)
- Amazon RDS provides full featured database services using MySQL, Oracle as well as SQL Server database engine.
- RDS provides fault-tolerance / high availability by creating Multi-AZ Deployments. With this option, one instance of RDS is created in the availability zone selected by user, and second instance is created in an alternative availability zone. Both instances are kept upto date in parallel. The second instance is not visible / available, until the first instance becomes unavailable, and when it does, the second instance takes over immediately.
- RDS instance can be configured to create Read Replica which are copies of the RDS instance, that can be used for reporting purposes.
- RDS instances are backed up by default in AWS and this backup remains available for a limited time. Backups are totally configurable and can be persisted indefinitely too.
9) Amazon Simple Notification Service (SNS)
- Amazon SNS is a publish and subscribe model using which systems or user can generate and/or receive alerts and/or notifications.
- There are three methods in which alerts / notifications are delivered: Email / Http based web service call / A message via Simple Queue Service (SQS).
10) Amazon CloudFront
Its the Content Delivery Network of AWS that distributes and caches content at the nearest servers based on user request patterns.
11) Amazon Elastic MapReduce (EMR)
- Amazon EMR provides features to process large amounts of data using Hadoop based processing combined with other AWS products.
- EMR also provides option to run HBase (column oriented, distributed, NoSQL database) on Hadoop clusters which enables real-time data access to Hadoop in cloud.
12) Amazon Identity and Access Management (IAM) and Amazon CloudFormation provides means to control permissions to AWS resources as well as manage AWS resources as a system respectively. Amazon Route 53 is a highly available and scalabe Domain Name System (DNS) management service that can be used with AWS IAM to manage domains with faster performance.