Thursday, February 07, 2013
I'm reading: Building Social Analytics with MS BITweet this !
Every form of analysis needs data, but it's not possible that one might have that data generated and stored in organizational repository. Many forms of analysis depends upon data from third-party, and platforms like Windows Azure Marketplace are based on the same principle.
Social Analytics is widely used to forecast the impact on the business and extract insights to counter the same. The interesting question here is, what is the data source that can be used to calculate / derive sentiments of customers related to the respective business ? A majority of this data would come from social / professional / collaboration forums. Examples of such sources are Facebook, YouTube, Twitter, LinkedIn, PInterest, IMDb, Blogs etc. Anyone would agree that the analytics derived from unstructured data created by the public interaction on social media can be expected to be much more close to precision than even any data mining algorithm. But the big question here is, the amount of data - very very very big data. On a daily basis, there are 400 million Tweets, 2.7 billion Facebook Likes, and 2 billion YouTube views. Even these figures might have been outdated today.
Say an organization is influenced by Sharepoint 2013 enhancements related to social media collaboration, and intends to add an ability to derive sentiment analysis in their client offering. Let's say that as a starting source, Twitter is selected as the source of data, and all the public tweets for a particular product would be analyzed and the results would be stored for future use.
The first challenge is that according to a study, Twitter generates approximately 1 billion tweets in less than 3 days. So how to deal with processing such a huge amount of unstructured data and just consider the kind of infrastructure required to handle this processing. To proceed with the case study, let's say that we live in the age of cloud and we just signed up on AWS and have beefed up a fat Amazon EMR that uses Hadoop and HBase NoSQL database.
The second challenge in this case is how to get access to Twitter Firehose - an API that provides streaming access to Twitter public tweets. One needs to partner with Twitter and pay millions of dollars to get licensed access to it's sea of unfiltered dataset. Also you would need rights to publicly sell this dataset to your end-clients. Considering this complexity any organization would give up the idea of implementing it for own use.
Sometimes the answer to the problem is not technology but it's partner technology. Only three publicly known companies have licensed rights to Twitter's Firehose - Topsy, Datasift, and Gnip. These companies have established partnership with hundreds and thousands of social media platforms, established a web scale and google inspired flavor of infrastructure based on Hadoop clustering methodology, and also have been maintaining a huge archive of historical social data. On the top of it, these providers provide real time access to live stream of social media and also provides social analytics using intelligent methods. An interesting case study of how Datasift manages infrastructure for huge processing, storage and analytics can be read from here.
How MS BI is related to it ?
Even if one selects to sign-up with any of these providers and source analyzed data from them, one would have to keep storing the results. These providers have pay-per-use pricing model depending upon the selected source. After intelligently extracting analyzed data from different sources through these providers, one would have to warehouse the same to avoid paying repeatedly for the same data. Considering the volume of data, even if analyzed data from these social media providers is warehoused, it would easily create a huge warehouse of data.
Microsoft have two different flavors of analysis models (Tabular mode SSAS and OLAP mode SSAS) under the BISM umbrella and a very strong set of end user collaboration platforms including Sharepoint and Excel. Analyzing the warehoused data from social analytics providers with MS BI and including the same in solution offerings can be a deal breaker than implementing complex data mining algorithms or such methods.
I would really like to hear what Microsoft thinks about my idea around social analytics with ms bi. Anyone reading this post is interested in sharing their thoughts about this idea, I would be more than happy to receive the same.
Saturday, February 02, 2013
Amazon Web Services (AWS) pocket reference for Business Intelligence Architects / Architecture DesignI'm reading: Amazon Web Services (AWS) pocket reference for Business Intelligence Architects / Architecture DesignTweet this !
Every large scale IT organization is organized in some form of verticals / Strategic Business Units (SBU), or in some other form. These may be grouped by geography / technology / industry groups etc. Almost inevitably every such organization has a cloud computing capability, and most of cloud based projects / architectures are designed and developed by this capability. This may work till you are working in the capacity of an architect for your own set of projects that just deal with your technology.
I believe that when one intends to grow as an enterprise architect, one needs to collaborate with SMEs from cross environments / technologies / platforms, and for the same one needs to have a good understanding of a variety of each of it.
Why Amazon Web Services (AWS) - AWS is probably the largest cloud player in providing IaaS. Azure and other such platforms have started providing IaaS recently, but their major strength is PaaS where they provide technology to build solutions and the infra is managed by them. If one intends to develop solutions that have a very broad mix variety of technologies, then one would have to opt a very strong IaaS cloud environment, than a PaaS environment.
Below are some of my quick notes on the world of Amazon Web Services, that one might want to keep in consideration while architecting BI solutions on AWS.
1) AWS has two types of clouds : Public / Virtual private cloud (VPC)
In public cloud servers are under AWS control, which can be configured by user. In VPC, servers are hosted within AWS but part of corporate network. IPs are under the control of the corporate network and security between the corporate network and servers hosted on AWS is the obligation of the corporate.
2) Amazon Simple Storage Service (S3) :
- Its an object store, where one can store any type of data in huge amounts, and the same can be accessed using the API provided by amazon for S3.
- It's a highly available service, as it stores copies of data in multiple locations. It can be used as a staging location for migrating data across availability zones when using Elastic Block Store Disk.
- When data is stored into S3, the datatype is stored in a metadata tag. When a client accesses the data, it can check this tag to ensure that the data is read accordingly.
- S3 can store an object with max 5 GB in size. S3 objects can be accessed via REST/SOAP/HTTP. Third party tools are available to handle storage management inside S3.
3) Amazon Elastic Compute Cloud - EC2
- Provides scalable and flexible compute capacity EC2 instance provides interface to manage Amazon Machine Image (AMI, also known as bundle). Amazon, and other third party providers like RightScale, IBM and others provide ready images for use.
- Any software installation would be lost from EC2 instance, once the instance is "terminated". Persistent images are also available which can persist software changes, once the instance is stopped (but not terminated). These images are based on EBS or S3 instance store.
- If you use a SQL Server 2008 R2 AMI, then the license cost of SQL Server is included in the cost of running the instance. One cannot use their own purchased licenses to offset the cost of SQL Server license in a AWS provided SQL Server AMI.
- One can allocate static IP address to an instance using AWS "Elastic IP", and after that once can RDP to the same using the same IP / DNS every time. Without an Elastic IP, the IP address for the instance would change every time the instance is started and stopped. Elastic IPs are chargeable.
Billing types for EC2 instance
- Reserved Instance - This instance type requires reserving the instance for a fixed term. It includes an up-front cost, along with usage charges. This instance is cheaper than Unreserved instance.
- Unreserved Instance - This instance is billed on pay-per-use basis, but is comparatively expensive than Reserved Instance.
- Spot Instance - These are unique type of EC2 instances, which are basically amazon's way to handle spare capacity. You need to set a price and number of instances you need. When the average spot price falls below the price set by you, the instances would be allocated to your account. But downside is that once the average spot price rise above the price set by you, those instance would stop.
- In AWS, you are not billed for any data transfer between AWS components (for example data transfer between S3 and EC2). But for any data traffic that goes in and out of the instance using Internet, is billable.
- Various categories of EC2 instances available like Micro, Standard, Cluster Compute, High-Memory Cluster, Cluster GPU, High Memory, High CPU, High Storage, High I/O etc. Also each of them have small, medium, large scaling for each category. A comparison can be seen from here: http://www.ec2instances.info , http://aws.amazon.com/ec2/instance-types/
4) Amazon Elastic Block Storage (EBS)
- Its the storage system / disk where EC2 instance would store and persist data. EBS is created, configured and managed out of EC2 instance and not within it. Even if an EC2 instance has been terminated, data stored on EBS would persist.
- EBS volumes can be 1 GB to 1 TB in size.
- EBS volume availability is restricted to the region and availability zone in which they are created. It's possible to make it available within a different zone by creating a snapshot of EBS and storing it into S3, and again creating a new EBS from the snapshot stored in S3. But EBS cannot be made available across regions by any means.
- One EC2 instance can have many EBS volumes, but one EBS volume cannot be shared by multiple EC2 instances.
5) Amazon Security Groups
- It provides a way to restrict access on EC2 instances, by configuring ports, ip and servers that can connect to an EC2 instance. It acts as a firewall for an EC2 instance.
- All the EC2 instance on which a security group is applied, does not become part of a common group / subnet.
6) Amazon CloudWatch
- Cloudwatch are of two types in AWS - Basic CloudWatch and Detailed CloudWatch.
- Basic CloudWatch is available with EC2 instance. It collects different performance metrics related to the EC2 instance.
- Detailed CloudWatch enables a detailed monitoring of EC2 instances, with alerts and notifications.
7) Amazon Elastic Load Balancing (ELB)
- Elastic Load Balancing can be used for two major purposes - Load balancing and Fault tolerance.
- As a load balancer it can distribute incoming traffic to different servers in a load balanced fashion.
- As a fail over balancer, it can detect a failed / unresponsive / unhealthy EC2 instance and route traffic to other instances as required.
8) Amazon Relational Database Service (RDS)
- Amazon RDS provides full featured database services using MySQL, Oracle as well as SQL Server database engine.
- RDS provides fault-tolerance / high availability by creating Multi-AZ Deployments. With this option, one instance of RDS is created in the availability zone selected by user, and second instance is created in an alternative availability zone. Both instances are kept upto date in parallel. The second instance is not visible / available, until the first instance becomes unavailable, and when it does, the second instance takes over immediately.
- RDS instance can be configured to create Read Replica which are copies of the RDS instance, that can be used for reporting purposes.
- RDS instances are backed up by default in AWS and this backup remains available for a limited time. Backups are totally configurable and can be persisted indefinitely too.
9) Amazon Simple Notification Service (SNS)
- Amazon SNS is a publish and subscribe model using which systems or user can generate and/or receive alerts and/or notifications.
- There are three methods in which alerts / notifications are delivered: Email / Http based web service call / A message via Simple Queue Service (SQS).
10) Amazon CloudFront
Its the Content Delivery Network of AWS that distributes and caches content at the nearest servers based on user request patterns.
11) Amazon Elastic MapReduce (EMR)
- Amazon EMR provides features to process large amounts of data using Hadoop based processing combined with other AWS products.
- EMR also provides option to run HBase (column oriented, distributed, NoSQL database) on Hadoop clusters which enables real-time data access to Hadoop in cloud.
12) Amazon Identity and Access Management (IAM) and Amazon CloudFormation provides means to control permissions to AWS resources as well as manage AWS resources as a system respectively. Amazon Route 53 is a highly available and scalabe Domain Name System (DNS) management service that can be used with AWS IAM to manage domains with faster performance.
Wednesday, January 30, 2013
I'm reading: Data layer requirements for applications with cloud hosting and NoSQL databases - CloudantTweet this !
MVC is one of the most famous and widely adopted application development design patterns. Typically data layer of applications starts off with features that requests actions related to data manipulations.
- If an application is public facing where users can sign-up on the app and consume services, there are bright changes that with the success of the application, user data would grow linearly.
- On the top of it if the application is related to social media / networking / large scale e-commerce, then the app is a future candidate to mature in the big data processing requirements. This puts an obligation on the data layer to be more mature, if you want to keep your operational costs (incurred by dedicated DBAs, infrastructure procurement and maintenance, etc) under check.
- As data grows big, and tends towards being unstructured from structured, database choices tends to be towards specialized NoSQL databases from generalized relational RDBMS.
- To manage data load, data management methods like replication, sharding, master-slave configuration, partitioning etc are employed. As most of these techniques require flexibly scalable infrastructure, data is hosted mostly on cloud and this means that data access layer should be mature enough to work with cloud and almost manage the infrastructure administration using the cloud provider api.
- Different cloud providers have their own pros and cons. An application downtime is directly proportional to equivalent loss is business in today's world. So you might want your data layer to be flexible enough, such that it can access data from some of the leading cloud hosting environments.
These specifications are too ambitious for an application to develop just for it's own use. But at the same time to keep frequent revamping of data layer at bay or to avoid create duplicate data layers for different optimal data access and management methods, one needs a data layer that can manage the above mentioned challenges.
One interesting company provider a data layer solution that can meet the above specifications with even more features - Cloudant. It provides a data layer to
- Manage data in CouchDB NoSQL database,
- Hosted on choice of cloud providers like AWS, Azure, RackSpace etc
- RESTful CouchDB compliant APIs
- JSON based data exchange format
- Lucene based text search and much more.
NoSQL databases, Cloud providers of the scale of Azure, AWS etc, MPP systems like Hadoop etc are quite famous. But the data layer that manages data access along with data management on cloud can be a very interesting option, than starting your app from scratch with a fat sized cloud account. Checkout the technical whitepaper on Cloudant for a detailed understanding of this service and this article.
Sunday, January 27, 2013
I'm reading: Database Sharding, SQL Azure Federation, SQL Server 2012 and SequenceTweet this !
Almost everyone who is a SQL Server professional directly or indirectly knows the newly introduced Sequence object in SQL Server 2012. This post is not about explaining how to use it. Just google what most famous blogger have to say about Sequence object, and you would find that the only content in the post is how to use it. I was hoping that someone would see it from the angle that I see it, but I didn't find any. So I thought of sharing my viewpoint on Sequence object.
One can manage generating incremental keys/ids using the IDENTITY property. Is Sequence just another means to flexibly manage automated incremental ID generation ? I would say it's one of the means, but I don't see it as the ultimate purpose of it.
Applications enabling answer to complex queries often needs data of a huge size. How easily would you get to read data, depends on strategically how-and-where you write data. Network bandwidth, computing capacity and memory availability are some of the driving factors in read-write operations. When apps face performance pressure, the first step that is typically followed is scaling up. After scale up reaches a limit, scale out is the next measure. Partitioning, Replication, Federation are different approaches to scale out databases. If data is distributed and stored in a way that the query is likely to fetch all the data from a single server / partition, this strategy might work till each database size remains in gigabyte sized limits. If each partition / database is multi terabyte sized, even if entire data to be returned to database is figured out, it would face a bottleneck due to network bandwidth available to a single server due to the amount of data to be returned.
Database sharding is partitioning data horizontally across many servers on commodity hardware. Some areas where sharding is used include:
- NoSQL databases running on Hadoop and MapReduce based infrastructure with database sharding have been able to solve database scalability related challenges on some of the worlds biggest and most aggressively growing datasets.
- SQL Azure Federation on Windows Azure platform is one of the awaited features in SQL Azure, that would allow multi tenant databases with sharding.
- Cloud databases often have to be multi-tenant / federated, due to the limits on database sizes as a result of cloud based storage architecture.
The interesting part is, database sharding on relational databases is not available out-of-box, at least in SQL Server. Consider a scenario where the need is to insert a couple of records in order but in different databases. One would need a key/id generation mechanism that is independent of tables, unlike IDENTITY property.
Sequence is independent of tables, which makes it much more scalable. From the feature set it seems that Sequence can help to generate a ordered set of keys for storing data across shards in a sorted format. This can be confirmed only with a PoC, but for now I am hopeful that when SQL Azure Federation would go RTM, Sequence would definitely make a place in the T-SQL features for the same.
Thursday, January 24, 2013
I'm reading: 70-463 Exam Guide - Part 2Tweet this !
How to accommodate changing attribute values of a dimension in a dimensional model ?
How does dimensional modeling of a data warehouse enable / empower analytics ?
1) Data analysis often requires pivoting huge amounts of data. Attributes of a dimension (i.e. columns in a dimension table) ideally contains non-continuous data. Such data is often suitable to slice / pivot data and can be used as a scale axis to visualize categorized data on a graph. Even if the data in attributes is continuous (i.e. data contains lots of distinct values), OLAP technologies like SSAS provides was to categorize continuous data into discrete categories.
2) For example, if we try to plot data on a graph use values in Age attribute for the population in a city, on X or Y axis, then there would be inevitably 100 bars on a graph which is almost impossible to analyze. A better approach would be to divide age attribute in few distinct age groups and use those groups on the graph. Each group would represent a bar, and upon drilling down the bar, actual data can be brought on the graph.
Why do dimension tables often include columns that cannot be used for pivoting / filtering / analytics ?
1) Often dimensions like Geography, Employee, Account, Products etc have attributes describing the dimension member. For example Employee dimension can have attributes like Gender, Date of Birth, Marital Status, Number of children etc.. Such attributes are neither good candidates for pivoting nor for any kind of roll-ups or aggregation.
2) However broad or narrow be the analytical spectrum, inevitably almost any such system would have reporting on the top of it. If the dimensions have only pivotable or aggregatable attributes, then such data would be sufficient only to plot on a graph but not on a detailed report. Such data can facilitate drill-down, but not drill-through of the problem. For drill-through, one would require different attributes based on which a report can be sketched out.
3) For an Employee report, one might start to analyze a report based on salary attribute as it is pivotable. For example, list all employees whose gross income is less than 30k USD. But post that one would want to check which of those are women having more than two kids and are aged over forty, to balance work load on them. For such a report, if dimensions do not contain such attributes, then a cross-database query joining data warehouse and transactional OLTP database would be required. This results in poor performance and other reporting issues. Hence such attributes are included in a dimension.
Such attributes in a dimension are called member properties.
What are hierarchies and how they help in analytics ?
1) Attributes within a dimension as well as across dimensions can be related and well as unrelated. For example, date of birth and gender are unrelated attributes, but country and city are related attributes. Related attributes are often pivotable and form a hierarchy called natural hierarchies. Unrelated attributes have lesser probability of forming a hierarchy, and if they do, then it's called unnatural hierarchies.
2) Hierarchies provide a drill-down path for data analysis are the data can be pivoted. Aggregatable attributes can only be rolled-up but cannot be drilled down to another level based on another related / unrelated attribute. For example salary can be rolled up from daily to the highest possible granularity. But to drill down salary from Year-Month-Week, a hierarchy made up of Year, Month and Date attributes is required which would slice rolled-up values of Salary attribute. Hierarchies formed of two or more related / unrelated attributes can be drilled-down which are very useful for analytics.
How to accommodate changing attribute values of a dimension in a dimensional model ?
1) Data archival policy is often overlooked or considered a trivial aspect in database modeling. But in data warehouse it has a very deep impact on the modeling itself. Facts should ideally contain transactional data, so there is no question of data update. Dimension attributes often have changing values, which might be worth keeping a track. For example, an organization might not be interested in keep a track of every time when an employee had children but only total number of children as of the latest status. But say, if the employee is a sales person, an organization would want to keep track of the field area where the sales executive was posted at any point in time. Without tracking this history, reports would always show that sales executive has operated in one area which might be his/her present area of operation, as the attribute value is always updated / overwritten. Such changing values of attributes in a dimension are addressed by a design aspect known as Slowly Changing Dimension (SCD).
2) SCD are of three types, based on how attribute history has to be preserved.
SCD Type 1: Attributes for which values are updated / overwritten, and history is not maintained is know as SCD Type 1.
SCD Type 2: To maintain history, at a minimum three new columns are added to the dimension - ValidFrom, ValidTill, and IsCurrent. ValidFrom and ValidTill defines the time scope during which the value was valid, and IsCurrent helps queries to easily identify the latest value of the attribute(s). When a value changes for an attribute, the ValidTill value is updated with the current date, and IsCurrent is marked as "N". A new record with the latest value of the attribute is added to the dimension, having ValueFrom as the current date, ValidTill as NULL and IsCurrent as "Y".
SCD Type 3: If one decided to maintain history of attribute values, and if the attribute values are frequently changing, then the dimension can have rapidly changing values (often known as rapidly changing dimension). Due to the same, dimension would have huge amount of dimension members effectively translating into performance issues in dimension processing as well as querying. Type 3 is a method where one can limit the amount of history to be preserved. Only the latest value and the one before that are preserved. Due to this design, the tracking and scoping fields required in Type 2 would not be required in Type 3.
How are dimensions (especially slowly changing dimensions) associated with facts in a dimensional model ?
1) Any dimension would have a business key, which is one of the candidates for primary key, for example ProductID in Product dimension. If a dimension does not have any SCD attributes, then Business key can be used as the primary key for the dimension. But in other case, whenever a new record is inserted for a SCD attribute, the business key would get duplicated. So to manage this, SCD dimensions have another field that acts as a placeholder of primary key - popularly known as Surrogate Key. This field would typically have auto-increment identity values in most of the cases.
2) Primary keys in a dimension act as a foreign keys in fact tables, which enables pivoting data based on dimension attributes for analytics. Business keys in a dimension helps in maintaining lineage with OLTP data sources, from where data is collected and warehoused typically using ETL methods. One can't correctly slice data in a fact with a dimension whose key is not present directly / indrectly in the fact table. A combination of all the foreign keys in a fact table can be used as a composite primary key in many cases.
3) If a fact table is not associated with a dimension table, and still one wants to associate a dimension with a fact without adding primary key from dimension into the fact table, one can create bridge fact tables. These bridge tables contain only keys from dimension and fact, which creates an indirect association between dimension and fact tables.
What are measures and where are they stored ?
1) Attributes of a fact table are known as measures. Fact table should contain calculable data, which is measure of different attributes of an entity. For example, for a Sales fact table would have order value, number of units in order, tax on order value etc, profit margin percentage etc.
2) Additivity of different measures can be different - some can be fully additive like order value, some can be partially additive like tax, and some can be non-additive like profit margin percentage. Based on the nature of addivity, different aggregation functions can be applied on the attributes for roll-ups.