Thursday, January 24, 2013

70-463 Exam Guide - Part 2

I'm reading: 70-463 Exam Guide - Part 2Tweet this !

How does dimensional modeling of a data warehouse enable / empower analytics ?

1) Data analysis often requires pivoting huge amounts of data. Attributes of a dimension (i.e. columns in a dimension table) ideally contains non-continuous data. Such data is often suitable to slice / pivot data and can be used as a scale axis to visualize categorized data on a graph. Even if the data in attributes is continuous (i.e. data contains lots of distinct values), OLAP technologies like SSAS provides was to categorize continuous data into discrete categories.

2) For example, if we try to plot data on a graph use values in Age attribute for the population in a city, on X or Y axis, then there would be inevitably 100 bars on a graph which is almost impossible to analyze. A better approach would be to divide age attribute in few distinct age groups and use those groups on the graph. Each group would represent a bar, and upon drilling down the bar, actual data can be brought on the graph.

Why do dimension tables often include columns that cannot be used for pivoting / filtering / analytics ?

1) Often dimensions like Geography, Employee, Account, Products etc have attributes describing the dimension member. For example Employee dimension can have attributes like Gender, Date of Birth, Marital Status, Number of children etc.. Such attributes are neither good candidates for pivoting nor for any kind of roll-ups or aggregation.

2) However broad or narrow be the analytical spectrum, inevitably almost any such system would have reporting on the top of it. If the dimensions have only pivotable or aggregatable attributes, then such data would be sufficient only to plot on a graph but not on a detailed report. Such data can facilitate drill-down, but not drill-through of the problem. For drill-through, one would require different attributes based on which a report can be sketched out.

3) For an Employee report, one might start to analyze a report based on salary attribute as it is pivotable. For example, list all employees whose gross income is less than 30k USD. But post that one would want to check which of those are women having more than two kids and are aged over forty, to balance work load on them. For such a report, if dimensions do not contain such attributes, then a cross-database query joining data warehouse and transactional OLTP database would be required. This results in poor performance and other reporting issues. Hence such attributes are included in a dimension.

Such attributes in a dimension are called member properties.

What are hierarchies and how they help in analytics ?

1) Attributes within a dimension as well as across dimensions can be related and well as unrelated. For example, date of birth and gender are unrelated attributes, but country and city are related attributes. Related attributes are often pivotable and form a hierarchy called natural hierarchies. Unrelated attributes have lesser probability of forming a hierarchy, and if they do, then it's called unnatural hierarchies.

2) Hierarchies provide a drill-down path for data analysis are the data can be pivoted. Aggregatable attributes can only be rolled-up but cannot be drilled down to another level based on another related / unrelated attribute. For example salary can be rolled up from daily to the highest possible granularity. But to drill down salary from Year-Month-Week, a hierarchy made up of Year, Month and Date attributes is required which would slice rolled-up values of Salary attribute. Hierarchies formed of two or more related / unrelated attributes can be drilled-down which are very useful for analytics.

How to accommodate changing attribute values of a dimension in a dimensional model ?

1) Data archival policy is often overlooked or considered a trivial aspect in database modeling. But in data warehouse it has a very deep impact on the modeling itself. Facts should ideally contain transactional data, so there is no question of data update. Dimension attributes often have changing values, which might be worth keeping a track. For example, an organization might not be interested in keep a track of every time when an employee had children but only total number of children as of the latest status. But say, if the employee is a sales person, an organization would want to keep track of the field area where the sales executive was posted at any point in time. Without tracking this history, reports would always show that sales executive has operated in one area which might be his/her present area of operation, as the attribute value is always updated / overwritten. Such changing values of attributes in a dimension are addressed by a design aspect known as Slowly Changing Dimension (SCD).

2) SCD are of three types, based on how attribute history has to be preserved.

SCD Type 1: Attributes for which values are updated / overwritten, and history is not maintained is know as SCD Type 1.

SCD Type 2: To maintain history, at a minimum three new columns are added to the dimension - ValidFrom, ValidTill, and IsCurrent. ValidFrom and ValidTill defines the time scope during which the value was valid, and IsCurrent helps queries to easily identify the latest value of the attribute(s). When a value changes for an attribute, the ValidTill value is updated with the current date, and IsCurrent is marked as "N". A new record with the latest value of the attribute is added to the dimension, having ValueFrom as the current date, ValidTill as NULL and IsCurrent as "Y".

SCD Type 3: If one decided to maintain history of attribute values, and if the attribute values are frequently changing, then the dimension can have rapidly changing values (often known as rapidly changing dimension). Due to the same, dimension would have huge amount of dimension members effectively translating into performance issues in dimension processing as well as querying. Type 3 is a method where one can limit the amount of history to be preserved. Only the latest value and the one before that are preserved. Due to this design, the tracking and scoping fields required in Type 2 would not be required in Type 3.

How are dimensions (especially slowly changing dimensions) associated with facts in a dimensional model ?

1) Any dimension would have a business key, which is one of the candidates for primary key, for example ProductID in Product dimension. If a dimension does not have any SCD attributes, then Business key can be used as the primary key for the dimension. But in other case, whenever a new record is inserted for a SCD attribute, the business key would get duplicated. So to manage this, SCD dimensions have another field that acts as a placeholder of primary key - popularly known as Surrogate Key. This field would typically have auto-increment identity values in most of the cases.

2) Primary keys in a dimension act as a foreign keys in fact tables, which enables pivoting data based on dimension attributes for analytics. Business keys in a dimension helps in maintaining lineage with OLTP data sources, from where data is collected and warehoused typically using ETL methods. One can't correctly slice data in a fact with a dimension whose key is not present directly / indrectly in the fact table. A combination of all the foreign keys in a fact table can be used as a composite primary key in many cases.

3) If a fact table is not associated with a dimension table, and still one wants to associate a dimension with a fact without adding primary key from dimension into the fact table, one can create bridge fact tables. These bridge tables contain only keys from dimension and fact, which creates an indirect association between dimension and fact tables.

What are measures and where are they stored ?

1) Attributes of a fact table are known as measures. Fact table should contain calculable data, which is measure of different attributes of an entity. For example, for a Sales fact table would have order value, number of units in order, tax on order value etc, profit margin percentage etc.

2) Additivity of different measures can be different - some can be fully additive like order value, some can be partially additive like tax, and some can be non-additive like profit margin percentage. Based on the nature of addivity, different aggregation functions can be applied on the attributes for roll-ups.

1 comment:

Varun Gulati said...

Nice blog, got chance to read one today...

Related Posts with Thumbnails