Wednesday, January 09, 2013
I'm reading: 70-463 Certification Exam Guide / Notes - Part ITweet this !
This blog has been silent almost across the previous year with very little activity. I got married, got 7 projects at a time on my back, worked as a Senior Architect, reviewed a book and lots more. I struggled striking a balance between personal life and professional life, and now I am trying to come back to business.
As a part of my role, I am pursuing and mentoring a group to pass the exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012. Through the series of posts I would share my notes for this exam preparation. This is Part I of the notes.
Why do we need a Data Warehouse ?
1) Data stored in normalized schema in OLTP systems can have hundreds to thousands of tables for an enterprise. OLTP systems have a probability, that a portion of these tables can be often less descriptive due to lack of self relevant naming conventions. This makes designing queries harder for reporting purposes.
2) As normalised schema has a lot of tables, a single piece of related information is split and stored into various tables with referential integrity constraints. This means that reading this data requires creating joins with many tables. For reporting and analysis purposes over a very huge dataset spanning thousands to millions of records of historical data, such queries would perform very poorly.
3) OLTP systems tend to archive data time to time on a scheduled basis. It might be in the form of hard / soft delete. Lack of historical data can limit the level of analysis that can be performed on the data.
4) OLTP systems tend to store most updated version of data only. For ex employee address, product name, martial status etc. Generally history for such attributes would not be preserved, which results in loss of historical data. Lack of history for such attributes can limit the level of analysis that can be performed on the data.
5) OLTP data in an enterprise can be federated across different departments / business units / associate group companies. Each of it would typically have their own set of applications. There would be inevitably as set of common master data duplicated across the units. Due to duplicate and non-synchronized master data, consolidated reporting and analysis that combines data across the enterprise becomes almost impossible.
6) As each working unit of an enterprise can have their own OLTP system, different attributes can have data representation in different forms for the same data. For ex, an employee is permanent can be represented by Yes / No, Y / N, 1 / 0 etc. Even this makes data extremely difficult to interpret and hence raising the challenges to a centralized reporting over the enterprise data.
7) OLTP systems can contain data in free form, which can lead to degraded data quality. For example, free form address entry can lead to a very poor quality of geographic information due to typographic mistakes, effectively leading to unusable data and inviting data cleansing exercise.
8) OLTP systems tend to have a very good classification of business entities, but the some entities required for analysis are not normalized to the most detailed extent. For example, attributes related to date would would stored in datetime format. Entities like date and time has constituents like day / week / month / quater / year / fiscal year / sec / minute / hour etc. Mostly reporting and analytics works on a time scale. To report data based on any particular constituent of date / time, it would require extracting this constituent from the date value on a very huge dataset. This could lead to great performance issues.
What is a Star schema and when it should be used ?
1) Data model / schema used to create a data warehouse is also known as dimensional model. In this model, reference / lookup / master tables are called dimensions. Measureable details of data are called measures / facts and the tables that host the same are called Fact Tables.
2) Typcially, a simplified denormalized schema that covers the most granular section of a business entity can be represented by a star schema. Such a schema would have a single fact table containing measurable / aggregatable values. This fact table would have foreign keys from all dimension tables that completes identity of that record and provides different angles of analysis.
If you look at AdventureworksDW 2012 schema, FactInternetSales / FactResellerSales is a Fact Table. This table has reference from different dimensions like Date, Time, Product, Customers etc. Together these Fact and Dimensions form a schema known as Star Schema.
What is a Snowflake schema and when it should be used
1) A typical dimension is a highly denormalized business entity. If the same is normalized to 2nd or 3rd normal form, then the same star schema is termed as snowflake schema. For example, Product-ProductCategory-ProductSubCategory, Country-State-City etc are examples of snowflaked Product and Geography dimension respectively.
2) From a warehousing perspective, any data analysis done on data sourced from warehouse directly would suffer performance due to more joins as a result of more tables in a snowflake schema. From the perspective of creating data marts, snowflake schema would provide options to source data more selectively to datamarts.
3) Any typical enterprise class dimensional model would inevitably contain a snowflaked model of schema. Even if one selects not to normalize business entity into normalized dimensions, still dimensions would contain some common attributes like date keys, time keys, status keys etc. So Date dimension would be related Patent dimension by the date key for patenteddate field.
What is a Conformed Dimension ?
1) Theoretically considering if each discrete business area has a star schema, then different business areas typically tend to be connected to be each other. So if a star schema is created for each business unit in a enterprise, all these star schemas would have to be connected too.
2) Two schemas can be connected when they have something in common. If a dimension is shared across more than one schemas, then such dimensions are known as conformed dimensions. For ex, Date dimension would be shared by almost all the Fact tables in the schema.
What is a Data mart ?
1) Using a data warehouse as the source, technologies like SQL Server Analysis Services creates multidimensional data structures known as Cube. A cube contains dimensions and measures that can be used for large scale real-time analysis. These systems are known as Online Analytical Processing (OLAP) applications.
2) Typically a data warehouse can contain data from the entire enterprise along with history. A data mart would source data from a data warehouse and build specialized data structures on the top of the same for reporting and analytics.
What is Dimensionality / Granularity ?
1) If two business entities contain data at the same scale, then they are said to be at the same level of granularity. For example, if sales data is stores weekly in facts and orders data is stored daily, then order data can be said to have a lower granularity compared to sales.
2) To associate two entities in a dimensional model, they need to be calculated as the same level of granularity.
Feel free to share you queries on this posts, and I would try to answer the same in best possible way.