Data Warehouse
Data: a meaning full information.
Datawarehouse:
A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data
consolidations.
Understanding a Data Warehouse
•A data warehouse is a database, which is kept separate from the organization's operational database.
•There is no frequent updating done in a data warehouse.
•It possesses consolidated historical data, which helps the organization to analyze its business.
•A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.
•Data warehouse systems help in the integration of diversity of application systems.
•A data warehouse system helps in consolidated historical data analysis
Functions of Data Warehouse Tools and Utilities
The following are the functions of data warehouse tools and utilities −
•Data Extraction − Involves gathering data from multiple heterogeneous sources.
•Data Cleaning − Involves finding and correcting the errors in data.
•Data Transformation − Involves converting the data from legacy format to warehouse format.
•Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.
•Refreshing − Involves updating from data sources to warehouse.
Note − Data cleaning and data transformation are important steps in improving the quality of data and data mining results.
Data Mart
• Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization.
• In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart
may contain data related to items, customers, and sales. Data marts are confined to subjects.
• Points to remember about data marts −
• Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers.
• The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than months or years.
• The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data warehouse.
• Data mart are flexible.
Enterprise Warehouse:
• An enterprise warehouse collects all the information and the subjects spanning an entire organization
• It provides us enterprise-wide data integration.
• The data is integrated from operational systems and external information providers .
• This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
Star schema
• A Star Schema is a schema Architectural structure used for creation and implementation of the Data Warehouse systems, where
there is only one fact table and multiple dimension tables connected to it. It is structured like a star in shape of appearance.
Snowflake schema
• The snowflake schema consists of one fact table which is linked to many dimension tables, which can be linked to other
dimension tables through a many-to-one relationship. Tables in a snowflake schema are generally normalized to the third normal
form. Each dimension table performs exactly one level in a hierarchy.
Table:
• It consists of columns, and rows . In relational databases, and flat file databases, a table is a set of data elements (values) using a
model of vertical columns (identifiable by name) and horizontal rows, the cell being the unit where a row and column intersect. A
table has a specified number of columns, but can have any number of rows.
Dimension & Fact Tables:
Dimension: Dimensions store the textual descriptions of the business. With help of dimension you can easily identify the measures.
The different types of dimension tables are available as below:
Types of Dimension Tables
Slowly Changing Dimensions : This is the popular dimension type. Attributes of a dimension that would undergo changes over
time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data
warehouse. This is called a slowly changing attribute and a dimension containing such an attribute is called a slowly changing
dimension. Eg. Home Address doesnt change often, its a SCD attribute
Types of SCD :
Type 1 :In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no
history is kept.
Advantages : SCD-1 is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of
the old information
Disadvantages : All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case,
the company would not be able to know that Charlie lived in Illinois before
Type 2 : In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information.
Therefore, both the original and the new record will be present. The new record gets its own primary key.
Advantages:
• Type 2 is the popular dimension in Data warehousing. It preserves entire history of changes and is the most effective SCD
Disadvantages:
• Complex ETL required to do change data capture and perform the SCD Type 2 Process
• As a new record is inserted every time there is a change.
• This will cause the size of the table to grow fast.
Type 3 :In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of
interest, one indicating the original value, and one indicating the current value. There will also be a column
that indicates when the current value becomes active
Advantages:
• Does not increase the size of the table, since new information is updated in the same row
• Allows us to store some part of history
Disadvantages:
• Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Charlie later moves to
Texas on April 15, 2020, the California information will be lost
• Rapidly Changing Dimensions: A dimension attribute that changes frequently is a rapidly changing attribute. If you
don't need to track the changes, the rapidly changing attribute is no problem, but if you do need to track the
changes, using a standard slowly changing dimension technique can result in a huge inflation of the size of the
dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table.
This new dimension is called a rapidly changing dimension. Eg. Body Temperature is a rapidly changing attribute
• Junk Dimensions: A junk dimension is a single table with a combination of different and unrelated attributes to avoid
having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign
keys created by rapidly changing dimensions. For example, attributes such as flags, weights, BMI (body mass index)
etc
• Degenerate Dimensions: A degenerate dimension is when the dimension attribute is stored as part of fact table, and
not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a
data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated
number in a report. You can use these values to trace back to transactions in the OLTP system. For example, receipt
number does not have dimension table associated with it. Such details are just for information purpose
• Conformed Dimensions: A dimension that is used in multiple locations is called a conformed dimension. A conformed
dimension may be used with multiple fact tables in a single database, or across multiple data marts or data
warehouses. Conformed dimension example would be Customer dimension, i.e. both marketing and sales
department can use Customer dimension for their reporting purpose
• Static Dimensions: Static dimensions are not extracted from the original data source, but are created within the
context of the data warehouse. A static dimension can be loaded manually - for example with status codes - or it can
be generated by a procedure, such as a date or time dimension
• Role Playing Dimensions: A role-playing dimension is one where the same dimension key - along with its associated
attributes - can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign
keys for both ship date and delivery date. But the same date dimension attributes apply to each foreign key, so you
can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship
date as well as delivery date, and hence the name of role playing dimension. For example, you can use a date
dimension for “date of sale”, as well as “date of delivery”, or “date of hire”
Types of Facts: There are three types of facts:
• Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but
not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact
table.
Indexes:
• Primary indexing is defined mainly on the primary key of the data-file, in which the data-file is already ordered based
on the primary key. Primary Index is an ordered file whose records are of fixed length with two fields.
• Secondary Index − Secondary index may be generated from a field which is a candidate key and has a unique value
in every record.
Joins:
Inner Join:selects records that have matching values in both tables.
SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;
Left Outer Join: returns all records from the left table (table1), and the matching records from the right table (table2). The result is 0
records from the right side, if there is no match.
SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name = table2.column_name;
Right Outer Join:returns all records from the right table (table2), and the matching records from the left table (table1). The result is 0
records from the left side, if there is no match.
SELECT column_name(s)
FROM table1
RIGHT JOIN table2
ON table1.column_name = table2.column_name;
Full Outer Join:returns all records when there is a match in left (table1) or right (table2) table records.
SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name
WHERE condition;
Union & Union all
• The UNION operator is used to combine the result-set of two or more SELECT statements.
• Every SELECT statement within UNION must have the same number of columns
• The columns must also have similar data types
• The columns in every SELECT statement must also be in the same order
SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;
The UNION operator selects only distinct values by default. To allow duplicate values, use UNION ALL:
SELECT column_name(s) FROM table1
UNION ALL
SELECT column_name(s) FROM table2;
Self Join: A self join is a regular join, but the table is joined with itself.
SELECT column_name(s)
FROM table1 T1, table1 T2
WHERE condition;
Thank You
References:
Sql:https://www.w3schools.com/sql/default.asp
Practice:https://www.w3schools.com/sql/trysql.asp?
filename=trysql_select_join_inner