MCS 221 Notes
MCS 221 Notes
{Block1}
{CH=1}
DATA WAREHOUSING define data warehouse. List and explain 4 chrctstcs of data warehouse(10)
Data Warehouse is used to collect and manage data from various sources, in order to provide meaningful business
insights. A data warehouse is usually used for linking and analyzing heterogeneous sources of business data. Data
warehouses gather data from multiple sources (including databases), with an emphasis on storing, filtering, retrieving
and in particular, analyzing huge quantities of organized data. Data warehouses are used extensively in the largest and
most complex businesses around the world. In demanding situations, good decision making becomes critical.
Benefits/Characterstics:
Integrated Data: One of the key characteristics of a data warehouse is that it contains integrated data. This means
that the data is collected from various sources, such as transactional systems, and then cleaned, transformed, and
consolidated into a single, unified view. This allows for easy access and analysis of the data, as well as the ability to
track data over time.
Subject-Oriented: A data warehouse is also subject-oriented, which means that the data is organized around
specific subjects, such as customers, products, or sales. This allows for easy access to the data relevant to a specific
subject, as well as the ability to track the data over time.
Non-Volatile: Another characteristic of a data warehouse is that it is non-volatile. This means that the data in the
warehouse is never updated or deleted, only added to. This is important because it allows for the preservation of
historical data, making it possible to track trends and patterns over time.
Time-Variant: A data warehouse is also time-variant, which means that the data is stored with a time dimension.
This allows for easy access to data for specific time periods, such as last quarter or last year. This makes it possible
to track trends and patterns over tim
Top-down Approach: Bill Inmon’s design methodology is based on a top-down approach. In the top-down
approach, the data warehouse is designed first and then data mart is built on top of data warehouse.
Below are the steps that are involved in top-down approach:
• Data is extracted from the various source systems. The extracts are loaded and validated in the stage area.
Validation is required to make sure the extracted data is accurate and correct. You can use the ETL tools or
approach to extract and push to the data warehouse.
• Data is extracted from the data warehouse in regular basis in stage area. At this step, you will apply various
aggregation, summarization techniques on extracted data and loaded back to the data warehouse.
• Once the aggregation and summarization is completed, various data marts extract that data and apply the
some more transformation to make the data structure as defined by the data marts.
Bottom-up Approach: Ralph Kimball’s data warehouse design approach is called dimensional modelling or
the Kimball methodology. This methodology follows the bottom-up approach. As per this method, data marts are
first created to provide the reporting and analytics capability for specific business process, later with these data
marts enterprise data warehouse is created. Basically, Kimball model reverses the Inmon model i.e. Data marts are
directly loaded with the data from the source systems and then ETL process is used to load in to Data Warehouse.
Below are the steps that are involved in bottom-up approach:
• The data flow in the bottom up approach starts from extraction of data from various source systems into the
stage area where it is processed and loaded into the data marts that are handling specific business process.
• After data marts are refreshed the current data is once again extracted in stage area and transformations are
applied to create data into the data mart structure. The data is the extracted from Data Mart to the staging area
is aggregated, summarized and so on loaded into EDW and then made available for the end user for analysis and
enables critical business decisions.
• Load Manager: Load Manager Component of data warehouse is responsible for collection of data from
operational system and converts them into usable form for the users. This component is responsible for
importing and exporting data from operational systems.
• Warehouse Manager: The warehouse manager is the center of data-warehousing system and is the data
warehouse itself. It is a large, physical database that holds a vast am0unt of information from a wide variety of
sources. The data within the data warehouse is organized such that it becomes easy to find, use and update
frequently from its sources.
• Query Manager: Query Manager Component provides the end-users with access to the stored warehouse
information through the use of specialized end-user tools. Data mining access tools have various categories such
as query and reporting, on-line analytical processing (OLAP), statistics, data discovery and graphical and
geographical information systems.
• End-user access tools: This is divided into the following categories, such as:
Reporting Data
Query Tools
Data Dippers
Tools for EIS
Tools for OLAP and tools for data mining.
OLTP AND OLAP short note on OLTP (5)
• Online Transaction Processing (OLTP): An OLTP system captures and maintains transaction data in a
database. Each transaction involves individual database records made up of multiple fields or columns. Examples
include banking and credit card activity or retail checkout scanning.
In OLTP, the emphasis is on fast processing, because OLTP databases are read, written, and updated frequently. If
a transaction fails, built-in system logic ensures data integrity.
• Online Analytical Processing (OLAP): Online Analytical Processing (OLAP) applies complex queries to
large amounts of historical data, aggregated from OLTP databases and other sources, for data mining, analytics, and
business intelligence projects. In OLAP, the emphasis is on response time to these complex queries. Each query
involves one or more columns of data aggregated from many rows.
Examples include year-over-year financial performance or marketing lead generation trends.
OLTP is operational, while OLAP is informational.
difference between OLTP and OLAP.
OLTP OLAP
Characteristics Handles a large number of small transactions Handles large volumes of data with complex
queries
Query types Simple standardized queries Complex queries
Operations Based on INSERT, UPDATE, DELETE Based on SELECT commands to aggregate data
commands for reporting
Response time Milliseconds Seconds, minutes, or hours depending on the
amount of data to process
Purpose Control and run essential business operations Plan, solve problems, support decisions,
in real time discover hidden insights
Productivity Increases productivity of end users Increases productivity of business managers,
data analysts, and executives
Data view Lists day-to-day business transactions Multi-dimensional view of enterprise data
User examples Customer-facing personnel, clerks, online Knowledge workers such as data analysts,
shoppers business analysts, and executives
Database design Normalized databases for efficiency Denormalized databases for analysis
TYPES OF DATA WAREHOUSES: There are three different types of traditional Data Warehouse models as
listed below:
Define a data warehouse. List and explain all 3 types of data warehouse.10
(i) Enterprise Data Warehouse: An enterprise provides a central repository database for decision support
throughout the enterprise. It is a central place where all business information from different sources and
applications are made available. The enterprise goal is to provide a complete overview of any particular object in
the data model.
(ii) Operational Data Warehouse: It assists in obtaining data straight from the database, which also helps data
transaction processing. The data present in the Operational Data Store can be scrubbed, and the duplication which
is present can be reviewed and fixed by examining the corresponding market rules.
(iii) Data Mart: Data Mart may be a subset of knowledge warehouse, and it supports a specific region, business unit,
or business function. Data Mart focuses on storing data for a particular functional area, and it contains a subset of
data saved in a memory. Data Marts help in enhancing user responses and also reduces the volume of data for data
analysis.
{CH=2}
DATA WAREHOUSE ARCHITECTURE
Data warehouse architecture defines the arrangement of the data in different databases. When designing a data
warehouse, there are three different types of models to consider, based on the approach of number of tiers the
architecture has.
(i) Single-tier data warehouse architecture (ii) Two-tier data warehouse architecture (iii) Three-tier data warehouse architecture
1. Single-tier data warehouse architecture: The single-tier architecture (Figure 1) is not a frequently
practiced approach. The main goal of having such architecture is to remove redundancy by minimizing the
amount of data stored. Its primary disadvantage is that it doesn’t have a component that separates analytical
and transactional processing.
2. Two-tier data warehouse architecture: The two-tier architecture (Figure 2) includes a staging area for all
data sources, before the data warehouse layer. By adding a staging area between the sources and the storage
repository, you ensure all data loaded into the warehouse is cleansed and in the appropriate format.
3. Three-tier data warehouse architecture: The three-tier approach (Figure 3) is the most widely used
architecture for data warehouse systems. Essentially, it consists of three tiers:
• The bottom tier is the database of the warehouse, where the cleansed and transformed data is loaded.
• The middle tier is the application layer giving an abstracted view of the database. It arranges the data to
make it more suitable for analysis. This is done with an OLAP server, implemented using the ROLAP or
MOLAP model.
• The top-tier is where the user accesses and interacts with the data. It represents the front-end client layer.
You can use reporting tools, query, analysis or data mining tools.
COMPONENTS OF DATA WAREHOUSE ARCHITECTURE
1. Data Warehouse Database: The central component of a DW architecture is a data warehouse database
that stocks all enterprise data and makes it manageable for reporting. This means we choose the kind of
database we’ll use to store data in your warehouse
2. Extraction, Transformation, and Loading Tools (ETL) (Explain detail in ch 4)
What is ETL? Why do you need ETL in data warehouses? Also, explain how ETL works in Data Warehousing? 10
ETL tools are central components of enterprise data warehouse architecture. These tools help extract data from
different sources, transform it into a suitable arrangement, and load it into a data warehouse.
The ETL tool you choose will determine:
The time expended in data extraction
Approaches to extracting data
Kind of transformations applied and the simplicity to do so
Business rule definition for data validation and cleansing to improve end-product analytics
Outlining information distribution from the fundamental depository to your BI applications
3. Metadata:
What is Metadata? What are its contents? Justify how metadata can be an important component in Data
Warehousing? Also mention its types. 10
In the data warehouse architecture, metadata describes the data warehouse database and offers a framework for
data. It helps in constructing, preserving, handling, and making use of the data warehouse. Metadata plays an
important role for businesses and the technical teams to understand the data present in the warehouse and
convert it into information. There are two types of metadata in data mining:
• Technical Metadata comprises information that can be used by developers and managers when executing
warehouse development and administration tasks.
• Business Metadata comprises information that offers an easily understandable standpoint of the data
stored in the warehouse.
4. Data Warehouse Access Tools: A data warehouse uses a database or group of databases as a foundation.
Data warehouse corporations generally cannot work with databases without the use of tools unless they have
database administrators available. It include
Query and reporting tools
Application development tools
Data mining tools for data warehousing
OLAP tools
5. Data Warehouse Bus: It defines the data flow within a data warehousing bus architecture and includes a
data mart. A data mart is an access level that allows users to transfer data. It is also used for partitioning data
that is produced for a particular user group.
6. Data Warehouse Reporting Layer: The reporting layer in the data warehouse allows the end-users to
access the BI interface or BI database architecture. The purpose of the reporting layer in the data warehouse is
to act as a dashboard for data visualization, create reports, and take out any required information.
(i) Data source layer: The data source layer is the place where unique information, gathered from an assortment of
inner and outside sources, resides in the social database. Following are the examples of the data source layer:
Operational Data — Product information, stock information, marketing information, or HR information
Social Media Data — Website hits, content fame, contact page completion
Outsider Data — Demographic information, study information, statistics information
(ii) Data Staging Layer: This layer dwells between information sources and the data warehouse. In this layer,
information is separated from various inside and outer data sources. Since source data comes in various
organizations, the data extraction layer will use numerous technologies and devices to extricate the necessary
information. Once the extracted data has been stacked, it will be exposed to high-level quality checks. The
conclusive outcome will be perfect and organized data that you will stack into your data warehouse
(iii) Data Storage Layer: This layer is the place where the data that was washed down in the arranging zone is put
away as a solitary central archive. Contingent upon your business and your warehouse architecture necessities, your
data storage might be a data warehouse center, data mart (data warehouse somewhat recreated for particular
departments), or an Operational Data Store (ODS).
(iv) Data Presentation Layer: This is where the users communicate with the scrubbed and sorted out data. This
layer of the data architecture gives users the capacity to query the data for item or service insights, break down the
data to conduct theoretical business situations, and create computerized or specially appointed reports.
You may utilize an OLAP or reporting instrument with an easy to understand Graphical User Interface (GUI) to assist
users with building their queries, perform analysis, or plan their reports.
DESIGNING THE DATA MARTS: The process for designing a data mart usually comprises the
following steps:
(i)Essential Requirements Gathering: The first step is to create a robust design. Some critical processes involved in
this phase include collecting the corporate and technical requirements, identifying data sources, choosing a suitable
data subset, and designing the logical layout (database schema) and physical structure.
(ii) Build/Construct: The next step is to construct it. This includes creating the physical database and the logical
structures. In this phase, you’ll build the tables, fields, indexes, and access controls.
(iii) Populate/Data Transfer: The next step is to populate the mart, which means transferring data into it. In this
phase, you can also set the frequency of data transfer, such as daily or weekly. This usually involves extracting
source information, cleaning and transforming the data, and loading it into the departmental repository.
(iv) Data Access: In this step, the data loaded into the data mart is used in querying, generating reports, graphs, and
publishing. The main task involved in this phase is setting up a meta-layer and translating database structures and
item names into corporate expressions so that non-technical operators can easily use the data mart. If necessary,
you can also set up API and interfaces to simplify data access.
(v) Manage: The last step involves management and observation, which includes:
Controlling ongoing user access.
Optimization and refinement of the target system for improved performance.
Addition and management of new data into the repository.
Configuring recovery settings and ensuring system availability in the event of failure.
{Ch=3}
DIMENSIONAL MODELING: Dimensional Modelling (DM) is a data structure technique optimized
for data storage in a Data warehouse. The purpose of dimensional modelling is to optimize the database for faster
retrieval of data. A dimensional model in data warehouse is designed to read, summarize, analyze numeric
information like values, balances, counts, weights, etc. in a data warehouse. The main objective of dimension
modelling is to provide an easy architecture for the end user to write queries. Also it will reduce the number of
relationships between the tables and dimensions, hence providing efficient query handling.
Features of Dimensional Modelling
• Easy to Understand
• Promote Data Quality
• Optimise Performance
• It reduces the number of relationships between different data elements.
Steps Involved in Dimensional Modelling
The following are the steps involved in Dimension modelling as shown in figure1.
Step 1: Identify the Business Process: Identifying the actual business process a datarehouse should cover. This
could be Marketing, Sales, HR, etc. as per the data analysis needs of the organization. The selection of the Business
process also depends on the quality of data available for that process. It is the most important step of the Data
Modelling process
Step 2: Identifying Granularity: The Grain describes the level of detail for the business problem/solution. It is the
process of identifying the lowest level of information for any table in your data warehouse. If a table contains sales
data for every day, then it should be daily granularity. If a table contains total sales data for each month, then it has
monthly granularity.
Step 3: Identifying Dimensions and attributes: The dimensions of the data warehouse can be understood by the
entities of the database. like, items, products, date, stocks, time etc. The identification of the primary keys and the
foreign keys specifications all are described here.
Step 4: Build the Schema: The database structure or arrangement of columns in a database table, decides the
schema. There are various popular schemas like, star, snowflake, fact constellation schemas - summarizing, from
the selection of business process to identifying each and every finest level of detail of the business transactions.
Identifying the significant dimensions and attributes would help to build the schema.
STAR SCHEMA With the help of an example, explain the star schema dimensional model. 10
It represents the multidimensional model. In this model the data is organized into facts and dimensions. The star
model is the underlying structure for a dimensional model. It has one broad central table (fact table) and a set of
smaller tables (dimensions) arranged in a star design around the primary table. This design is logically shown in fig.
SNOWFLAKE SCHEMA Describe the snowflake schema multidimensional modeling technique. Give an
example and illustrate it. List its pros and cons. 10
A snowflake schema is a type of data modeling technique used in data warehousing to represent data in a
structured way that is optimized for querying large amounts of data efficiently. In a snowflake schema, the
dimension tables are normalized into multiple related tables, creating a hierarchical or “snowflake” structure.
In a snowflake schema, the fact table is still located at the center of the schema, surrounded by the dimension
tables. However, each dimension table is further broken down into multiple related tables, creating a hierarchical
structure that resembles a snowflake.
Example: In the below Figure 4,
{Block-2}
{Ch=4}
ETL (Extract, Transform, Load)
What is ETL? Why do you need ETL in data warehouses? Also, explain how ETL works in Data Warehousing? 10
It is a process of data integration that encompasses three steps - extraction, transformation, and loading. In a
nutshell, ETL systems take large volumes of raw data from multiple sources, convert it for analysis, and load that
data into your warehouse.
Why Do You Need ETL?: ETL saves you significant time on data extraction and preparation - time that you can
better spend on evaluating your business. Practicing ETL is also part of a healthy data management workflow,
ensuring high data quality, availability, and reliability. Each of the three major components in the ETL saves time and
development effort by running just once in a dedicated data flow:
Extract: In ETL, the first link determines the strength of the chain. The extract stage determines which data sources
to use, the refresh rate (velocity) of each source, and the priorities (extract order) between them — all of which
heavily impact your time to insight.
Transform: After extraction, the transformation process brings clarity and order to the initial data swamp. Dates
and times combine into a single format and strings parse down into their true underlying meanings. Location data
convert to coordinates, zip codes, or cities/countries. The transform step also sums up, rounds, and averages
measures, and it deletes useless data and errors or discards them for later inspection. It can also mask personally
identifiable information (PII) to comply with GDPR, CCPA, and other privacy requirements.
Load: In the last phase, much as in the first, ETL determines targets and refresh rates. The load phase also
determines whether loading will happen incrementally, or if it will require “upsert” (updating existing data and
inserting new data) for the new batches of data.
ETL PROCESS: ETL collects and processes data from various sources into a single data store (a data
warehouse or data lake), making it much easier to analyze. The three steps in ETL process are mentioned below:
1. Data Extraction: Data extraction involves the following four steps:
• Identify the data to extract: The first step of data extraction is to identify the data sources. These sources
might be from relational SQL databases like MySQL or non-relational NoSQL databases like MongoDB or
Cassandra
• Estimate how large the data extraction is: The size of the data extraction matters. A larger quantity of data
will require a different ETL strategy.
• Choose the extraction method:
2. Data Transformation: What is data transformation? List and explain all the data transformation steps (10)
data transformation that occurs in a staging area (after extraction) is “multistage data transformation”. In ELT,
data transformation that happens after loading data into the data warehouse is “in-warehouse data
transformation”. Some of the following data transformations are:
• Deduplication (normalizing): Identifies and removes duplicate information.
• Key restructuring: Draws key connections from one table to another.
• Cleansing: Involves deleting old, incomplete, and duplicate data to maximize data accuracy - perhaps
through parsing to remove syntax errors, typos, and fragments of records.
• Format revision: Converts formats in different datasets - like date/time, male/female, and units of
measurement - into one consistent format.
• Derivation: Creates transformation rules that apply to the data. For example, maybe you need to subtract
certain costs or tax liabilities from business revenue figures before analyzing them.
• Aggregation: Gathers and searches data so you can present it in a summarized report format.
• Integration: Reconciles diverse names/values that apply to the same data elements across the data
warehouse so that each element has a standard name and definition.
• Filtering: Selects specific columns, rows, and fields within a dataset.
• Splitting: Splits one column into more than one column.
• Joining: Links data from two or more sources, such as adding spend information across multiple SaaS
platforms.
• Summarization: Creates different business metrics by calculating value totals. For example, you might add
up all the sales made by a specific salesperson to create total sales metrics for specific periods.
• Validation: Sets up automated rules to follow in different circumstances. For instance, if the first five fields
in a row are NULL, then you can flag the row for investigation or prevent it from being processed with the
rest of the information.
3. Data Loading: Data loading is the process of loading the extracted information into your target data
repository. Loading is an ongoing process that could happen through “full loading” (the first time you load data
into the warehouse) or “incremental loading” (as you update the data warehouse with new information).
Because incremental loads are the most complex.
WORKING OF ETL
1. Parsing/Cleansing: Data generated by applications may be in various formats like JSON, XML, or CSV. The
parsing stage maps data into a table format with headers, columns, and rows, and then extracts specified
fields.
2. Data enrichment: Preparing data for analytics usually requires certain data enrichment steps, including
injecting expert knowledge, resolving discrepancies, and correcting bugs.
3. Setting velocity: “Velocity” refers to the frequency of data loading, i.e. inserting new data and updating
existing data.
4. Data validation: In some cases, data is empty, corrupted, or missing crucial elements. During data
validation, ETL finds these occurrences and determines whether to stop the entire
ELT (Extract/load/transform)
Extract/load/transform (ELT) is the process of extracting data from one or multiple sources and loading it into a
target data warehouse. Instead of transforming the data before it’s written, ELT takes advantage of the target
system to do the data transformation. This approach requires fewer remote sources than other techniques because
it needs only raw and unprepared data.
• Extract - This step works similarly in both ETL and ELT data management approaches. Raw streams of data from
virtual infrastructure, software, and applications are ingested either in their entirety or according to predefined
rules.
• Load – The ELT differs here with the ETL. Rather than deliver this mass of raw data and load it to an interim
processing server for transformation, ELT delivers it directly to the target storage location. This shortens the
cycle between extraction and delivery.
• Transform - The database or data warehouse sorts and normalizes the data, keeping part or all of it on hand
and accessible for customized reporting. The overhead for storing this much data is higher, but it offers more
opportunities to mine it for relevant business intelligence in near real-time.
Need: The ELT process improves data conversion and manipulation capabilities due to parallel load and data
transformation functionality.
The ELT process saves you steps and time. Data is first loaded into the target ecosystem, such as a data warehouse,
and then transformed. Authorized users can securely access the data without returning it to source systems. No
downloading is necessary for it.
ETL Vs ELT
The primary differences between ETL and ELT are how much data is retained in data warehouses and where data is
transformed. With ETL, the transformation of data is done before it is loaded into a data warehouse. This enables
analysts and business users to get the data they need faster, without building complex transformations or
persistent tables in their business intelligence tools. Using the ELT approach, data is loaded into the warehouse as
is, with no transformation before loading
{Ch=5}
ONLINE ANALYTICAL PROCESSING(OLAP)
Online Analytical Processing (OLAP) is the technology to analyze and process data from multiple sources at the same
time. It accesses the multiple databases at the same time. It helps the data analysts to collect data from different
perspective for developing effective business strategies. OLAP is a software used to perform high-speed,
multivariate analysis of large amounts of data in data warehouses, data markets, or other unified and centralized
data warehouses. The data is broken down for display, monitoring or analysis
Multidimensional Operations List and describe all the four types of OLAP operations. 10
Typically, there are four types of OLAP operations are:
i. Roll-up
ii. Drill-down
iii. Slice and Dice
iv. Pivot (rotate)
1. Roll-up: It is an OLAP operation to move from low level of concept hierarchy to higher level known as
aggregation. It can be performed by reducing the dimensions as well.
2. Drill-down: In this process of scaling down from higher level to the lower level the data is broken into
smaller parts.
3. Slice: It is another OLAP operation to fetch the data. In this the query on one dimension is triggered in the
database and a new sub cube is created
4. Dice: This OLAP Dice operation as shown in figure 6 is just like the Projection relational query you have read
in RDBMS. In this technique, you select two or more dimensions that results in the creation of a sub cube
5. Pivot: This OLAP operation fixes one attribute as a Pivot and rotate the cube to fetch the results, just like an
inverted spreadsheet, giving it a different perspective.
ROLAP Architecture
ROLAP implies Relational OLAP, an application based on relational DBMSs. It performs dynamic multidimensional
analysis of data stored in a relational database. The architecture is like three-tier. It has three components viz. front
end (User Interface), ROLAP server (Metadata request processing engine) and the back end (Database Server).
In this three-tier architecture the user submits the request and ROLAP engine converts the request into SQL and
submits to the backend database. After the processing of request by the engine, it presents the resulting data into
multidimensional format to make the task easier for the client to view it. The architecture has three components:
• Database server. ● ROLAP server. ● Front-end tool
MOLAP Architecture
Explain the following OLAP architectures and draw their architectural diagram: 10
(i) Multidimensional Online Analytical Processing (MOLAP)
(ii) Hybrid Online Analytical Processing (HOLAP)
MOLAP stands for Multidimensional Online Analytical Processing. It processes the data using the multidimensional
cube using various combinations. Since, the data is stored in multidimensional structure the MOLAP engine uses the
pre-computed or pre-stored information. MOLAP engine processes pre-compiled information. It has dynamic
abilities to perform aggregation of concept hierarchy. MOLAP is very useful in time-series data analysis and
economic evaluation. The architecture has three components:
● Database server ● MOLAP server ● Front-End tool
Characteristics of MOLAP
• It is a user-friendly architecture, easy to use.
• The OLAP operations slice and dice speeds up the data retrieval.
• It has small pre-computed hypercubes.
HOLAP Architecture
It defines Hybrid Online Analytical Processing. It is the hybrid of ROLAP and MOLAP technologies. It connects both
the dimensions together in one architecture. It stores the intermediate or part of the data in ROLAP and MOLAP.
Depending on the query request it accesses the databases. It stores the relational tables in ROLAP structure, and the
data requires multidimensional views, stored and processed using MOLAP architecture as shown in figure 12.
It has the following components:
● Database server. ● ROLAP and MOLAP server. ● Front-end tool
Characteristics of HOLAP are:
• Flexible handling of data.
• Faster aggregation of data.
• HOLAP can drill down the hierarchy of data and can access to relational database for any relevant and stored
information in it.
DOLAP Architecture
Desktop Online Analytical Processing (DOLAP) architecture is most suitable for local multidimensional analysis. It is
like a miniature of multidimensional database or it’s like a sub cube or any business data cube. The components are:
• Database Server. ● DOLAP server. ● Front-end
characteristics of DOLAP are:
• The three-tier architecture is designed for low-end, standalone user like a small shop owner in the locality.
• The data cube is locally stored in the system so, retrieval of results is faster.
• No load on the backend or at the server end.
• DOLAP is relatively cheaper to deploy (NO DIAGRAM OF DOLAP)
{CH=6}
DATA LAKES Define a Data Lake. Explain the step-by-step process of creating a data lake. 10
A data lake is a central location that holds a large amount of data in its native, raw format. data lake uses a flat
architecture and object storage to store the data. It can store structured, semi-structured, or unstructured data,
which means data can be kept in a more flexible format for future use. When storing data, a data lake associates it
with identifiers and metadata tags for faster retrieval.
Step by step process me koi dhng ki language ni mili…. Net pr smjh kr apni language me likh denge
2. Data Vault Model: The model also allows coping with issues such as auditing, data traceability, data
loading performance and resilience to change. Data traceability implies that every row of data in a data vault
must be associated with a record sourc e and date-of-load attributes. This information allows an auditor to
trace values to their respective sources
• Hubs: Hubs are comprised of a list of distinct business keys with low affinity to change and contain a
surrogate key per Hub item. The Hub also contains metadata designating the source of the business
key. Satellite tables;
• Links: Link tables are models providing relations between business keys and are fundamentally
many-to-many join tables, containing some metadata.
Links can connect to other links, to handle granularity variations. Using link references within another
link is bad modelling practice, as it creates dependencies between links making parallel loading of
links more challenging.
• Satellites: The hubs and link’s temporal attributes and descriptive attributes stored in separate
constructs called satellite tables or simply satellites. The satellites also comprise metadata
connecting them to their parent link or hub, metadata designating the relationship source and
attributes, also including a time-series with the attribute’s start and end dates.
Data warehouse automation tools: A DWA tool provides a seamless experience, free from the
hassles of coding, for integration and transition of diverse data from its source to a DW and other related
components. The tool automates deployment of ETL scripts, batch-processing of data and demonstrates various
functions like:
• Processes for high performance ETL and reliable ELT based integration of data
• Modelling of data at source
• Normalized, De-normalized and data structures with multiple dimensions
• Seamless integration with various data sources
{Block-3}
{Ch=7}
DATA MINING What is data mining? List out the key differences between data warehousing and data
mining. Mention any four applications of data mining. 10
Data mining is the process by which the patterns can be found by filtering the data based on a condition of data
sorting in order to find something important. At a micro- scale, mining can be defined as an activity that involves
gathering data in one place in a structural way. For example, clubbing an Excel Spreadsheet or creating a
summarization of the main points of some text.
• Data mining finds valuable information hidden in large volumes of data.
• Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in
sets of data.
• The computer is responsible for finding the patterns by identifying the underlying rules and features in the data.
{CH=8}
Data Preprocessing Techniques:
• Data cleaning may be used to prevent noise and to correct incoherence of data.
• Data aggregation combines data from multiple channels into a cohesive data store, for example a data center.
• Data transformations may be applied, such as standardization.
• Data reduction will, for example, reduce data size by the compilation, elimination of duplicate components or
clustering; it can operate together.
DATA CLEANING Define ‘‘data cleaning’’ which is a data preprocessing technique. In this context, explain
the concept of Noisy data cleaning along with some suitable examples. 10
The other problem is that findings in the modern world are inconsistent, noisy and inexact. Software – cleaning (or
data cleaning) procedures are intended to complete missing values, smooth out sounds as outliers are identified
and data errors are corrected.
1. Missing Values: Some tuples have no significance mentioned for a range of attributes, such as consumer
income. We're going to fill up the vacant values.
2. Noisy Data: Noisy is a measured random error or variance in the variable. Noise is minimized by techniques of
data smoothing.
3. Data cleaning as a process:
Discrepancy Detection: The first step in data cleaning is to identify discrepancies. It considers the meta data
awareness and examines the following rules in order to identify the distinction.
Data Transformation: This is the second step of the data cleaning process. After identifying inconsistencies, we
have to define (sequence of) transformations and execute to correct them.
Mostly chize chodi h is chapter ki
{Ch=9}
ASSOCIATION RULE MINING
Short note on Association Rule Generation (5)
What is Association Rule Mining? With reference to this, explain the concepts given below: 10
(i) Support Count (ii) Frequent Item set (iii) Association Rule (iv) Rule Evaluation metrics
• Association rule mining is a popular and well researched method for discovering interesting relations
between variables in large databases.
• It is intended to identify strong rules discovered in Databases using different measures of interestingness.
• Based on the concept of strong rules, Rakesh Aggarwal et al. introduced association rules.
Rule Evaluation Metrics –
Support: It is one of the measures of interestingness. This tells about the usefulness and certainty of rules.
5% Support means total 5% of transactions in the database follow the rule.
Support(A -> B) = Support_count(A ∪ B)
Support count(X): Number of transactions in which X appears. If X is A union B then it is the number of
transactions in which A and B both are present.
For example, if a dataset contains 100 transactions and the item set {milk, bread} appears in 20 of those
transactions, the support count for {milk, bread} is 20.
Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of transactions that includes
all items in {A} to the no of transactions that includes all items in {A}.
Frequent item sets: A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of transactions or records in the
dataset that contain the item set. Frequent item sets and association rules can be used for a variety of tasks such
as market basket analysis, cross-selling and recommendation systems.
{Block 4}
{Ch=10}
CLASSIFICATION
Define classification. With the help of an example for each, explain the following classification models: 10
(i) Descriptive modelling
(ii) Predictive modelling
A classification problem requires that examples be classified into one of two or more classes. A classification can
include reliable or discrete input variables. A two-class dilemma is also regarded as a challenge between two classes
or binary classification. A multi-class grouping problem is often called a two-group problem.
Classification is the activity of learning a target function "c," which maps each attribute to one of the class labels "b"
The attribute set 'a' can contain a number of attributes and can
contain binary, categorical and continuous attributes. The class
'b' mark must have a distinct, i.e binary or categorical attribute
(nominal or ordinal)
Classification Models:
Descriptive Modelling: It is a model of classification used to summarize data.
Predictive Modelling: It is a classification model used to predict the class mark of unknown data.
Decision Tree:
What is a Decision Tree? How is it useful in classification? With the help of an example, explain the process of
construction of a decision tree and its representation. 10
Decision tree is the most effective and common prediction and classification method. A decision tree is a flux chart-
like tree structure, where each internal node indicates an attribute test, each branch is the product of the test, and
each leaf node has a class name.
{Ch=11}
CLUSTERING Short note on Clustering and its Methods (5)
A cluster is a grouping of comparable items that belongs to the same category.
A group of comparable data objects is classed as a cluster in clustering. A collection of data is referred to as a group.
The cluster analysis divides data into groupings based on how closely they resemble one another. Clustering is an
unsupervised Machine Learning-based Algorithm that divides a set of data points into clusters so that the objects
are all part of the same group. Using clustering, data can be subdivided into smaller, more manageable groups. Data
from each of these subgroups are grouped together into a single cluster.
Applications of cluster analysis in Data Mining:
• Clustering analysis is widely utilized in a variety of fields, including data analysis, market research, pattern
identification, and image processing.
• It aids marketers in identifying distinct customer segments based on their customers' purchasing habits.
They are able to classify their clients.
• It aids data discovery by assigning documents on the internet.
• For example, credit card fraud detection relies on clustering.
Categorization of Major Clustering Methods:
1. Partitioning Method: There are a few requirements that must be met with this Partitioning Clustering
Method, and they are as follows:
Only one group should have a single objective at any given time.
There should be no organization that does not have a single goal.
2. Hierarchical Method: This method decomposes a set of data items into a hierarchy. Depending on how the
hierarchical breakdown is generated, we can put hierarchical approaches into different categories.
3. Density-based Method
4. Grid-Based Method
5. Model-Based Method
6. Constraint-based Method
{Ch=12}
Text Mining
Define text mining. Mention any three applications of it. What are the various techniques used to analyse the web-
usage patterns? 10
What is Text Mining? Where is it used? Explain any two text mining techniques with the help of a suitable example
for each. 10
Text mining is a component of data mining that deals specifically with unstructured text data. Text mining can be
used as a preprocessing step for data mining or as a standalone process for specific tasks.
By using text mining, the unstructured text data can be transformed into structured data that can be used for
data mining tasks such as classification, clustering, and association rule mining. This allows organizations to gain
insights from a wide range of data sources, such as customer feedback, social media posts, and news articles.
Text mining is widely used in various fields, such as natural language processing, information retrieval, and social
media analysis. It has become an essential tool for organizations to extract insights from unstructured text data
and make data-driven decisions.
Procedures for Analysing Text Mining
• Text Summarization: To extract its partial content and reflect its whole content automatically.
• Text Categorization: To assign a category to the text among categories predefined by users.
• Text Clustering: To segment texts into several clusters, depending on the substantial relevance.
WEB MINING
Web mining as the name suggests that it involves the mining of web data. The extraction of information from
websites uses data mining techniques. It is an application based on data mining techniques. The parameters
generally to be mined in web pages are hyperlinks, text or content of web pages, linked user activity between web
pages of the same website or among different websites.
Web Mining performs various tasks such as:
1) Generating patterns existing in some websites, like customer buying behavior or navigation of web sites.
2) The web mining helps to retrieve faster results of the queries or the search text posted on the search engines like
Google, Yahoo etc.
3) The ability to classify web documents according to the search performed on the ecommerce websites helps to
increase businesses and transactions