0% found this document useful (0 votes)

19 views24 pages

MCS 221 Notes

The document provides an overview of data warehousing, including definitions, characteristics, design approaches, and architecture. It discusses the differences between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP), as well as the importance of data granularity and types of data warehouses. Additionally, it outlines components of data warehouse architecture, layers of architecture, and the concept of data marts, including their benefits and design processes.

Uploaded by

Akhil ph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views24 pages

MCS 221 Notes

Uploaded by

Akhil ph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

MCS 221 Data Warehouse Fundamentals and Architecture

{Block1}
{CH=1}

DATA WAREHOUSING define data warehouse. List and explain 4 chrctstcs of data warehouse(10)
Data Warehouse is used to collect and manage data from various sources, in order to provide meaningful business
insights. A data warehouse is usually used for linking and analyzing heterogeneous sources of business data. Data
warehouses gather data from multiple sources (including databases), with an emphasis on storing, filtering, retrieving
and in particular, analyzing huge quantities of organized data. Data warehouses are used extensively in the largest and
most complex businesses around the world. In demanding situations, good decision making becomes critical.

Benefits/Characterstics:
Integrated Data: One of the key characteristics of a data warehouse is that it contains integrated data. This means
that the data is collected from various sources, such as transactional systems, and then cleaned, transformed, and
consolidated into a single, unified view. This allows for easy access and analysis of the data, as well as the ability to
track data over time.
Subject-Oriented: A data warehouse is also subject-oriented, which means that the data is organized around
specific subjects, such as customers, products, or sales. This allows for easy access to the data relevant to a specific
subject, as well as the ability to track the data over time.
Non-Volatile: Another characteristic of a data warehouse is that it is non-volatile. This means that the data in the
warehouse is never updated or deleted, only added to. This is important because it allows for the preservation of
historical data, making it possible to track trends and patterns over time.
Time-Variant: A data warehouse is also time-variant, which means that the data is stored with a time dimension.
This allows for easy access to data for specific time periods, such as last quarter or last year. This makes it possible
to track trends and patterns over tim

DATA WAREHOUSE DESIGN APPROACHES

Data Warehouse design approaches are very important aspect of building data warehouse. Selection of right data
warehouse design could save lot of time and project cost. There are two different Data Warehouse Design
Approaches Bill Inmon(Top-Down Approach) and Ralph Kimball(Bottom up Approach).

Top-down Approach: Bill Inmon’s design methodology is based on a top-down approach. In the top-down
approach, the data warehouse is designed first and then data mart is built on top of data warehouse.
Below are the steps that are involved in top-down approach:
• Data is extracted from the various source systems. The extracts are loaded and validated in the stage area.
Validation is required to make sure the extracted data is accurate and correct. You can use the ETL tools or
approach to extract and push to the data warehouse.
• Data is extracted from the data warehouse in regular basis in stage area. At this step, you will apply various
aggregation, summarization techniques on extracted data and loaded back to the data warehouse.
• Once the aggregation and summarization is completed, various data marts extract that data and apply the
some more transformation to make the data structure as defined by the data marts.

Bottom-up Approach: Ralph Kimball’s data warehouse design approach is called dimensional modelling or
the Kimball methodology. This methodology follows the bottom-up approach. As per this method, data marts are
first created to provide the reporting and analytics capability for specific business process, later with these data
marts enterprise data warehouse is created. Basically, Kimball model reverses the Inmon model i.e. Data marts are
directly loaded with the data from the source systems and then ETL process is used to load in to Data Warehouse.
Below are the steps that are involved in bottom-up approach:
• The data flow in the bottom up approach starts from extraction of data from various source systems into the
stage area where it is processed and loaded into the data marts that are handling specific business process.
• After data marts are refreshed the current data is once again extracted in stage area and transformations are
applied to create data into the data mart structure. The data is the extracted from Data Mart to the staging area
is aggregated, summarized and so on loaded into EDW and then made available for the end user for analysis and
enables critical business decisions.

 Top down Approach  Bottom up Approach

Components of a Data Warehouse

• Load Manager: Load Manager Component of data warehouse is responsible for collection of data from
operational system and converts them into usable form for the users. This component is responsible for
importing and exporting data from operational systems.
• Warehouse Manager: The warehouse manager is the center of data-warehousing system and is the data
warehouse itself. It is a large, physical database that holds a vast am0unt of information from a wide variety of
sources. The data within the data warehouse is organized such that it becomes easy to find, use and update
frequently from its sources.
• Query Manager: Query Manager Component provides the end-users with access to the stored warehouse
information through the use of specialized end-user tools. Data mining access tools have various categories such
as query and reporting, on-line analytical processing (OLAP), statistics, data discovery and graphical and
geographical information systems.
• End-user access tools: This is divided into the following categories, such as:
Reporting Data
Query Tools
Data Dippers
Tools for EIS
Tools for OLAP and tools for data mining.
OLTP AND OLAP short note on OLTP (5)
• Online Transaction Processing (OLTP): An OLTP system captures and maintains transaction data in a
database. Each transaction involves individual database records made up of multiple fields or columns. Examples
include banking and credit card activity or retail checkout scanning.
In OLTP, the emphasis is on fast processing, because OLTP databases are read, written, and updated frequently. If
a transaction fails, built-in system logic ensures data integrity.
• Online Analytical Processing (OLAP): Online Analytical Processing (OLAP) applies complex queries to
large amounts of historical data, aggregated from OLTP databases and other sources, for data mining, analytics, and
business intelligence projects. In OLAP, the emphasis is on response time to these complex queries. Each query
involves one or more columns of data aggregated from many rows.
Examples include year-over-year financial performance or marketing lead generation trends.
OLTP is operational, while OLAP is informational.
difference between OLTP and OLAP.
OLTP OLAP
Characteristics Handles a large number of small transactions Handles large volumes of data with complex
queries
Query types Simple standardized queries Complex queries
Operations Based on INSERT, UPDATE, DELETE Based on SELECT commands to aggregate data
commands for reporting
Response time Milliseconds Seconds, minutes, or hours depending on the
amount of data to process
Purpose Control and run essential business operations Plan, solve problems, support decisions,
in real time discover hidden insights
Productivity Increases productivity of end users Increases productivity of business managers,
data analysts, and executives
Data view Lists day-to-day business transactions Multi-dimensional view of enterprise data
User examples Customer-facing personnel, clerks, online Knowledge workers such as data analysts,
shoppers business analysts, and executives
Database design Normalized databases for efficiency Denormalized databases for analysis

DATA GRANULARITY short note on Data Granularity (5)

Granularity is one of the main elements in the modelling of DW data. Granularity of data refers to detail levels.
Multiple levels of detail may be available depending on the requirements. At least two granular levels exist for many
data warehouses. The relation between detailing and granularity is important to understand. It means greater detail
of the data (less summary) when you speak of less granularity or fine granularity. Greater granularity means fewer
details or gross granularity (greater summarization). The operational data is stored at the lowest level of
information. The type of granularity depends on how data will be processed, and the performance expectations. For
example, each year details of each month, day, hour, minute, second, and so forth are available.

TYPES OF DATA WAREHOUSES: There are three different types of traditional Data Warehouse models as
listed below:
Define a data warehouse. List and explain all 3 types of data warehouse.10
(i) Enterprise Data Warehouse: An enterprise provides a central repository database for decision support
throughout the enterprise. It is a central place where all business information from different sources and
applications are made available. The enterprise goal is to provide a complete overview of any particular object in
the data model.
(ii) Operational Data Warehouse: It assists in obtaining data straight from the database, which also helps data
transaction processing. The data present in the Operational Data Store can be scrubbed, and the duplication which
is present can be reviewed and fixed by examining the corresponding market rules.
(iii) Data Mart: Data Mart may be a subset of knowledge warehouse, and it supports a specific region, business unit,
or business function. Data Mart focuses on storing data for a particular functional area, and it contains a subset of
data saved in a memory. Data Marts help in enhancing user responses and also reduces the volume of data for data
analysis.

{CH=2}
DATA WAREHOUSE ARCHITECTURE
Data warehouse architecture defines the arrangement of the data in different databases. When designing a data
warehouse, there are three different types of models to consider, based on the approach of number of tiers the
architecture has.
(i) Single-tier data warehouse architecture (ii) Two-tier data warehouse architecture (iii) Three-tier data warehouse architecture
1. Single-tier data warehouse architecture: The single-tier architecture (Figure 1) is not a frequently
practiced approach. The main goal of having such architecture is to remove redundancy by minimizing the
amount of data stored. Its primary disadvantage is that it doesn’t have a component that separates analytical
and transactional processing.
2. Two-tier data warehouse architecture: The two-tier architecture (Figure 2) includes a staging area for all
data sources, before the data warehouse layer. By adding a staging area between the sources and the storage
repository, you ensure all data loaded into the warehouse is cleansed and in the appropriate format.
3. Three-tier data warehouse architecture: The three-tier approach (Figure 3) is the most widely used
architecture for data warehouse systems. Essentially, it consists of three tiers:
• The bottom tier is the database of the warehouse, where the cleansed and transformed data is loaded.
• The middle tier is the application layer giving an abstracted view of the database. It arranges the data to
make it more suitable for analysis. This is done with an OLAP server, implemented using the ROLAP or
MOLAP model.
• The top-tier is where the user accesses and interacts with the data. It represents the front-end client layer.
You can use reporting tools, query, analysis or data mining tools.
COMPONENTS OF DATA WAREHOUSE ARCHITECTURE
1. Data Warehouse Database: The central component of a DW architecture is a data warehouse database
that stocks all enterprise data and makes it manageable for reporting. This means we choose the kind of
database we’ll use to store data in your warehouse
2. Extraction, Transformation, and Loading Tools (ETL) (Explain detail in ch 4)
What is ETL? Why do you need ETL in data warehouses? Also, explain how ETL works in Data Warehousing? 10
ETL tools are central components of enterprise data warehouse architecture. These tools help extract data from
different sources, transform it into a suitable arrangement, and load it into a data warehouse.
The ETL tool you choose will determine:
The time expended in data extraction
Approaches to extracting data
Kind of transformations applied and the simplicity to do so
Business rule definition for data validation and cleansing to improve end-product analytics
Outlining information distribution from the fundamental depository to your BI applications
3. Metadata:
What is Metadata? What are its contents? Justify how metadata can be an important component in Data
Warehousing? Also mention its types. 10
In the data warehouse architecture, metadata describes the data warehouse database and offers a framework for
data. It helps in constructing, preserving, handling, and making use of the data warehouse. Metadata plays an
important role for businesses and the technical teams to understand the data present in the warehouse and
convert it into information. There are two types of metadata in data mining:
• Technical Metadata comprises information that can be used by developers and managers when executing
warehouse development and administration tasks.
• Business Metadata comprises information that offers an easily understandable standpoint of the data
stored in the warehouse.
4. Data Warehouse Access Tools: A data warehouse uses a database or group of databases as a foundation.
Data warehouse corporations generally cannot work with databases without the use of tools unless they have
database administrators available. It include
Query and reporting tools
Application development tools
Data mining tools for data warehousing
OLAP tools
5. Data Warehouse Bus: It defines the data flow within a data warehousing bus architecture and includes a
data mart. A data mart is an access level that allows users to transfer data. It is also used for partitioning data
that is produced for a particular user group.
6. Data Warehouse Reporting Layer: The reporting layer in the data warehouse allows the end-users to
access the BI interface or BI database architecture. The purpose of the reporting layer in the data warehouse is
to act as a dashboard for data visualization, create reports, and take out any required information.

LAYERS OF DATA WAREHOUSE ARCHITECTURE

The data warehouse architecture can be divided into four layers. They are:
i. Data Source Layer
ii. Data Staging Layer
iii. Data Storage Layer
iv. Data Presentation Layer

(i) Data source layer: The data source layer is the place where unique information, gathered from an assortment of
inner and outside sources, resides in the social database. Following are the examples of the data source layer:
Operational Data — Product information, stock information, marketing information, or HR information
Social Media Data — Website hits, content fame, contact page completion
Outsider Data — Demographic information, study information, statistics information
(ii) Data Staging Layer: This layer dwells between information sources and the data warehouse. In this layer,
information is separated from various inside and outer data sources. Since source data comes in various
organizations, the data extraction layer will use numerous technologies and devices to extricate the necessary
information. Once the extracted data has been stacked, it will be exposed to high-level quality checks. The
conclusive outcome will be perfect and organized data that you will stack into your data warehouse
(iii) Data Storage Layer: This layer is the place where the data that was washed down in the arranging zone is put
away as a solitary central archive. Contingent upon your business and your warehouse architecture necessities, your
data storage might be a data warehouse center, data mart (data warehouse somewhat recreated for particular
departments), or an Operational Data Store (ODS).
(iv) Data Presentation Layer: This is where the users communicate with the scrubbed and sorted out data. This
layer of the data architecture gives users the capacity to query the data for item or service insights, break down the
data to conduct theoretical business situations, and create computerized or specially appointed reports.
You may utilize an OLAP or reporting instrument with an easy to understand Graphical User Interface (GUI) to assist
users with building their queries, perform analysis, or plan their reports.

DATA MARTS Short note on Data Marts (5) (again)

A data mart is a subset of a data warehouse focused on a particular line of business, department, or subject area.
Data marts make specific data available to a defined group of users, which allows those users to quickly access
critical insights without wasting time searching through an entire data warehouse. For example, many companies
may have a data mart that aligns with a specific department in the business, such as finance, sales, or marketing.
Benefits of Using Data Mart
1. Cost-efficiency Simplified data acces Quicker access to insights Simpler data maintenance Easier and
faster implementation

TYPES OF DATA MARTS

• Dependent data marts are partitioned segments within an enterprise data warehouse. This top-down
approach begins with the storage of all business data in one central location. The newly created data marts
extract a defined subset of the primary data whenever required for analysis.
• Independent data marts act as a standalone system that doesn't rely on a data warehouse. Analysts can
extract data on a particular subject or business process from internal or external data sources, process it,
and then store it in a data mart repository until the team needs it.
• Hybrid data marts combine data from existing data warehouses and other operational sources. This unified
approach leverages the speed and user-friendly interface of a top-down approach and also offers the
enterprise-level integration of the independent method.

DESIGNING THE DATA MARTS: The process for designing a data mart usually comprises the
following steps:
(i)Essential Requirements Gathering: The first step is to create a robust design. Some critical processes involved in
this phase include collecting the corporate and technical requirements, identifying data sources, choosing a suitable
data subset, and designing the logical layout (database schema) and physical structure.
(ii) Build/Construct: The next step is to construct it. This includes creating the physical database and the logical
structures. In this phase, you’ll build the tables, fields, indexes, and access controls.
(iii) Populate/Data Transfer: The next step is to populate the mart, which means transferring data into it. In this
phase, you can also set the frequency of data transfer, such as daily or weekly. This usually involves extracting
source information, cleaning and transforming the data, and loading it into the departmental repository.
(iv) Data Access: In this step, the data loaded into the data mart is used in querying, generating reports, graphs, and
publishing. The main task involved in this phase is setting up a meta-layer and translating database structures and
item names into corporate expressions so that non-technical operators can easily use the data mart. If necessary,
you can also set up API and interfaces to simplify data access.
(v) Manage: The last step involves management and observation, which includes:
Controlling ongoing user access.
Optimization and refinement of the target system for improved performance.
Addition and management of new data into the repository.
Configuring recovery settings and ensuring system availability in the event of failure.

{Ch=3}
DIMENSIONAL MODELING: Dimensional Modelling (DM) is a data structure technique optimized
for data storage in a Data warehouse. The purpose of dimensional modelling is to optimize the database for faster
retrieval of data. A dimensional model in data warehouse is designed to read, summarize, analyze numeric
information like values, balances, counts, weights, etc. in a data warehouse. The main objective of dimension
modelling is to provide an easy architecture for the end user to write queries. Also it will reduce the number of
relationships between the tables and dimensions, hence providing efficient query handling.
Features of Dimensional Modelling
• Easy to Understand
• Promote Data Quality
• Optimise Performance
• It reduces the number of relationships between different data elements.
Steps Involved in Dimensional Modelling
The following are the steps involved in Dimension modelling as shown in figure1.

Step 1: Identify the Business Process: Identifying the actual business process a datarehouse should cover. This
could be Marketing, Sales, HR, etc. as per the data analysis needs of the organization. The selection of the Business
process also depends on the quality of data available for that process. It is the most important step of the Data
Modelling process
Step 2: Identifying Granularity: The Grain describes the level of detail for the business problem/solution. It is the
process of identifying the lowest level of information for any table in your data warehouse. If a table contains sales
data for every day, then it should be daily granularity. If a table contains total sales data for each month, then it has
monthly granularity.
Step 3: Identifying Dimensions and attributes: The dimensions of the data warehouse can be understood by the
entities of the database. like, items, products, date, stocks, time etc. The identification of the primary keys and the
foreign keys specifications all are described here.
Step 4: Build the Schema: The database structure or arrangement of columns in a database table, decides the
schema. There are various popular schemas like, star, snowflake, fact constellation schemas - summarizing, from
the selection of business process to identifying each and every finest level of detail of the business transactions.
Identifying the significant dimensions and attributes would help to build the schema.

STAR SCHEMA With the help of an example, explain the star schema dimensional model. 10
It represents the multidimensional model. In this model the data is organized into facts and dimensions. The star
model is the underlying structure for a dimensional model. It has one broad central table (fact table) and a set of
smaller tables (dimensions) arranged in a star design around the primary table. This design is logically shown in fig.

SNOWFLAKE SCHEMA Describe the snowflake schema multidimensional modeling technique. Give an
example and illustrate it. List its pros and cons. 10
A snowflake schema is a type of data modeling technique used in data warehousing to represent data in a
structured way that is optimized for querying large amounts of data efficiently. In a snowflake schema, the
dimension tables are normalized into multiple related tables, creating a hierarchical or “snowflake” structure.
In a snowflake schema, the fact table is still located at the center of the schema, surrounded by the dimension
tables. However, each dimension table is further broken down into multiple related tables, creating a hierarchical
structure that resembles a snowflake.
Example: In the below Figure 4,

The snowflake schema is shown

of a case study of customers,
sales, products, location wise
quantity sold, and number of
items sold are calculated.
The customers, products, date,
store are saved in the fact table
with their respective primary
keys acting in fact table as a
foreign key.
In this, two aggregate functions
can be applied to calculate
quantity sold and amount sold.
Further, some dimensions are
extended to the type of
customer and also store
information territory wise too.
Note, date has been expanded
Features of Snowflake Schema
into date, month, year.
• It has normalized tables
• Occupy less disk space.
• It requires more lookup time as many tables are interconnected and extending dimensions.
Advantages of Snowflake Schema
• It provides structured data which reduces the problem of data integrity.
• It uses small disk space because data are highly structured
• Data is easy to maintain and more structured.
• Data quality is better than star schema.

Limitations of Snowflake Schema

• Snowflaking reduces space consumed by dimension tables but compared with the entire data warehouse
the saving is usually insignificant.
• Avoid snowflaking or normalization of a dimension table, unless required and appropriate.
• Multiple hierarchies that can belong to the same dimension have been designed at the lowest possible
detail.

Star Schema Vs Snowflake Schema

Star Schema Snowflake Schema
Dimension Table The dimension tables in star schema are not This schema has normalized dimension tables
normalized so they may contain redundancies
Queries The execution of queries is relatively faster as The execution of snowflake schema complex
there are less joins needed in forming a query. queries is slower than star schema as many joins
are needed to form a query.
Performance Star schema model has faster execution and It has slow performance as compared to star
response time schema
Storage Space This type of schema requires more storage Snowflake schema tables are easy to maintain
space as compared to snowflake due to and save storage space due to normalized tables.
unnormalised tables.
Usage Star schema is preferred when the dimension If the dimension table contains large number of
tables have lesser rows rows, snowflake schema is preferred
Type of DW This schema is suitable for 1:1 or 1: many It is used for complex relationships such as
relationships such as data marts. many: many in enterprise Data warehouses.
Dimension Tables Star schema has a single table for each Snowflake schema may have more than one
dimension dimension table for each dimension.

FACT CONSTELLATION SCHEMA

There is another schema for representing a multidimensional model. This term fact constellation is like the galaxy of
universe containing several stars. It is a collection of fact schemas having one or more-dimension tables in common
as shown in the figure below. This logical representation is mainly used in designing complex database systems
Fact tables
Placement (Stud_roll, Company_id, TPO_id) , need to calculate the number of students eligible and number of
students placed.
Workshop ( Stud_roll, Institute_id, TPO_id) need to find out the facts about number of students selected, number
of students attended the workshop)
So, there are two fact tables namely, Placement and Workshop which are part of two different star schemas having
dimension tables – Company, Student and TPO in Star schema with fact table Placement and dimension tables –
Training Institute, Student and TPO in Star schema with fact table Workshop.
Both the star schema has two-dimension tables common and hence, forming a fact constellation or galaxy schema.
Advantages of Fact Constellation Schema
• Different fact tables are explicitly assigned to the dimensions.
• Provides a flexible schema for implementation
3.9.2 Limitations of Fact Constellation Schema
• Complexity of the schema involved because of several aggregations
• Fact constellation solution is hard to maintain and support

{Block-2}
{Ch=4}
ETL (Extract, Transform, Load)
What is ETL? Why do you need ETL in data warehouses? Also, explain how ETL works in Data Warehousing? 10
It is a process of data integration that encompasses three steps - extraction, transformation, and loading. In a
nutshell, ETL systems take large volumes of raw data from multiple sources, convert it for analysis, and load that
data into your warehouse.
Why Do You Need ETL?: ETL saves you significant time on data extraction and preparation - time that you can
better spend on evaluating your business. Practicing ETL is also part of a healthy data management workflow,
ensuring high data quality, availability, and reliability. Each of the three major components in the ETL saves time and
development effort by running just once in a dedicated data flow:
Extract: In ETL, the first link determines the strength of the chain. The extract stage determines which data sources
to use, the refresh rate (velocity) of each source, and the priorities (extract order) between them — all of which
heavily impact your time to insight.
Transform: After extraction, the transformation process brings clarity and order to the initial data swamp. Dates
and times combine into a single format and strings parse down into their true underlying meanings. Location data
convert to coordinates, zip codes, or cities/countries. The transform step also sums up, rounds, and averages
measures, and it deletes useless data and errors or discards them for later inspection. It can also mask personally
identifiable information (PII) to comply with GDPR, CCPA, and other privacy requirements.
Load: In the last phase, much as in the first, ETL determines targets and refresh rates. The load phase also
determines whether loading will happen incrementally, or if it will require “upsert” (updating existing data and
inserting new data) for the new batches of data.

ETL PROCESS: ETL collects and processes data from various sources into a single data store (a data
warehouse or data lake), making it much easier to analyze. The three steps in ETL process are mentioned below:
1. Data Extraction: Data extraction involves the following four steps:
• Identify the data to extract: The first step of data extraction is to identify the data sources. These sources
might be from relational SQL databases like MySQL or non-relational NoSQL databases like MongoDB or
Cassandra
• Estimate how large the data extraction is: The size of the data extraction matters. A larger quantity of data
will require a different ETL strategy.
• Choose the extraction method:
2. Data Transformation: What is data transformation? List and explain all the data transformation steps (10)
data transformation that occurs in a staging area (after extraction) is “multistage data transformation”. In ELT,
data transformation that happens after loading data into the data warehouse is “in-warehouse data
transformation”. Some of the following data transformations are:
• Deduplication (normalizing): Identifies and removes duplicate information.
• Key restructuring: Draws key connections from one table to another.
• Cleansing: Involves deleting old, incomplete, and duplicate data to maximize data accuracy - perhaps
through parsing to remove syntax errors, typos, and fragments of records.
• Format revision: Converts formats in different datasets - like date/time, male/female, and units of
measurement - into one consistent format.
• Derivation: Creates transformation rules that apply to the data. For example, maybe you need to subtract
certain costs or tax liabilities from business revenue figures before analyzing them.
• Aggregation: Gathers and searches data so you can present it in a summarized report format.
• Integration: Reconciles diverse names/values that apply to the same data elements across the data
warehouse so that each element has a standard name and definition.
• Filtering: Selects specific columns, rows, and fields within a dataset.
• Splitting: Splits one column into more than one column.
• Joining: Links data from two or more sources, such as adding spend information across multiple SaaS
platforms.
• Summarization: Creates different business metrics by calculating value totals. For example, you might add
up all the sales made by a specific salesperson to create total sales metrics for specific periods.
• Validation: Sets up automated rules to follow in different circumstances. For instance, if the first five fields
in a row are NULL, then you can flag the row for investigation or prevent it from being processed with the
rest of the information.

3. Data Loading: Data loading is the process of loading the extracted information into your target data
repository. Loading is an ongoing process that could happen through “full loading” (the first time you load data
into the warehouse) or “incremental loading” (as you update the data warehouse with new information).
Because incremental loads are the most complex.

WORKING OF ETL
1. Parsing/Cleansing: Data generated by applications may be in various formats like JSON, XML, or CSV. The
parsing stage maps data into a table format with headers, columns, and rows, and then extracts specified
fields.
2. Data enrichment: Preparing data for analytics usually requires certain data enrichment steps, including
injecting expert knowledge, resolving discrepancies, and correcting bugs.
3. Setting velocity: “Velocity” refers to the frequency of data loading, i.e. inserting new data and updating
existing data.
4. Data validation: In some cases, data is empty, corrupted, or missing crucial elements. During data
validation, ETL finds these occurrences and determines whether to stop the entire

ELT (Extract/load/transform)
Extract/load/transform (ELT) is the process of extracting data from one or multiple sources and loading it into a
target data warehouse. Instead of transforming the data before it’s written, ELT takes advantage of the target
system to do the data transformation. This approach requires fewer remote sources than other techniques because
it needs only raw and unprepared data.

• Extract - This step works similarly in both ETL and ELT data management approaches. Raw streams of data from
virtual infrastructure, software, and applications are ingested either in their entirety or according to predefined
rules.
• Load – The ELT differs here with the ETL. Rather than deliver this mass of raw data and load it to an interim
processing server for transformation, ELT delivers it directly to the target storage location. This shortens the
cycle between extraction and delivery.
• Transform - The database or data warehouse sorts and normalizes the data, keeping part or all of it on hand
and accessible for customized reporting. The overhead for storing this much data is higher, but it offers more
opportunities to mine it for relevant business intelligence in near real-time.

Need: The ELT process improves data conversion and manipulation capabilities due to parallel load and data
transformation functionality.
The ELT process saves you steps and time. Data is first loaded into the target ecosystem, such as a data warehouse,
and then transformed. Authorized users can securely access the data without returning it to source systems. No
downloading is necessary for it.

ETL Vs ELT
The primary differences between ETL and ELT are how much data is retained in data warehouses and where data is
transformed. With ETL, the transformation of data is done before it is loaded into a data warehouse. This enables
analysts and business users to get the data they need faster, without building complex transformations or
persistent tables in their business intelligence tools. Using the ELT approach, data is loaded into the warehouse as
is, with no transformation before loading
{Ch=5}
ONLINE ANALYTICAL PROCESSING(OLAP)
Online Analytical Processing (OLAP) is the technology to analyze and process data from multiple sources at the same
time. It accesses the multiple databases at the same time. It helps the data analysts to collect data from different
perspective for developing effective business strategies. OLAP is a software used to perform high-speed,
multivariate analysis of large amounts of data in data warehouses, data markets, or other unified and centralized
data warehouses. The data is broken down for display, monitoring or analysis

Multidimensional Operations List and describe all the four types of OLAP operations. 10
Typically, there are four types of OLAP operations are:
i. Roll-up
ii. Drill-down
iii. Slice and Dice
iv. Pivot (rotate)

1. Roll-up: It is an OLAP operation to move from low level of concept hierarchy to higher level known as
aggregation. It can be performed by reducing the dimensions as well.
2. Drill-down: In this process of scaling down from higher level to the lower level the data is broken into
smaller parts.
3. Slice: It is another OLAP operation to fetch the data. In this the query on one dimension is triggered in the
database and a new sub cube is created
4. Dice: This OLAP Dice operation as shown in figure 6 is just like the Projection relational query you have read
in RDBMS. In this technique, you select two or more dimensions that results in the creation of a sub cube
5. Pivot: This OLAP operation fixes one attribute as a Pivot and rotate the cube to fetch the results, just like an
inverted spreadsheet, giving it a different perspective.

OLAP ARCHITECTURE: MOLAP, ROLAP, HOLAP AND DOLAP

There are various types of OLAP architecture: ROLAP, MOLAP, HOLAP and others

ROLAP Architecture
ROLAP implies Relational OLAP, an application based on relational DBMSs. It performs dynamic multidimensional
analysis of data stored in a relational database. The architecture is like three-tier. It has three components viz. front
end (User Interface), ROLAP server (Metadata request processing engine) and the back end (Database Server).
In this three-tier architecture the user submits the request and ROLAP engine converts the request into SQL and
submits to the backend database. After the processing of request by the engine, it presents the resulting data into
multidimensional format to make the task easier for the client to view it. The architecture has three components:
• Database server. ● ROLAP server. ● Front-end tool
MOLAP Architecture
Explain the following OLAP architectures and draw their architectural diagram: 10
(i) Multidimensional Online Analytical Processing (MOLAP)
(ii) Hybrid Online Analytical Processing (HOLAP)
MOLAP stands for Multidimensional Online Analytical Processing. It processes the data using the multidimensional
cube using various combinations. Since, the data is stored in multidimensional structure the MOLAP engine uses the
pre-computed or pre-stored information. MOLAP engine processes pre-compiled information. It has dynamic
abilities to perform aggregation of concept hierarchy. MOLAP is very useful in time-series data analysis and
economic evaluation. The architecture has three components:
● Database server ● MOLAP server ● Front-End tool
Characteristics of MOLAP
• It is a user-friendly architecture, easy to use.
• The OLAP operations slice and dice speeds up the data retrieval.
• It has small pre-computed hypercubes.

HOLAP Architecture
It defines Hybrid Online Analytical Processing. It is the hybrid of ROLAP and MOLAP technologies. It connects both
the dimensions together in one architecture. It stores the intermediate or part of the data in ROLAP and MOLAP.
Depending on the query request it accesses the databases. It stores the relational tables in ROLAP structure, and the
data requires multidimensional views, stored and processed using MOLAP architecture as shown in figure 12.
It has the following components:
● Database server. ● ROLAP and MOLAP server. ● Front-end tool
Characteristics of HOLAP are:
• Flexible handling of data.
• Faster aggregation of data.
• HOLAP can drill down the hierarchy of data and can access to relational database for any relevant and stored
information in it.

DOLAP Architecture
Desktop Online Analytical Processing (DOLAP) architecture is most suitable for local multidimensional analysis. It is
like a miniature of multidimensional database or it’s like a sub cube or any business data cube. The components are:
• Database Server. ● DOLAP server. ● Front-end
characteristics of DOLAP are:
• The three-tier architecture is designed for low-end, standalone user like a small shop owner in the locality.
• The data cube is locally stored in the system so, retrieval of results is faster.
• No load on the backend or at the server end.
• DOLAP is relatively cheaper to deploy (NO DIAGRAM OF DOLAP)
{CH=6}
DATA LAKES Define a Data Lake. Explain the step-by-step process of creating a data lake. 10
A data lake is a central location that holds a large amount of data in its native, raw format. data lake uses a flat
architecture and object storage to store the data. It can store structured, semi-structured, or unstructured data,
which means data can be kept in a more flexible format for future use. When storing data, a data lake associates it
with identifiers and metadata tags for faster retrieval.

Step by step process me koi dhng ki language ni mili…. Net pr smjh kr apni language me likh denge

Complex data modelling

What is complex data modelling? Where is this used? Explain “Anchor” complex data model. (10)
Complex data travels through data processing tools, ETLs, ERP software, data lakes and other paths. Complex data
has its intrinsic traits like its own syntax, schema, technology, terminology and type. This diversity of traits
complicates the work of data modelers. Various models like statistical, polyglot, no payload, multi-level, etc. exist to
store, process and analyze complex data. Some standard models are also available like Anchor models and Data
Vault Models. Both of these models provide advantages like Scalability, Temporal data handling and are resilient to
structural and content-based changes.

Complex data models

1. Anchor Model: Anchor modelling has four elements to it:
• Anchors: The anchors are used to model entities and events
• Attributes: Attributes are used to model properties of anchors.
• Ties: Ties are used to model the relationships between anchors.
• Knots: Knots are used in the modelling of shared properties, such as states. Attributes and ties can be
versioned (termed as historization), when changes in the information the model need to be retained.
The different graphical symbols, representing the elements of the model, are similar to those used in entity-
relationship modelling with a few extensions. An outline on a tie or attribute depicts versioning of changes.

2. Data Vault Model: The model also allows coping with issues such as auditing, data traceability, data
loading performance and resilience to change. Data traceability implies that every row of data in a data vault
must be associated with a record sourc e and date-of-load attributes. This information allows an auditor to
trace values to their respective sources
• Hubs: Hubs are comprised of a list of distinct business keys with low affinity to change and contain a
surrogate key per Hub item. The Hub also contains metadata designating the source of the business
key. Satellite tables;
• Links: Link tables are models providing relations between business keys and are fundamentally
many-to-many join tables, containing some metadata.
Links can connect to other links, to handle granularity variations. Using link references within another
link is bad modelling practice, as it creates dependencies between links making parallel loading of
links more challenging.
• Satellites: The hubs and link’s temporal attributes and descriptive attributes stored in separate
constructs called satellite tables or simply satellites. The satellites also comprise metadata
connecting them to their parent link or hub, metadata designating the relationship source and
attributes, also including a time-series with the attribute’s start and end dates.

CLOUD DATA WAREHOUSING short note on Cloud Data Warehousing (5)

What is Hadoop?
Hadoop is an open-source java-oriented framework comprising two fundamental components: i) Storage and ii)
Processing and was developed, especially, to deal with massive volumes of data in a distributed computing
ecosystem. The storage component of Hadoop is called the Hadoop Distributed File System (HDFS) while the
processing component is termed as MapReduce.
modules:
1. The Hadoop Distributed File System: The HDFS allows data storage with easy accessibility. The data storage
resides in an environment guaranteeing highly reliability and highly availability.
2. MapReduce: MapReduce is principally a combination of two processes. Mapping and Reduction. Data is read
from the data sources and converted to a format conducive to data analysts. This is the mapping process. The data
is then subjected to mathematical operations like aggregations, grouping, encoding ranges, etc. also called as
reductions and hence the terminology - reduce. In the Hadoop framework, it is the MapReduce programming model
implementation for processing data on a large scale also termed as data processing at scale.
3. Hadoop Common: Hadoop common is a collection of libraries and tools required by other Hadoop modules.
4. YARN: YARN is a platform that orchestrates and manages clustered computing resources and uses them for
scheduling user-applications.
5. Hadoop Ozone: Ozone is an object store (a data storage architecture that manages data as objects, through a
combination of a globally unique identifier, metadata and the data itself

Data warehouse automation tools: A DWA tool provides a seamless experience, free from the
hassles of coding, for integration and transition of diverse data from its source to a DW and other related
components. The tool automates deployment of ETL scripts, batch-processing of data and demonstrates various
functions like:
• Processes for high performance ETL and reliable ELT based integration of data
• Modelling of data at source
• Normalized, De-normalized and data structures with multiple dimensions
• Seamless integration with various data sources

{Block-3}
{Ch=7}
DATA MINING What is data mining? List out the key differences between data warehousing and data
mining. Mention any four applications of data mining. 10
Data mining is the process by which the patterns can be found by filtering the data based on a condition of data
sorting in order to find something important. At a micro- scale, mining can be defined as an activity that involves
gathering data in one place in a structural way. For example, clubbing an Excel Spreadsheet or creating a
summarization of the main points of some text.
• Data mining finds valuable information hidden in large volumes of data.
• Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in
sets of data.
• The computer is responsible for finding the patterns by identifying the underlying rules and features in the data.

The purpose of mining the data can be multi-fold which includes:

• Predicting various outcomes;
• Modelling target audience;
• Collecting the information about the product.
Data Mining Applications: Here is the list of areas where data mining is widely used −
• Financial Data Analysis
• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection

Data Mining Vs Data Warehouse

Data Mining Data Warehouse
Data mining is the process of analyzing unknown A data warehouse is database system which is designed
patterns of data. for analytical instead of transactional work.
Data mining is a method of comparing large amounts Data warehousing is a method of centralizing data from
of data to finding right patterns. different sources into one common repository.
Data mining is usually done by business users with the Data warehousing is a process which needs to occur
assistance of engineers. before any data mining can take place.
Data mining is the considered as a process of extracting On the other hand, Data warehousing is the process of
data from large data sets. pooling all relevant data together.
One of the most important benefits of data mining One of the pros of Data Warehouse is its ability to
techniques is the detection and identification of errors update consistently. That's why it is ideal for the
in the system. business owner who wants the best and latest features.

DATA MINING TOOLS

Orange Data Mining
SAS Data Mining: SAS stands for Statistical Analysis System. It is a product of the SAS Institute created for analytics
and data management.
Rattle: Rattle is a data mining tool based on GUI. It uses the R programming language. Rattle combines the power of
R with significant data mining features.
Rapid Miner: Rapid Miner is one of the most popular tool for performing predictions. It is written in JAVA
programming language. It offers an integrated environment for text mining, deep learning, machine learning, and
predictive analysis.
KNIME: KNIME is open source software for creating data science applications and services. This Data mining tool
helps you to understand data and to design data science workflows.

{CH=8}
Data Preprocessing Techniques:
• Data cleaning may be used to prevent noise and to correct incoherence of data.
• Data aggregation combines data from multiple channels into a cohesive data store, for example a data center.
• Data transformations may be applied, such as standardization.
• Data reduction will, for example, reduce data size by the compilation, elimination of duplicate components or
clustering; it can operate together.

DATA CLEANING Define ‘‘data cleaning’’ which is a data preprocessing technique. In this context, explain
the concept of Noisy data cleaning along with some suitable examples. 10
The other problem is that findings in the modern world are inconsistent, noisy and inexact. Software – cleaning (or
data cleaning) procedures are intended to complete missing values, smooth out sounds as outliers are identified
and data errors are corrected.
1. Missing Values: Some tuples have no significance mentioned for a range of attributes, such as consumer
income. We're going to fill up the vacant values.
2. Noisy Data: Noisy is a measured random error or variance in the variable. Noise is minimized by techniques of
data smoothing.
3. Data cleaning as a process:
Discrepancy Detection: The first step in data cleaning is to identify discrepancies. It considers the meta data
awareness and examines the following rules in order to identify the distinction.
Data Transformation: This is the second step of the data cleaning process. After identifying inconsistencies, we
have to define (sequence of) transformations and execute to correct them.
Mostly chize chodi h is chapter ki

{Ch=9}
ASSOCIATION RULE MINING
Short note on Association Rule Generation (5)
What is Association Rule Mining? With reference to this, explain the concepts given below: 10
(i) Support Count (ii) Frequent Item set (iii) Association Rule (iv) Rule Evaluation metrics
• Association rule mining is a popular and well researched method for discovering interesting relations
between variables in large databases.
• It is intended to identify strong rules discovered in Databases using different measures of interestingness.
• Based on the concept of strong rules, Rakesh Aggarwal et al. introduced association rules.
Rule Evaluation Metrics –
Support: It is one of the measures of interestingness. This tells about the usefulness and certainty of rules.
5% Support means total 5% of transactions in the database follow the rule.
Support(A -> B) = Support_count(A ∪ B)
Support count(X): Number of transactions in which X appears. If X is A union B then it is the number of
transactions in which A and B both are present.
For example, if a dataset contains 100 transactions and the item set {milk, bread} appears in 20 of those
transactions, the support count for {milk, bread} is 20.

Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of transactions that includes
all items in {A} to the no of transactions that includes all items in {A}.

Frequent item sets: A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of transactions or records in the
dataset that contain the item set. Frequent item sets and association rules can be used for a variety of tasks such
as market basket analysis, cross-selling and recommendation systems.

Apriori Algorithm Short note on Apriori Algorithm (5)

Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset for
boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties. We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1
itemsets.
Apriori algorithm refers to the algorithm which is used to calculate the association rules between objects. It means
how two or more objects are related to one another. In other words, we can say that the apriori algorithm is an
association rule leaning that analyzes that people who bought product A also bought product B.
Apriori algorithm helps the customers to buy their products with ease and increases the sales performance of the
particular store.
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant association rules.
Generally, the apriori algorithm operates on a database containing a huge number of transactions. For example, the
items customers but at a Big Bazar.
Components of Apriori algorithm: The given three components comprise the apriori algorithm.
1. Support 2. Confidence 3. Lift

APPROACHES FOR MINING MULTILEVEL ASSOCIATION RULES

Uniform Minimum Support:
• The same minimum support threshold is used when mining at each level of abstraction.
• When a uniform minimum support threshold is used, the search procedure is simplified.
• The method is also simple in that users are required to specify only one minimum support threshold.
Reduced Minimum Support
• Each level of abstraction has its own minimum support threshold.
• The deeper the level of abstraction, the smaller the corresponding threshold is.
• For example, the minimum support thresholds for levels 1 and 2 are 5% and 3%, respectively. In this way,
computer, laptop computer, and desktop computer‖ are all considered frequent.
Group-Based Minimum Support:
• Because users or experts often have insight as to which groups are more important than others, it is
sometimes more desirable to set up user-specific, item, or group based minimal support thresholds when
mining multilevel rules.

{Block 4}
{Ch=10}

CLASSIFICATION
Define classification. With the help of an example for each, explain the following classification models: 10
(i) Descriptive modelling
(ii) Predictive modelling
A classification problem requires that examples be classified into one of two or more classes. A classification can
include reliable or discrete input variables. A two-class dilemma is also regarded as a challenge between two classes
or binary classification. A multi-class grouping problem is often called a two-group problem.
Classification is the activity of learning a target function "c," which maps each attribute to one of the class labels "b"

The attribute set 'a' can contain a number of attributes and can
contain binary, categorical and continuous attributes. The class
'b' mark must have a distinct, i.e binary or categorical attribute
(nominal or ordinal)

Applications of Classification Models:

• Spam detection e-mails based on the header and content of the document.
• Classification of galaxies according to their types.
• Classification of students according to their qualifications.
• Patients are classified according to their medical history.
• Classification can be used for the approval of credit.

Classification Models:
Descriptive Modelling: It is a model of classification used to summarize data.
Predictive Modelling: It is a classification model used to predict the class mark of unknown data.
Decision Tree:
What is a Decision Tree? How is it useful in classification? With the help of an example, explain the process of
construction of a decision tree and its representation. 10
Decision tree is the most effective and common prediction and classification method. A decision tree is a flux chart-
like tree structure, where each internal node indicates an attribute test, each branch is the product of the test, and
each leaf node has a class name.

Construction of Decision Tree:

A tree can be "learned" by dividing the source into subsets based on a value test attribute. This process is replicated
in a recursive way, called recursive partitioning on each derived subset. The recursion is done when the subset at a
node has all the same value as the target variable or when splitting doesn't add any more value to the forecasts. The
construction of a decision tree classification involves no domain knowledge or parameter setting, so it is ideal for
exploration of explorative knowledge. High dimensional data can be handled by decision trees. The decision tree
classifier has excellent accuracy in general. Induction of decision tree is a standard inductive way of learning
classification information.
Decision Tree Representation:
Decision trees identify instances by sorting them from the root to a leaf node that provides the instance
classification. An instance is classified by starting from the tree root node, evaluating the specified attribute by this
node, then moving down the tree branch corresponding to the attribute's value, as shown in the figure above. This
process is repeated for the subtree that is rooted in the new node.

Baye’s Theorem Short note (5)

The theorem of Bayes named after British mathematician Thomas Bayes from the 18th century is a mathematical
method to determine the probability of a condition. Conditional probability is the probability of an outcome, based
on a prior result. The theorem of Bayes offers a way to revisit current predictions or hypotheses (update
probabilities) with new or additional proof. The Bayes theorem can be used in finance to assess the danger of
lending capital to potential lenders.

{Ch=11}
CLUSTERING Short note on Clustering and its Methods (5)
A cluster is a grouping of comparable items that belongs to the same category.
A group of comparable data objects is classed as a cluster in clustering. A collection of data is referred to as a group.
The cluster analysis divides data into groupings based on how closely they resemble one another. Clustering is an
unsupervised Machine Learning-based Algorithm that divides a set of data points into clusters so that the objects
are all part of the same group. Using clustering, data can be subdivided into smaller, more manageable groups. Data
from each of these subgroups are grouped together into a single cluster.
Applications of cluster analysis in Data Mining:
• Clustering analysis is widely utilized in a variety of fields, including data analysis, market research, pattern
identification, and image processing.
• It aids marketers in identifying distinct customer segments based on their customers' purchasing habits.
They are able to classify their clients.
• It aids data discovery by assigning documents on the internet.
• For example, credit card fraud detection relies on clustering.
Categorization of Major Clustering Methods:
1. Partitioning Method: There are a few requirements that must be met with this Partitioning Clustering
Method, and they are as follows:
Only one group should have a single objective at any given time.
There should be no organization that does not have a single goal.
2. Hierarchical Method: This method decomposes a set of data items into a hierarchy. Depending on how the
hierarchical breakdown is generated, we can put hierarchical approaches into different categories.
3. Density-based Method
4. Grid-Based Method
5. Model-Based Method
6. Constraint-based Method

{Ch=12}
Text Mining
Define text mining. Mention any three applications of it. What are the various techniques used to analyse the web-
usage patterns? 10
What is Text Mining? Where is it used? Explain any two text mining techniques with the help of a suitable example
for each. 10
Text mining is a component of data mining that deals specifically with unstructured text data. Text mining can be
used as a preprocessing step for data mining or as a standalone process for specific tasks.
By using text mining, the unstructured text data can be transformed into structured data that can be used for
data mining tasks such as classification, clustering, and association rule mining. This allows organizations to gain
insights from a wide range of data sources, such as customer feedback, social media posts, and news articles.

Text mining is widely used in various fields, such as natural language processing, information retrieval, and social
media analysis. It has become an essential tool for organizations to extract insights from unstructured text data
and make data-driven decisions.
Procedures for Analysing Text Mining
• Text Summarization: To extract its partial content and reflect its whole content automatically.
• Text Categorization: To assign a category to the text among categories predefined by users.
• Text Clustering: To segment texts into several clusters, depending on the substantial relevance.

Application of Text Mining:

Digital Library Academic and Research Field Life Science Social-Media Business Intelligence
Text Mining Techniques
1. Information Retrieval: In the process of Information retrieval, we try to process the available documents
and the text data into a structured form so, that we can apply different pattern recognition and analytical
processes. It is a process of extracting relevant and associated patterns according to a given set of words
or text documents.
2. Information Extraction: It is a process of extracting meaningful words from documents.
• Feature Extraction – In this process, we try to develop some new features from existing ones. This
objective can be achieved by parsing an existing feature or combining two or more features based on
some mathematical operation.
• Feature Selection – In this process, we try to reduce the dimensionality of the dataset which is
generally a common issue while dealing with the text data by selecting a subset of features from the
whole dataset.
3. Natural Language Processing: Natural Language Processing includes tasks that are accomplished by using
Machine Learning and Deep Learning methodologies. It concerns the automatic processing and analysis of
unstructured text information

WEB MINING
Web mining as the name suggests that it involves the mining of web data. The extraction of information from
websites uses data mining techniques. It is an application based on data mining techniques. The parameters
generally to be mined in web pages are hyperlinks, text or content of web pages, linked user activity between web
pages of the same website or among different websites.
Web Mining performs various tasks such as:
1) Generating patterns existing in some websites, like customer buying behavior or navigation of web sites.
2) The web mining helps to retrieve faster results of the queries or the search text posted on the search engines like
Google, Yahoo etc.
3) The ability to classify web documents according to the search performed on the ecommerce websites helps to
increase businesses and transactions

WEB CONTENT MINING

Web content mining is used for taking out information, data from web pages. It extracts the information and
knowledge from the content of web pages. In web content mining each page is considered as an individual
document
• The usefulness of web content mining is that it helps in distinguish topics or heading content of the web
pages.
• Web content mining scans or reads the images or text and group of web pages. The input queries posted by
the users and display the result of search engines. For example, if a user wants to search some information
about some hotel in a particular region, the list will be displayed.
WEB STRUCTURE MINING short note on Web structure mining (5)
Web structure mining helps the web pages to extract knowledge. This task is not easy as the challenge in accessing
the unstructured web data link is huge requires lot of computational costs. It is used to find out the link structural
parameters of hyperlink. It helps to find out the network link or if the hyperlink is directly associated with the data.
The social network analysis in web structure mining treats web as a directed graph, in which the webpages are as
vertices associated with the hyperlinks.

MINING MULTIMEDIA DATA ON THE WEB

In the context of mining multimedia data on the web, explain the following terms: 10
Page Rank (ii) Hits (iii) Page Layout Analysis (iv) Vision Page Segmentation
• PageRank: This measure is used to count the number of pages the webpage is connected to other websites. It
gives the importance of the webpage. The Google search engine uses the algorithm PageRank and rank the web
page very significant if is frequently connected with the other webpages on the social network.
• HITS: This measure is used to rate the webpage. It is developed by Jon Kleinberg. It uses hubs and authorities to
be determined from a web page. Hubs and Authorities defines a recursive relationship between web pages.
This algorithm helps in web link structure and speeds up the search operation of a web page. Given a query to a
Search Engine, the set of highly relevant web pages are called Roots. They are potential Authorities.
Pages that are not very relevant but point to pages in the Root are called Hubs. Thus, an Authority is a page that
many hubs link to whereas a Hub is a page that links to many authorities.
• Page Layout analysis: It extracts and maintains the page-to-block, block-to-page relationships from link
structure of web pages.
• Vision page segmentation (VIPS) algorithm: It first extracts all the suitable blocks from the HTML
Document Object Model (DOM) tree, and then it finds the separators between these blocks. Here separators
denote the horizontal or vertical lines in a Web page that visually cross with no blocks. Based on these
separators, the semantic tree of the Web page is constructed.

Unit - 1 Introduction To Data Warehousing
No ratings yet
Unit - 1 Introduction To Data Warehousing
57 pages
Ad 5 Case Study
No ratings yet
Ad 5 Case Study
11 pages
DW Concepts
100% (1)
DW Concepts
40 pages
Data Warehousing and Data Mining Bhoj Reddy Engineering College For Women
No ratings yet
Data Warehousing and Data Mining Bhoj Reddy Engineering College For Women
11 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
100% (1)
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
77 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
39 pages
INFORMATION MANAGEMENT Unit 3 NEW
100% (1)
INFORMATION MANAGEMENT Unit 3 NEW
61 pages
جودة المواقع PDF
No ratings yet
جودة المواقع PDF
25 pages
IR-ADV C3530 C3525 C3520 III Series Partscatalog E EUR
No ratings yet
IR-ADV C3530 C3525 C3520 III Series Partscatalog E EUR
138 pages
Data Warehouses: FPT University
No ratings yet
Data Warehouses: FPT University
49 pages
Eng Pcdmis 2022.1 Vision Manual
No ratings yet
Eng Pcdmis 2022.1 Vision Manual
274 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
53 pages
Data Warehouse and Design Approaches
No ratings yet
Data Warehouse and Design Approaches
54 pages
Paper Presentation: Data Ware Housing AND Data Mining
No ratings yet
Paper Presentation: Data Ware Housing AND Data Mining
10 pages
Data Warehouse
No ratings yet
Data Warehouse
74 pages
Module 2
No ratings yet
Module 2
43 pages
R16 4-2 DataMining Notes UNIT-I
No ratings yet
R16 4-2 DataMining Notes UNIT-I
31 pages
Data Warehouse Concepts & Terminology: - Vamshi Myana
No ratings yet
Data Warehouse Concepts & Terminology: - Vamshi Myana
39 pages
Olap and Oltap
No ratings yet
Olap and Oltap
14 pages
Labour Welfare Scheme
No ratings yet
Labour Welfare Scheme
20 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
21 pages
In T e G R A Ti o N: Integration of Data
No ratings yet
In T e G R A Ti o N: Integration of Data
21 pages
Geotech 1 Lecture 2 Structure
No ratings yet
Geotech 1 Lecture 2 Structure
38 pages
MGT301 MIDTERM Solved PDF
100% (2)
MGT301 MIDTERM Solved PDF
69 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
92 pages
Data Warehouse Essentials
No ratings yet
Data Warehouse Essentials
6 pages
DWH Week 02
No ratings yet
DWH Week 02
22 pages
Module1 Part3
No ratings yet
Module1 Part3
46 pages
Advance Concept in Data Bases Unit-5 by Arun Pratap Singh
100% (1)
Advance Concept in Data Bases Unit-5 by Arun Pratap Singh
82 pages
Introduction To DW
No ratings yet
Introduction To DW
28 pages
7 Data Warehousing - 1
No ratings yet
7 Data Warehousing - 1
32 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
Understanding Resistance to Change
No ratings yet
Understanding Resistance to Change
19 pages
Data Warehousing and Mining Guide
No ratings yet
Data Warehousing and Mining Guide
46 pages
All Unit
No ratings yet
All Unit
17 pages
Data Warehousing & Dimensional Modeling Concepts !!
No ratings yet
Data Warehousing & Dimensional Modeling Concepts !!
33 pages
DWDM - Unit - I
No ratings yet
DWDM - Unit - I
70 pages
Introduction to Data Warehousing
No ratings yet
Introduction to Data Warehousing
39 pages
17 Cloud Computing
No ratings yet
17 Cloud Computing
35 pages
Data Warehouse
No ratings yet
Data Warehouse
4 pages
Data Warehouse
No ratings yet
Data Warehouse
143 pages
Gear / Gearbox Transmission Error (TE)
No ratings yet
Gear / Gearbox Transmission Error (TE)
1 page
Dividend Policy Veddanta
No ratings yet
Dividend Policy Veddanta
14 pages
Kurikulim Socs
No ratings yet
Kurikulim Socs
16 pages
02 DataWarehousing and OLAP
No ratings yet
02 DataWarehousing and OLAP
66 pages
Factors and Norms Influencing Unpaid Care Work
No ratings yet
Factors and Norms Influencing Unpaid Care Work
64 pages
Unit I
No ratings yet
Unit I
18 pages
Ambassador SWOT Examples
No ratings yet
Ambassador SWOT Examples
18 pages
Lotus Dance Floor & LED Display Price List
No ratings yet
Lotus Dance Floor & LED Display Price List
20 pages
CA2 Notes
No ratings yet
CA2 Notes
8 pages
Implementing Merchandise Plans
100% (4)
Implementing Merchandise Plans
19 pages
Safety Data Sheet Idlube XL: 1. Identification of The Substance/Preparation and The Company
No ratings yet
Safety Data Sheet Idlube XL: 1. Identification of The Substance/Preparation and The Company
4 pages
BBA Students: Globalization Insights
No ratings yet
BBA Students: Globalization Insights
4 pages
Unit - II Data Warehouseing&OLAP
No ratings yet
Unit - II Data Warehouseing&OLAP
17 pages
Data Warehousing, Business Analytics and Online Analytical - 1
No ratings yet
Data Warehousing, Business Analytics and Online Analytical - 1
35 pages
DWDM
No ratings yet
DWDM
15 pages
CS2202 DataWarehouse OLAP
No ratings yet
CS2202 DataWarehouse OLAP
49 pages
Unit3 Notes
No ratings yet
Unit3 Notes
15 pages
Uniswap Formulas
100% (2)
Uniswap Formulas
14 pages
DWM Unit-I Notes
No ratings yet
DWM Unit-I Notes
9 pages
Financial Systems & Cheque Clearing
No ratings yet
Financial Systems & Cheque Clearing
5 pages
4 Startup Roles To Hire
No ratings yet
4 Startup Roles To Hire
8 pages
Data Mining Unit-2 Notes
No ratings yet
Data Mining Unit-2 Notes
8 pages
MunicipalBank E-Passbook13-05-2024 195315
No ratings yet
MunicipalBank E-Passbook13-05-2024 195315
3 pages
Sajan Reliance MF
No ratings yet
Sajan Reliance MF
2 pages
Data Warehouse
No ratings yet
Data Warehouse
22 pages
Unit Ii-Ba
No ratings yet
Unit Ii-Ba
16 pages
Nov 2024 - UPSC Infocademy
No ratings yet
Nov 2024 - UPSC Infocademy
77 pages
DWM Cheatsheet Sem 5
No ratings yet
DWM Cheatsheet Sem 5
27 pages
DWDM202
No ratings yet
DWDM202
6 pages
Lec09-Data Warehousing
No ratings yet
Lec09-Data Warehousing
32 pages
DBMS II Seven 7
No ratings yet
DBMS II Seven 7
13 pages
Unit 2 Datawarehouse
No ratings yet
Unit 2 Datawarehouse
17 pages
RJS Pre Exam 2024 Answer Key
No ratings yet
RJS Pre Exam 2024 Answer Key
3 pages
Crush Your Data Warehousing Interviews
No ratings yet
Crush Your Data Warehousing Interviews
91 pages
Dataware House
No ratings yet
Dataware House
31 pages
Halit Sahitaj - Criminal Network and Russian Intelligence Ties
No ratings yet
Halit Sahitaj - Criminal Network and Russian Intelligence Ties
5 pages
PP Riseofchina
No ratings yet
PP Riseofchina
16 pages
Data Warehouse
No ratings yet
Data Warehouse
33 pages
Btech Cse & Aids DWDM Material - 2025
100% (1)
Btech Cse & Aids DWDM Material - 2025
45 pages
Bstm20oe201 2ND Sem Sy2024 2025
No ratings yet
Bstm20oe201 2ND Sem Sy2024 2025
1 page
Developing Management Skills 10th Edition Download Instantly
No ratings yet
Developing Management Skills 10th Edition Download Instantly
315 pages
Industrial Ventilation A Manual of Recommended Practice For Operation and Maintenance 2nd Edition Acgih Download
100% (1)
Industrial Ventilation A Manual of Recommended Practice For Operation and Maintenance 2nd Edition Acgih Download
58 pages
Overview of Data Ware Housing
No ratings yet
Overview of Data Ware Housing
17 pages

MCS 221 Notes

Uploaded by

MCS 221 Notes

Uploaded by

MCS 221 Data Warehouse Fundamentals and Architecture

DATA WAREHOUSE DESIGN APPROACHES

 Top down Approach  Bottom up Approach

Components of a Data Warehouse

DATA GRANULARITY short note on Data Granularity (5)

LAYERS OF DATA WAREHOUSE ARCHITECTURE

DATA MARTS Short note on Data Marts (5) (again)

TYPES OF DATA MARTS

Features of Star Schema

The snowflake schema is shown

Limitations of Snowflake Schema

Star Schema Vs Snowflake Schema

FACT CONSTELLATION SCHEMA

OLAP ARCHITECTURE: MOLAP, ROLAP, HOLAP AND DOLAP

Complex data modelling

Complex data models

CLOUD DATA WAREHOUSING short note on Cloud Data Warehousing (5)

The purpose of mining the data can be multi-fold which includes:

Data Mining Vs Data Warehouse

DATA MINING TOOLS

Apriori Algorithm Short note on Apriori Algorithm (5)

APPROACHES FOR MINING MULTILEVEL ASSOCIATION RULES

Applications of Classification Models:

Construction of Decision Tree:

Baye’s Theorem Short note (5)

Application of Text Mining:

WEB CONTENT MINING

MINING MULTIMEDIA DATA ON THE WEB

You might also like