Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
193 views10 pages

Data Warehousing & Data Mining

1. A data warehouse is a central repository where data from multiple sources is stored and organized so that it can be analyzed. It allows data to be analyzed for business intelligence and insights. 2. Data in a data warehouse comes from operational databases and other sources. It is organized and aggregated for analysis rather than transactional purposes. Data warehouses provide historical, integrated views of data across the entire organization. 3. Data warehouses are used across many industries for applications like analyzing customer behavior, monitoring product performance, assessing risk, and gaining strategic insights for planning and decision making.

Uploaded by

Binay Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views10 pages

Data Warehousing & Data Mining

1. A data warehouse is a central repository where data from multiple sources is stored and organized so that it can be analyzed. It allows data to be analyzed for business intelligence and insights. 2. Data in a data warehouse comes from operational databases and other sources. It is organized and aggregated for analysis rather than transactional purposes. Data warehouses provide historical, integrated views of data across the entire organization. 3. Data warehouses are used across many industries for applications like analyzing customer behavior, monitoring product performance, assessing risk, and gaining strategic insights for planning and decision making.

Uploaded by

Binay Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Warehousing & Data Mining

Data: UNIT-1 INTRODUCTION


In general, data is any set of characters that is gathered and translated for some purpose, usually analysis. If
data is not put into context, it doesn't do anything to a human or computer.
When data are processed, organized, structured or presented in a given context so as to make them useful,
they are called Information.
Who creates data?
Data can be created on a computer by the user, software, or hardware connected to the computer.
How is data stored on a computer?
Data and information are stored on a computer using a hard drive or another storage device.
How Data warehouse works?
A Data Warehouse works as a central repository where information arrives from one or more data sources.
Data flows into a data warehouse from the transactional system and other relational databases.
Data may be:
1. Structured
2. Semi-structured
3. Unstructured data
Structured data is the data which conforms to a data model, has a well define structure, follows a
consistent order and can be easily accessed and used by a person or a computer program.
Structured data is usually stored in well-defined schemas such as Databases. It is generally tabular with
column and rows that clearly define its attributes.
Sources of Structured Data:
 SQL Databases
 Spreadsheets such as Excel
 OLTP Systems
 Online forms
 Sensors such as GPS or RFID tags
 Network and Web server logs
 Medical devices

Semi-structured data is data that does not conform to a data model but has some structure. It lacks a
fixed or rigid schema. It is the data that does not reside in a rational database but that have some
organizational properties that make it easier to analyze. With some processes, we can store them in the
relational database.
Sources of semi-structured Data:
 E-mails
 XML and other markup languages
 TCP/IP packets
 Zipped files
 Integration of data from different sources
 Web pages
Unstructured data is the data which does not conforms to a data model and has no easily identifiable
structure such that it can not be used by a computer program easily.
Sources of Unstructured Data:
 Web pages
 Images (JPEG, GIF, PNG, etc.)
 Videos
 Memos
 Reports
 Word documents and PowerPoint presentations
 Surveys

Er.Binay Kumar Yadav Page 1


Data Warehousing & Data Mining

Data Quality Definition


Data quality is the measure of how well suited a data set is to serve its specific purpose. Measures of data
quality are based on data quality characteristics such as accuracy, completeness, consistency, validity,
uniqueness, and timeliness.

Data Quality Dimensions


There are six main dimensions of data quality, which are
 Accuracy: The data should reflect actual, real-world scenarios; the measure of accuracy can be
confirmed with a verifiable source.
 Completeness: Completeness is a measure of the data’s ability to effectively deliver all the required
values that are available.
 Consistency: Data consistency refers to the uniformity of data as it moves across networks and
applications. The same data values stored in difference locations should not conflict with one
another.
 Validity: Data should be collected according to defined business rules and parameters, and should
conform to the right format and fall within the right range.
 Uniqueness: Uniqueness ensures there are no duplications or overlapping of values across all data
sets. Data cleansing and deduplication can help remedy a low uniqueness score.
 Timeliness: Timely data is data that is available when it is required. Data may be updated in real time
to ensure that it is readily available and accessible.
What is Data Warehousing?
A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide
meaningful business insights. A Data warehouse is
typically used to connect and analyze business data
from heterogeneous sources.
Data warehouse system is also known by the
following name:
 Decision Support System (DSS)
 Executive Information System
 Management Information System
 Business Intelligence Solution
 Analytic Application
 Data Warehouse

Er.Binay Kumar Yadav Page 2


Data Warehousing & Data Mining

Important Features of Data Warehouse


The Important features of Data Warehouse are given below:
1. Subject Oriented
A data warehouse is subject-oriented. It provides useful data about a subject instead of the company's
ongoing operations, and these subjects can be customers, suppliers, marketing, product, promotion, etc. A
data warehouse usually focuses on modeling and analysis of data that helps the business organization to
make data-driven decisions.
2. Time-Variant:
The different data present in the data warehouse provides information for a specific period.
3. Integrated
A data warehouse is built by joining data from heterogeneous sources, such as social databases, level
documents, etc.
4. Non- Volatile
It means, once data entered into the warehouse cannot be change.
Advantages of Data Warehouse:
o More accurate data access
o Improved productivity and performance
o Cost-efficient
o Consistent and quality data
Types of Data Warehouse
Three main types of Data Warehouses (DWH) are:
1. Enterprise Data Warehouse (EDW):
Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support service across
the enterprise. It offers a unified approach for organizing and representing data. It also provide the ability to
classify data according to the subject and give access according to those divisions.
2. Operational Data Store:
Operational Data Store, which is also called ODS, are nothing but data store required when neither Data
warehouse nor OLTP systems support organizations reporting needs. In ODS, Data warehouse is refreshed
in real time. Hence, it is widely preferred for routine activities like storing records of the Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of business, such as
sales, finance, sales or finance. In an independent data mart, data can collect directly from sources.
General stages of Data Warehouse
Earlier, organizations started relatively simple use of data warehousing. However, over time, more
sophisticated use of data warehousing begun.
The following are general stages of use of the data warehouse (DWH):
Offline Operational Database:
In this stage, data is just copied from an operational system to another server. In this way, loading,
processing, and reporting of the copied data do not impact the operational system’s performance.
Offline Data Warehouse:
Data in the Data warehouse is regularly updated from the Operational Database. The data in Data warehouse
is mapped and transformed to meet the Data warehouse objectives.
Real time Data Warehouse:
In this stage, Data warehouses are updated whenever any transaction takes place in operational database. For
example, Airline or railway booking system.
Integrated Data Warehouse:
In this stage, Data Warehouses are updated continuously when the operational system performs a
transaction. The Datawarehouse then generates transactions which are passed back to the operational
system.

Er.Binay Kumar Yadav Page 3


Data Warehousing & Data Mining

Components of Data warehouse


Four components of Data Warehouses are:
Load manager: Load manager is also called the front component. It performs with all the operations
associated with the extraction and load of data into the warehouse. These operations include transformations
to prepare the data for entering into the Data warehouse.
Warehouse Manager: Warehouse manager performs operations associated with the management of the data
in the warehouse. It performs operations like analysis of data to ensure consistency, creation of indexes and
views, generation of denormalization and aggregations, transformation and merging of source data and
archiving and baking-up data.
Query Manager: Query manager is also known as backend component. It performs all the operation
operations related to the management of user queries. The operations of this Data warehouse components are
direct queries to the appropriate tables for scheduling the execution of queries.
End-user access tools:
This is categorized into five different groups like 1. Data Reporting 2. Query Tools 3. Application
development tools 4. EIS tools, 5. OLAP tools and data mining tools.
What Is a Data Warehouse Used For?
Here, are most common sectors where Data warehouse is used:
Airline:
In the Airline system, it is used for operation purpose like crew assignment, analyses of route profitability,
frequent flyer program promotions, etc.
Banking:
It is widely used in the banking sector to manage the resources available on desk effectively. Few banks also
used for the market research, performance analysis of the product and operations.
Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient’s treatment
reports, share data with tie-in insurance companies, medical aid services, etc.
Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government agencies to
maintain and analyze tax records, health policy records, for every individual.
Investment and Insurance sector:
In this sector, the warehouses are primarily used to analyze data patterns, customer trends, and to track
market movements.
Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to track items,
customer buying pattern, promotions and also used for determining pricing policy.
Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make distribution
decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and promotion
campaigns where they want to target clients based on their feedback and travel patterns.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful data that would allow the
business to take the data-driven decision from huge sets of data is called Data Mining.

Data Mining is the process of investigating hidden patterns of information to various


perspectives for categorization into useful data, which is collected and assembled in particular areas such as
data warehouses, efficient analysis, data mining algorithm, helping decision making and other data
requirement to eventually cost-cutting and generating revenue.

Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.

Er.Binay Kumar Yadav Page 4


Data Warehousing & Data Mining

Types of Data Mining


Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records, and columns
from which data can be accessed in various ways without having to recognize the database tables. Tables
convey and share information, which facilitates data searchability, reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the organization to
provide meaningful business insights. The huge amount of data comes from multiple places such as
Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision- making
for a business organization. The data warehouse is designed for the analysis of data rather than transaction
processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT professionals
utilize the term more clearly to refer to a specific kind of setup within an IT structure. For example, a group
of databases, where an organization has kept various kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model is called an object-
relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the Relational
database and the object-oriented model practices frequently utilized in many programming languages, for
example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential to undo a
database transaction if it is not performed appropriately. Even though this was a unique capability a very
long while back, today, most of the relational database systems support transactional database activities.
FEATURES OF DATA MINING:
 It is good with large databases and datasets
 It predicts future results
 It creates actionable insights
 It utilizes the automated discovery of patterns
ADVANTAGES OF DATA MINING:
 Fraud Detection:
It is used to find which insurance claims, phone calls, debit or credit purchases are fraud.
 Trend Analysis:
Existing marketplace trends are analyzed,which provides a strategic benefit as it helps in reduction of costs,
as in manufacturing per demand.
 Market Analysis:
It can predict the market and therefore help to make business decisions. For example: it can identify a target
market for a retailer, or certain types of products desired by types of customers.
Data Mining Techniques
Data mining uses algorithms and various techniques to convert large collections of data into useful output.
The most popular types of data mining techniques include:
 Association rules, also referred to as market basket analysis, searches for relationships between
variables. This relationship in itself creates additional value within the data set as it strives to link
pieces of data. For example, association rules would search a company's sales history to see which
products are most commonly purchased together; with this information, stores can plan, promote,
and forecast accordingly.
 Classification uses predefined classes to assign to objects. These classes describe characteristics of
items or represent what the data points have in common with each. This data mining technique
allows the underlying data to be more neatly categorized and summarized across similar features or
product lines.
 Clustering is similar to classification. However, clustering identified similarities between objects,
then groups those items based on what makes them different from other items. While classification

Er.Binay Kumar Yadav Page 5


Data Warehousing & Data Mining

may result in groups such as "shampoo", "conditioner", "soap", and "toothpaste", clustering may
identify groups such as "hair care" and "dental health".
 Decision trees are used to classify or predict an outcome based on a set list of criteria or decisions.
A decision tree is used to ask for input of a series of cascading questions that sort the dataset based
on responses given. Sometimes depicted as a tree-like visual, a decision tree allows for specific
direction and user input when drilling deeper into the data.
 K-Nearest Neighbor (KNN) is an algorithm that classifies data based on its proximity to other
data. The basis for KNN is rooted in the assumption that data points that are close to each are more
similar to each other than other bits of data. This non-parametric, supervised technique is used to
predict features of a group based on individual data points.
 Neural networks process data through the use of nodes. These nodes is comprised of inputs,
weights, and an output. Data is mapped through supervised learning (similar to how the human
brain is interconnected). This model can be fit to give threshold values to determine a model's
accuracy.
 Predictive analysis strives to leverage historical information to build graphical or mathematical
models to forecast future outcomes. Overlapping with regression analysis, this data mining
technique aims at supporting an unknown figure in the future based on current data on hand.
Challenges of Implementation in Data mining
Although data mining is very powerful, it faces many challenges during its execution. Various challenges
could be related to performance, data, methods, and techniques, etc. The process of data mining becomes
effective when the challenges or problems are correctly recognized and adequately resolved.

Incomplete and noisy data:


The process of extracting useful data from large volumes of data is data mining. The data in the real-world is
heterogeneous, incomplete, and noisy. Data in huge quantities will usually be inaccurate or unreliable. These
problems may occur due to data measuring instrument or because of human errors.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing environment. It might be
in a database, individual systems, or even on the internet. Practically, It is a quite tough task to make all the
data to a centralized data repository mainly due to organizational and technical concerns.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on. Managing these various types of data and extracting useful
information is a tough task. Most of the time, new technologies, new tools, and methodologies would have
to be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and techniques used.
If the designed algorithm and techniques are not up to the mark, then the efficiency of the data mining
process will be affected adversely.
Data Privacy and Security:
Data mining usually leads to serious issues in terms of data security, governance, and privacy. For example,
if a retailer analyzes the details of the purchased items, then it reveals data about buying habits and
preferences of the customers without their permission.

Er.Binay Kumar Yadav Page 6


Data Warehousing & Data Mining

Data Visualization:
In data mining, data visualization is a very important process because it is the primary method that shows
the output to the user in a presentable way. The extracted data should convey the exact meaning of what it
intends to express. But many times, representing the information to the end-user in a precise and easy way is
difficult. The input data and the output information being complicated, very efficient, and successful data
visualization processes need to be implemented to make it successful.

Data Mining Applications


Data mining is highly useful in the following domains −
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection
Apart from these, data mining can also be used in the areas of production control, customer retention,
science exploration, sports, astrology, and Internet Web Surf-Aid.
Market Analysis and Management
Listed below are the various fields of market where data mining is used −
 Customer Profiling − Data mining helps determine what kind of people buy what kind of products.
 Identifying Customer Requirements − Data mining helps in identifying the best products for
different customers. It uses prediction to find the factors that may attract new customers.
 Cross Market Analysis − Data mining performs Association/correlations between product sales.
 Target Marketing − Data mining helps to find clusters of model customers who share the same
characteristics such as interests, spending habits, income, etc.
 Determining Customer purchasing pattern − Data mining helps in determining customer
purchasing pattern.
 Providing Summary Information − Data mining provides us various multidimensional summary
reports.
Corporate Analysis and Risk Management
Data mining is used in the following fields of the Corporate Sector −
 Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction, contingent
claim analysis to evaluate assets.
 Resource Planning − It involves summarizing and comparing the resources and spending.
 Competition − It involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect frauds. In fraud
telephone calls, it helps to find the destination of the call, duration of the call, time of the day or week, etc. It
also analyzes the patterns that deviate from expected norms.
Data Preprocessing in Data Mining
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.

Er.Binay Kumar Yadav Page 7


Data Warehousing & Data Mining

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
i. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

ii. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty
data collection, data entry errors etc. It can be handled in following ways :
i. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.
ii. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).
iii. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
i. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
ii. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
iii. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
iv. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge
volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction
technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
i. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
ii. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded.
iii. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.

Er.Binay Kumar Yadav Page 8


Data Warehousing & Data Mining

iv. Dimensionality Reduction:


This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after reconstruction
from compressed data, original data can be retrieved, such reduction are called lossless reduction else it
is called lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms
and PCA (Principal Component Analysis).

Data Mining – Knowledge Discovery in Databases(KDD).


Why do we need Data Mining?
Volume of information is increasing everyday than we can handle from business transactions, scientific
data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of extracting essence of
information available and that can automatically generate report,
views or summary of data for better decision-making.
Why Data Mining is used in Business?
Data mining is used in business to make better managerial decisions by:
 Automatic summarization of data
 Extracting essence of information stored.
 Discovering patterns in raw data.

Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of
implicit, previously unknown and potentially useful information from data stored in databases.
Steps Involved in KDD Process:

KDD process

1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in
a common source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided
and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure.
Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to destination to capture transformations.
 Code generation: Creation of the actual transformation program.

Er.Binay Kumar Yadav Page 9


Data Warehousing & Data Mining

5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which utilizes
visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.

Er.Binay Kumar Yadav Page 10

You might also like