Operational Information v/s Strategic
Information
• Operational computer systems did provide information to
Operational
run day-to-day operations, and answer’s daily questions.
• Also called online transactional processing system (OLTP)
• Data is read or manipulated with each transaction
Information • Transactions/queries are simple, and easy to write
• Usually for middle management
• Sales systems
Examples • Hotel reservation systems
• Railway Reservation system
Operational System
• OLTP Systems used to run
the day to day core
business of company.
• Optimized to handle large
numbers of simple
read/write transactions
Strategic Information
• Data set are mounting everywhere, but not useful for decision support
• Decision-making require complex questions from integrated data.
• Decision makers wanted to know which project lines to strengthen
and which markets to strengthen.
They need Strategic Information
• Strategic information is required by the executives and managers to
formulate business strategies, establish goals, set objectives and
monitor results etc.
Example:
• Gain market share by 10% in the next 3 yrs.
• Bring 3 new products to market in 2 yrs.
A New Type of System Environment is Required
Desired features are
• Data is designed for analytical tasks
• Data from multiple applications
• Read-intensive data usage
• Direct interaction with the system by the users without IT
assistance
• Content updated periodically and stable
• Content to include current and historical data
• Ability for users to run queries and get results online
• Ability for users to initiate reports
Data warehouse
• A decision support database that is maintained separately
from the organization’s operational database
• Support information processing by providing a solid
platform of consolidated, historical data for analysis.
• Also allows to create a lots of reports by the use of
mining tools.
Data Warehouse
• A data warehouse is a Subject-oriented, integrated, time-variant
and non-volatile collection of data in support of Management’s
decision making process.
Data warehousing-
Process of constructing and using data warehouses. Process of
extracting and transferring operational data into informational data &
loading it into a central data store (warehouse).
Characteristics of Data warehouse
Subject
Oriented
Integrated
Time
Variant
Non
Volatile
1. Subject oriented data
Organized around major subjects, such as customer, product,
sales.
Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.
In operational systems data is stored by individual
applications or business process. Like data about individual
order , customer etc.
For example in banking industry data sets for saving or
checking accounts contain data about that particular
application.
But in DW data is stored by real world business objectives or
events not by the applications.
• In DW subject is the organization method
• Subjects vary with enterprise
2. Integrated data
• Data in DW comes from several operational systems.
• Different datasets have different file formats.
Example: Data for subject Account comes from 3 different data
sources.
• Before moving the data into the data warehouse, you
have to go through a process of transformation,
consolidation, and integration of the source data.
• Here are some of the items that would need
standardization:
Naming conventions
Codes
Data attributes
Time variant data
In operational systems the stored data contains current values.
Like in saving account system the balance is the current balance of the customer.
But the data in the DW is meant for analysis and decision making.
Comparative analysis is one of the best techniques for business performance
evaluation
Time is critical factor for comparative analysis.
Every data structure in DW contains time element.
Contd…
So, DW has to contain historical data and current values.
Data is stored as snapshots over past and current periods.
The time-variant nature of the data in a data warehouse
Allows for analysis of the Relates information to the Enables forecasts for the
past present future
Non Volatile Data
• Data from operational systems are moved into DW after
specific intervals
• Every business transaction don’t update in DW
• Data from DW is not deleted
• Data is neither changed by individual transactions
OLAP AND OLTP
OLTP Vs OLAP
OLTP OLAP
Users Cleark, IT Professional Knowledge worker
Function Day to day operations Long term informational
requirements Decision
Support
DB Design Application Oriented Subject Oriented
Data Current, Up-to-date Historical, summar-
detailed, flat relational ized, multi-dimensi-
onal, integrated,
Consolidated
Unit of work Short,simple transaction Complex query
Access Read/write, index/ hash Lot of scans
on primary key
No. of records Tens Millions
accessed
No. of Users Thousands Hundreds
DB size 100 MB-GB 100 GB-TB
Metric Transaction throughput Query throughput, response
ASET 16/55
Example of OLTP queries
Query 1: What is the salary of Mr. Mishra?
Query 2: What is the address and phone number of a person in-charge of the
supplies department?
Query 3: How many employees have received an excellent credential in latest
appraisal?
.
17
Example of OLAP queries
Query 1: How is the employee attrition scene changing over the years across
company?
Query 2: Is there a correlation between the geographical location of company
unit and excellent employee appraisals?
Query 3: Is it financially viable to continue over manufacturing unit in Noida?
18
OLAP AND OLTP
OLAP
OnLineAnalyticalProcessing is computer
processing that enables a user to easily
and selectively extract and view data
from different view points.
OLAP allowed users to analyse
database information from multiple
database systems at once.
OLAP data is stored in
multidimensional databases.
ASET 19/55
Data Warehouse
DATA WAREHOUSE: ACHITECTURE
Datawarehouse: Architecture
ASET 22/55
Data Source
Production data
Comes from various operational systems of the enterprise.
Internal Data
• Like private documents, customer profiles, departmental databases etc.
External Data
• Statistics data produced by external agencies. Used for comparing
performance against other organizations.
Archived Data
• In every operational systems, the old data periodically stored in archived
files or on disk storage. This data is also required as the data warehouse
keeps historical snapshots of data.
Data staging component
• After data is extracted, data is to be prepared
• Data extracted from sources needs to be changed,
converted and made ready in suitable format
• Three major functions to make data ready (ETL)
• Extract
• Transform
• Load
• Staging area provides a place and area with a set of
functions to
• Clean
• Change
• Combine
• Convert
ETL(Extraction, Transformation and Loading process)
Data extraction
get data from multiple, heterogeneous, and external sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse format
Load
sort, summarize, consolidate, compute views, check integrity,
Refresh
propagate the updates from the data sources to the warehouse
25
Data Loading: Data Movement to the Data Warehouse
Data Marts
Data mart is a subset of data warehouse and is
oriented to specific purpose.
Data warehouse is seen as a collection of data marts.
Data marts can also be seen as small warehouse for
OLAP activities within a given segment.
e.g. In Indian railway system , we have one segment
of railway called express train reservation. So, this
segment can be considered as data mart.
27
Data Mart
The data mart is a subset of the data warehouse and is usually
oriented to a specific business line or team.
Whereas data warehouses have an enterprise-wide depth, the
information in data marts pertains to a single department.
How Data Mart is Different from a Data Warehouse?
Metadata component
• Metadata is the data about the data in the data warehouse.
• Metadata in a data warehouse contains the answers to
questions about the data in the data warehouse.
• Serves as a directory of the contents of the data warehouse
Metadata can hold all kinds of information about DW data like:
• Source for any extracted data.
• Use of that DW data.
• Any kind of data and its values.
• Features of data.
• Transformation logic for extracted data.
• DW tables and their attributes.
• Timestamps
Metadata Repository contain
Metadata are created for the data names and definitions of the
given warehouse, timestamping of extracted data, source of
extracted data, missing fields have been added by data cleaning or
integration process
• Description of data warehouse structure
• Operational metadata
• Details for summarization
• Mapping from the operational environment to datawarehouse
• Data related to system performance
• Business metadata
INTRODUCTION: DM AND KDD PROCESS
Cluster Analysis
How many clusters do u expect..??
ASET 32/55
INTRODUCTION: DM AND KDD PROCESS
Cluster Analysis
Possibility-1
ASET 33/55
INTRODUCTION: DM AND KDD PROCESS
Cluster Analysis
Possibility-2
ASET 34/55
INTRODUCTION: DM AND KDD PROCESS
Outlier Detection
ASET 35/55
INTRODUCTION: DM AND KDD PROCESS
Classification
Data mining technique used to predict group membership for
data.
Two methods
Crispy classification- given an input, the
classifier returns its label
Probabilistic classification- given an input,
the classifier returns its probabilities to
belong to particular class.
useful when some mistakes can be more
costly than others.
ASET 36/55
INTRODUCTION: DM AND KDD PROCESS
Regression/ Forecasting
Considers data’s statistical correlation
mapping without any prior assumption on
functional form of data distribution.
Curve fitting
finds a well defined and known function
underlying the data
theory/expertise helps in this.
ASET 37/55
SUPERVISED AND UNSUPERVISED LEARNING
Machine Learning
Machine learning is to teach computer the capability to
learn without being explicitly programmed.
Types of Learning:
Supervised learning
Training data includes both input and
desired results
Correct target results are known for
some
cases which are provided while training
Construction of proper training, validation
and test set is important.
Fast and accurate.
Should be able to generalize.
Unsupervised learning
Is not provided with correct results while
training
can be used to cluster input data in
Classes on the basis of their statistical
properties only.
Cluster significance and labeling
ASET 38/55
Supervised Machine Learning Methods
Classification- Classification refers to taking an input value and
mapping it to a discrete value. In classification problems, our
output typically consists of classes or categories. This could be
things like trying to predict what objects are present in an image (a
cat/ a dog) or whether it is going to rain today or not.
Regression- Regression is related to continuous data (value
functions). In Regression, the predicted output values are real
numbers. It deals with problems such as predicting the price of a
house or the trend in the stock price at a given time, etc.
Some of the most common algorithms in Supervised Learning
include Support Vector Machines (SVM), Logistic Regression,
Naive Bayes, Neural Networks, K-nearest neighbor (KNN), and
Random Forest.
Supervised Machine Learning
Applications
•Predictive analytics (house prices, stock exchange prices, etc.)
•Text recognition
•Spam detection
•Customer sentiment analysis
•Object detection (e.g. face detection)
Unsupervised Learning Methods
Clustering
Association
Dimensionality reduction
Example: Data Mining on Web framework
Web mining includes:
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of mining results
Patterns/ Knowledge used/stored in
knowledge base.
ASET 47/55