Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views33 pages

Data Mining Techniques

Data Mining concepts

Uploaded by

sheetaltoms79
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views33 pages

Data Mining Techniques

Data Mining concepts

Uploaded by

sheetaltoms79
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

DATA MINING

TECHNIQUES

MRS NEETA GEORPHIN


INTRODUCTION
Parametric techniques
Parametric models describe the relationship between input and output through the use of algebraic equations where
some parameters are not specified. These unspecified parameters are determined by providing input examples. For
real-world problems, these parametric models may not be useful.
Linear regression is a widely used parametric technique for modeling the relationship between a dependent variable
and one or more independent variables.

Non Parametric techniques


Nonparametric techniques are more appropriate for data mining applications. A nonparametric model is one that is
data-driven. No explicit equations are used to determine the model. This means that the modeling process adapts to
the data at hand. Unlike parametric modeling, where a specific model is assumed ahead of time, nonparametric
techniques create a model based on the input. While the parametric methods require more knowledge about the data
before the modeling process, the nonparametric technique requires a large amount of data as input to the modeling
process itself. The modeling process then creates the model by sifting through the data.
Nonparametric techniques include neural networks, decision trees, and genetic algorithms.
2
STATISTICAL PERSPECTIVE ON
DATA MINING
POINT ESTIMATION
• Point estimation refers to the process of estimating a population parameter, by an estimate of the parameter,
• An estimator is a statistical quantity or formula that is used to estimate an unknown population
parameter based on sample data.
• The bias of an estimator is the difference between the expected value of the estimator and the actual value:

• An unbiased estimator is one whose bias is 0. While point estimators for small data sets may actually be unbiased, for
larger database applications, we would expect that most estimators are biased.
• One measure of the effectiveness of an estimate is the mean squared error (MSE), which is defined as the expected
value of the squared difference between the estimate and the actual value:

For example, if the true value for an attribute was 10 and the prediction was 5, the squared error would be (5 -10)^2 =
3
25. The squaring is performed to ensure that the measure is always positive and to give a higher weight to
the estimates that are grossly inaccurate.
PRESENTATION TITLE
The root mean square (RMS) may also be used to estimate error or as another statistic to describe a distribution.
Calculating the mean does not indicate the magnitude of the values. The RMS can be used for this purpose. Given
a set of n values X ={x1 , . . . , xn}, the RMS is defined by

A popular estimating technique is the jackknife estimate. With this approach, the estimate of a parameter, is
obtained by omitting one value from the set of observed values. Suppose that there is a set of n values X = {x1 , . .
. , Xn } . An estimate for the mean would be

Here the subscript (i) indicates that this estimate is obtained by omitting the ith value.
4
PRESENTATION TITLE Another technique for point estimation is called the maximum likelihood estimate (MLE). Likelihood can be
defined as a value proportional to the actual probability that with a specific distribution, the given sample
exists. So, the sample gives us an estimate for a parameter from the distribution. The higher the likelihood
value, the more likely the underlying distribution will produce the results observed.
Given a sample set of values X = {x 1 , . . . , xn } from a known distribution function
the MLE can estimate parameters for the population from which the sample is drawn.
The approach obtains parameter estimates that maximize the probability that the sample data occur for
the specific model. It looks at the joint probability for observing the sample data by multiplying the individual
probabilities. The likelihood function, L, is thus defined as

5
We wish to find the mean , , for data that follow the normal distribution where the known data are { 1 , 5, 10, 4} with two data
PRESENTATION TITLE items missing. Here, n = 6 and k = 4. Suppose that we initially guess = 3. We then use this value for the two missing values.
Using this, we obtain the MLE estimate for the mean as

6
PRESENTATION TITLE

The expectation-maximization (EM) algorithm is an approach that solves the estimation


problem with incomplete data. The EM algorithm finds MLE for a parameter (such as a mean) using a two-step
an

process: estimation and maximization.


1. An initial set of estimates for the parameters is obtained.
Given these estimates and the training data as input, the algorithm then calculates a value for the missing data. These
data (with the new value added) are then used to determine an estimate for the mean that maximizes the likelihood.
7 These steps are applied iteratively until successive parameter estimates converge. Any approach can be used to find
the initial parameter estimates.
8
PRESENTATION TITLE
PRESENTATION TITLE
Models Based on Summarization
The basic well-known statistical concepts such as mean, variance, standard deviation, median, and mode are
simple models of the underlying population.
Fitting a population to a specific frequency distribution provides an even better model of the data.
There are also many well-known techniques to display the structure of the data graphically. For example, a
histogram shows the distribution of the data. A box plot is a more sophisticated technique that illustrates several
different features of the population at once.
Minimum: The minimum value in the given dataset
First Quartile (Q1): The first quartile is the median of the lower half
of the data set.
Median: The median is the middle value of the dataset, which
divides the given dataset into two equal parts. The median is
considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper
half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot
are:
Interquartile Range (IQR): The difference between the third
quartile and first quartile is known as the interquartile range. (i.e.)
IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the
ordered data is tested to be the outliers. Generally, the outliers fall
9 more than the specified distance from the first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 .
IQR).
PRESENTATION TITLE

Another visual technique to display data is called a scatter diagram. This is a graph on a two-dimensional axis of points
representing the relationships between x and y values. By plotting the actually observable (x, y) points as seen in a sample,
a visual image of some derivable functional relationship between the x and y values in the total population may be seen.
Figure 3.2 shows a scatter diagram that plots some observed values. Notice that even though the points do not lie
on a precisely linear line they do hint that this may be a good predictor of the relationship between x and y.

10
PRESENTATION TITLE Bayes Theorem
Given a set of data X = {x 1 , . . . , x11 }, a data mining problem is to uncover properties of the distribution from which
the set comes. Bayes rule, is a technique to estimate the likelihood of a property given
the set of data as evidence or input. Suppose that either hypothesis h1or hypothesis h2 must occur, but not both.
Also, suppose that Xi is an observable event.

Here P (h1 I Xi) is called the posterior probability, while P (h 1 ) is the prior probability
associated with hypothesis h 1 . P (xi) is the probability of the occurrence of
data value Xi and P (xi I h1) is the conditional probability that, given a hypothesis, the tuple satisfies it.

Where there are mdifferent hypotheses we have:


Thus, we have

11
Thus, we have
PRESENTATION TITLE

Suppose that a credit loan authorization problem can be associated with for hypotheses:
H = {h1 , h2, h3, h4} where h 1 = authorize purchase, h2 = authorize after further identification, h3 = do not
authorize, and h4 = do not authorize but contact police.
The training data for this example are shown in Table 3 . 1 . From training data,
we find that P (h 1 ) = 60%, P (h2) = 20%, P (h3) = 10%, and P (h4) = 10%.
To make our predictions, a domain expert has determined that the attributes we should be looking at are income
and credit category.
Assume that income, I, has been categorized by ranges [0, $10,000), [$ 10,000, $50, 000), [$50,000, $100,000),
and [$ 100,000,∞).
These ranges are encoded and are shown in Table 3 . 1 as 1, 2, 3, and 4 , respectively.
Suppose that credit is categorized as excellent, good, or bad. These ranges are encoded and are shown in
Table 3 . 1 as 1, 2, 3, and 4, respectively.
Suppose that credit is categorized as excellent, good, or bad. By combining these, we then have 12 values in the
12 data space: D = {xi , . . . , XJ2}. The relationship between
13
PRESENTATION TITLE
PRESENTATION TITLE

Using these values, the last column in Table 3.1 shows the x; group into which that tuple falls. Given these, we can
then calculate and P (x; ). We illustrate how this is done with h1· There are
six tuples from the training set that are in h1; we use the distribution of these across the x; to obtain:

Suppose we wanted to predict the class for X4. We thus need to find P(h J I x4) for each h J.
We would then classify x4 to the class with the largest value for h 1 .

14
We would thus classify X4 to the h 1 class.
PRESENTATION TITLE

Limitations
1. Table 3 . 1 has no entries for X1 , x3 , x5 , X6, or X12 · This makes it impossible to use this training sample to determine how
to make predictions for this combination of input data.
2. a sample of this size is too small

15
DECISION TREES
PRESENTATION TITLE

A decision tree is a predictive modeling technique used in classification, clustering, and prediction tasks.
Decision trees use a "divide and conquer" technique to split the problem search space into subsets.
Leaf nodes represent a successful guess as to the object being predicted.
Each question successively divides the search space much as a binary search does.
As with a binary search, questions should be posed so that the remaining space is divided into two equal parts.

16
PRESENTATION TITLE
DEFINITION- A decision tree (DT) is a tree where the root and each internal node are labeled
with a question. The arcs emanating from each node represent each possible answer to the
associated question. Each leaf node represents a prediction of a solution to the problem under
consideration.
A decision tree (DT) model is a computational model consisting of three parts:
1. A decision tree
2. An algorithm to create the tree.
3. An algorithm that applies the tree to data and solves the problem under consideration.

The building of the tree may be accomplished via an algorithm that examines data from a
training sample or could be created by a domain expert.

The complexity of the algorithm is straightforward to analyze. For each tuple in the database,
we search the tree from the root down to a particular leaf.
At each level, the maximum number of comparisons to make depends on the branching factor
at that level. So, the complexity depends on the product of the number of levels and the
maximum branching factor.

17
PRESENTATION TITLE ALGORITHM
Input :
T / /Decision tree
D / / Input database
Output :
M / / Model prediction
DTProc algorithm :
/ / Simplistic algorithm to illustrate prediction technique using DT
for each t € D do
n = root node of T;
while n not leaf node do
Obtain answer to question on n applied to t ;
Identify arc from t, which contains correct answer ;
n = node at end of this arc ;
Make prediction for t based on labeling of n ;

18
Suppose that students in a particular university are to be
PRESENTATION TITLE
classified as short, tall, or medium based on their height.
Assume the database schema is {name, address, gender,
height, age, year, major}. To construct a decision tree, we
must identify the attributes that are important to the
classification problem at hand. Suppose that height, age, and
gender are chosen. Certainly, a female who is 1 . 95 m in
height is considered as tall, while a male of the same height
may not be considered tall. Also, a child 10 years of age may
be tall
if he or she is only 1 . 5 m. Since this is a set of university
students, we would expect most of them to be over 17 years
of age. We thus decided to filter out those under this age and
perform their classification' separately. We may consider
these students to be outliers because their ages (and more
important their height classifications) are not typical of most
university students. Thus, for classification, we have only
gender and height. Using these two attributes, a decision
tree-building algorithm will construct a tree using a sample of
19
the database with known classification values. This training
sample forms the basis of how the tree is constructed. One
possible resulting tree after training is shown in Figure 3.5
Neural networks (NN), often referred to as artificial neural networks (ANN) to distinguish them from biological neural
PRESENTATION TITLE
networks, are modeled after the workings of the human brain.
The NN is actually an information processing system that consists of a graph representing the processing system
as well as various algorithms that access that graph.
As with the human brain, the NN consists of many connected processing elements. The NN, then, is structured as
a directed graph with many nodes (processing elements) and arcs (interconnections) between them.
The nodes in the graph are like individual neurons, while the arcs are their interconnections.

20
PRESENTATION TITLE

INTRODUCTION

At Contoso, we empower organizations to foster collaborative


thinking to further drive workplace innovation.

By closing the loop and leveraging agile frameworks, we help


businesses grow organically and foster a consumer-first
mindset.

21
PRIMARY GOALS

ANNUAL
REVENUE
GROWTH
QUARTERLY PERFORMANCE
PRESENTATION TITLE

2.0
Q1 2.4
4.3

2.0
Q2 4.4
2.5

3.0
Q3 1.8
3.5

5.0
Q4 2.8
4.5

- 1.0 2.0 3.0 4.0 5.0 6.0


SERIES 1 SERIES 2 SERIES 3
23
AREAS OF GROWTH
PRESENTATION TITLE

2022 B2B SUPPLY CHAIN ROI E-COMMERCE

Q1 4.5 2.3 1.7 5.0

Q2 3.2 5.1 4.4 3.0

Q3 2.1 1.7 2.5 2.8

Q4 4.5 2.2 1.7 7.0

24
PRESENTATION TITLE

“BUSINESS OPPORTUNITIES
ARE LIKE BUSES.
THERE’S ALWAYS ANOTHER
ONE COMING.”

25 R ICHAR D B R A NS ON
MEET OUR TEAM
PRESENTATION TITLE

TAKUMA MIRJAM FLORA RAJESH


HAYASHI NILSSON BERGGREN SANTOSHI
President Chief Executive Officer Chief Operations Officer VP Marketing

26
MEET OUR EXTENDED TEAM
PRESENTATION TITLE

TAKUMA MIRJAM FLORA RAJESH


HAYASHI NILSSON BERGGREN SANTOSHI
President Chief Executive Officer Chief Operations Officer VP Marketing

GRAHAM ROWAN ELIZABETH ROBIN


27 BARNES MURPHY MOORE KLINE
VP Product SEO Strategist Product Designer Content Developer
PLAN FOR
PRESENTATION TITLE

PRODUCT LAUNCH

PLANNING MARKETING DESIGN STRATEGY LAUNCH

Synergize scalable Disseminate Coordinate Foster holistically Deploy strategic


e-commerce standardized e-business superior networks with
metrics applications methodologies compelling
e-business needs

28
PRESENTATION TITLE

SEP OCT NOV DEC JAN


Synergize scalable Disseminate Coordinate Foster holistically Deploy strategic
e-commerce standardized e-business applications superior methodologies networks with
metrics compelling
e-business needs

TIMELINE
29
AREAS OF FOCUS
PRESENTATION TITLE

B 2 B MAR K ET C L O U D -BA SED


S CENAR IO S OPPORT UNIT IES

Develop winning strategies to keep ahead of the Iterative approaches to corporate strategy
competition
Establish a management framework from the inside
Capitalize on low hanging fruit to identify a ballpark value
Visualize customer directed convergence

30
ROI
PRESENTATION TITLE

Envision multimedia-based expertise and cross-media


growth strategies

Visualize quality intellectual capital

Engage worldwide methodologies with web-enabled


technologies

NI C H E MA R K ETS
Pursue scalable customer service through sustainable
strategies

Engage top-line web services with cutting-edge

HOW WE
deliverables

GOT THERE SUPPLY CHAINS


Cultivate one-to-one customer service with robust ideas

31 Maximize timely deliverables for real-time schemas


SUMMARY
PRESENTATION TITLE

At Contoso, we believe in giving 110%. By using our


next-generation data architecture, we help organizations
virtually manage agile workflows. We thrive because of
our market knowledge and great team behind our
product. As our CEO says, "Efficiencies will come from
proactively transforming how we do business."

32
THANK YOU

MIRJAM NILSSON
[email protected] | WWW.CONTOSO.COM

You might also like