Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views4 pages

Data Mining and Other Analogous Disciplines

The document discusses the relationships between data mining and disciplines such as statistics and artificial intelligence. It explains that data mining uses techniques from these disciplines such as regression, clustering analysis, and neural networks. It then introduces an information theory-based approach to measure the amount of information in the data and determine what part can be used to solve business problems. This approach allows for the selection of optimal variables and assesses how well the models capture the information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Data Mining and Other Analogous Disciplines

The document discusses the relationships between data mining and disciplines such as statistics and artificial intelligence. It explains that data mining uses techniques from these disciplines such as regression, clustering analysis, and neural networks. It then introduces an information theory-based approach to measure the amount of information in the data and determine what part can be used to solve business problems. This approach allows for the selection of optimal variables and assesses how well the models capture the information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

5 Data mining and other analogous disciplines

There is some controversy in defining the existing boundaries between data mining and disciplines.
analogous, such as statistics, artificial intelligence, etc. There are those who argue that
data mining is nothing but statistics wrapped in business jargon that makes it
a sellable product. Others, on the other hand, find in it a series of problems and methods
specifics that make it distinct from other disciplines.

The fact is that in practice, almost all models and algorithms used in mining
data—neural networks, regression and classification trees, logistic models, analysis of
main components, etc.—enjoy a relatively long tradition in other fields.

5.1 On Statistics

Certainly, data mining draws from statistics, from which it takes the following techniques:

Analysis of variance, through which the existence of significant differences is evaluated between
the means of one or more continuous variables in different populations.
Regression: defines the relationship between one or more variables and a set of predictor variables
of the first.
Chi-square test: by means of which the hypothesis of dependence is tested.
between variables.
Clustering analysis: it allows the classification of a population of individuals.
characterized by multiple attributes (binary, qualitative or quantitative) in a number
determined by groups, based on the similarities or differences of individuals.
Discriminant analysis: it allows the classification of individuals into groups that have previously been established.
established, allows finding the classification rule of the elements of these groups, and by
both a better identification of what the variables are that define membership in the group.
Time series: allows the study of the evolution of a variable over time to
to be able to make predictions, based on that knowledge and under the assumption that they will not
structural changes occur.

5.2 From Computing

From computer science, take the following techniques:


Genetic algorithms: They are numerical optimization methods, in which that variable or
variables that are intended to be optimized along with the study variables constitute a
segment of information. Those configurations of the analysis variables that obtain
the best values for the response variable will correspond to segments with higher
reproductive capacity. Through reproduction, the best segments endure and their
the proportion grows from generation to generation. Additionally, elements can be introduced
random for the modification of the variables (mutations). After a certain number of
iterations, the population will be composed of good solutions to the optimization problem,
Well, the bad solutions have been discarded, iteration after iteration.
Artificial Intelligence: Through a computer system that simulates an intelligent system, it
proceed to the analysis of the available data. Among the Artificial Intelligence systems there
Expert Systems and Neural Networks would fall under.
Expert Systems: These are systems that have been created from practical rules extracted from
knowledge of experts. Mainly based on inferences or cause-effect.
Intelligent Systems: They are similar to expert systems, but with greater advantages over
new unknown situations for the expert.
Neural networks: Generically, they are methods of parallel numerical processing, in which
the variables interact through linear or nonlinear transformations, until obtaining some
outputs. These outputs are contrasted with those that should have been released, based on certain
test data, leading to a feedback process through which the network
reconfigure until obtaining a suitable model.
6 Data mining based on information theory

All traditional data mining tools assume that the data they will use to
building the models contain the necessary information to achieve the desired purpose:
obtain sufficient knowledge that can be applied to the business (or problem) to achieve a
benefit (or solution).

The drawback is that this is not necessarily true. Furthermore, there is another bigger problem.
still. Once the model is built, it is not possible to know if it has captured all of the
information available in the data. For this reason, the common practice is to create several models
with different parameters to see if any achieve better results.

A relatively new approach to data analysis solves these problems by making the
data mining practice resembles more a science than an art.

In 1948, Claude Shannon published a paper called 'A Mathematical Theory of Communication.'
Subsequently, this came to be called Information Theory and laid the foundations of communication.
and the encoding of information. Shannon proposed a way to measure the amount of
information to be expressed in bits. In 1999 Dorian Pyle published a book called 'Data
Preparation for Data Mining" in which it proposes a way to use Information Theory
to analyze data. In this new approach, a database is a channel that transmits
information. On one hand, there is the real world that captures data generated by the business. On the
Another is all the important situations and problems of the business. And the information flows from
the real world and through data, to the issues of the business.
With this perspective and using Information Theory, it is possible to measure the amount of
information available in the data and what portion of it can be used to solve the
business problem. As a practical example, it could be found that the data contains
65% of the information needed to predict which customers will terminate their contracts. Of this
that way, if the final model is able to make predictions with a 60% accuracy, it can be
ensure that the tool that generated the model did a good job capturing the
available information. Now, if the model had had an accuracy percentage of only the
10%, for example, then trying other models or even with other tools could be worth it.
the penalty.
The ability to measure information contained in data has other important advantages.
By analyzing the data from this new perspective, an information map is generated that makes
unnecessary prior preparation of the data, an absolutely essential task if one wishes
good results, but it takes a huge amount of time.
It is possible to select an optimal set of variables that contains the necessary information to
create a prediction model.
Once the variables are processed in order to create the information map and then
selected those that provide the most information, the choice of the tool that is
what will be used to create the model stops being important, since the most work was done in the
previous steps.
BIBLIOGRAPHIES

Unable to access external content.

http://shuy-rz.blogspot.com/2011/09/data-mining-and-other-disciplines.html

Unable to access or translate content from the provided link.


dm.shtml#datamininb

You might also like