Data Mining and Other Analogous Disciplines
Data Mining and Other Analogous Disciplines
There is some controversy in defining the existing boundaries between data mining and disciplines.
analogous, such as statistics, artificial intelligence, etc. There are those who argue that
data mining is nothing but statistics wrapped in business jargon that makes it
a sellable product. Others, on the other hand, find in it a series of problems and methods
specifics that make it distinct from other disciplines.
The fact is that in practice, almost all models and algorithms used in mining
data—neural networks, regression and classification trees, logistic models, analysis of
main components, etc.—enjoy a relatively long tradition in other fields.
5.1 On Statistics
Certainly, data mining draws from statistics, from which it takes the following techniques:
Analysis of variance, through which the existence of significant differences is evaluated between
the means of one or more continuous variables in different populations.
Regression: defines the relationship between one or more variables and a set of predictor variables
of the first.
Chi-square test: by means of which the hypothesis of dependence is tested.
between variables.
Clustering analysis: it allows the classification of a population of individuals.
characterized by multiple attributes (binary, qualitative or quantitative) in a number
determined by groups, based on the similarities or differences of individuals.
Discriminant analysis: it allows the classification of individuals into groups that have previously been established.
established, allows finding the classification rule of the elements of these groups, and by
both a better identification of what the variables are that define membership in the group.
Time series: allows the study of the evolution of a variable over time to
to be able to make predictions, based on that knowledge and under the assumption that they will not
structural changes occur.
All traditional data mining tools assume that the data they will use to
building the models contain the necessary information to achieve the desired purpose:
obtain sufficient knowledge that can be applied to the business (or problem) to achieve a
benefit (or solution).
The drawback is that this is not necessarily true. Furthermore, there is another bigger problem.
still. Once the model is built, it is not possible to know if it has captured all of the
information available in the data. For this reason, the common practice is to create several models
with different parameters to see if any achieve better results.
A relatively new approach to data analysis solves these problems by making the
data mining practice resembles more a science than an art.
In 1948, Claude Shannon published a paper called 'A Mathematical Theory of Communication.'
Subsequently, this came to be called Information Theory and laid the foundations of communication.
and the encoding of information. Shannon proposed a way to measure the amount of
information to be expressed in bits. In 1999 Dorian Pyle published a book called 'Data
Preparation for Data Mining" in which it proposes a way to use Information Theory
to analyze data. In this new approach, a database is a channel that transmits
information. On one hand, there is the real world that captures data generated by the business. On the
Another is all the important situations and problems of the business. And the information flows from
the real world and through data, to the issues of the business.
With this perspective and using Information Theory, it is possible to measure the amount of
information available in the data and what portion of it can be used to solve the
business problem. As a practical example, it could be found that the data contains
65% of the information needed to predict which customers will terminate their contracts. Of this
that way, if the final model is able to make predictions with a 60% accuracy, it can be
ensure that the tool that generated the model did a good job capturing the
available information. Now, if the model had had an accuracy percentage of only the
10%, for example, then trying other models or even with other tools could be worth it.
the penalty.
The ability to measure information contained in data has other important advantages.
By analyzing the data from this new perspective, an information map is generated that makes
unnecessary prior preparation of the data, an absolutely essential task if one wishes
good results, but it takes a huge amount of time.
It is possible to select an optimal set of variables that contains the necessary information to
create a prediction model.
Once the variables are processed in order to create the information map and then
selected those that provide the most information, the choice of the tool that is
what will be used to create the model stops being important, since the most work was done in the
previous steps.
BIBLIOGRAPHIES
http://shuy-rz.blogspot.com/2011/09/data-mining-and-other-disciplines.html