Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views10 pages

FDS Unit01

Data mining is the process of extracting valuable information from large datasets using various techniques such as clustering, classification, and anomaly detection. It plays a crucial role in analytics, enabling organizations to derive insights for decision-making and predictive analysis. The Knowledge Discovery in Databases (KDD) process includes multiple steps, with data mining being a key component, and addresses challenges such as noisy data, data integration, and privacy concerns.

Uploaded by

Ummulwara Hanagi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

FDS Unit01

Data mining is the process of extracting valuable information from large datasets using various techniques such as clustering, classification, and anomaly detection. It plays a crucial role in analytics, enabling organizations to derive insights for decision-making and predictive analysis. The Knowledge Discovery in Databases (KDD) process includes multiple steps, with data mining being a key component, and addresses challenges such as noisy data, data integration, and privacy concerns.

Uploaded by

Ummulwara Hanagi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Mining

It is the non-trivial extraction of implicit, previously unknown, and potentially useful


information from data. This encompasses a number of different technical approaches, such
as clustering, data summarization, learning classification rules, finding dependency
networks, analysing changes, and detecting anomalies.
Data mining is the search for relationships and global patterns that exist in large
databases but are ‘hidden’ among the vast amount of data, such as a relationship between
patient data and their medical diagnosis. These relationships represent valuable
knowledge about the database and the objects in the database.
Data mining refers to “using a variety of techniques to identify nuggets of
information or decision-making knowledge in bodies of data, and extracting these in such
a way that they can be put to use in the areas such as decision support, prediction,
forecasting and estimation.

Why is data mining important?


Data mining is a crucial component of successful analytics initiatives in
organizations. The information it generates can be used in business intelligence (BI) and
advanced analytics applications that involve analysis of historical data, as well as real-time
analytics applications that examine streaming data as it's created or collected.

KDD (Knowledge Discovery in Databases)


It is a field of computer science, which includes the tools and theories to help
humans in extracting useful and previously unknown information (i.e. knowledge) from
large collections of digitized data. KDD consists of several steps, and Data Mining is one of
them.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 1


• Data Cleaning: In this step the noise and inconsistent data is removed.
• Data Integration: In this step multiple data sources are combined.
• Data Selection: In this step relevant to the analysis task are retrieved from the
database.
• Data Transformation: In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
• Data Mining: In this step intelligent methods are applied in order to extract data
patterns.
• Pattern Evaluation: In this step, data patterns are evaluated. Knowledge Presentation
- In this step, knowledge is represented.

DBMS v/s Data Mining


Data mining as it is also known is the non-trivial extraction of implicit, previously
unknown and potentially useful information from the data.
Data retrieval, in its usual sense in database literature, attempts to retrieve data that
is stored explicitly in the database and presents it to the user in a way that the user can
understand. It does not attempt to extract implicit information.
Data mining is the search for the relationships and global patterns that exist in large
databases but are hidden among vast amounts of data, such as the relationship between
patient data and their medical diagnosis.
Data mining refers to using a variety of techniques to identify nuggets of information
or decision-making knowledge in the database and extracting these in such a way that they
can be put to use in areas such as decision support, prediction, forecasting and estimation.
The data mining system self learns from the previous history of the investigated
system, formulating and testing hypothesis about rules which systems obey.
Data mining is the process of discovering meaningful, new correlation patterns and
trends by sifting through large amount of data stored in repositories, using pattern
recognition techniques as well as statistical and mathematical techniques.

KDD Vs. Data Minning


• Data mining is only one of the many steps involved in knowledge discovery in databases.
• The KDD process tends to be highly iterative and interactive.
• The structures that are the outcome of the data mining process must meet certain
conditions so that these can be considered as knowledge. These conditions are: validity,
understandability, utility, novelty and interestingness.

Elements and uses of Data Mining


Data mining consists of five major elements:
1. Extract, transform and load transaction data onto the data warehouse system.
2. Store and manage the data in a multidimensional database system.
3. Provided data access to business analysts and information technology professionals.
4. Analyze the data by application software.
5. Present the data in a useful format such as a graph or a table.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 2


DM techniques
1. Data cleaning and preparation
Data cleaning and preparation is a vital part of the data mining process. Raw data
must be cleansed and formatted to be useful in different analytic methods. Data cleaning
and preparation includes different elements of data modelling, transformation, data
migration, ETL, data integration, and aggregation. It’s a necessary step for understanding
the basic features and attributes of data to determine its best use.

2. Tracking patterns
Tracking patterns is a fundamental data mining technique. It involves identifying and
monitoring trends or patterns in data to make intelligent inferences about business
outcomes. Once an organization identifies a trend in sales data, for example, there’s a basis
for taking action to capitalize on that insight. If it’s determined that a certain product is
selling more than others for a particular demographic, an organization can use this
knowledge to create similar products or services, or simply better stock the original
product for this demographic.

3. Association
Association is a data mining technique related to statistics. It indicates that certain
data (or events found in data) are linked to other data or data-driven events.

4. Clustering
Clustering is an analytics technique that relies on visual approaches to
understanding data. Clustering mechanisms use graphics to show where the distribution
of data is in relation to different types of metrics. Clustering techniques also use different
colors to show the distribution of data. Graph approaches are ideal for using cluster
analytics. With graphs and clustering in particular, users can visually see how data is
distributed to identify trends that are relevant to their business objectives.

5. Prediction
Prediction is a very powerful aspect of data mining that represents one of four
branches of analytics. Predictive analytics use patterns found in current or historical data
to extend them into the future. Thus, it gives organizations insight into what trends will
happen next in their data.

Data Mining Problems


Data mining systems rely on databases to supply the raw data for input and this
raises problems in that databases tend be dynamic, incomplete, noisy, and large. Other
problems arise as a result of the adequacy and relevance of the information stored
Limited Information
A database is often designed for purposes different from data mining and sometimes
the properties or attributes that would simplify the learning task are not present nor can
they be requested from the real world.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 3


Noise and Missing Values
Databases are usually contaminated by errors so it cannot be assumed that the data
they contain is entirely correct. Attributes, which rely on subjective or measurement
judgments, can give rise to errors such that some examples may even be misclassified.
Uncertainty
Uncertainty refers to the severity of the error and the degree of noise in the data.
Data precision is an important consideration in a discovery system.
Size, updates, and irrelevant fields
Databases tend to be large and dynamic in that their contents are ever-changing as
information is added, modified or removed.

Issues and Challenges in DM


1. Noisy and Incomplete Data
Data mining is the process of extracting information from large volumes of data. The
real-world data is heterogeneous, incomplete and noisy. Data in large quantities normally
will be inaccurate or unreliable. These problems could be due to errors of the instruments
that measure the data or because of human errors.
2. Distributed Data
Real world data is usually stored on different platforms in distributed computing
environments. It could be in databases, individual systems, or even on the Internet. It is
practically very difficult to bring all the data to a centralized data repository mainly due to
organizational and technical reasons.
3. Complex Data
Real world data is really heterogeneous and it could be multimedia data including
images, audio and video, complex data and so on. It is really difficult to handle these
different kinds of data and extract required information.
4. Performance
The performance of the data mining system mainly depends on the efficiency of
algorithms and techniques used. If the algorithms and techniques designed are not up to
the mark, then it will affect the performance of the data mining process adversely.
5. Data Visualization
Data visualization is a very importance process in data mining because it is the main
process that displays the output in a presentable manner to the user. The information
extracted should convey the exact meaning of what it actually intends to convey. But many
times, it is really difficult to represent the information in an accurate and easy-to-
understand way to the end user.
6. Data Privacy and Security
Data mining normally leads to serious issues in terms of data security, privacy and
governance.

Potential Application Areas and Applications


Retail/Marketing
• Identify buying patterns from customers.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 4


• Find associations among customer demographic characteristics.
• Predict response to mailing campaigns.
• Market basket analysis.

Banking
• Detect patterns of fraudulent credit card use.
• Identify ‘loyal’ customers.
• Predict customers likely to change their credit card affiliation.
• Determine credit card spending by customer groups’
• Find hidden correlations between different financial indicators.
• Identify stock trading rules from historical market data.

Insurance and Health Care


• Claims analysis - i.e which medical procedures are claimed together.
• Predict which customers will buy new policies.
• Identify behaviour patterns of risky customers.
• Identify fraudulent behaviour.

Transportation
• Determine the distribution schedules among outlets.
• Analyse loading patterns.

Medicine
• Characterise patient behaviour to predict office visits.
• Identify successful medical therapies for different illnesses

Data Warehouse
Bill Inmon, the father of Data Warehouse provides the following definition : “ A data
warehouse is a subject oriented, integrated, non-volatile , and time variant collection of
data in support of management’s decisions”

• Subject oriented
A DW is organized around major subjects such as customer , supplier , products and
sales. DW focuses on modeling and analysis of data for decision making. Hence DW
provides a simple and concise view around particular subjects and excludes data that is not
useful for decision support.

• Integrated Data
In Dw data comes form heterogeneous sources such as relational databases, flat
files, XML files, network and hierarchical databases, transactions files. The file layouts,
character code representation and filed naming conversions are different. It leads to

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 5


redundancy and inconsistency of data, hence data need to standardized i.e data needs to
go through the process of transformation, consolidations and integration.

• Time Variant
Data in DW is meant for analysis and decision making. Hence data are stored to
provide information form a historical perspective. Data is stored as snapshots over and
current periods. Every data structure in DW contains the time element. The time variant
nature of data in DW
- allows for analysis of past
- relates information to the present
- enables forecast for the future

• Non-volatile Data
Data form operations system are moved into the DW as specific intervals depending
on requirement of business(at the end of day, week, month , quarter or year).Add , delete
and updates are done in operational data. Once the data is moved into DW, you do not run
individual transaction to change the data there. Delete are not done in DW. The data in
DW is non-volatile. It is used for query and analysis.

APPLICATIONS OF DATA WAREHOUSEs


• Consumer goods
• Banking services
• Financial services
• Manufacturing
• Retail sectors

ADVANTAGES OF DATA WAREHOUSING


• Cost-efficient and provides quality of data.
• Performance and productivity are improved.
• Accurate data access and consistency

Difference between Data warehouse and Data mining


Data warehouse Data mining
Integrate data from various sources Extract useful patterns / knowledge from
huge amount of data.
It focused on storage medium. It focused on discovered useful patterns.
Must take place before data mining. It must take place after data warehouse.
Data is stored periodically. Data is retrieved regularly.
It is used to identify the overall behaviour It is used to identify the behaviour of the
of the system. particular extracted information.
Increased system performance. Increased throughput [output].
Operations: Data cleaning, refresh, extract, Operations: Classification, association,
transform, data loading. clustering, regression etc.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 6


A multidimensional data model.
DW and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube. A data cube allows data to be modeled and viewed in
multiple dimensions. It is defined by dimension and facts.
➢ Dimension are perspectives or entities with respect to which an organization want to
keep records. Ex: Electronic company may create a sales DW in order to keep records
of the stores with respect to dimension time, item and location. Each dimension has
table associated with it called dimension table.
➢ Facts are numerical measures which are quantities by which we want to analyze the
relationship between dimensions. Ex: Fact for the sales DW include dollors_sold( sales
amount in dollars) , unit_sold( number of units sold).The fact table contains the names
of the facts or measures and keys of each of the related dimension tables.

Definition of OLAP:
On line analytical processing (OLAP) is a category of software technology, that
enables analysts, managers and executives to gain insight into data through fast,
consistent, interactive access in a wide variety of possible view of information that has
been transformed form raw data to reflect the real dimensionality of the enterprise as
understood by the user.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 7


OLAP operations/functions in the Multidimensional data model
• Roll_up: The roll-up operation performs aggregation on a data cube, either by climbing
up a concept hierarchy for a dimension or by dimension reduction.
• Drill-down: Drill-down is the reverse of roll-up. It navigates form less detailed data to
more detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions.
• Slice: The slice operation performs a selection on one dimension of the cube, resulting
in sub cube.

• Dice: The dice operation defines a sub cube by performing a selection on two or more
dimensions.

• Pivot(rotate): Pivot is a visualization operation that rotates the data axes in view in
order to provide an alternative presentation of the data. Ex: rotating the axes in 3-d
cube, or transforming a 3d cube into a series of 2-d planes.

• Drill-across: Executes queries involving more than one fact table.

Data Preprocessing
Data Cleaning
Data cleaning is the way of preparing statistics for analysis with the help of getting
rid of or enhancing incorrect, incomplete, irrelevant, duplicate, or irregularly formatted
information.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 8


Data Integration
It combines data from multiple sources into a coherent data store, as in data
warehousing. These sources may include multiple databases, data cubes, or flat files. The
data integration systems are formally defined as triple<G,S,M>
Where, G: The global schema.
S: Heterogeneous source of schemas.
M: Mapping between the queries of source and global schema.
Issues in Data integration
1. Schema integration and object matching: How can the data analyst or the computer be
sure that customer id in one database and customer number in another reference to the
same attribute.
2. Redundancy: An attribute (such as annual revenue) may be redundant if it can be
derived from another attribute or set of attributes. Inconsistencies in attribute or
dimension naming can also cause redundancies in the resulting data set.
3. Detection and resolution of data value conflicts: For the same real-world entity,
attribute values from different sources may differ.

Data Transformation
In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Data transformation can involve the following:
• Smoothing which works to remove noise from the data. Such techniques include
binning, regression, and clustering.
• Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of the
data at multiple granularities.
• Generalization of the data, where low-level or ―primitive‖ (raw) data are replaced by
higher-level concepts through the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher-level concepts, like city or country.
• Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as 1:0 to 1:1, or 0:0 to 1:0.
• Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.

Data Reduction
Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the original
data. That is, mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results. Strategies for data reduction include the
following:
• Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 9


• Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes
or dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
• Numerosity reduction, where the data are replaced or estimated by alternative,
smaller data representations such as parametric models (which need store only the
model parameters instead of the actual data) or nonparametric methods such as
clustering, sampling, and the use of histograms.
• Discretization and concept hierarchy generation, where raw data values for attributes
are replaced by ranges or higher conceptual levels. Data discretization is a form of
numerosity reduction that is very useful for the automatic generation of concept
hierarchies. Discretization and concept hierarchy generation are powerful tools for
datamining, in that they allow the mining of data at multiple levels of abstraction.

Questions
2 Marks
1. What is data mining?
2. What are the steps of Knowledge Discovery Database?
3. How KDD is different from Data Mining?
4. What are the application areas of DM?
5. Define Data Warehouse.
6. Define multi – dimensional data model.

5 Marks
1. Explain the steps of Knowledge Discovery Database?
2. What is the difference between DBMS and data mining?
3. Explain Data Mining techniques.
4. What are the issues and challenges in Data Mining?
5. Explain the definition of data warehouse.
6. Explain the functions of multi – dimensional data model.

10 Marks
1. Explain the process of Data processing.
2. Explain multi dimension data model and its functions.

PRASANNA HEBBAR @GOVT FIRST GRADE COLLEGE HONNAVAR 10

You might also like