0% found this document useful (0 votes)

63 views20 pages

Data Mining Display

data mining display

Uploaded by

nitdrjothilakshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views20 pages

Data Mining Display

data mining display

Uploaded by

nitdrjothilakshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Data Mining

 The process of structuring, analyzing, and formulating massive

amounts of raw data in order to find patterns and through

mathematical and computational algorithms is called Data

Mining.

 Every data scientist who wants to advance further in their career

and obtain powerful skill set needs to at least know the basics of

data mining

 Through learning the techniques of data mining, one can use this

knowledge to generate new insights and find new trends

 The process of mining data can be divided into three main parts:

 gathering,

 collecting,

 cleaning

The data applying a data mining technique on the data, and

validating the results of the technique.

Data Mining Architecture

The data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base

 There are many techniques out there that one can use to perform

data miningI will focus on the top 5 data mining techniques used

right now by individuals and big companies

 The techniques we will cover are:

MapReduce.

Clustering.

Link Analysis.

Recommendation Systems.

Frequent Itemset Analysis.

 MapReduce is a programming model and implementation for

collecting and processing big amounts of data sets on parallel.

MapReduce takes on some chunk of data, divided it to be

processed on different hardware, and then gather the information

from all of that hardware

A MapReduce program is composed of three steps:

1. map step: Performs filtering and sorting. The results of this step

are a collection of (key, value) pairs that represent the mapping of

the data we are attempting to mine.

2. shuffle step: The shuffle state acts as an intermediate state

between the map and the reduce states. Its only job is to sort the

(key, value) collection so that the reduce stage gets all identical

keys.

3. reduce step: Performs a summary operation (such as counting

the different values for the same key).

 Clustering is the task of grouping a set of items so that the items in

one group are connected some another . Every group is then called a

cluster. Clustering is often used in data mining and data analysis. Can

find clustering in many applications such as pattern recognition,

computer vision, data composition, and bioinformatics.

Clustering can be done using one of two strategies:

1. Hierarchal Clustering: Here, each data point starts as its own

cluster. Then the algorithm starts to join clusters that are close in

distance to each other till in reaches a specified limit. This limit

can either be a set number of clusters or a set of rules on the

different clusters.

2. Point Assignment: Each data point is assigned to a pre-defined

cluster based on which it fits best. Some variations of these

algorithms allow for cluster-splitting or cluster-joining. There are

some popular point assignment algorithms out there such as k-

means and BFR .

 Link analysis is a data mining technique based on a mathematics

branch called graph theory. Graph Theory represents different objects

(nodes) and the relationships between them (edges) as a graph. Link

analysis can be used for both directed and undirected data mining.
Link analysis is often performed in 4 steps:

1. Data Processing: Collecting and manipulation of data using

different algorithms, such as sorting, aggregation, classification,

and validation.

2. Transforming: Converting data from one format or structure into

another format or structure in order to ease up the process of

analyzing that data.

3. Analysis: Once the data has been transformed, different analysis

strategies can be used to extract useful, desirable information.

4. Visualization: The best wayto communicate information is to use

a visualization approach.

 Recommendation Systems are a class of application that involves

using machine learning and mathematical models to predict the user’s

responses to different sets of options.

There are different approaches to implement a recommendation

system, the 4 most used approaches are:

1. Collaborative systems: This approach combines different users

and objects and it is the main approach used in Amazon.

2. Content-based systems: This approach focuses mainly on the

content of your previous experiences.

3. Risk-aware systems: This approach uses content and

collaborative techniques but adds another layer on top. This new

layer will calculate the risk of recommending a specific content

based on the location or the age of the user.

4. Hybrid systems: Hybrid systems are those that make use of

different recommendation techniques to increase the accuracy of

their recommendation and ensure a higher user satisfaction rate.

 Frequent Itemset Analysis is the analysis approach used with

market-basket model data. The market-basket is a data model that is

used to describe a common form of a many-to-many relationship.

This data model is used to connect two kinds of data points, items,

and baskets. Each basket has a set of items .

Frequent itemset analysis can be used to categorize and analyze

different kinds of applications,

1. Related concepts: If we want to look for some words that appear

in many documents, the sets will be dominated by the most

common words in documents, such as stop words or connecting

words. We can ignore these words to see the most frequent words

in the documents.
2. Plagiarism: the items will be the documents and the baskets will

be the sentences within the document. An item is a part of a

basket if the sentence is in the document. If we want to detect

plagiarism, then we try to look for pairs of items that appear

together in several baskets within two different documents. If we

find such a pair, then we have 2 documents that share several

sentences in common, which means that plagiarism exists.

Data Warehouse

 Introduction

 A Data Warehouse is Built by combining data from multiple

diverse sources that support analytical reporting, structured and

unstructured queries, and decision making for the organization,

and Data Warehousing is a step-by-step approach for

constructing and using a Data Warehouse.

 Many data scientists get their data in raw formats from various

sources of data and information

 many data scientists also as business decision-makers, particularly

in big enterprises, the main sources of data and information are

corporate data warehouses

 A data warehouse holds data from multiple sources, including

internal databases and Software platforms. After the data is loaded,

it often cleansed, transformed, and checked for quality

 it is used for analytics reporting, data science, machine learning, or

anything.

 What is Data Warehouse?

 A Data Warehouse is a collection of software tools that facilitates

analysis of a large set of business data used to help an organization

make decisions.

 A large amount of data in data warehouses comes from

numerous sources such that internal applications like marketing,

sales, and finance; customer-facing apps.

 A data warehouse is mainly a data management system that’s

designed to enable and support business intelligence (BI)

activities, particularly analytics. Data warehouses are alleged to

perform queries, cleaning, manipulating, transforming and

analyzing.
 Need of Data Warehousing
 Data Warehousing is aessential tool for business intelligence. It

allows organizations to make quality business decisions.

 The data warehouse benefits by improving data analytics.

 Basic Data Warehouse Architecture

 Data warehouses can find out more practical business strategies.

 Business User: Business users or customers need a data

warehouse to look at summarized data from the past.

 Maintains consistency: Data warehouses are programmed

in such a way that they can be applied in a regular format

to all collected data from different sources.

 standardizing the data and risk of error in interpretation is

also reduced and improves overall accuracy

 Storehistoricaldata: DataWarehouses are also used to

store historical data that means, the time variable

data from the past and this input can be used for various

purposes.

 Make strategic decisions: Data warehouses contribute to

making better strategic decisions. Some business

strategies may be depending upon the data stored within the

data warehouses.

 High response time: Data warehouse has got to be

prepared for masses and type of queries that demands a

major degree of flexibility and fast.

 Characteristics of Data warehouse:

 Subject Oriented: A data warehouse is often subject-oriented

because it delivers may be achieved on a

particular theme which means the data warehousing process is

proposed to handle a particular theme that is more defined.

These themes are often sales, distribution, selling. etc.

 Time-Variant: When the data is maintained via totally different

intervals of time like weekly, monthly,

or annually, etc.

 Non-volatile: The data residing in the data warehouse is

permanent and additionally

means that the data in the data warehouse is cannot be erased or

deleted or also when new data is inserted into it. In the data

warehouse, data is read-only and can only be refreshed at a

particular interval of time. Operations such as delete, update and

insert that is done in a software application over data is lost in

the data warehouse environment.

 There are only two types of data operations that can be done in

the data warehouse

 Data Loading

 Data Access

 Integrated: A data warehouse is created by integrating data

from different sources such that from mainframe computers and

a relational database.
 It also have reliable naming conventions, formats, and codes.

Integration of data warehouse benefits in the successful analysis

of data

 Dependability in naming conventions, column scaling, encoding

structure, etc
Basic Statistics Concepts for Data Science
1. Descriptive Statistics

It is used to describe the basic features of data that provide a summary of the given

data set which can either represent the entire population or a sample of the

population.

It is derived from calculations that include:

 Mean: It is the central value which is commonly known as arithmetic average.

 Mode: It refers to the value that appears most often in a data set.

 Median: It is the middle value of the ordered set that divides it in exactly half .

2. Variability

Variability includes the following parameters:

 Standard Deviation: It is a statistic that calculates the dispersion of a data set as

compared.

 Variance: It refers to a statistical measure of the spread between the numbers in a

data set. In general terms, it means the difference from the mean. A large variance

indicates that numbers are far apart from average value. Small variance indicates

that the numbers are closer to the average values. Zero variance indicates that the

values are identical to the given set.

 Range: This is defined as the difference between the largest and smallest value of

a dataset.
 Percentile: It refers to the measure used in statistics that indicates the value

below which the given percentage of observation in the dataset falls.

 Quartile: It is defined as the value that divides the data points into quarters .

 Interquartile Range: It measures the middle half of your data . In general terms, it

is the middle 50% of the dataset.

3. Correlation

It is one of the major statistical techniques that measure the relationship between two

variables. The correlation coefficient indicates the strength of the linear relationship

between two variables.

 A correlation coefficient that is more than zero indicates a positive relationship.

 A correlation coefficient that is less than zero indicates a negative relationship.

 Correlation coefficient zero indicates that there is no relationship between the two

variables.

4. Probability Distribution

It specifies of all possible events. In simple terms, an event refers to the result of an

experiment. Events are of two types dependent and independent .

 Independent event: The event is said to be an Independent event when it is not

affected by the earlier events .

 Dependent event: The event is said to be dependent when the occurrence of the

event is dependent on the earlier events

The probability of independent events is calculated by simply multiplying the

probability of each event and for a dependent event is calculated by conditional

probability.

5. Regression

It is a method that is used to determine the relationship between one or more

independent variables and a dependent variable. Regression is mainly of two types:

 Linear regression: It is used to fit the regression model that explains the

relationship between a numeric predictor variable and one or more predictor

variables.

 Logistic regression: It is used to fit a regression model that explains the

relationship between the binary response variable and one or more predictor

variables.

6. Normal Distribution

Normal is used to define the probability density function for a continuous random

variable in a system. The standard normal distribution has two parameters – mean

and standard deviation . When the distribution of random variables is unknown, the

normal distribution is used. The central limit theorem justifies why normal distribution

is used in such cases.

7. Bias

In statistical terms, it means when a model is representative of a complete population.

This needs to be minimized to get the desired outcome .

The three most common types of bias are:

 Selection bias: It is a phenomenon of selecting a group of data for statistical

analysis, the selection in such a way that data is not randomized resulting in the

data being unrepresentative of the whole population.

 Confirmation bias: It occurs when the person performing the statistical analysis

has some predefined assumption.

 Time interval bias: It is caused intentionally by specifying a certain time range to

favor a particular outcome.

Internship
No ratings yet
Internship
12 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
36 pages
Ai Pass
No ratings yet
Ai Pass
12 pages
Unit 2 Data Warehouse
No ratings yet
Unit 2 Data Warehouse
22 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
Data Mining and Data Warehouse Study Material - Edited
No ratings yet
Data Mining and Data Warehouse Study Material - Edited
7 pages
Data Mining and Data Warehousing: Gayathri Vidya Parishad College of Engineering Visakhapatnam
No ratings yet
Data Mining and Data Warehousing: Gayathri Vidya Parishad College of Engineering Visakhapatnam
11 pages
Data Warehosing and Data Mining
No ratings yet
Data Warehosing and Data Mining
15 pages
Data Notes
No ratings yet
Data Notes
37 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
6 pages
DWDM External
No ratings yet
DWDM External
30 pages
Data Mininng
No ratings yet
Data Mininng
11 pages
Datamining and Datawarehousean In-Depth Review
No ratings yet
Datamining and Datawarehousean In-Depth Review
14 pages
Data Mining Abstract
No ratings yet
Data Mining Abstract
6 pages
Datawarehouse and Data Mining Final Notes
No ratings yet
Datawarehouse and Data Mining Final Notes
9 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
3 pages
Data Mining & KDD Overview
No ratings yet
Data Mining & KDD Overview
63 pages
Session 35 - Data Mining and Data Warehousing
No ratings yet
Session 35 - Data Mining and Data Warehousing
14 pages
DATA Mining UNIT1 DATA Mining UNIT1: Operating System (Sindhi College) Operating System (Sindhi College)
No ratings yet
DATA Mining UNIT1 DATA Mining UNIT1: Operating System (Sindhi College) Operating System (Sindhi College)
24 pages
Data Warehousing and Data Mining Final Year Seminar Topic
No ratings yet
Data Warehousing and Data Mining Final Year Seminar Topic
10 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
Data Mining Essentials for Students
No ratings yet
Data Mining Essentials for Students
41 pages
Unit 5 Endsem PYQS
No ratings yet
Unit 5 Endsem PYQS
16 pages
1 What Is Data Mining
No ratings yet
1 What Is Data Mining
9 pages
ALL YOU NEED Data - Mining - and - Warehousing
No ratings yet
ALL YOU NEED Data - Mining - and - Warehousing
42 pages
Belga. Data Warehousing and Data Mining. 4G.
No ratings yet
Belga. Data Warehousing and Data Mining. 4G.
4 pages
Introduction to Data Warehousing
No ratings yet
Introduction to Data Warehousing
80 pages
Data Warehousing Mining
No ratings yet
Data Warehousing Mining
26 pages
Unit I DWDM
No ratings yet
Unit I DWDM
26 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
135 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
12 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
19 pages
Database 4
No ratings yet
Database 4
35 pages
Difference Between Data Warehousing and Data Mining
No ratings yet
Difference Between Data Warehousing and Data Mining
8 pages
Unit 01
No ratings yet
Unit 01
10 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
17 pages
Unit 1
No ratings yet
Unit 1
18 pages
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
No ratings yet
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
10 pages
Build The Models
No ratings yet
Build The Models
7 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
Data Mining v3
No ratings yet
Data Mining v3
54 pages
Data Ware House
No ratings yet
Data Ware House
203 pages
??? ????????? ???
No ratings yet
??? ????????? ???
21 pages
Data Warehousing and Data Mining
75% (4)
Data Warehousing and Data Mining
14 pages
Data Mining & Techniques Guide
No ratings yet
Data Mining & Techniques Guide
108 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Data Warehousing & Mining Overview
No ratings yet
Data Warehousing & Mining Overview
55 pages
Ctit QB Solution-U1
No ratings yet
Ctit QB Solution-U1
12 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
14 pages
DSDM Notes
No ratings yet
DSDM Notes
114 pages
DWM 2
No ratings yet
DWM 2
31 pages
DWDM 5 Unit Notes
No ratings yet
DWDM 5 Unit Notes
86 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
9 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Data Mining and Warehousing: - Data Mining Has Become A Popular Buzzword But, in Fact, Promises To
No ratings yet
Data Mining and Warehousing: - Data Mining Has Become A Popular Buzzword But, in Fact, Promises To
9 pages
Data Mining Unit 1-1
No ratings yet
Data Mining Unit 1-1
11 pages
Concept of Data Warehouse
No ratings yet
Concept of Data Warehouse
4 pages
Dara Mining
No ratings yet
Dara Mining
3 pages
Basic Statistics Concepts For Data Science
No ratings yet
Basic Statistics Concepts For Data Science
4 pages
Basic Data Science
No ratings yet
Basic Data Science
2 pages
Advance Graph Theory
No ratings yet
Advance Graph Theory
21 pages
10212cs214 Data Visualization Unit III 19.02.2024
No ratings yet
10212cs214 Data Visualization Unit III 19.02.2024
127 pages
The Cyclic Graph of A Finite Group
No ratings yet
The Cyclic Graph of A Finite Group
8 pages
1
No ratings yet
1
2 pages
Unit I Graph Theory and Concepts
No ratings yet
Unit I Graph Theory and Concepts
35 pages
PCC Viva Questions
No ratings yet
PCC Viva Questions
9 pages
DSA Viva Q&ASEM III Lab Externals @vtunetwork
No ratings yet
DSA Viva Q&ASEM III Lab Externals @vtunetwork
26 pages
Eternal M-Security in Graphs: P. Roushini Leely Pushpam, G. Navamani
No ratings yet
Eternal M-Security in Graphs: P. Roushini Leely Pushpam, G. Navamani
11 pages
Graph Theory Question Bank (2003-2010)
No ratings yet
Graph Theory Question Bank (2003-2010)
8 pages
807purl Discrete-Mathematics TYS
No ratings yet
807purl Discrete-Mathematics TYS
34 pages
Part 1: For Each of These Vertex-Edge Graphs, Try To Trace It (Without Lifting Your Pen From The
No ratings yet
Part 1: For Each of These Vertex-Edge Graphs, Try To Trace It (Without Lifting Your Pen From The
2 pages
Finite-Time Coordination in Multiagent Systems
No ratings yet
Finite-Time Coordination in Multiagent Systems
8 pages
Analyzing Relationships Classes
No ratings yet
Analyzing Relationships Classes
7 pages
Provident Fund Management Architecture
No ratings yet
Provident Fund Management Architecture
46 pages
Graph Theoretic Concepts in Computer Science 46th International Workshop WG 2020 Leeds UK June 24 26 2020 Revised Selected Papers Isolde Adler Download
No ratings yet
Graph Theoretic Concepts in Computer Science 46th International Workshop WG 2020 Leeds UK June 24 26 2020 Revised Selected Papers Isolde Adler Download
153 pages
DM MCQs-1
No ratings yet
DM MCQs-1
37 pages
CSEN502 Theory of Computation
No ratings yet
CSEN502 Theory of Computation
24 pages
Caravel Mesh PDF
No ratings yet
Caravel Mesh PDF
28 pages
Mumbai University SYBSc Math Syllabus
No ratings yet
Mumbai University SYBSc Math Syllabus
24 pages
Daa Manual PDF
No ratings yet
Daa Manual PDF
60 pages
Unit 2 Adv Review Absolute Value
No ratings yet
Unit 2 Adv Review Absolute Value
2 pages
DS PPT
No ratings yet
DS PPT
370 pages
Minghao Guo, Wan Shou, Liane Makatura, Timothy Erps, Michael Foshey, Wojciech Matusik
No ratings yet
Minghao Guo, Wan Shou, Liane Makatura, Timothy Erps, Michael Foshey, Wojciech Matusik
55 pages
FDS-U1-P2 - Facets of Data
No ratings yet
FDS-U1-P2 - Facets of Data
15 pages
Topic 14 - Graph 2
No ratings yet
Topic 14 - Graph 2
29 pages
B.Sc. CS: Intro to Programming
No ratings yet
B.Sc. CS: Intro to Programming
51 pages
Discrete Math Structures Course Guide
No ratings yet
Discrete Math Structures Course Guide
21 pages
Graphs Notes
No ratings yet
Graphs Notes
59 pages
Dynamic Programming for CS Students
No ratings yet
Dynamic Programming for CS Students
36 pages
35 Spanning Trees
No ratings yet
35 Spanning Trees
13 pages