Data Mining
The process of structuring, analyzing, and formulating massive
amounts of raw data in order to find patterns and through
mathematical and computational algorithms is called Data
Mining.
Every data scientist who wants to advance further in their career
and obtain powerful skill set needs to at least know the basics of
data mining
Through learning the techniques of data mining, one can use this
knowledge to generate new insights and find new trends
The process of mining data can be divided into three main parts:
gathering,
collecting,
cleaning
The data applying a data mining technique on the data, and
validating the results of the technique.
Data Mining Architecture
The data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base
There are many techniques out there that one can use to perform
data miningI will focus on the top 5 data mining techniques used
right now by individuals and big companies
The techniques we will cover are:
MapReduce.
Clustering.
Link Analysis.
Recommendation Systems.
Frequent Itemset Analysis.
MapReduce is a programming model and implementation for
collecting and processing big amounts of data sets on parallel.
MapReduce takes on some chunk of data, divided it to be
processed on different hardware, and then gather the information
from all of that hardware
A MapReduce program is composed of three steps:
1. map step: Performs filtering and sorting. The results of this step
are a collection of (key, value) pairs that represent the mapping of
the data we are attempting to mine.
2. shuffle step: The shuffle state acts as an intermediate state
between the map and the reduce states. Its only job is to sort the
(key, value) collection so that the reduce stage gets all identical
keys.
3. reduce step: Performs a summary operation (such as counting
the different values for the same key).
Clustering is the task of grouping a set of items so that the items in
one group are connected some another . Every group is then called a
cluster. Clustering is often used in data mining and data analysis. Can
find clustering in many applications such as pattern recognition,
computer vision, data composition, and bioinformatics.
Clustering can be done using one of two strategies:
1. Hierarchal Clustering: Here, each data point starts as its own
cluster. Then the algorithm starts to join clusters that are close in
distance to each other till in reaches a specified limit. This limit
can either be a set number of clusters or a set of rules on the
different clusters.
2. Point Assignment: Each data point is assigned to a pre-defined
cluster based on which it fits best. Some variations of these
algorithms allow for cluster-splitting or cluster-joining. There are
some popular point assignment algorithms out there such as k-
means and BFR .
Link analysis is a data mining technique based on a mathematics
branch called graph theory. Graph Theory represents different objects
(nodes) and the relationships between them (edges) as a graph. Link
analysis can be used for both directed and undirected data mining.
Link analysis is often performed in 4 steps:
1. Data Processing: Collecting and manipulation of data using
different algorithms, such as sorting, aggregation, classification,
and validation.
2. Transforming: Converting data from one format or structure into
another format or structure in order to ease up the process of
analyzing that data.
3. Analysis: Once the data has been transformed, different analysis
strategies can be used to extract useful, desirable information.
4. Visualization: The best wayto communicate information is to use
a visualization approach.
Recommendation Systems are a class of application that involves
using machine learning and mathematical models to predict the user’s
responses to different sets of options.
There are different approaches to implement a recommendation
system, the 4 most used approaches are:
1. Collaborative systems: This approach combines different users
and objects and it is the main approach used in Amazon.
2. Content-based systems: This approach focuses mainly on the
content of your previous experiences.
3. Risk-aware systems: This approach uses content and
collaborative techniques but adds another layer on top. This new
layer will calculate the risk of recommending a specific content
based on the location or the age of the user.
4. Hybrid systems: Hybrid systems are those that make use of
different recommendation techniques to increase the accuracy of
their recommendation and ensure a higher user satisfaction rate.
Frequent Itemset Analysis is the analysis approach used with
market-basket model data. The market-basket is a data model that is
used to describe a common form of a many-to-many relationship.
This data model is used to connect two kinds of data points, items,
and baskets. Each basket has a set of items .
Frequent itemset analysis can be used to categorize and analyze
different kinds of applications,
1. Related concepts: If we want to look for some words that appear
in many documents, the sets will be dominated by the most
common words in documents, such as stop words or connecting
words. We can ignore these words to see the most frequent words
in the documents.
2. Plagiarism: the items will be the documents and the baskets will
be the sentences within the document. An item is a part of a
basket if the sentence is in the document. If we want to detect
plagiarism, then we try to look for pairs of items that appear
together in several baskets within two different documents. If we
find such a pair, then we have 2 documents that share several
sentences in common, which means that plagiarism exists.
Data Warehouse
Introduction
A Data Warehouse is Built by combining data from multiple
diverse sources that support analytical reporting, structured and
unstructured queries, and decision making for the organization,
and Data Warehousing is a step-by-step approach for
constructing and using a Data Warehouse.
Many data scientists get their data in raw formats from various
sources of data and information
many data scientists also as business decision-makers, particularly
in big enterprises, the main sources of data and information are
corporate data warehouses
A data warehouse holds data from multiple sources, including
internal databases and Software platforms. After the data is loaded,
it often cleansed, transformed, and checked for quality
it is used for analytics reporting, data science, machine learning, or
anything.
What is Data Warehouse?
A Data Warehouse is a collection of software tools that facilitates
analysis of a large set of business data used to help an organization
make decisions.
A large amount of data in data warehouses comes from
numerous sources such that internal applications like marketing,
sales, and finance; customer-facing apps.
A data warehouse is mainly a data management system that’s
designed to enable and support business intelligence (BI)
activities, particularly analytics. Data warehouses are alleged to
perform queries, cleaning, manipulating, transforming and
analyzing.
Need of Data Warehousing
Data Warehousing is aessential tool for business intelligence. It
allows organizations to make quality business decisions.
The data warehouse benefits by improving data analytics.
Basic Data Warehouse Architecture
Data warehouses can find out more practical business strategies.
Business User: Business users or customers need a data
warehouse to look at summarized data from the past.
Maintains consistency: Data warehouses are programmed
in such a way that they can be applied in a regular format
to all collected data from different sources.
standardizing the data and risk of error in interpretation is
also reduced and improves overall accuracy
Storehistoricaldata: DataWarehouses are also used to
store historical data that means, the time variable
data from the past and this input can be used for various
purposes.
Make strategic decisions: Data warehouses contribute to
making better strategic decisions. Some business
strategies may be depending upon the data stored within the
data warehouses.
High response time: Data warehouse has got to be
prepared for masses and type of queries that demands a
major degree of flexibility and fast.
Characteristics of Data warehouse:
Subject Oriented: A data warehouse is often subject-oriented
because it delivers may be achieved on a
particular theme which means the data warehousing process is
proposed to handle a particular theme that is more defined.
These themes are often sales, distribution, selling. etc.
Time-Variant: When the data is maintained via totally different
intervals of time like weekly, monthly,
or annually, etc.
Non-volatile: The data residing in the data warehouse is
permanent and additionally
means that the data in the data warehouse is cannot be erased or
deleted or also when new data is inserted into it. In the data
warehouse, data is read-only and can only be refreshed at a
particular interval of time. Operations such as delete, update and
insert that is done in a software application over data is lost in
the data warehouse environment.
There are only two types of data operations that can be done in
the data warehouse
Data Loading
Data Access
Integrated: A data warehouse is created by integrating data
from different sources such that from mainframe computers and
a relational database.
It also have reliable naming conventions, formats, and codes.
Integration of data warehouse benefits in the successful analysis
of data
Dependability in naming conventions, column scaling, encoding
structure, etc
Basic Statistics Concepts for Data Science
1. Descriptive Statistics
It is used to describe the basic features of data that provide a summary of the given
data set which can either represent the entire population or a sample of the
population.
It is derived from calculations that include:
Mean: It is the central value which is commonly known as arithmetic average.
Mode: It refers to the value that appears most often in a data set.
Median: It is the middle value of the ordered set that divides it in exactly half .
2. Variability
Variability includes the following parameters:
Standard Deviation: It is a statistic that calculates the dispersion of a data set as
compared.
Variance: It refers to a statistical measure of the spread between the numbers in a
data set. In general terms, it means the difference from the mean. A large variance
indicates that numbers are far apart from average value. Small variance indicates
that the numbers are closer to the average values. Zero variance indicates that the
values are identical to the given set.
Range: This is defined as the difference between the largest and smallest value of
a dataset.
Percentile: It refers to the measure used in statistics that indicates the value
below which the given percentage of observation in the dataset falls.
Quartile: It is defined as the value that divides the data points into quarters .
Interquartile Range: It measures the middle half of your data . In general terms, it
is the middle 50% of the dataset.
3. Correlation
It is one of the major statistical techniques that measure the relationship between two
variables. The correlation coefficient indicates the strength of the linear relationship
between two variables.
A correlation coefficient that is more than zero indicates a positive relationship.
A correlation coefficient that is less than zero indicates a negative relationship.
Correlation coefficient zero indicates that there is no relationship between the two
variables.
4. Probability Distribution
It specifies of all possible events. In simple terms, an event refers to the result of an
experiment. Events are of two types dependent and independent .
Independent event: The event is said to be an Independent event when it is not
affected by the earlier events .
Dependent event: The event is said to be dependent when the occurrence of the
event is dependent on the earlier events
The probability of independent events is calculated by simply multiplying the
probability of each event and for a dependent event is calculated by conditional
probability.
5. Regression
It is a method that is used to determine the relationship between one or more
independent variables and a dependent variable. Regression is mainly of two types:
Linear regression: It is used to fit the regression model that explains the
relationship between a numeric predictor variable and one or more predictor
variables.
Logistic regression: It is used to fit a regression model that explains the
relationship between the binary response variable and one or more predictor
variables.
6. Normal Distribution
Normal is used to define the probability density function for a continuous random
variable in a system. The standard normal distribution has two parameters – mean
and standard deviation . When the distribution of random variables is unknown, the
normal distribution is used. The central limit theorem justifies why normal distribution
is used in such cases.
7. Bias
In statistical terms, it means when a model is representative of a complete population.
This needs to be minimized to get the desired outcome .
The three most common types of bias are:
Selection bias: It is a phenomenon of selecting a group of data for statistical
analysis, the selection in such a way that data is not randomized resulting in the
data being unrepresentative of the whole population.
Confirmation bias: It occurs when the person performing the statistical analysis
has some predefined assumption.
Time interval bias: It is caused intentionally by specifying a certain time range to
favor a particular outcome.