Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
93 views9 pages

ML Assignment 2 PDF

This document is Anubhav Monga's submission for Assignment-2 of his Machine Learning course. It provides detailed explanations of various machine learning techniques including: decision trees, k-means clustering, support vector machines, Naive Bayes classification, k-nearest neighbors, random forests, and linear regression. For each technique, it describes the primary purpose, basic methodology, and provides an example to illustrate how it works. The document aims to elaborate Anubhav's understanding of these fundamental machine learning algorithms.

Uploaded by

Anubhav Monga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views9 pages

ML Assignment 2 PDF

This document is Anubhav Monga's submission for Assignment-2 of his Machine Learning course. It provides detailed explanations of various machine learning techniques including: decision trees, k-means clustering, support vector machines, Naive Bayes classification, k-nearest neighbors, random forests, and linear regression. For each technique, it describes the primary purpose, basic methodology, and provides an example to illustrate how it works. The document aims to elaborate Anubhav's understanding of these fundamental machine learning algorithms.

Uploaded by

Anubhav Monga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ASSIGNMENT-2

OF
Machine Learning

SUBMITTED TO SUBMITTED BY
Dr Varun Malik Anubhav Monga
1955991509
Btech It 7A

DEPARTMENT OF COMPUTER APPLICATIONS


CHITKARA UNIVERSITY, PUNJAB
Q1 Elaborate your understanding on various machine learning
techniques in detail including their primary purpose and an
example or a case.
Ans:
Various Machine Learning Techniques:
Decision Tree
Ø Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and
each leaf node represents the outcome.
Ø In a Decision tree, there are two nodes, which are the Decision Node and
Leaf Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
Ø It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
Ø It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure.
Ø In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
Ø A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
Ø Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this problem,
the decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer).
K means Clustering
Ø K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.
Ø It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has
similar properties.
Ø It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without
the need for any training.
Ø It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of distances
between the data point and their corresponding clusters.
Ø The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Ø The k-means clustering algorithm mainly performs two tasks:
Ø Determines the best value for K center points or centroids by an iterative
process.
Ø Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
Support Vector Machine Algorithm
Ø Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in Machine
Learning.
Ø The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
Ø SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine.
Ø Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will see
the extreme case of cat and dog. On the basis of the support vectors, it will
classify it as a cat.

Naïve Bayes Classifier Algorithm


Ø Naïve Bayes algorithm is a supervised learning algorithm, which is based on
Bayes theorem and used for solving classification problems.
Ø It is mainly used in text classification that includes a high-dimensional training
dataset.
Ø Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
Ø It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
Ø Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
Ø K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories.
Ø K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
Ø K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
Ø K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
Ø It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
Ø KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Ø Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this identification,
we can use the KNN algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to the cats and dogs
images and based on the most similar features it will put it in either cat or dog
category.

Random Forest Algorithm


Ø Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble learning,
which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
Ø As the name suggests, "Random Forest is a classifier that contains a number
of decision trees on various subsets of the given dataset and takes the
average to improve the predictive accuracy of that dataset." Instead of relying
on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.
Ø The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
Ø Example: Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is divided
into subsets and given to each decision tree. During the training phase, each
decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts
the final decision.
Linear Regression in Machine Learning
Ø It is a statistical method that is used for predictive analysis.
Ø Linear regression makes predictions for continuous/real or numeric variables
such as sales, salary, age, product price, etc.
Ø Linear regression algorithm shows a linear relationship between a dependent
(y) and one or more independent (y) variables, hence called as linear
regression. Since linear regression shows the linear relationship, which
means it finds how the value of the dependent variable is changing according
to the value of the independent variable.
Ø The linear regression model provides a sloped straight line representing the
relationship between the variables.

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.

Logistic Regression in Machine Learning


Ø It is used for predicting the categorical dependent variable using a given set of
independent variables.
Ø Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Ø Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
Ø In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
Ø The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on
its weight, etc.
Ø Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
Ø Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification.

Logistic Function (Sigmoid Function):


The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Q2 Explain the concept of over-fitting problem. How can this be
avoided?
Ans:
Overfitting
Ø Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset.
Because of this, the model starts caching noise and inaccurate values present
in the dataset, and all these factors reduce the efficiency and accuracy of the
model. The overfitted model has low bias and high variance.
Ø The chances of occurrence of overfitting increase as much we provide training
to our model. It means the more we train our model, the more chances of
occurring the overfitted model.
Ø Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:

As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which
we can reduce the occurrence of overfitting in our model.
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
• Regularization
• Ensembling

You might also like