Intro to
DATA SCIENCE
What is Data Science?
● Data Science is about drawing useful conclusions from large
and diverse data sets through
○ Exploration involves identifying patterns in information.
○ Prediction involves using information we know to make informed
guesses about values we wish we knew.
○ Inference involves quantifying our degree of certainty
● Data Science is about
○ Data gathering
○ Analysis
○ Decision-making.
Why Data Science?
● Data-driven decision making has already transformed a
tremendous breadth of industries, including finance,
advertising, manufacturing, and real estate.
● Data is created everyday
○ Every minute 300+ hours of video uploaded to YouTube.
○ Every minute 31.25 Millions messages are transferred
● Driving interesting insights from unstructured data or
semi-structured data that can help in adding to business
value.
Where is Data Science Needed?
● Email Filtering
● E-Commerce Recommendations
● Search Engines
● Stock markets
● Fraud Detection
● Generating Art
● Story Generation
● Disease Diagnosis
Data Science Pipeline
0. Understand Problem statement
1. Data Acquisition
2. Data Pre-processing
3. Exploratory Analysis – Visualisation
4. Machine Learning and Deep Learning Modelling
5. Final Insights
Data Science Career
● Data Science Engineer
● Data Analyst (Statistics, Visualisation)
● Machine Learning Engineer
● Deep Learning Engineer
Data Engineering Skills
● Programming languages
● Database Management
● Analytics/BI tools
● Solid Knowledge of Operating Systems
● Containers
● Domain and Business Expertise’
● Optimization
● Data Governance and Security
● Creativity and Collaboration
● Machine learning/AI
Mathematics required for Data Science
● Linear Algebra
○ Matrices and Vectors
○ Matrix Multiplication
○ Eigenvalues and Eigenvectors
● Calculus
○ Integrals
○ Derivatives
○ Gradient Descent
● Statistics
○ Descriptive statistics
○ Probability & Conditional Probability
What is Statistics?
● The practice or science of collecting and analyzing numerical data in large
quantities, especially for the purpose of inferring proportions in a whole from those
in a representative sample.
● It’s a branch of science that deals with numeric data. It is mainly used to infer
knowledge about big sections (parts / proportions) by trying to understand a
smaller part of it.
Why do we need Statistics ?
● The major reason is that it helps us understand data better.
● When we understand the data better we work on it better.
● When we work on it better, our deliverables(outcome) after working on the data
turns out to be better.
What kind of data do we deal with?
Descriptive Statistics
● Descriptive statistics are brief descriptive coefficients that summarize a given data
set, which can be either a representation of the entire or a sample of a population.
● Descriptive statistics are broken down into measures of central tendency and
measures of variability (spread)
Measures of central tendency
● When we say measures of central tendency, we basically are talking about 3 things
the mean the median and the mode.
● The mean, which is also referred as the average or the expected value. It’s nothing
but average value of the population or the sample. This value gives us an
approximate idea of what the entire data looks like.
5 6 7
Mean= (5+6+7) = 6
3
Mean
● Statistical Formula :
● X = mean
● ∑x = summation of all the values
● N = number of observations
● Simpler formula :
● Mean = Summation of all the values
Total number of observations
Examples (Mean)
● There are 3 baskets A, B, C having 12, 14, 22 apples respectively from 3 different
locations. The farmer Mr. Orchard wants to know how many apples on an average
does he have.
○ Answer = 16
● There are 3 baskets A, B, C having 12, 14, 220 apples respectively from 3 different
locations. The farmer Mr. Orchard wants to know how many apples on an average
does he have.
○ Answer : 82
Understanding outliers
Mean : $ 333,334,233.33
Approx : $ 333,000,000
Median
● Median is the middle observation, after the observations are arranged in the
ascending order. In this way it is not affected by the outliers from any direction.
Example (Median)
Mean : $ 333,334,233.33
Median : $ 1,500
Example (Median)
● 4 families meet at a restaurant to have dinner together, the manager asks the
host for the average number of people in each family so that he can arrange
the tables accordingly. The host recollected:
● Family 1 : 3
● Family 2 : 4
● Family 3 : 5
● Family 4 : 5
● Help the host find the median from the above data.
○ Answer : 4.5
Mode
● Mode is the observation with the highest number of occurrence (frequency) in the
data. The data doesn’t need to be arranged, we can identify the mode just by
glancing over the frequency of each observation.
Example (MODE)
● The amount of money in each person has in his pocket is mentioned
below:
● 8, 5, 7, 10, 15, 21, 5, 7, 2, 5
○ 2, 8, 10, 15 and 21 occur 1 time
○ 7 occurs 2 times
○ 5 occurs 3 times
● Answer : The most occurring number is 5, which occurs three times.
Example (MODE)
● 4 families meat at a restaurant to have dinner together, the manager asks the
host for the average number of people in each family so that he can arrange the
tables accordingly. The host recollected:
● Family 1 : 3
● Family 2 : 4
● Family 3 : 5
● Family 4 : 5
● Help the host find the mode from the above data.
○ Answer : 5
Bimodal
● The data on the right shows that the data has 2 such occurrences that have the
high frequencies.
● Example:
The sale of all the beverages from a shop, the data is most likely to look something
like this. The sale of coffee and tea are most likely to outrun the other in that
section.
Understanding the bell curve (normal
curve)
● The name of the curve is because it’s
resemblance to a bell.
● The bell shape of the curve portrays the
symmetric nature of the curve.
● Example:
If we pick up a random group of people from
the crowd of all age groups, we will find that
most of them will be from the in the age of 30-
50, a person having an age of 80+ or a kid of
age say 10 is really low.
Skewed Data
Negatively
Skewed:
Mode>Median>Mean
Positively
Skewed:
Mean>Median>Mode
What is Measure of Spread
● It is a measure that gives us an idea of how spread across the data is.
● Just because 2 data sets have the same mean, doesn’t make the data sets similar.
Examples
● There are 2 classes with 5 student’s each. The obtained marks are follows:
○ Class 1 : [ -10, 0, 10, 20, 30 ]
○ Class 2 : [ 08, 09 10, 11, 12 ]
● The mean for both the classes are 10. Taking that into account can we say that the
data sets are similar?
● Answer: No, the data have different spreads.
○ Range for class 1 : 30 - (-10) = 40
○ Range for class 2 : 12 - 8 = 4
Why range isn’t the best way to understand
spreads?
● Like we saw earlier that sometimes the best of methods also fail to give us the
correct interpretations of the data set.
● Range is no different, there might be cases were the mean is same the data range is
same but the values differ telling us that the data might have a different spread.
● If we go with ranges all the time we might end up with a completely different
interpretation.
Variance (σ2)
Variance Definition in Statistics:
● It is a measure of how data points differ from the mean. According to layman’s
terms, it is a measure of how far a set of data (numbers) are spread out from their
mean (average) value.
Examples
● There are 2 classes with 5 student’s each. The obtained marks are follows:
○ Class 1 : [ -10, 0, 10, 20, 30 ]
○ Class 2 : [ 08, 09 10, 11, 12 ]
● Variance :
● Class 1 : (-10-10)2+(0-10)2+(10-10)2+(20-10)2+(30-10)2 = 1000 = 200
5 5
● Class 2 : (08-10)2+(09-10)2+(10-10)2+(11-10)2+(12-10)2 = 10 = 2
5 5
Standard Deviation (σ)
● The standard deviation is nothing but the root over of
variance. It is a quantity expressing by how much the
members of a group differ from the mean value for the group.
Examples
● There are 2 classes with 5 student’s each. The obtained marks are follows:
○ Class 1 : [ -10, 0, 10, 20, 30 ]
○ Class 2 : [ 08, 09 10, 11, 12 ]
● Variance :
● Class 1 : √(-10-10)2+(0-10)2+(10-10)2+(20-10)2+(30-10)2 = √1000 = 10√2
5 5
Class 2 : √(08-10)2+(09-10)2+(10-10)2+(11-10)2+(12-10)2 = √10 = √2
5 5
What’s the problem with Variance
● Taking the previous example we see that the values for classes are marks.
Now when we take the variance into consideration we get the final value
in terms of marks2. Which doesn’t make a lot of sense.
That’s the reason we use Standard deviation over variance.
Probability
● The quality or state of being probable; the extent to which
something is likely to happen or be the case.
● Basically it tells you the chances of an event to occur or not
occur. We can quantify it in terms of a fraction, a decimal or in
terms of percentages.
● The value is always between 0 and 1.
So how to calculate it?
● A = event you want to find the probability for.
● P(A) = probability of that event.
Examples
● What is the probability of getting a head for a fair / unbiased
coin.
● Ans. A = Getting a head
● P(A) = Favorable outcomes
Total number of outcomes.
● P(A) = 1/2 or 0.5 or 50%
Examples
● What is the probability of not getting a head for a fair / unbiased coin.
● Ans. A = Not getting a head or getting a tail
● P(A) = Favorable outcomes
Total number of outcomes.
● P(A) = 1/2 or 0.5 or 50%
Examples
● What is the probability of getting a 4 in a roll of a die,
considering that it is unbiased.
● Ans. Unbiased die has an equal chance of getting any of its
side. Which means that all sides are equally likely to occur.
● A = Getting a 4
● P(A) = 1
6
Union and Intersection
● P(A∩B) = intersection area of event A and B.
P(A⋃B) = the entire area of A and B.
● P(A∩B) = P (A).P (B)
P(A⋃B) = P(A) + P(B) - P(A∩B)
Examples
● There are 52 playing cards in a deck, what is the probability of
getting a jack and heart.
● Ans. It’s a known fact that there are 4 suits in a deck, and each
suit has 1 jack each. So there should be 1/52. Let’s test it out.
● P(A) = probability that a card is jack = 4/52 = 1/13
P(B) = probability that a card is heart = 13/52 = 1/4
● P(A⋂B) = 1 x 1 = 1
13x4 52
Examples
● There are 52 playing cards in a deck, what is the probability of
getting a jack or a heart.
● Ans.
● P(A) = probability that a card is jack = 4/52 = 1/13
P(B) = probability that a card is heart = 13/52 = 1/4 P(A⋂B) =
1/52
● P(AሀB) = P(A)+P(B)-P(A⋂B)
= 1/13 + 1/4 - 1/52
= 16/52 = 4/13
Examples
● What is the probability of getting a 4 and 6 in a roll of a die,
considering that it is unbiased.
● A = event where 4 occurs.
B = event where 6 occurs.
● P(A∩B) = Where A and B both occur at the same time. (which
is not possible)
● Since there is no intersection, therefore
● P(A∩B) =0
Conditional Probability
● Conditional probability is finding the probability of an event
given the other event occurs.
● Formula:
P(B|A) = P(B ∩ A)/P(A)
● So basically we try to find the probability of an event based
on another event that has already occurred.
Examples
● Two dies are thrown simultaneously, and the sum of the numbers obtained is found
to be 7. What is the probability that the number 3 has appeared at least once?
● A = {(3, 1), (3, 2), (3, 3)(3, 4)(3, 5)(3, 6)(1, 3)(2, 3)(4, 3)(5, 3)(6, 3)}
B = {(1, 6)(2, 5)(3, 4)(4, 3)(5, 2)(6, 1)}
● P(A) = 11/36 P(B) = 6/36
● A∩B=2
● P(A ∩ B) = 2/36
● Applying the conditional probability formula we get,
P(A|B) = P(A∩B)/P(B) = (2/36)/(6/36) = 1/3
Examples
● In a group of 100 computer buyers, 40 bought Brand 1, 30 purchased
Brand 2, and 20 purchased Brand 1 and 2. If a computer buyer choose at
random and bought a Brand 1, what is the probability they also bought a
computer from Brand 2?
● P(A) = 40/100 = 0.4 P(A∩B) = 20/100 = 0.2
● P(A|B) = P(A∩B)/P(B) = 0.2/0.4 = 0.5 or ½
The Probability that the person bought Brand 2 computer given that he
purchased Brand 1 is 50%.
Data Scientist vs Statistics
Pillars of Knowledge
● Analytics knowledge and toolsets
● Domain knowledge and collaboration
● (Big) data management and (new) IT skills
Data Pre-processing
Data Pre-processing
● Data Cleaning
● Missing Values
○ Impute missing values with median/mode
○ K-nearest neighbors
○ Bagging Tree
● Centring and Scaling
● Resolve Skewness
● Resolve Outliers
● Collinearity
● Sparse Variables
● Re-encode Dummy Variables
An introduction to Machine Learning
● The field of study that gives computers the ability to learn
without being explicitly programmed.
● Automating and improving the learning process of
computers based on their experiences without being
actually programmed i.e. without any human assistance.
● Traditional Programming vs Machine Learning
● How does ML work?
○ Gathering data
○ Data Processing
○ Divide the input data into training, cross-validation, and test sets
○ Building models
○ Testing our conceptualized model with data that was not fed to the
model
Types of machine learning problems
● Supervised learning
○ Classification
○ Regression
● Unsupervised learning
○ Clustering
○ Dimensionality reduction
● Semi-supervised learning
● Reinforcement learning
Terminologies of Machine Learning
● Model
● Feature
● Target (Label)
● Training
● Prediction
Diving into Machine Learning
Importing
the data EDA
Data
Transformation
Model selection
Training the
Model
Testing the
Model
Deployment
Importing the Data
● The data can be in various formats and distributed amongst
various files. Combining them is one of the task that is
crucial before processing it any further.
● Many a times the data is unstructured, understanding the
flow, pattern and converting it into a consistent format is
another task that comes under this process.
● Data Cleaning : Sometimes there are missing values in the
data that need to be taken care of, there are various of
doing so.
Exploratory Data Analysis (EDA)
● Viewing the data in various ways to understand the
structure.
● Understanding how the values distributed in the columns
● Use Visual and Non-Visual Methods to understand the data.
● Make a note of all the inferences, insights and assumptions
that you gain or make for the visuals.
Data Transformation
● If there is any discrepancy in the data type of the column
values or you need to modify the data into a consistent
format.
● Sometimes you need to make different column values from
the existing ones or remove a few columns as they don’t
add any information that helps us build a better model.
Model selection
● After understanding how the data is and transforming it in
the best way possible for further computation. Choosing
the model is an important task. Just because a complex
model gives us better accuracy (generally) we don’t start of
with them.
● We start of with the weak models, and gradually move up
the ladder. Each model is selected trained, tested and tuned
till we feel that that’s the best that the model can do on this
data set.
Training and Testing the Model
● Each and every model selected is trained over multiple
times in multiple ways to understand why one training set
is better than the other or why one model performs better
over the other.
● We can only understand how good the model is by testing
its learning on data that it has not been trained on.
● The training data has to be unseen by the model to
understand how well the model is able to work on data that
it hasn’t seen before.
Deployment of Model
● After being satisfied with the results we deploy the model
in the real world to see how it works on the data from the
real world.
● The task doesn’t end here. If the model works fine we try to
make it work better. If the model fails to work, then we
take it up again and go through the entire cycle all over
again. It’s a continuous process.
Supervised learning
● Supervised learning is when the model is getting trained on
a labelled dataset. A labelled dataset is one that has both
input and output parameters.
● Classification - It is a Supervised Learning task where
output is having defined labels
○ Binary Classification
○ Multiclass Classification
● Regression - It is a Supervised Learning task where output
is having continuous value.
○ Linear regression
○ Polynomial regression
Supervised Learning Algorithms
● Linear Regression
● Logistic Regression
● K Nearest Neighbor
● Gaussian Naive Bayes
● Decision Trees
● Support Vector Machine (SVM)
Linear Regression
● Linear regression performs the task to
predict a dependent variable value (y)
based on a given independent variable
(x). So, this regression technique finds
out a linear relationship between x
(input) and y(output). Hence the name,
Linear Regression.
Linear Regression
Formula:
● Since we are trying to build a linear relation between the the 2 variables,
we use the formula for a straight line.
● The same formula can be written as
y =mx+c.
○ c = θ1 = Intercept.
○ m = θ2 = Slope
How do we update the θ1 and θ2 values?
● To find the best fit line we need to have the
best θ1 and θ2 values. In order to find that
we need to minimize the cost function (J).
The cost function represents the difference
value between predicted and actual.
● Since the predicted and actual difference
might have positive and negative values
iteratively we square the error to make it
positive. (MSE)
How do we understand if the line is best fit.
● θ1 and θ2 are randomly selected at first and then
optimized using Gradient Descent on the Cost Function.
Now when the cost function is minimum we consider that
to be the best fit line .
Other ways of evaluation
● MAE : Mean Absolute Error, taking the absolute value of the error value.
● RMSE: Mean Squared Error can be difficult to interpret at times when we are
dealing with large values. Taking the root of the same gives us better
understanding.
● In R2 we first see what is the variation in error terms when we fit a line to the mean
of the distribution. Then we fit our line and see how much variation was explained
by the new fit line. The higher the variation explained the better is the line. The
value is always between 0 and 1.
Other ways of evaluation
● Adjusted R-squared is a modified version of R-squared that has been
adjusted for the number of predictors in the model. The adjusted R-
squared increases when the new term improves the model more than it
would be expected by chance. It decreases when a predictor improves the
model by less than expected. Typically, the adjusted R-squared is positive.
It is always lower than the R-squared.
R2 = Var (mean) - Var(line)
Var(mean)
Example
Consider the Var(mean) = 32; and the Var(line) =6
● R2 = Var (mean) - Var(line)
Var(mean)
● R2 = 32 - 6
32
● R2 = 26/32 = 0.8125
● That means the line explains 81.25% of the variance, the remaining is
considered as error and which can’t be explained.
Assumptions to Linear Regression
● Linear relationship: There exists a linear relationship between the
independent variable, x, and the dependent variable, y.
● Independence: The residuals are independent. In particular, there is no
correlation between consecutive residuals in time series data.
● Homoscedasticity: The residuals have constant variance at every level of x.
● Normality: The residuals of the model are normally distributed.
● If one or more of these assumptions are violated, then the results of our
linear regression may be unreliable or even misleading.
Multi Linear Regression
● A line in 2D is a plane in 3D. As the dimensions increase the dimension of
the line also increases.
● To take into account all the dependent and the independent variables
the line needs to become a plane, in the 3D.
Logistic Regression
● Logistic regression is basically a supervised classification algorithm. In this
analytics approach, the dependent variable is finite or categorical: either
A or B (binary regression) or a range of finite options A, B, C or D
(multinomial regression).
● It is used in statistical software to understand the relationship between
the dependent variable and one or more independent variables by
estimating probabilities using a logistic regression equation.
Evaluate Logistic Regressions
● Confusion matrix is a good way to have a look at the correctly identified
classes and misclassified classes.
● Using the values from there we can find the accuracy. The formula for
that is total number of correctly classified records divided by the total
number of records.
Logistic Regression
● Stratified Sampling : when there is a class imbalance it’s best to use
stratified sampling, this makes sure that the test data and train data
have an equal distributions in terms of class proportions.
Example:
● Total number of classes : {0: 100 , 1:50}
● Considering the test size to be 20%:
○ Test records for the model {0 : 20, 1:10}
Logistic Regression
● Types of Logistic Regression:
○ Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
○ Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
○ Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
● Assumptions for Logistic Regression:
○ The dependent variable must be categorical in nature.
○ The independent variable should not have multi-collinearity.
K nearest Neighbours
● KNN stands for K nearest Neighbors. Now k in
nothing but a placeholder, which depicts the
number of neighbors you want to take into
consideration. Example k =3, I am going to
take the 3 most nearest neighbors.
● How do we measure which element is close, we
use some distance measure to decide that.
Distance Measures for KNN
● Euclidean : The distance is calculated through a straight line between two
points.
● Manhattan : The distance is the summation of the perpendicular distance
and horizontal distance.
● Minkowski : it’s the distance between 2 points by using a curved line.
KNN
● Model Summary:
● Precision : What proportion of positive
identifications was actually correct?
● Recall : What proportion of actual
positives was identified correctly?
● F1 Score : It is calculated from the
precision and recall of the test, The F1
score is the harmonic mean of the
F1
precision and recall.
KNN
● Confusion Matrix:
[TP: 1 , FP: 1
FN: 8, TN: 90]
● Precision = 1/1+1 = 0.5
● Recall = 1/1+8 = 0.11
● F1-Score = 2 x(0.055/0.61) = 0.18
● Accuracy = 91 / 100 = 0.91
KNN
● Grid Search Cross Validation: This is a hyperparameter tuning method
where you put in all the parameter values that you want to train and
test your model with and on the basis of that you get a combination of
all the values passed. You can select the best out of that.
● Random Search Cross Validation: This is similar to Grid Search but doesn’t
make a combination of all the values. It make a combination of that
values that are most likely to give you better results. (Best for larger
datasets and more number of parameters)
KNN
● K Fold and Stratified K Fold Cross Validation: In K fold we basically fold
our data set k times and do the train test on that data set. This gives us K
number of accuracies, now you can either look at the range in which the
accuracies lie or just have an average accuracy.
● Stratified K Fold CV is just an extension of the same old K fold, the only
difference is that the data over here is stratified, has an equal proportion
of class distribution in both the training and the testing splits.
Naive Bayes
● Naive Bayes uses the Bayes’ Theorem and assumes that all predictors are
independent. In other words, this classifier assumes that the presence of
one particular feature in a class doesn’t affect the presence of another
one.
● The very reason the name of the classification technique is naive is
because of the same reason that the model assumes that the features are
conditionally independent.
Bayes Theorem
● If A and B are two events, then the formula for Bayes theorem is given
by:
● Conditional Probability:
- P(A|B) = P(A∩B)/P(B)
● Bayes' Formula:
- P(B|A) = (P(A|B) x P(B))/P(A)
- P(B|A) = ((P(A∩B)/P(B)) x P(B))/P(A)
- P(B|A) = P(A∩B)/P(A)
Applying Bayes Theorem to Understand
Conditional Independence
Q. Let event A signify the occurrence of head and event B be the occurrence
of a 1 in a roll of a dice.Find the Probability of B given that A occurs.
● P(A) = ½ and P(B) = ⅙
P(A∩B) = P(A) x P(B) [As there is no intersection points]
● P(B|A) = P(A∩B)/P(A)
P(B|A) = (1/12) / (1/2)
P(B|A) = ⅙
● P(B|A) = P(B) =⅙ as they are conditionally independent.
Naive Bayes
● P(y|x1,x2) = P(x1|y) x P(x2|y) x P(y)
P(x1) x P(x2)
● x1 = Good; x2 = Bad
● P(Y | Good, Bad) = P(Good|Y) x P(Bad|Y) x P(Y)
P(Good) x P(Bad)
Naive Bayes
Naive Bayes
● P(Y | Good, Bad)= P(Good|Y) x P(Bad|Y) x P(Y) = ½ x ½ x ⅔
P(Good) x P(Bad) ⅓x⅔
● = 3/4 or 75% probability.
SVM
● SVM stands for Support Vector Machines, Where the
main agenda of the models is to split the data into n
categories, n being the number of categories present.
The basic version of the SVM uses a linear hyperplane
to distinguish between 2 or more classes.
● A hyperplane is nothing but a line (2D) or a plane
(3D) that is used to determine which side of it will it
fall into when a new point is provided to it.
What are the steps to create a SVM model?
● The Process is simple:
● Step1 : Identify the support vectors.
● Step2 : See if the data is linearly separable.
● Step3 : If it is then make linear separator between them (exactly
in the middle).
● Step4 : your model is ready.
What are the steps for
non linearly separable data points?
● The Process is simple:
● Step1: Use the kernel function to take the data points to a higher dimension.
● Step2 : Identify the support vectors.
● Step3 : Draw the hyperplane in N dimensions.
● Step4 : your model is ready.
CART
● CART stands for Classification and
Regression Trees. The classification trees
are also referred as decision trees. The
name of the model is decision tree because
it looks like an inverted tree. The decision
is made on the basis of some condition and
the splits are made accordingly.
● The tree goes on till the last node (referred
as the leaf node), till it can’t be split any
further.
What is the criteria for splitting?
● The root node(starting point) is calculated using
the information gain. The information gain is
calculated in 2 ways, entropy and gini.
● Entropy is calculated by:
Disadvantages of Decision Trees:
Decision trees to the max depth tend to overfit. (Generally)
● Overfit: when the model takes in the training values and fit it perfectly
and fails to perform well on the testing set, we say that our model has
overfit. You can test by checking the training and testing accuracy. If the
training accuracy is way higher than the testing accuracy, it is safe to say
that the model did over fit.
● Underfit: when the data is unable to fit well, ie. the accuracy is lower on
both the training and testing set we go on to say that the model underfit.
This is generally the case when the model we are using is too simple for
the data set.
Clustering
● A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a
group that has less or no similarities with another group.
● The clustering technique is commonly used for statistical data analysis.
Types of Clustering Methods
● Partitioning Clustering
● Density-Based Clustering
● Distribution Model-Based Clustering
● Hierarchical Clustering
● Fuzzy Clustering
Applications of Clustering
● In Identification of Cancer Cells
● Search Engines
● Customer Segmentation
● Biology
● Land Use
K means Clustering
● We choose an arbitrary k value which depicts
the number of centroids, which in turn depicts
the number of clusters.
● The distance from centroids to all the data
points is calculated and the centroid with the
least distance is assigned that that point.
● The cycle goes on till the the centroid values
don’t change significantly.
How does the K-Means Algorithm Work?
● Step-1: Select the number K to decide the number of clusters.
● Step-2: Select random K points or centroids. (It can be other from the input
dataset).
● Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
● Step-4: Calculate the variance and place a new centroid of each cluster.
● Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
● Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
● Step-7: The model is ready.
How to find the optimal value for K in K means?
● We use something called the elbow graph. The graph plots the value of k
against the amount of distortion in the data points. The point after which
the graph plateaus, is called the elbow point.
● In the example we can see the elbow form at k=3.
Hierarchical Clustering
● We use something called a dendrogram for
this purpose. There are 2 approaches:
● Top down approach: Consider all the data
points to be in one single cluster, and then go
on splitting.
● Bottom up approach: Consider all the data
points to be single clusters and then club
them as you move up.
● Now in each case the distance is calculated
and the ones with the least distance are
clubbed together.
How do linkage affect the dendrogram ?
● There are 3 kind of linkage that can be used:
● Single Linkage: For two clusters C1 and C2, the single linkage returns the
minimum distance between two points i and j such that i belongs to C1
and j belongs to C2.
● Complete Linkage: For two clusters C1 and C2, the complete linkage
returns the maximum distance between two points i and j such that i
belongs to C1 and j belongs to C2.
● Average Linkage: For two clusters C1 and C2, first for the distance
between any data-point i in C1 and any data-point j in C2 and then the
arithmetic mean of these distances are calculated.
Working of Dendrogram in
Hierarchical clustering
DBSCAN
● Density-Based Clustering identifies distinctive groups/clusters in
the data, based on the idea that a cluster in data space is a
contiguous region of high point density, separated from other
such clusters by contiguous regions of low point density.
● DBSCAN an discover clusters of different shapes and sizes from a
large amount of data, which is containing noise and outliers.
DBSCAN
● The DBSCAN algorithm uses two parameters:
● minPts: The minimum number of points (a threshold) clustered together
for a region to be considered dense.
● eps (ε): A distance measure that will be used to locate the points in the
neighborhood of any point.
● Reachability in terms of density establishes a point to be reachable from
another if it lies within a particular distance (eps) from it.
● Connectivity, on the other hand, involves a transitivity based chaining-
approach to determine whether points are located in a particular cluster.
DBSCAN
● Core — This is a point that has at least m
points within distance n from itself.
● Border — This is a point that has at least one
Core point at a distance n.
● Noise — This is a point that is neither a Core
nor a Border. And it has less than m points
within distance n from itself.
Algorithmic steps for DBSCAN
clustering
● The algorithm proceeds by arbitrarily picking up a point in the dataset
(until all points have been visited).
● If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point
then we consider all these points to be part of the same cluster.
● The clusters are then expanded by recursively repeating the
neighborhood calculation for each neighboring point
Dimensionality Reduction
● It is a way of converting the higher dimensions dataset into lesser
dimensions dataset ensuring that it provides similar information.
● It is commonly used in the fields that deal with high-dimensional data.
Dimensionality Reduction
Benefits
● By reducing the dimensions of the features, the space required to store
the dataset also gets reduced.
● Less Computation training time is required for reduced dimensions of
features.
● Reduced dimensions of features of the dataset help in visualizing the data
quickly.
● It removes the redundant features (if present) by taking care of
multicollinearity.
Dimensionality Reduction
Disadvantages
● Some data may be lost due to dimensionality reduction.
● In the PCA dimensionality reduction technique, sometimes the
principal components required to consider are unknown.
Dimensionality Reduction
Feature Selection
● Filters Methods
● Wrappers Methods
● Embedded Methods
Feature Extraction
● Principal Component Analysis
● Linear Discriminant Analysis
● Kernel PCA
● Quadratic Discriminant Analysis
Common techniques of Dimensionality Reduction
● Principal Component Analysis
● Backward Elimination
● Forward Selection
● Missing Value Ratio
● Low Variance Filter
● High Correlation Filter
● Random Forest
● Factor Analysis
● Auto-Encoder
Principal Component Analysis
● It is a statistical process that converts the observations of correlated
features into a set of linearly uncorrelated features with the help of
orthogonal transformation.
● These new transformed features are called the Principal Components.
● PCA generally tries to find the lower-dimensional surface to project the
high-dimensional data.
● The PCA algorithm is based on some mathematical concepts such as:
○ Variance and Covariance
○ Eigenvalues and Eigen factors
Some common terms used in PCA algorithm
● Dimensionality
● Correlation
● Orthogonal
● Eigenvectors
● Covariance Matrix
Steps for PCA algorithm
● Getting the dataset
● Representing data into a structure
● Standardizing the data
● Calculating the Covariance of Z
● Calculating the Eigen Values and Eigen Vectors
● Sorting the Eigen Vectors
● Calculating the new features Or Principal Components
● Remove less or unimportant features from the new dataset.
Linear Discriminant Analysis
● It is used as a pre-processing step in Machine Learning and applications of
pattern classification.
● The goal of LDA is to project the features in higher dimensional space onto a
lower-dimensional space in order to avoid the curse of dimensionality and
also reduce resources and dimensional costs.
● Limitations of Logistic Regression
○ Two-class problems
○ Unstable with Well-Separated classes
○ Unstable with few examples
Linear Discriminant Analysis
How does LDA work?
● We need to calculate the separability between classes which is the
distance between the mean of different classes. This is called the between-
class variance.
● Calculate the distance between the mean and sample of each class. It is
also called the within-class variance.
● Construct the lower-dimensional space which maximizes the between-
class variance and minimizes the within-class variance. P is considered as
the lower-dimensional space projection, also called Fisher’s criterion.
How to prepare data from LDA?
● LDA is mainly used in classification problems where you have a categorical output
variable. It allows both binary classification and multi-class classification.
● The standard LDA model makes use of the Gaussian Distribution of the input
variables. You should check the univariate distributions of each attribute and
transform them into a more Gaussian-looking distribution. Outliers can skew the
primitive statistics used to separate classes in LDA, so it is preferable to remove
them.
● Since LDA assumes that each input variable has the same variance, it is always
better to standardize your data before using an LDA model. Keep the mean to be
0 and the standard deviation to be 1.
Recommendation Systems
● Recommendation engines are a subclass of machine learning which
generally deal with ranking or rating products / users.
● They’re used by various large name companies like Google, Instagram,
Spotify, Amazon, Reddit, Netflix etc. often to increase engagement with
users and the platform.
● Recommender systems are often seen as a “black box”, the model
created by these large companies are not very easily interpretable.
What Defines a Good Recommendation?
● The quality of a recommendation can be assessed through various tactics
which measure coverage and accuracy.
● Accuracy is the fraction of correct recommendations out of total possible
recommendations while coverage measures the fraction of objects in the
search space the system is able to provide recommendations for.
● Recommender systems share several conceptual similarities with the
classification and regression modelling problem.
● K Fold Cross Validation
● MAE (Mean Absolute Error)
● RMSD (Root Mean Square Deviation)
Collaborative Filtering Systems
● Collaborative filtering is the process of predicting the interests of a user by
identifying preferences and information from many users.
● There are two common types of approaches in collaborative filtering,
memory based and model based approach.
● Examples:
○ YouTube content recommendation to users
○ Coursera course recommendation
Content Based Systems
● Content based systems generate recommendations based on the users
preferences and profile.
● Unlike most collaborative filtering models which leverage ratings between
target user and other users, content based models focus on the ratings
provided by the target user themselves.
● The simplest forms of content based systems require the following sources of
data
○ Item level data source
○ User level data source
● Examples
○ Amazon product feed
Hybrid Recommendation System
● Hybrid recommender systems are ones designed to use different available
data sources to generate robust inferences.
● The parallel design provides the input to multiple recommendation
systems, each of those recommendations are combined to generate one
output.
Hybrid Recommendation System
● The sequential design provides the input parameters to a single
recommendation engine, the output is passed on to the following
recommender in a sequence.
Difference Between Supervised and Unsupervised Learning
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained Unsupervised learning algorithms are
using labeled data. trained using unlabeled data.
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
In supervised learning, input data is In unsupervised learning, only input data is
provided to the model along with the provided to the model.
output.
The goal of supervised learning is to train The goal of unsupervised learning is to find
the model so that it can predict the output the hidden patterns and useful insights
when it is given new data. from the unknown dataset.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.
Difference Between Supervised and Unsupervised Learning
Supervised Learning Unsupervised Learning
Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input data
corresponding outputs. and no corresponding output data.
Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first true Artificial Intelligence as it learns
train the model for each data, and then similarly as a child learns daily routine
only it can predict the correct output. things by his experiences.
It includes various algorithms such as It includes various algorithms such as
Linear Regression, Logistic Regression, Clustering, KNN, and Apriori algorithm.
Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian
Logic, etc.
Semi-Supervised Learning
● Semi-Supervised represents the intermediate ground between Supervised
and Unsupervised learning algorithms. It uses the combination of labeled
and unlabeled datasets during the training period.
● To overcome these drawbacks of supervised learning and unsupervised
learning algorithms, the concept of Semi-supervised learning is
introduced.
● Under semi-supervised learning, the student has to revise itself after
analyzing the same concept under the guidance of an instructor at
college.
Assumptions followed by Semi-Supervised
Learning
● Continuity Assumption: As per the continuity assumption, the objects near
each other tend to share the same group or label.
● Cluster assumptions- In this assumption, data are divided into different
discrete clusters. Further, the points in the same cluster share the output
label.
● Manifold assumptions- This assumption helps to use distances and
densities, and this data lie on a manifold of fewer dimensions than input
space. The dimensional data are created by a process that has less degree
of freedom and may be hard to model directly. (This assumption
becomes practical if high).
Working of Semi-Supervised Learning
● It trains the model with less amount of training data similar to the
supervised learning models. The training continues until the model gives
accurate results.
● The algorithms use the unlabeled dataset with pseudo labels in the next
step, and now the result may not be accurate.
● Now, the labels from labeled training data and pseudo labels data are
linked together.
● The input data in labeled training data and unlabeled training data are
also linked.
● In the end, again train the model with the new combined input as did in
the first step. It will reduce errors and improve the accuracy of the model.
Real-world applications of Semi-
supervised Learning
● Speech Analysis
● Web content classification
● Protein sequence classification
● Text document classifier
Reinforcement Learning
● Reinforcement Learning is a feedback-based Machine learning technique
in which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty.
● In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning.
● Since there is no labeled data, so the agent is bound to learn by its
experience only.
Terms used in Reinforcement Learning
● Agent
● Environment
● Action
● State
● Reward
● Policy
● Value
● Q-value
Key Features of Reinforcement Learning
● In RL, the agent is not instructed about the environment and what
actions need to be taken.
● It is based on the hit and trial process.
● The agent takes the next action and changes states according to the
feedback of the previous action.
● The agent may get a delayed reward.
● The environment is stochastic, and the agent needs to explore it to reach
to get the maximum positive rewards.
Approaches to implement
Reinforcement Learning
● Value-based: The value-based approach is about to find the optimal
value function, which is the maximum value at a state under any policy.
Therefore, the agent expects the long-term return at any state(s) under
policy π.
● Policy-based: Policy-based approach is to find the optimal policy for the
maximum future rewards without using the value function. In this
approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward
Approaches to implement
Reinforcement Learning
● Policy-based: The policy-based approach has mainly two types of policy:
○ Deterministic: The same action is produced by the policy (π) at any state.
○ Stochastic: In this policy, probability determines the produced action.
● Model-based: In the model-based approach, a virtual model is created
for the environment, and the agent explores that environment to learn
it. There is no particular solution or algorithm for this approach because
the model representation is different for each environment.
Types of Reinforcement learning
● Positive Reinforcement:
○ The positive reinforcement learning means adding something to increase the tendency that
expected behavior would occur again. It impacts positively on the behavior of the agent
and increases the strength of the behavior. This type of reinforcement can sustain the
changes for a long time, but too much positive reinforcement may lead to an overload of
states that can reduce the consequences.
● Negative Reinforcement:
○ The negative reinforcement learning is opposite to the positive reinforcement as it increases
the tendency that the specific behavior will occur again by avoiding the negative condition.
It can be more effective than the positive reinforcement depending on situation and
behavior, but it provides reinforcement only to meet minimum behavior.
Reinforcement Learning Algorithms
● Q-Learning:
○ Q-learning is an Off policy RL algorithm, which is used for the temporal difference
Learning. The temporal difference learning methods are the way of comparing
temporally successive predictions.
● State Action Reward State action (SARSA):
○ SARSA stands for State Action Reward State action, which is an on-policy temporal
difference learning method. The on-policy control method selects the action for each
state while learning using a specific policy.
● Deep Q Neural Network (DQN):
○ As the name suggests, DQN is a Q-learning using Neural networks. For a big state space
environment, it will be a challenging and complex task to define and update a Q-table.
Reinforcement Learning Applications
● Robotics
● Control
● Game Playing
● Chemistry
● Business
● Manufacturing
● Finance Sector
Data Science Machine Learning
Data Science is a field about processes and Machine Learning is a field of study that gives
systems to extract data from structured and computers the capability to learn without being
semi-structured data. explicitly programmed.
Need the entire analytics universe. Combination of Machine and Data Science.
Machines utilize data science techniques to
Branch that deals with data.
learn about the data.
Data in Data Science maybe or maybe not It uses various techniques like regression and
evolved from a machine or mechanical supervised clustering.
process.
Data Science as a broader term not only
focuses on algorithms statistics but also takes But it is only focused on algorithm statistics.
care of the data processing.
It is a broad term for multiple disciplines. It fits within data science.
Many operations of data science that is, data
It is three types: Unsupervised learning,
gathering, data cleaning, data manipulation,
Reinforcement learning, Supervised learning.
etc.
Example: Netflix uses Data Science Example: Facebook uses Machine Learning
technology. technology.
Deep Learning Algorithms
● It is a very important data science element that channels its modeling based
on data-driven techniques under predictive modeling and statistics.
● Deep learning algorithms are dynamically made to run through several
layers of neural networks, which are nothing but a set of decision-making
networks that are pre-trained to serve a task.
● Deep learning algorithms play a crucial role in determining the features and
can handle the large number of processes for the data that might be
structured or unstructured.
● Deep learning algorithms are highly progressive algorithms that learn about
the image that we discussed previously by passing it through each neural
network layer.
Deep Learning Algorithms
● Convolutional Neural Networks (CNNs)
● Long Short Term Memory Networks (LSTMs)
Deep Learning Algorithms
● Recurrent Neural Networks (RNNs)
● Generative Adversarial Networks (GANs)
Deep Learning Algorithms
● Radial Basis Function Networks (RBFNs)
● Multilayer Perceptrons (MLPs)
Deep Learning Algorithms
● Self Organizing Maps (SOMs)
● Deep Belief Networks (DBNs)
Deep Learning Algorithms
● Restricted Boltzmann Machines (RBMs)
● Autoencoders
Introduction to Keras and TensorFlow
● TensorFlow is a Python-based, free, open-source machine
learning platform, developed primarily by Google.
● Keras is a deep-learning framework for Python that
provides a convenient way to define and train almost any
kind of deep-learning model.
Data Mining
● Large data sets are sorted through in data mining in order
to find patterns and relationships that may be used in data
analysis to assist solve business challenges.
● Successful analytics initiatives in corporations depend
heavily on data mining.
● The explosive growth of data (from terabytes to petabytes)
● Effective data mining aids in various aspects of planning
business strategies and managing operations.
Data Mining
● Data scientists and other knowledgeable BI and analytics
professionals frequently perform data mining.
● Its core elements include machine learning and statistical
analysis, along with data management tasks done to
prepare data for analysis.
● These four key phases represent a breakdown of the data
mining process:
Benefits of Data Mining
● Improved sales and marketing
● Better customer service
● Effective supply chain management
● Increased uptime for manufacturing
● Stronger risk management
● Lower costs
Applications
● Retail
● Financial services
● Insurance
● Manufacturing
● Entertainment
● Healthcare
● Fraud Detection
● Human Resources
Data Pre-processing in Data Mining
● Data Cleaning
○ Missing Data
○ Noisy Data
● Data Transformation
○ Normalization
○ Attribute Selection
○ Discretization
○ Concept Hierarchy Generation
● Data Reduction
Natural Language Processing
● The field of study that focuses on the interactions between
human language and computers is called natural language
processing.
● Natural Language Processing is a field that covers com-
puter understanding and manipulation of human language,
and it’s ripe with possibilities for newsgathering.
● NLP is a field of artificial intelligence in which computers
analyze, understand, and derive meaning from human
language in a smart and useful way.
Example NLP algorithms
● Speech recognition
● Part of speech tagging
● Word sense disambiguation
● Named entity recognition
● Co-reference resolution
● Sentiment analysis
● Natural language generation
NLP tools
● Python and the Natural Language Toolkit (NLTK)
● Statistical NLP, machine learning, and deep learning
● Apache OpenNLP
● Stanford NLP
● MALLET
NLP Use Cases
● Spam detection
● Machine translation
● Virtual agents and chatbots
● Social media sentiment analysis
● Text summarization
NLP Pre-Processing
● Data Cleaning
● Preprocessing of data
○ Lowercase
○ Tokenization
○ Stop words removal
○ Stemming
○ Lemmatization
○ Text Data Vectorization
○ Bag of words
○ TF-IDF (Term Frequency - Inverse Document Frequency)