Machine Learning Essentials: Notes by Aniket Sahoo - Part I
Machine Learning Essentials: Notes by Aniket Sahoo - Part I
A global community of programmers develops and maintains CPython, an open source reference
implementation. A non-profit organization, the Python Software Foundation, manages and directs
resources for Python and CPython development.
Anaconda is a widely used free and open-source distribution of the Python and R programming
languages for scientific computing (data science, machine learning applications, large-scale data
processing, predictive analytics, etc.), that aims to simplify package management and deployment.
Package versions are managed by the package management system conda. The Anaconda
distribution includes data-science packages suitable for Windows, Linux, and MacOS.
Data Types
>> sentence = ' This is my first line of code in python '
>> print(sentence) # This is my first line of code in python
>> sentence.upper() # ' THIS IS MY FIRST LINE OF CODE IN PYTHON '
>> sentence.lower() # ' this is my first line of code in python '
>> sentence.strip() # 'this is my first line of code in python'
Slicing
>> sentence[0] # 'T'
>> sentence[33:39] # 'python'
>> sentence[:29] # 'this is my first line of code'
>> sentence[:-10] # 'this is my first line of code'
>> sentence[33:] # 'python'
>> sentence[-6:] # 'python'
>> sentence[0:39:2] # 'Ti sm is ieo oei yhn'
List
>> empty_list = []
>> DA_languages = ['R','Python', 'SAS', 'Scala', 42]
>> DA_languages[0] # R
>> DA_languages[-1] # 42
>> DA_languages[1:3] # ['Python', 'SAS']
>> DA_languages.append('Java') # ['R', 'Python', 'SAS', 'Scala', 42, 'Java']
>> DA_languages.pop() # 'Java'
>> DA_languages.pop(2) # 'SAS'
>> DA_languages.append('SAS') # ['R', 'Python', 'Scala', 42, 'SAS']
>> DA_languages.remove('SAS') # ['R', 'Python', 'Scala', 42]
>> new_list = DA_languages
>> another_list = DA_languages.copy()
>> print(id(DA_languages)) # 1419340537672
>> print(id(new_list)) # 1419340537672
>> print(id(another_list)) # 1419340535112
>> sentence.split()
# ['This', 'is', 'my', 'first', 'line', 'of', 'code', 'in', 'python']
>> sentence.split('i')
# [' Th', 's ', 's my f', 'rst l', 'ne of code ', 'n python ']
>> '_'.join(sentence.split()) # This_is_my_first_line_of_code_in_python
>> nums_1, nums_2 = [1,2], [3,4]
>> nums_1*3 # [1, 2, 1, 2, 1, 2]
>> nums_1.extend(nums_2) # [1, 2, 3, 4]
>> nums = nums_2 + nums_1 # [3, 4, 1, 2]
>> len(nums) # 4
>> sorted(nums) # [1, 2, 3, 4]
>> list(reversed(nums)) # [2, 1, 4, 3]
>> max(nums) # 4
>> min(nums) # 1
>> nums_1.append(nums_2) # [1, 2, [3, 4]]
>> nums_1[0] # 1
>> nums_1[2] # [3, 4]
List Comprehension
>> squares_list = [x**2 for x in range(1,10)]
# [1, 4, 9, 16, 25, 36, 49, 64, 81]
>> single_word_list = [word for word in sentence.split()]
# ['This', 'is', 'my', 'first', 'line', 'of', 'code', 'in', 'python']
Dictionary Comprehension
>> squared_dict = {num : num**2 for num in range(0, 25)}
# {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}
>> even_sq_dict = {num:square for num, square in squared_dict.items() if num%2==0}
# {0: 0, 2: 4, 4: 16, 6: 36, 8: 64}
Set
>> students_set_1 = set(['A','A','B','C','C','B']) # {'A', 'B', 'C'}
>> students_set_2 = set(['A','E','D','E','A']) # {'A', 'D', 'E'}
>> students_set_1.intersection(students_set_2) # {'A'}
>> students_set_1.union(students_set_2) # {'A', 'B', 'C', 'D', 'E'}
>> students_set_1.difference(students_set_2) # {'B', 'C'}
Tuple
>> city_tuple = ('Mumbai', 18.9949521, 72.8141853)
# ('Mumbai', 18.9949521, 72.8141853)
Functions
>> def square(num):
>> out = num**2
>> return(out)
1.1.2. NUMPY
NumPy (pronounced NUM-pee) is a library for the Python programming language, adding support
for large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. It stands for numerical python. The ancestor of
NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other
developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing
Numarray into Numeric, with extensive modifications. The most basic object in NumPy is the
ndarray, or simply an array which is an n-dimensional, homogeneous array. By homogenous, one
means that all the elements in a NumPy array have to be of the same data type, which is commonly
numeric (float or integer).
Installing Numpy
One can install numpy using the following command on the conda command prompt.
(base)$ source virtualenv_name/bin/activate
(virtualenv_name)$ pip install numpy
Array Creation
# axis = 0 refers to the rows
# axis = 1 refers to the columns
>> np.array([2, 4, 5, 6, 7, 9]) # 1-D array
>> np.array([[2, 3, 4], [5, 8, 7]]) # 2-D array
>> np.ones((5, 3)) # Array of ones
>> np.ones((5, 3), dtype = np.int) # Change dtype (default float64)
>> np.zeros(4, dtype = np.int) # Array of zeros
>> np.random.random([3, 4]) # Array of random numbers
>> np.arange(10, 100, 5) # Array of numbers 10 to 100 with a step of 5
>> np.linspace(15, 18, 25) # Array of length 25 between 15 and 18
>> np.full((4,3), 7) # Array of 7's
>> np.tile(some_array, 3) # Array of repeating sequence
>> np.eye(3, dtype = int) # Array of identity matrix
>> np.random.randint(0, 10, (4,4)) # Array of random integers ranging from 0 to 9
Array Inspection
>> rand_array.shape # shape
>> rand_array.dtype # dtype
>> rand_array.ndim # dimensions
>> rand_array.itemsize # itemsize
Array Manipulation
>> some_array.reshape(2, 3, 4) # Array reshape
>> some_array.reshape(4, -1) # Array reshape with automatic dimensions
>> some_array.T # Array Transpose
>> np.vstack((array_1, array_2)) # Array vertical stacking
>> np.hstack((array_1, array_2)) # Array horizontal stacking
Array Operations
>> array_1 * array_2 # multiplication
>> some_array ** 2 # squared
>> np.sin(some_array) # sin
>> np.exp(some_array) # exponential
>> np.log(some_array) # logarithm
>> f = np.vectorize(lambda x: func(x)) # custom function
>> f(some_array) # apply custom function
>> some_array.T # transpose
>> np.linalg.matrix_rank(some_array) # rank of array
>> np.linalg.inv(some_array) # inverse
>> np.linalg.det(some_array) # determinant
>> np.add(array_1, array_2) # addition
>> np.subtract(array_1, array_2) # subtraction
>> np.dot(array_1, array_2) # multiplication
>> np.divide(array_1, array_2) # division
>> eigen_val, eigen_vec = np.linalg.eig(some_array)
# eigen operation
1.1.3. PANDAS
Pandas is a software library written for the Python programming language for data manipulation and
analysis. In particular, it offers data structures and operations for manipulating numerical tables and
time series. The name is derived from the term “panel data”, an econometrics term for data sets that
include observations over multiple time periods for the same individuals. One shall use Pandas
heavily for data manipulation, visualisation, building machine learning models, etc. There are two
main data structures in Pandas - Series and Dataframes. The default way to store data is dataframes,
and thus manipulating dataframes quickly is probably the most important skill set for data analysis.
Installing Pandas
One can install pandas using the following command on the conda command prompt.
(base)$ source virtualenv_name/bin/activate
(virtualenv_name)$ pip install pandas
Creation
# A pandas series
>> pd.Series([2, 4, 5, 6, 9])
# A pandas dataframe
>> pd.DataFrame({'name': ['Vinay', 'Kushal', 'Aman', 'Saif'], 'age': [22, 25, 24,
28], 'occupation': ['engineer', 'doctor', 'data analyst', 'teacher']})
Structure Changing
>> data.set_index('col_1', inplace = True) # set col_1 as index
>> data.sort_index(axis = 0, ascending = False) # sort by index
>> data.sort_values(by = 'col_1', ascending = False) # sort by col_1
Inspection
>> data.shape # shape
>> data.head() # top 5 rows
>> data.tail() # bottom 5 rows
>> data.info() # metadata summary
>> data.describe() # statistical summary
>> data.memory_usage() # memory consumed by columns
>> data.columns # columns
>> data.values # values as an array
>> pd.options.display.max_info_columns = 100 # display 100 columns
Slicing
>> data['col_1'] # series of col_1 data
>> data.col_1 # series of col_1 data
>> data[['col_1', 'col_2', 'col_3']] # dataframe of col_1, col_2
>> data[2:7] # rows from indices 2 to 6
>> data[5::2] # alternate rows from index 5
Operations
>> pd.merge(data_1, data_2, how='inner', on=['col_1', 'col_2'])
# inner join of dataframes on a specific columns
>> pd.concat([data_1, data_2], axis = 0)
# concatenate dataframes one on top of the other
>> pd.concat([data_1, data_2], axis = 1)
# concatenate dataframes side by side
>> data_1.append(data_2)
# alternative to concatenate along the rows
>> data_1 + data_2
# add two dataframes (gives NaN when values not present in both)
>> data_1.add(data_2, fill_value = 0)
# add two dataframes (does not give NaN when values not present in both)
>> data['new_col'] = any_value
# add a new column
>> data['new_col'] = data['col_1']/data['col_2']
# add a new column
>> data.groupby(['col_1', 'col_2'])
# group the data by specific columns
>> data.groupby(['col_1', 'col_2'])['col_3'].sum()
# sum of col_3 of the grouped data
>> data['col_1'].apply(lambda x: function(x))
# applying an operation to a column
>> data.pivot_table(values = 'aggregation_col', index = 'group_by_row', columns =
'group_by_col', aggfunc = 'mean')
# pivot data
>> df.isnull()
# find all nulls
>> df.isnull().sum()
# find count of all nulls
>> df.isnull().any(axis=0)
# columns with at least one missing value
>> df.isnull().all(axis=0)
# columns with all missing values
>> df.isnull().sum(axis=0)
# sum of missing values in each column
>> df.isnull().all(axis=0).sum()
# how many columns have all missing values
>> df.isnull().any(axis=1)
# rows with at least one missing value
>> round(100*(df.isnull().sum()/len(df.index)), 2)
# percentage of missing values in columns
>> df.drop('col_1', axis=1)
# removing specific column
>> df.dropna(how='all', axis='columns')
# remove columns with missing values
>> df.fillna('value')
One can easily improve data density and improve the amount of information being conveyed. For
example, imagine the difficulty in interpreting a spreadsheet with the score of each inning of each
batsman being recorded along with his strike rate. The following data visualization helps one to
figure out many insights just by looking at the plot.
Visualisation can help in visual exploratory analytics. For example, the following visualization helps
in understanding the connections between different software and clustering them together based
on their features.
A picture is worth a thousand words. That’s the power of visual imagery. A message which cannot
be conveyed through a large set of texts and tabular data can easily be presented through visuals.
One often uses graphics to make sense of large and complex sets of information. This makes data
visualisation a very important step for data understanding. For example, the following visualization
of a data from an incomprehensible tabular format in a heat map helped a football coach formulate
the defence strategy for his team.
Stacked Bar chart Scatter plot Box plot Grouped Bar chart
To compare the share and To summarize the variation To represent the quartile, To represent different
contribution of categories of data points across two percentile and outliers sub-groups among the
across different sectors parameters values main categories
Plots
>> plt.xlabel('label_x') # change x label
>> plt.ylabel('label_y') # change y label
>> plt.title('title') # change title
>> plt.xlim([initial_value, final_value]) # x-axis limits
>> plt.ylim([initial_value, final_value]) # y-axis limits
Subplots
>> plt.figure(1, (width, height)) # initiating new figure explicitly
>> plt.subplots_adjust(hspace=1, wspace=1) # space between subplots
# subplot for 1 row 3 columns
>> plt.subplot(1, 3, 1) # set subplot 1
>> plt.title('subplot 1') # title for 1st position
>> plt.plot(data_x, data_y) # plot for 1st position
>> plt.subplot(1, 3, 2) # set subplot 2
>> plt.title('subplot 2') # title for 2nd position
>> plt.plot(data_x, data_y) # plot for 2nd position
>> plt.subplot(1, 3, 3) # set subplot 3
>> plt.title('subplot 3') # title for 3rd position
>> plt.plot(data_x, data_y) # plot for 3rd position
Plot Types
>> plt.plot(data_x, data_y) # line plot
>> plt.boxplot(data_x) # box plot
>> plt.hist(data_x) # histogram
>> plt.scatter(data_x, data_y) # scatter plot
>> image = plt.imread('image_path') # read image
>> plt.imshow(image) # plot image
Univariate Distributions
A univariate distribution shows how data points of one variable are distributed. Univariate plots can
be visualized using histograms, density plots, rug plots and box plots.
Bivariate Distributions
A bivariate distribution shows how two variables interact with each other. Bivariate plots can be
visualized using scatter plots. One can also visualise pairwise relationships between multiple
variables using heatmaps.
Categorical Distributions
It shows how the data is distributed across multiple segments or categories. This data can be
visualized using various categorical plots like box plot, bar plot, count plot etc.
1. Plotting time on the x-axis and the value (usually aggregated using mean, median etc.) of a
variable on the y-axis.
2. Plotting a heat map with year/ month/day along the axes and the values denoted by colour.
Plots
sns.set_style('whitegrid') # set style
sns.set_context('paper', font_scale=1) # set context
Bivariate Distribution
>> sns.jointplot('col_1', 'col_2', df) # joint plot
>> sns.jointplot('col_1', 'col_2', df, kind="hex", color="k") # joint plot
>> corr_matrix = df[['col_1', 'col_2', 'col_3']].corr() # correlation matrix
>> sns.heatmap(corr_matrix, cmap="YlGnBu", annot=True) # heat map
>> sns.pairplot(df[['col_1', 'col_2', 'col_3']]) # pairwise plot
Vectors
1. Vectors are usually represented in two ways - as ordered lists, such as x = [x1 , x2 , . . . xn ] or
or using the 'hat' notation, such as x = x1 ˆi + x2 ˆj + x3 ˆk where ˆi, ˆj, ˆk represent the three
perpendicular directions (or axes).
2. The number of elements in a vector is the dimensionality of the vector. For e.g. x = [x1 , x2 ] is
two dimensional (2-D) vector , x = [x1 , x2 , x3 ] is a 3-D vector and so on.
3. The magnitude of a vector is the distance of its tip from the origin. For an n-dimensional
vector x = [x1 , x2 , . . . xn ] , the magnitude is given by, ∣∣a∣∣ = √x2
1
+ x22 + . . . + x2n .
4. A unit vector is one whose distance from the origin is exactly 1 unit. For e.g. the vectors
ˆi, ˆj, ˆi/√2 + ˆj /√2 are unit vectors.
Vector Operations
1. Vector Addition/Subtraction : It is the element-wise sum/difference of two vectors.
Mathematically,
Geometrically,
The dot product of two perpendicular vectors (also called orthogonal vectors) is 0 . The dot
product can be used to compute the angle between two vectors using the formula,
This simple property of the dot product is extensively used in data science applications. One
such example is the spam detection, which is a popular application of machine learning,
where the spam mails are separated from genuine mails. Spam detection algorithms make a
decision based on the words in an email i.e. if the email contains phrases such as “easy
money”, “free!!”, “hurry up” etc., it is more likely to be a spam mail. On the other hand, if it
contains words such as “meeting”, “powerpoint”, “client” etc., it is probably genuine mail.
Each mail is represented as a vector based on the words it contains. Each mail is then
classified accordingly by checking for its similarity with known spam mails by finding the
angle between the vector representation of these mails (the smaller the angle the more
similar are the mails). This cosine similarity technique is an extremely useful technique and
can be extended to any set of text documents. In fact, it is a very general technique used in a
variety of machine learning techniques such as recommender systems, web and document
search etc.
Vector Spaces
1. Basis Vector : A basis vector of a vector space V is defined as a subset (v 1 , v 2 , . . . v n ) of
vectors in vector space V , that are linearly independent and span vector space V .
Consequently, if (v 1 , v 2 , . . . v n ) is a list of vectors in vector space V , then these vectors form
a vector basis if and only if every v in vector space V can be uniquely written as,
v = a1 v 1 , a2 v 2 , . . . an v n
The vectors ˆi and ˆj are chosen as the default basis vectors, though one can choose to
have a completely different, valid choice of basis vectors.
2. Span : The span of two or more vectors is the set of all possible vectors that one can get by
changing the scalars and adding them.
3. Linear Combination : The linear combination of two vectors is the sum of the scaled vectors.
Matrices
1. Rows : Rows are horizontal. The matrix A has m rows. Each row itself is a vector, so they are
also called row vectors.
2. Columns : Columns are vertical. The matrix A has n columns. Each column itself is a vector,
so they are also called column vectors.
3. Entities : Entities are individual values in a matrix. For a given matrix A , value of row i and
column j is represented as Aij .
4. Dimensions : The number of rows and columns. For m rows and n columns, the dimensions
are (m × n) .
5. Square Matrices : These are matrices where the number of rows is equal to the number of
columns, i.e m = n .
6. Diagonal Matrices : These are square matrices where all the off-diagonal elements are zero,
i.e,
7. Identity Matrices : These are diagonal matrices where all the diagonal elements are 1 , i.e,
3. Matrix Multiplication or Dot Product : It is the element-wise product of the two matrices i.e
the (i, j ) element of the output matrix is the dot product of the i th row of the first matrix and
th
the j column of the second matrix. Mathematically,
Not all matrices can be multiplied with each other. For the matrix multiplication AB to be
valid, the number of columns in A should be equal to the number of rows in B . i.e for two
matrices A and B with dimensions (m × n) and (o × p) , AB exists if and only if m = p and
B A exists if and only if o = n . Matrix multiplication is not commutative i.e AB =/ B A .
4. Matrix Inverse : The inverse of a matrix A is a matrix such that AA−1 = I (Identity M atrix) .
Linear Transformations
Any transformation can be geometrically visualised as the distortion of the n -dimensional space (it
can be squishing, stretching, rotating etc.). The distortion of space can be visualised as a distortion
of the grid lines that make up the coordinate system. Space can be distorted in several different
ways. A linear transformation, however, is a special distortion with two distinct properties,
Consider a linear transformation where the original basis vectors ˆi and ˆj move to the new points,
ˆi = [1,− 2] and ˆj = [3, 0] . This means that ˆi moves to (1,− 2) from (1, 0) and ˆj moves to (3, 0)
from (0, 1) in the linear transformation. This transformation simply stretches the space in the
y-direction by three units while stretching the space in x-direction by two units and rotating it by
sixty degrees in clockwise direction. One can combine the two vectors where ˆi and ˆj land and
write them as a single matrix, i.e,
As can be seen, each of these vectors form one column of the matrix (and hence are often called
column vectors). This matrix fully represents the linear transformation. Now, if one wants to find
where any given vector v would land after this transformation, one simply needs to multiply the
vector v with the matrix L , i.e v new = L.v . It is convenient to think of this matrix as a function which
describes the transformation, i.e it takes the original vector v as the input and returns the new
vector v new . The following figures represent the linear transformation.
Composite Transformation
One can also apply multiple linear transformations one after the other. For example, one can rotate
the space 90 degrees counterclockwise, then apply positive shear, and then rotate it back again 90
degrees clockwise. Let's say these matrices are called A , B and C respectively. Mathematically, if
one imagines these transformations being applied to a vector v , then the final vector would be,
v f inal = C .B.A.v . That is, one first applies A to vector v to get the matrix v transf ormation A = A.v , then B
to vector A.v to get v transf ormation B = B .A.v and so on to finally get v f inal = C .B.A.v . One can represent
the matrix product of C .B.A as another matrix L = C .B.A . The matrix L represents the three
transformations done one after the other or in other words a composite transformation matrix (doing
the three consecutive transformations is equivalent to the single transformation).
Determinants
The determinant of a matrix A , usually denoted as ∣ A ∣ is a numerical value associated with a
matrix. Mathematically,
The determinant represents the magnitude by which the area (for 2D matrix), volume (for 3D matrix)
and so on is scaled upon linear transformation. For example if the determinant of a 2D matrix is zero,
then it represents a transformation that squishes the 2D space into a straight line or a single point.
Solving this system means finding a combination of x1 , x2 , . . . xn that satisfies all the equations.
One can solve this system of equations algebraically, but in most practical applications one will have
to solve really large sets of equations and variables. Thus, one needs to automate the process of
solving such systems. Matrices give a very nifty way to express and solve these equations. The
equations above can be rewritten in the matrix form as,
Now, solving the system of linear equations boils down to just solving the matrix equation Ax = b ,
i.e. finding a vector x which satisfies the condition Ax = b . Thus, solving a system of equations (no
matter how many of them) gets reduced to computing the inverse of a matrix and multiplying it by a
vector. More importantly, since matrix operations can be parallelised, large systems can be solved in
a matter of seconds.
Inverse Matrix
Solving a system of linear equations Ax = b is equivalent to finding the unique vector x that lands
on the vector b after the transformation A , i.e. x = A−1 b . A−1 is known as the inverse of matrix A . In
most cases, A represents a transformation wherein the span is maintained (i.e 2D stays 2D, 3D
stays 3D, etc.). But in the rare cases when A happens to squish the space into a lower dimensional
one, such as squishing a 3D space onto a straight line, it becomes quite unlikely that a vector x
exists which lands on b . The problem is that the vector Ax lies on that 2D plane, but the vector b
does not lie on that plane. Hence, this problem becomes unsolvable in an exact sense. In such
cases one can say that the system of equations does not have a solution. Such situations are
reflected by the fact that the determinant of A is zero, and equivalently the inverse A−1 does not
exist. Non-invertible matrices are also called singular matrices.
Rank of a Matrix
The rank of a matrix is the dimensionality of the output of the transformation. For example consider
the matrix,
Column Space
The column space of a matrix is the span of the columns of the matrix. For example consider the
matrix,
The column space of the matrix is a straight line. Thus, rank is equal to the number of dimensions in
the column space of the matrix.
Null Space
The null space is the space of all vectors that land on the zero vector under the transformation.
This is an important and very frequently used technique as in most of the real-world phenomena, the
matrix A is not invertible. This is because many real-world systems are usually not represented by
square matrices.
Now, the quantity (A − λI) itself is a matrix, and one knows that the a matrix-vector multiplication is
zero only when the matrix (transformation) squishes the space into a lower dimensional one. This is
represented by the fact that the determinant of the matrix is zero. Thus, solving the equation
det(A − λI) = 0 will give the eigenvectors.
Thus, the two eigenvalues λ1 and λ2 have been determined. A square matrix of size n × n always
has exactly n eigenvalues, each with a corresponding eigenvector. The eigenvalue specifies the
size of the eigenvector. Now using the equation Av = λv for the two eigenvalues λ1 and λ2 one can
find the corresponding eigenvectors.
Since an eigenvector simply represents an orientation (the corresponding eigenvalue represents the
magnitude), all scalar multiples of the eigenvector are vectors that are parallel to this eigenvector,
and are therefore equivalent (i.e. upon normalizing the vectors, they would all be equal). Thus,
instead of further solving the above system of equations, one can freely choose a real value for
either x11 or x12 , and determine the other one by using the final equations. For example, if one
chooses arbitrarily x12 = 1 for λ =− 1 then x11 =− 1 and x12 = 1 for λ = 4 then x11 = 3/2 giving the
corresponding eigenvectors as,
Eigendecomposition of a Matrix
Similar to the prime factorisation of numbers (i.e. breaking down an integer as a product of its prime
factors, such as 12 = 2 × 2 × 3 ), one can write a matrix as a product of other matrices. These other
matrices are matrices formed by the eigenvectors and eigenvalues of the original matrix (though
there's no analogy between eigenvalues and prime numbers). One can decompose any matrix
diagonalisable A as,
Here, Q is a matrix whose each column is an eigenvector of A and Σ is a diagonal matrix whose
diagonal entries are the eigenvalues of A . This is called eigendecomposition of the matrix A . Most
of the square matrices are diagonalisable, with some exceptions. A matrix is diagonalisable if there
exists some invertible matrix P such that A = P −1 AP .
Using the matrix trick the eigenvectors v 1 and v 2 can be arranged as column vectors Q = [v 1 v 2 ] .
Also, another diagonal matrix Σ can be created with the eigenvalues as the diagonal entries. Then
the above equations can be written compactly in one matrix equation as,
Functions
A function is a relationship, or a mapping, between inputs and outputs. To quickly reiterate, if a
function f takes the input x and returns the output y , then this relation is denoted as y = f (x) . The
element x is called the argument of the function and y is the value of the function. Some of the
common functions have been given in the following table.
y = mx + b y = ∣x∣ y = x2 y = x3
y = 1/x y = ex y = ln x y = f loor(x)
Differentiation
The slope of a function at a point is basically the derivative of the function computed at that point.
Differentiation is the action of computing a derivative. The derivative of a function y = f (x) is a
measure of the rate at which the value y of the function changes with respect to the change in the
variable x . It is called the derivative of f with respect to x .
Although in practice one usually computes derivatives using the rules, it is important to understand
the derivatives by first principles. This method is the foundation for the rest of differential calculus.
Every differentiation rule and identity in calculus is based on the concept of derivatives by first
principles. The rule is as follows,
Some of the common differentiation rules are given in the following table.
Critical Points : Critical points are points where the function reaches a local or global maximum or
minimum value (or sometimes an inflexion point). A function f (x) has a critical point x if f ′(x) = 0 at
that point, or if the function is non-differentiable. It is not necessary that if f ′(x) = 0 then the point
needs to be either a maximum or a minimum, such points are called inflexion points. The following
figure explains the same.
Maxima and Minima : A differentiable function's maxima and minima points satisfy the condition
f ′(x) = 0 . To find whether a point is maxima or minima, one can compute the derivatives in the
vicinity of the point. If the derivative is positive to the right and negative to the left of the critical
point, then it is a minima and vice versa. But if the derivative does not change the sign from right to
the left of the critical point, then it is an inflexion point. There also is another way to deduce whether
a critical point is maxima or minima by computing the double derivative. If f ′′(x) > 0 , it is a minimum,
if f ′′(x) < 0 , it is a maximum and if f ′′(x) = 0 then it is an inflexion point.
Partial Derivatives
There is often the need to compute the rate of change of multivariate functions f (x, y ) with respect
to the variables x and y . This can be done by using partial derivatives, i.e. derivatives of a function
computed with respect to only one variable. For example, the partial derivative of f (x, y ) = x2 + y 2
with respect to x and y respectively are,
This represents the rate at which the value of the function changes with the change in value of x or
y , while the value of y or x is kept constant (while computing a partial derivative, all other variables
of the function are kept constant).
Total Derivatives
In an function f (x, y ) , the variables x and y are usually assumed to be independent. However, in
some situations, they may be dependent on some other common variables. For example, both x
and y themselves may be varying with time t , i.e. x = x(t) and y = y (t) . In such cases, one cannot
assume that x and y are independent and thus, one cannot compute the partial derivatives
assuming so. Thus, comes into play total derivatives. The total derivatives are somewhat analogous
to the rate of change of a function with respect to all its variables. For example, the total derivative
of f (x, y ) = x2 + y 2 with respect to t is,
Although distance and speed have different units, this is valid since the elements of a vector-values
function are pretty much independent of each other. One can also think about derivatives of vector
functions. The derivative F ′(t) will also be a vector of size two. Its first component will represent the
rate of change of the distance S (t) with time (i.e. the speed), while its second component will
represent the rate of change of the speed v (t) with time (i.e. its acceleration). Thus,
So, the vector functions are almost a trivial extension of the usual univariate and multivariable
functions, with the main advantage being that they make the notation compact, i.e. they can store
multiple output variables (such as distance and speed) in one single vector, rather than maintaining
many variables.
Jacobian
The derivatives of a multivariate function can be stored in a vector called the Jacobian. For example,
Geometrically, the Jacobian tells the slope of the function in all directions. The first component tells
the rate of change of f with respect to x (or the slope in the x direction) while the second
component tells the rate of change of f with respect to y (or the slope in the y direction). For
example, the Jacobian of the above function at the point (1, 0) is J f (1, 0) = [x2 + y 2 ](1,0) = [2, 0] .
Thus, at the point (1, 0) , ∂ f /∂x = 2 and ∂ f /∂y = 0 , which means that the function's slope is 2 in the
x direction and 0 in the y direction. In other words, if one is standing at the point (1, 0) and takes a
small step in the positive x -direction, one will move a little uphill (since the slope is positive). On the
Jacobian Matrix
In the most general case, the Jacobian can be extended to vector-valued functions. In such cases,
the Jacobian is a matrix. For example, consider the following vector-valued function,
This is most general case of a function, it has multiple outputs (f 1 , f 2 ) (i.e. vector-valued) and
multiple inputs (x, y ) (i.e. multivariable). The derivative of this function means to compute the rate of
change of all combinations, i.e. f 1 with respect to x , f 1 with respect to y , f 2 with respect to x and
f 2 with respect to y . This can be stored in a Jacobian matrix as follows,
Hessian Matrix
The Hessian matrix can be thought of as a simple extension of the Jacobian. Just like the Jacobian
contains the first order partial derivatives of a function, the Hessian matrix contains the second order
partial derivatives. The Hessian matrix is heavily used in optimisation algorithms of multivariate
functions, i.e. to find the maxima or minima of functions that depend on multiple variables.
Geometrically, the Hessian is a measure of the curvature of a function. The Hessian will be large at
the curvy regions and smaller at the flat ones. The curvature (what the Hessian measures) is not the
same as slope (what the Jacobian measures). The slope tells of the straight line that best represents
the shape of the function at a point, whereas the curvature tells the inverse of the radius of the circle
which best hugs the shape of the function at that point. The Hessian matrix H of a multivariable
function f is a square n × n matrix, usually defined and arranged as follows,
INFERENTIAL STATISTICS
Note that even after using inferential statistics, one would only be able to estimate the population
data from the sample data, but not find the exact values. This is because when one doesn't have the
exact data, one can only make reasonable estimates about it with a limited level of certainty.
Therefore, when certainty is limited, one talks in terms of probability.
Let’s consider the following game. There is a bag with three red and two blue balls. Everyone
playing the game gets four chances to pick a random ball from the bag. Everytime a ball is picked,
its color is noted and then the ball is placed back in the bag for the next pick till all the four chances
are complete. Anyone who gets a combination of four red balls for all the four picks will receive a
reward of 150 points and for any other combination will get a penalty of 10 points. The question out
here is whether in the long run (i.e. if the game is played a lot of times), is a player going to end up
with positive or negative points. For finding the answer to this question, one has to find the following
details. Each of these steps explains a concept which is very useful for finding the answer.
Random Variables
The following figure shows all the possible outcomes/combinations that can occur for the four picks.
In all there are 16 different possible outcomes. These outcomes can be quantified by using some
variable X representing the number of red balls picked or number of blue balls picked or the
difference in the number of red and blue balls picked. This random variable X basically converts the
outcomes of experiments to something measurable, converting the data entirely into quantitative
terms, making it possible to perform a number of statistical analyses on the data.
Possible Outcomes
0 Red Balls 1 Red Balls 2 Red Balls 3 Red Balls 4 Red Balls
4 Blue Balls 3 Blue Balls 2 Blue Balls 1 Blue Balls 0 Blue Balls
Possible
Combinations
of picks
Random Variables
No. of red balls X=0 X=1 X=2 X=3 X=4
No. of blue balls X=4 X=3 X=2 X=1 X=0
Difference in red
X =− 4 X =− 2 X=0 X=2 X=4
and blue balls
X P (X) = x/75
0 2/75 = 0.027
1 12/75 = 0.160
2 26/75 = 0.347
3 25/75 = 0.333
4 10/75 = 0.133
So basically, a probability distribution is ANY form of representation that tells about the probability
for all possible values of X . It could be an equation or table or chart. In a valid, complete probability
distribution, there are no negative values, and the total of all probability values adds up to 1 .
Expected Value
The expected value for a variable X is the value of X one would expect to get after performing the
experiment once. It should be interpreted as the average value one gets after the experiment has
been conducted an infinite number of times. It is also called the expectation or average or mean
value. Mathematically, for a random variable X that can take values x1 , x2 , x3 . . ., xn , the expected
value (EV) is given by,
Now using the probabilities calculated and above equation one can find the expected value for the
number of red balls.
So, on an average each player is going to earn 11.28 points in every game. Thus, there is a very
high chance of a player going to end up with positive points.
This same principle can be used in finding whether a game in a casino is going to be profitable for
the house or the player in the long run. If this same game is played in a casino where the points are
replaced with currency then in the long run the house is going to lose money. In order for the house
to always win, the expected value needs to be negative. To achieve this the house can either bring
down the value of rewards on winning or increase the value of penalty on losing the game or may
change the rules of the game. Consider another example, the game of roulette. The European
roulette wheel contains the numbers 0 to 36 written in an irregular sequence. The players can bet
on any number starting from 0 to 36 . For example, let’s say one bets ₹ 1000 on the number 5 .
Upon spinning the wheel if the ball lands on the pocket marked 5 , one would win
₹ 1000 × ₹ 36 = ₹ 36000 , resulting in net winnings of ₹ 36000 − ₹ 1000 = ₹ 35000 . However, if the ball
lands on any other pocket, one would not win anything, resulting in net winnings of
₹ 0 − ₹ 1000 = − ₹ 1000 . The probability of winning the game if one bets on the number 5 is
1/37 = 0.027 . The expected value of X where it is the random variable for net winnings is
(35000 × P (X = 35000)) + (− 1000 × P (X =− 1000)) = − 2.70 . Thus, the game of roulette is designed
to ensure a negative expected value, helping the house to always win in the long run.
As can be seen in the following figure the theoretical (calculated) values of probability are actually
quite close to the experimental value. The small differences that can be noticed exist because of the
lower number of experiments done.
Binomial Distribution
The binomial distribution gives the discrete probability distribution P (X = r) of obtaining exactly r
successes out of n Bernoulli trials (where the result of each Bernoulli trial is true with probability p
and false with probability 1 − p ). The binomial distribution is given by,
Inorder to be able to apply the binomial formula the following conditions needs to be satisfied,
1. The total number of trials should be fixed at n .
2. The n trials are independent.
3. Each trial is binary, i.e., has only two possible outcomes, success or failure.
4. Probability of success is same in all trials, denoted by p .
5. The random variable X is the number of successes in the n trials.
Geometric Distribution
A geometric distribution is a special case of a negative binomial distribution with r = 1 . Let X
denote the number of trials until the first success, then the probability distribution is given by,
Cumulative Probability
Till now only the probability of getting an exact value (for example, P (X = 4) ) has been discussed.
But, the casino may need to know the probability of getting three or less red balls (i.e P (X ≤ 3) ), as
this is where the players will lose and the house will make money. Sometimes, talking in less than is
more useful, and cumulative probability is used for such cases. The cumulative probability of X is
defined as the probability of the variable being less than or equal to x . It is denoted by,
The following figure gives the cumulative probability of the ball picking game.
x F (x) = P (X ≤ x)
0 0.0000 + 0.0256 = 0.0256
1 0.0256 + 0.1536 = 0.1792
2 0.1792 + 0.3456 = 0.5248
3 0.5248 + 0.3456 = 0.8704
4 0.8704 + 0.1296 = 1.0000
The main difference between the cumulative probability distribution of a continuous random
variable and a discrete one, is the way they are plotted. While the continuous variables’ cumulative
distribution is a curve, the distribution for discrete variables looks more like a bar chart. The reason
for plotting both of these so differently is that, for discrete variables, the cumulative probability does
not change very frequently. In the discrete example, one only cares about what the probability is for
0, 1, 2, 3, . . . (this is because the cumulative probability does not change between, say, 3 and
3.999999 and is equal to 0.8704 ). For continuous variables, PDFs are more commonly used in real
life as it is much easier to see the patterns in PDFs as compared to CDFs. The following figure gives
the CDF and PDF for a few random variables.
The parameter μ is the mean or expectation of the distribution (and also its median and mode) and
σ is its standard deviation. The variance of the distribution is σ 2 . A random variable with a Gaussian
distribution is said to be normally distributed. The following figures give the PDF and CDF of some
normally distributed functions.
As can be seen, the value of σ is an indicator of how wide the graph is. This is true for any graph,
not just the normal distribution. A low value of σ means that the graph is narrow, while a high value
implies that the graph is wider. This happens because the wider graph will clearly have more values
away from the mean, resulting in a high standard deviation.
Any data that is normally distributed follow the 1-2-3 rule. This rule states that,
1. There is a 68.27 % probability of the variable lying within 1 standard deviation of the mean.
2. There is a 95.45 % probability of the variable lying within 2 standard deviations of the mean.
3. There is a 99.73 % probability of the variable lying within 3 standard deviations of the mean.
1. P (25 < X < 45) = P (μ − 2σ < X < μ + 2σ) ≈ 47.5% + 47.5 ≈ 95%
2. P (25 < X < 50) = P (μ − 2σ < X < μ + 3σ) ≈ 47.5% + 49.85% ≈ 97.35%
3. P (X < 40) = P (0 < X < μ + σ ) ≈ 50% + 34% ≈ 84%
In the above examples one could see that for finding the probability of a random variable X , one is
basically finding out how far is the random variable X form the mean μ . For example, the random
variable X = 43.5 is 8.5 units away from the mean. But in standard terms it can be said as 1.65 σ or
(8.5/5) standard deviations away from the mean. This value of 1.65 is called as the standardised
random variable or z − score which is given by,
Basically, it tells how many standard deviations away from the mean μ , the random variable X is.
The standardised random variable Z is a much more informative variable than the unstandardized
random variable X while dealing with cumulative probabilities. A positive value of Z means that the
value is to the right of the center implying high cumulative probability and vice versa. The
cumulative probability corresponding to a given value of Z (say 0.68 ) can be found out using the
Z-table as shown in the following table.
Student’s T-Distribution
In real life scenarios most of the time one does not have all the information regarding a population
(such as population standard deviation etc.). In such scenario the t -distribution is used. It is a
continuous probability distribution that arises while estimating the mean of a normally distributed
population in situations where the sample size is small and the population standard deviation is
unknown. It is similar to the normal distribution in many cases (for example, it is symmetrical about
its central tendency). However, it is shorter than the normal distribution and has a flatter tail, implying
that it has a larger standard deviation. The general form of its probability density function is,
The parameter v = n − 1 is the degrees of freedom where n is the number of observations. The
following figures give the PDF and CDF of some t-distributed functions and comparison with the
normal distribution.
t-distribution v s z-distribution
Gamma Distribution
The gamma distribution is a two-parameter family of continuous probability distributions. The
exponential distribution and chi-squared distribution are special cases of the gamma distribution. It
is frequently used to model waiting times. The continuous random variable X follows a gamma
distribution for x (the waiting time) until the k th event occurs, if its probability density function is,
The following figures give the PDF and CDF of some gamma distributed functions.
Exponential Distribution
The exponential distribution is the probability distribution of the time between events in a Poisson
point process, i.e., a process in which events occur continuously and independently at a constant
average rate. The continuous random variable X follows an exponential distribution if its probability
density function is,
The following figures give the PDF and CDF of some exponentially distributed functions.
Chi-Squared Distribution
The chi-square distribution (also chi-squared or χ2 -distribution) with k degrees of freedom is the
distribution of a sum of the squares of k independent standard normal random variables. It is one of
the most widely used probability distributions in inferential statistics, notably in hypothesis testing
and in construction of confidence intervals. The continuous random variable X follows an
chi-squared distribution if its probability density function is,
The following figures give the PDF and CDF of some chi-squared distributed functions.
F Distribution
The F-distribution, also known as Snedecor's F distribution or the Fisher–Snedecor distribution is a
continuous probability distribution that arises frequently as the null distribution of a test statistic,
most notably in the analysis of variance. The general form of its probability density function is,
Samples
A population is an aggregate of creatures, things, cases and so on. A population commonly contains
too many individuals to study conveniently, so an investigation is often restricted to one or more
samples drawn from it. A well chosen sample contains most of the information about a particular
population parameter but the relation between the sample and the population must be such as to
allow true inferences to be made about a population from that sample. So, the first important
attribute of a sample is that every individual in the population from which it is drawn must have a
known non-zero chance of being included in it. Statistics (such as averages, standard deviations
etc.) when taken from populations are referred to as population parameters. The following table
gives the notations and formulae related to populations and their samples.
Second Quartile /
Q2 50th percentile cutting data set in half
Median
Interquartile Range I QR I QR = Q3 − Q1
Population Mean μ
Population Standard
σ
Deviation
Sample Mean X
Sample
(X 1 , X 2 . . . X n )
Sample Standard
S
Deviation
Sampling Distributions
The sampling distribution, specifically the sampling distribution of the sample means, is a probability
density function for the sample means of a population. This distribution has some very interesting
properties, which will later help one in estimating the sampling error. The following table gives the
sampling distributions for the ball picking game.
The following table gives the notations and formulae related to sampling distributions.
Thus using the CLT, instead of collecting the data from the whole population only a good enough
sample population can be used to collect the data for getting inference about the whole population.
However, it would not be fair to infer that the population mean data is going to be exactly equal to
the sample mean data. This is because the flaws of the sampling process must have led to some
error. Hence, the sample mean’s value needs to be reported with some margin of error. This margin
of error is known as the confidence interval.
If there is a sample with sample size n , mean X and standard deviation S , then the confidence
interval corresponding to y percentage of confidence level for μ is given by the range,
Confidence Interval Z*
90 % ± 1.65
95 % ± 1.96
99 % ± 2.58
At this point, it is important to address a very common misconception. Sampling distributions are just
a theoretical exercise and one is not actually expected to make one in real life. If one wants to
estimate the population mean, one needs to just take a sample rather than create an entire
sampling distribution. The theoretical study of sampling distributions helped in learning more about
CLT so that one can make all the assumptions as stated above. Consider the following examples.
1. The maximum permissible of lead in any food product is 2.5 ppm. The aim is to find the
average content of lead in one of the food products. A sample population of n = 100 was
taken having X = 2.3 ppm and S = 0.3 ppm . So before reporting the population mean μ one
needs to find the confidence interval. This being a very sensitive task upon which the entire
business depends so a high confidence level of 99 % needs to be chosen giving the
confidence interval of X ± ( Z * S /√n ) , i.e 2.3 ± 0.077 .
So, one can conclude that the mean lead content in the product lies in the range of
2.223 ppm to 2.377 ppm with a confidence level of 99 %.
2. The amount of paracetamol specified by the drug regulatory authorities is 500 mg with an
allowed error of 10 %. Anything below 450 mg is a quality issue as the drug becomes
ineffective, while anything above 550 mg is a serious regulatory issue. In a pharma company
there are a number of identical manufacturing plants, each of which produces approximately
10, 000 tablets per hour. The aim is to ensure that the manufacturing process is running
successfully by measuring the average amount of paracetamol in each tablet. A sample
population of n = 100 tablets were taken having X = 530 mg and S = 100 mg . Upon
calculating the average amount of paracetamol in each tablet it was found that the content
of paracetamol was in the range of 513.5 − 546.5 mg , 510.4 − 549.6 mg and 504.2 − 555.8 mg
for 90 %, 95 % and 99 % of confidence level respectively.
Thus, it can be claimed that the tablet is fit to consume and effective only at a confidence
level of 90 %.
3. A certain website is surveying which of the two features A and B is better. The survey was
conducted on a sample population of n = 10, 000 . Now it was found that 50.5 % of people
preferred the feature B over the feature A . If the random variable X is taken to be the
proportion of people that prefer feature B over the feature A , then, for this sample
If the margin of error is taken to be 1 %, then that would mean that μ , which is the proportion
of people that prefer feature B over the feature A , lies between the range 50.5 ± 1 %, i.e.
49.5 % to 51.5 %. It would then be difficult for anyone to say with certainty whether μ would
be more than 50 % or not. So, even though the proportion of people that prefer feature B
over the feature A is more than 50 % in the sample, one would not be able to say with
certainty that this proportion would be more than 50 % for the entire population.
On the other hand, if the margin of error is taken to be 0.3 %, then one would be able to say
that the population mean lies within the range 50.5 ± 0.3 %, i.e. 50.2 % to 50.8 %. So, one
would be able to say with certainty that the proportion of people that prefer feature B over
the feature A is more than 50 % in the sample and for the entire population too. The margin
of error corresponding to 90 % confidence level is given by Z * S /√n = 0.0033 ( 0.33 %), and
the population mean lies between 50.5 ± 0.33 %, i.e. 50.17 % to 50.83 %.
Hence, one can say that feature B should replace feature A with a confidence of 90 %.
HYPOTHESIS TESTING
Hypothesis Testing
One should not confuse between inferential statistics and hypothesis testing owing to their similar
terminologies. The inferential statistics is used to find some population parameter (mostly population
mean) when there is no initial number to start with. So, one starts with the sampling activity and finds
out the sample mean. Then, the population mean is estimated from the sample mean using the
confidence interval. Whereas the Hypothesis testing is used to confirm a conclusion (or hypothesis)
about the population parameter (which is known from EDA or intuition). Through hypothesis testing,
one can determine whether there is enough evidence to conclude if the hypothesis about the
population parameter is true or not. The various steps in Hypothesis Testing are,
1. Null hypothesis (H 0 ) : The status quo or prevailing belief about the population.
2. Alternate hypothesis (H 1 ) : The challenge to the status quo or complement of the null
hypothesis.
Some examples of null and alternate hypotheses are given in the following table.
Scenario Hypotheses
H 0 : Defendant is innocent.
Criminal Trial
H 1 : Defendant is not innocent.
Average lead content in food H 0 : Average lead content is less than or equal to 2.5 ppm .
products should be less than 2.5 ppm H 1 : Average lead content is more than 2.5 ppm .
A company claims its total valuation H 0 : Total valuation is more than or equal to $10
is at least $10 billion H 1 : Total valuation is less than $10
Average demand of AC units is 350 H 0 : μ = 350
units per month in summers H 1 : μ =/ 350
As can be seen for instances where the claim statement has words like “at least”, “at most”, “less
than” or “greater than”, then one cannot formulate the null hypothesis just from the claim statement
as it is not necessary that the claim is always about the status quo. In such scenarios one can use
the following rules to formulate the null and alternate hypotheses,
The formulation of the null and alternate hypotheses determines the type of the test and the
position of the critical regions in the normal distribution. One can tell the type of the test and the
position of the critical region on the basis of the sign in the alternate hypothesis. The following table
gives the same.
Types of Errors
There are two types of errors that can result during the hypothesis testing process.
1. Type-I Error : It is represented by α and occurs when a true null hypothesis is rejected.
2. Type-II Error : It is represented by β and occurs when a false null hypothesis is failed to be
rejected.
Null Hypothesis
Type of Error
True False
Fail to Reject
Correct Decision Type-II Error (β)
Null Hypothesis
Decision
Reject Null
Type-I Error (α) Correct Decision
Hypothesis
The power of any hypothesis test is defined by 1 − β . If one goes back to the analogy of the criminal
trial example, one would find that the probability of making a Type-I error would be more if the jury
convicts the accused even on less substantial evidence. The Type-I error can be reduced if the jury
adopts more stringent criteria to convict an accused party. However, reducing the probability of a
Type-I error may increase the probability of making a Type-II error, i.e. if the jury becomes very
liberal in acquitting the people on trial, there would be a higher probability that an actual criminal
walks free. The following figure represents the same.
2. A new quantity α (also known as the significance level) is defined for the test. It refers to the
proportion of the sample mean lying in the critical region. This value corresponds to the
probability of observing such an extreme value by chance or the probability of rejecting the
null hypothesis when it is true. One can take any value for α such as 0.01 (1 %) , 0.05 (5 %) ,
0.1 (10 %) and so on as per the sensitivity of the test. A significance level of 0.05 indicates
that there is 5 % risk of concluding that a difference exists when actually there is no
difference.
3. The cumulative probability of the UCV (upper critical value) is calculated from the value of α .
Then the z − score ( Z c ) is calculated using the UCV and the Z-table as shown in the
following table.
4. The critical values, UCV and LCV are calculated from the value of Z c using the formula
μ ± (Z c S /√n) .
5. Finally the decision is made on the basis of the value of the sample mean x with respect to
the critical values UCV and LCV.
1. A manufacturer claims that the average life of its product is 36 months. An auditor selects a
sample of size n = 49 units of the product having average life of X = 34.5 months and
standard deviation of S = 4 months. The hypotheses being considered are, H 0 : μ = 36
months and H 1 : μ =/ 36 months. This being a two-tailed test, for a significance level of 3 %
(i.e. α = 0.03 ), the UCV can be calculated as U CV = 1 − (0.03/2) = 0.9850 . The z − score for
0.9850 can be calculated from Z-table as 2.17 ( 2.1 on the horizontal axis and 0.07 on the
vertical axis). The LCV and UCV can then be calculated as LCV = 36 − (2.17 × 4/√49) = 34.76
and U CV = 36 + (2.17 × 4/√49) = 37.24 .
It can be seen that the sample mean in this case is 34.5 months, which is less than the LCV.
This implies that the sample mean lies in the critical region and thus the null hypothesis can
be rejected.
2. A retail chain claims that the average demand of AC units is at most 350 units per month in
summers. A sample of n = 36 stores were taken having average sales of X = 370.16 units
and standard deviation of S = 90 units. The hypotheses being considered are, H 0 : μ ≤ 350
units and H 1 : μ > 350 units. This being a one-tailed test (more specifically a upper-tailed
test), upon taking a significance level of 5 % (i.e. α = 0.05 ), the UCV can be calculated as
U CV = 1 − 0.05 = 0.9500 . The z − score for 0.9500 can be calculated from Z-table as 1.645
(as the value 0.9500 is not present in the Z-table, the nearest values 0.9495 and 0.9505 can
be taken and the average of their z − v alues be considered i.e (1.64 + 1.65)/2 = 1.645 ). The
UCV can then be calculated as U CV = 350 + (1.645 × 90/√36) = 374.67 .
It can be seen that the sample mean in this case is 370.16 units, which is less than the UCV.
This implies that the sample mean lies in the acceptance region and thus one fails to reject
the null hypothesis.
2. The z − score is calculated for the sample mean point X on the distribution using the
formula z − score = (X − μ)/(σ /√n) .
3. The p − v alue is then calculated from the cumulative probability for the given calculated
z − score using the Z-table. To find the correct p − v alue from the z − score , the cumulative
probability is first found out by simply checking the Z-table (which gives the area under the
curve till that point) as shown in the following table.
4. A decision is made on the basis of the p − v alue with respect to the given significance value
α . It should be noted that the p − v alue is multiplied by two for a two-tailed test.
Consider the following examples using the p − v alue method to make a decision about an
hypothesis.
1. A manufacturer claims that the average life of its product is 36 months. An auditor selects a
sample of size n = 49 units of the product having average life of X = 34.5 months and
standard deviation of S = 4 months. The hypotheses being considered are, H 0 : μ = 36
months and H 1 : μ =/ 36 months. The value of the z − score can be calculated for the sample
mean point ( X = 34.5 ) as (34.5 − 36)/(4/√49) = − 2.62 (since the sample mean lies on the
left side of the hypothesised mean of 36 months, the z − score comes out to be negative).
The z − v alue for − 2.62 can be calculated from Z-table as 0.0044 (value corresponding to
For a significance level of 3 % , it can be seen that the p − v alue is less than the significance
level (0.0088 < 0.03) . Thus, the null hypothesis can be rejected.
2. A retail chain claims that the average demand of AC units is at most 350 units per month in
summers. A sample of n = 36 stores were taken having average sales of X = 370.16 units
and standard deviation of S = 90 units. The hypotheses being considered are, H 0 : μ ≤ 350
units and H 1 : μ > 350 units. The value of the z − score can be calculated for the sample
mean point ( X = 370.16 ) as (370.16 − 350)/(90/√36) = 1.344 (since the sample mean lies on
the right side of the hypothesised mean of 350 units, the z − score comes out to be positive).
The p − v alue for 1.344 can be calculated from Z-table as 1 − 0.90988 = .0895 (value
corresponding to 1.3 on the horizontal axis and 0.04 on the vertical axis). Since the sample
mean is on the right side of the distribution and this is a upper-tailed test, the p − v alue
would remain the same .0895 .
For a significance level of 5 % , it can be seen that the p − v alue is more than the
significance level (0.0895 > 0.05) . Thus, one fails to reject the null hypothesis.
T-Test
A t -Test is a statistical test used to compare the mean of two given samples. Similar to z -test, the t
-test also assumes a normal distribution of the sample. It is used when the population parameters
(such as mean μ , standard deviation σ etc.) are not known. The test statistic used is given by
t = (X − μ)/(S /n) .
The method for hypothesis testing is similar to that of z -Test critical value method, except that the
T-Table is used instead of the Z-Table while calculating the value of Z c . The T-table contains the
values of Z c for a given degree of freedom and significance level as shown in the following table.
1. The average birth weight for babies born in cities in India is 2.9 Kg. One wants to compare
the average birth weight of a sample of babies born in villages to this value.
2. The recommended total cholesterol level by doctors is below 200 mg/dl. One wants to test if
the samples gathered from all the metro cities are statistically different, on average, from this
recommended level.
1. There is a new drug being tested. One would need to compare the sample before and after
the drug is taken to check if the results are different or not.
2. There is a hypothesis that Virat Kohli performs better or as good in the second innings of a
test match as the first innings. This would be a two-sample mean test, where sample 1 would
contain his score from the first innings and sample 2 would contain his score from the
second innings. This would be a paired test since each row in the data would correspond to
the same match.
1. There is a new drug being tested. One needs to compare its effectiveness to that of the
standard available drug. So, one needs to take a sample of patients who have consumed the
new drug and compare it with another sample who have consumed the standard drug.
2. There is a hypothesis that Virat Kohli performs better than Sachin Tendulkar in the second
innings of a ODI match. This would be a two-sample mean test, where sample 1 would
contain Kohli’s score from the second innings and sample 2 would contain Sachin’s score
1. Two drugs are being compared for the effectiveness of two drugs. The desired outcome of
the drug is defined as success. So, one needs to take a sample of patients who have
consumed the new drug and record the number of successes and compare it with successes
in another sample who have consumed the standard drug.
1. During the development of an e-commerce website, there could be different opinions about
the choices of various elements, such as the shape of buttons, the text on the call-to-action
buttons, the colour of various UI elements, the copy on the website, or numerous other such
things. Often, the choice of these elements are very subjective and it is very difficult to
predict which option would perform better. To resolve such conflicts, one can use
A/B-Testing. It provides a way to test two different versions of the same element and check
which one performs better.
Chi-Square Test
A Chi-Square Test is a statistical test used to determine whether there is a statistically significant
difference (i.e. a magnitude of difference that is unlikely to be due to chance alone) between the
expected frequencies and the observed frequencies in one or more categories of a so-called
2
contingency table. The test statistic used is given by χ2 = Σ[ (Observed − E xpected) /Expected ] .
The method for hypothesis testing is similar to the z -Test p -value method, except that the
Chi-Square Table is used instead of the Z-Table while calculating the p -value. The Chi-Square Table
contains the Chi-Square values for a given degree of freedom and significance level as shown in the
following table.
1. A public opinion poll surveyed a sample of n = 1000 voters. Respondents were classified by
the first categorical variable gender (male or female) and by second categorical variable
voting preference (Republican, Democrat, or Independent). Results are shown in the
following contingency table.
The hypotheses being considered are, H 0 : There is no relationship between gender and
voting preference and H 1 : There is a relationship between gender and voting preference.
The expected values are calculated assuming null hypothesis is correct. Thus, expected
values are E M ale, Republican = (400 × 450)/1000 = 180 , E F emale, Republican = (600 × 450)/1000 = 270
2 2
and so on. The test statistic χ2 = [(200 − 180) /180] + [(250 − 270) /270] + ... = 16.2 . For a
significance level of 5 % and df = (levels of gender − 1) × (levels of voting pref erence − 1)
= (2 − 1) × (3 − 1) = 2 , the p − v alue = 0.0003 (i.e. the probability that the test statistic is more
extreme than 16.2 ).
Since, the p − v alue is less than the significance level (0.0003 < 0.05) , the null hypothesis
can be rejected.
1. A Toy Company prints cricket cards. The company claims that 30 % of the cards are rookies,
60 % veterans but not All-Stars, and 10 % are veteran All-Stars. The hypotheses being
considered are, H 0 : The proportion of rookies, veterans and All-stars is 30 % , 60 % and
10 % respectively and H 1 : At least one of the proportions is not true. For a sample of
n = 100 cards the count of various types were 50 rookies, 45 veterans and 5 All-Stars. The
expected values are calculated assuming null hypothesis is correct. Thus, expected values
are E rookie = 0.3 × 100 = 30 , E veteran = 0.6 × 100 = 60 and so on. The test statistic
χ2 = [(50 − 30)2 /30] + [(45 − 60)2 /60] + ... = 19.58 . For a significance level of 5 % and
df = (levels of player − 1) = 3 − 1 = 2 , the p − v alue = 0.0001 (i.e. the probability that the test
statistic is more extreme than 19.58 ).
Since, the p − v alue is less than the significance level (0.0001 < 0.05) , the null hypothesis
can be rejected.
The Two Sample t-Tests can only validate hypotheses containing only two groups at a time. For
samples involving three or more groups, the t-Test then becomes tedious as one has to perform the
tests for each combination of the groups. Also, the Type-I error increases in this process. So,
analysis of variance (ANOVA) is used to statistically assess the equality of means. ANOVA can
determine whether the means of three or more groups are different. It uses F-Tests to statistically
test the equality of means. Consider the following example.
1. A test was conducted in a workplace, and the feedback on three e-commerce platforms
were recorded in a dataset. The following table shows the data.
The hypotheses being considered are, H 0 : All the platforms are equally popular
(μ1 = μ2 = ... = μp ) and H 1 : At least one of the platforms has different popularity from the
rest (μ1 =/ μ2 =/ ... =/ μp ) . The following table gives the method of calculation of F-Ratio.
Since, the F − critical value is less than the calculated F − v alue (3.3690 < 4.7800) , the null
hypothesis can be rejected.
Chi-Square Test
>> from scipy.stats import chi2_contingency
>> stat, p, dof, expected = chi2_contingency([data['col_1'], data['col_2']])
F-Test ANOVA
>> from scipy.stats import f_oneway
>> stat, p = f_oneway(data['col_1'], data['col_2'], data['col_3'])
1. Data Sourcing
2. Data Cleaning
3. Univariate Analysis
4. Bivariate Analysis
5. Derived Metrics
Data Sourcing
There are two major kinds of data which can be classified according to the source,
1. Public data : A large amount of data collected by the government or other public agencies is
made public for the purposes of research. Such data sets do not require special permission
for access and are therefore called public data.
2. Private data : It is that data which is sensitive to organisations and is thus not available in the
public domain. Banking, telecom, retail, and media are some of the key private sectors that
rely heavily on this data to make decisions.
It should be noted that public data isn’t always relevant and private data isn’t always easily available.
The following data sources are handy when looking for data sets.
1. https://github.com/awesomedata/awesome-public-datasets
2. https://data.gov.in/
3. https://github.com/datameet
Public Data
Public data is available on the internet on various platforms. A lot of data sets are available for direct
analysis, whereas some of the data have to be manually extracted and converted into a format that
is fit for analysis.
Private Data
A large number of organisations seek to leverage data analytics to make crucial decisions. As
organisations become customer-centric, they utilise insights from data to enhance customer
experience, while also optimising their daily processes. While banks use data to make credit related
decisions, telecoms use it to optimise plans for customers and predict customer churn. While retail
data analytics helps drive decisions such as pricing and stocking, the HR data analytics helps
identify and predict employee behaviour. The media industry uses the data extensively to target
viewers better, while the advertisers use it to identify best avenues for targeting customers and the
journalists use the same data for visualisation to aid information.
Formatting Errors
Formatting errors such as ill-formatted and unclearly named rows and columns need to be
addressed first. The following steps are used to correct some of these issues at the level of rows
and columns.
Fixing Rows :
Fixing Columns :
Missing Values
Another common data quality issue is the missing values. If there are reliable external sources then
one can replace the missing values with the information. But often, it is better to let missing values
be and continue with the analysis rather than extrapolating the available information as good
methods add information, bad methods exaggerate the information. The following steps are used to
treat missing values.
1. Identify values that indicate missing data but are not recognised by the software as such
(disguised missing values such as blank strings, “NA”, “XX”, “999”, etc.) as missing.
2. Either add the reliable data from external sources or better keep it as such rather than
exaggerating the data.
3. Delete rows if the number of missing values are significant in number, as this would not
impact the analysis. Similarly, delete columns if the missing values are quite significant in
number.
4. Fill partial missing values using business judgement (such as missing time zone, century, etc.)
as such values are easily identifiable.
Standardising Values
Scaling of data ensures that all the values have a common scale, which makes analysis easier. Along
with this, removing outliers is another important step in data cleaning as it may disproportionately
affect the results of the analysis and may lead to faulty interpretations. There is no fixed definition of
1. Standardise all the values so as to ensure that all observations under a variable have a
common and consistent unit (such as converting lbs to kgs, miles/hr to km/hr, etc.)
2. Scale the values if required to ensure that the observations under a variable have a common
scale.
3. Standardise the precision for better presentation of data (such as 4.5312341 kgs to 4.53 kgs
etc.).
6. Standardise the case for textual data (such as using uppercase, lowercase, title case,
sentence case etc.).
7. Standardise the format wherever required (such as 23/10/16 to 2016/10/20, “Modi, Narendra”
to “Narendra Modi”, etc.).
Invalid Values
A data set can contain invalid values in various forms. Some of the values could be truly invalid (such
as a string “tr8ml” in a variable containing mobile numbers, a height of 11 f t in a set containing
heights of children etc.) On the other hand, some other invalid values can be corrected (such as a
numeric value with a data type of string, junk characters due to wrong encoding etc.). The following
steps are used to clean invalid values.
1. Encode the data properly while reading values (such as changing the encoding to CP1252
instead of UTF-8 if the data is being read as junk characters etc.).
2. Convert the incorrect data types (such as converting numeric values stored as strings into
number, strings into date etc.).
3. Correct the values that go beyond logical range (such as temperature less than − 273° C ,
height of human more than 20 f t etc.) A close look at the data helps in checking if there is
scope for correction in the value or if the value needs to be removed.
4. Delete the invalid values and treat them as missing values (such as wrong data, values not
belonging to the categorical list etc.).
5. Correct the wrong structure or remove the values that don't follow a defined structure (such
as 12 digit pin code, 20 digit mobile contact number etc.)
6. Validate the internal rules (such as a date of delivery must definitely be after the date of the
order etc.).
1. Remove duplicate data (such as identical rows, rows where some columns are identical etc.).
2. Filter the rows to get only the rows relevant to the analysis (such as filter by segment, filter
by date period etc.).
3. Filter the columns by picking only the columns relevant to the analysis.
4. Aggregate the data by grouping the data by required keys and aggregate the rest.
1. Categorical variables.
2. Quantitative/Numeric variables.
Similarly, there is one other method of classifying variables known as Steven's typology. Steven's
typology classifies variables into four types.
1. Nominal variables.
2. Ordinal variables.
3. Interval variables.
4. Ratio variables.
Categorical Variables
These are qualitative data which are used to categorize data into various categories (such as
male/female, yes/no, etc.). These are again divided into,
Nominal Variables
These are categorical variables, where the categories differ only by their names, there is no order
among categories (such as red/blue/green, male/female etc.). These are the most basic forms of
categorical variables.
Ordinal Variables
These are categorical variables, where the categories follow a certain order, but the mathematical
difference between categories is not meaningful (such as primary school/high school/college,
high/medium/low, bad/good/excellent etc.). The ordinal variables are nominal as well.
Interval Variables
These are categorical variables which follow a certain order where the mathematical difference
between categories is meaningful but division or multiplication is not (such as temperature in
degrees celsius, dates etc.). The interval variables are both nominal and ordinal.
Ratio Variables
These are categorical variables which follow a certain order where apart from the mathematical
difference, the ratio (division/multiplication) is possible (such as sales in dollars, marks of students
etc.). The ratio variables are nominal, ordinal and interval type.
From the histogram one can see that there are huge spikes in readings aligned with the tariff slab
boundaries as these are the ones where the power theft is being done and the reading adjusted
accordingly to pay lower bills. The other spikes being seen are the ones where the readings have
been cooked up by the agents sitting at offices so that they would not have to work to get the actual
readings.
Standard deviation and interquartile difference are both used to represent the spread of the data.
Interquartile difference is a much better metric than standard deviation if there are outliers in the
data. This is because the standard deviation is influenced by outliers while the interquartile
difference simply ignores them. Simple box plots can be used to check the spread of data as shown
in the following figure.
The following figure gives one such example of segmented univariate analysis for the total bill on
days of a week.
As can be seen the correlated variables have been grouped by similarities, and the correlation has
also been calculated for groups of variables. This is called clustering, where the idea is to form a
hierarchy of similar groups of variables. The top-right half represents the correlation coefficient and
the left bottom half has the scatter plot between the two variables.
A segmented univariate analysis may deceive one into thinking that a certain phenomenon is true
without asking the question whether it is true for all sub-populations or is it true only when
aggregated across the entire population. So, bivariate analysis is performed to check the influence
of the variables. In general, there are two fundamental aspects of analysing categorical variables to
draw conclusions, they are,
1. Checking the distribution of two categorical variables (such as checking the distribution of
the occupations of parents across their educational qualifications using cross tables etc.).
2. Checking the distribution of two categorical variables with one continuous variable (such as
checking the distribution of incomes of parents across their educational qualifications and
occupation etc.).
By plotting the marks against the month of birth (derived variable), one could observe that the
children who were born after June would have missed the cutoff by a few days and gotten
admission at the age of 5 instead of 4 . The ones being born after June (such as July, August, etc.)
were intellectually and emotionally more mature than their peers because of their higher age,
resulting in better performance. This unexpected insight could not have been discovered without
the derived variable. Broadly, there are three different types of derived metrics,
1. Type-driven metrics.
2. Business-driven metrics.
3. Data-driven metrics.
Type-Driven Metrics
These metrics can be derived by understanding the variable’s typology. Understanding the types of
variables enables one to derive new metrics of types different from the same column. For example,
age in years is a ratio attribute, but one can convert it into an ordinal type by binning it into
categories such as children ( < 13 years), teenagers ( 13 − 19 years), young adults ( 20 − 25 years),
etc. This helps in getting insights about the children, teenagers, young adults etc. such as are
Business-Driven Metrics
Every business has certain rules, based on which metrics can be derived. These are the metrics
derived from the business perspective and completely domain specific. Extracting meaningful
information from existing variables (such as month from date etc.) can be easy, but extracting
information that requires business expertise is not an easy task. It requires a decent domain
experience. Without understanding the domain correctly, deriving insights can be difficult and prone
to errors. Some more examples of business driven derived metrics are,
Data-Driven Metrics
These metrics can be created based on the variables present in the existing data set or on certain
analysis. For example, if there are two variables in the data set such as weight and height which are
highly correlated, one can derive a new metric Body Mass Index (BMI) for analysis instead of
analysing weight and height variables separately. Once the BMI is found, one can easily categorise
people based on their fitness (such as B M I < 18.5 should be considered as an underweight while
category, while B M I > 30.0 can be considered as obese). This is how data-driven metrics can help
one discover the hidden patterns out of the data.