Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
1K views64 pages

SSMDA Practical - File

Ssmda

Uploaded by

27adityajha27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views64 pages

SSMDA Practical - File

Ssmda

Uploaded by

27adityajha27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Statistics, Statistical Modelling &

Data Analytics Lab


(DA-304P)

Faculty Name: Dr. Sunil Maggu Name: Vishal Singh Thapa


Roll No.: 05214803121
Semester: 6th
Group: 6 AIML IV A

Maharaja Agrasen Institute of Technology, PSP Area, Sector - 22,


New Delhi – 110085
Statistics, Statistical Modelling &
Data Analytics Lab
(DA-304P)

Faculty Name: Ms. Namita Goyal Name: Abhijeet Prakash


Roll No.: 06714803121
Semester: 6th
Group: 6 AIML IV B

Maharaja Agrasen Institute of Technology, PSP Area, Sector - 22,


New Delhi – 110085
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY

VISION OF THE INSTITUTE

To attain global excellence through education, innovation, research, and work ethics with the commitment to
serve humanity.

MISSION OF THE INSTITUTE

M1. To promote diversification by adopting advancement in science, technology,


management, and allied discipline through continuous learning
M2. To foster moral values in students and equip them for developing sustainable
solutions to serve both national and global needs in society and industry.
M3. To digitize educational resources and process for enhanced teaching and effective
learning.
M4. To cultivate an environment supporting incubation, product development,
technology transfer, capacity building and entrepreneurship.
M5. To encourage faculty-student networking with alumni, industry, institutions, and
other stakeholders for collective engagement.
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY

DEPARTMENT OF INFORMATION TECHNOLOGY

VISION OF THE DEPARTMENT

To establish a centre of excellence promoting Information Technology related education and research for
preparing technocrats and entrepreneurs with ethical values.

MISSION OF THE DEPARTMENT

M1. To excel in the field by imparting quality education and skills for software
development and applications.
M2. To establish a conducive environment that promotes intellectual growth and
research.
M3. To facilitate students to acquire entrepreneurial skills for innovation and product
development.
M4. To encourage students to participate in competitive events and industry interaction
with a focus on continuous learning.
Rubrics Evaluation
PRACTICAL RECORD

PAPER CODE : DA-304P


Name of the student: Vishal Singh Thapa
University Roll No.: 05214813121
Branch : IT
Section/ Group: 6 AIML IV A

PRACTICAL DETAILS

a) Experiments according to the list provided by GGSIPU

Exp. Experiment Name Date of Date of R1 R2 R3 R4 R5 Total Signature


No. performance checking (3) (3) (3) (3) (3) Marks
(15)
1. Exercises to implement the 1/02/24
basic matrix operations in
Scilab.

2. Exercises to find the 14/02/24


Eigenvalues and
eigenvectors in Scilab

3. Exercises to solve 21/02/24


equations by Gauss
elimination, Gauss Jordan
Method and
Gauss Siedel in Scilab.
4. Exercises to implement the 29/02/24
associative, commutative
and distributive property in a
matrix in Scilab.

5. Exercises to find the 7/03/24


reduced row echelon form
of a matrix in
Scilab.
6. Exercises to plot the 14/03/24
functions and to find its
first and second
derivatives in Scilab.
7. Exercises to present the 14/03/24
data as a frequency table
in SPSS.

8. Exercises to find the outliers 14/03/24


in a dataset in SPSS.
PRACTICAL RECORD

PAPER CODE : DA-304P


Name of the student: Abhijeet Prakash
University Roll No.: 06714813121
Branch : IT
Section/ Group: 6 AIML IV B

PRACTICAL DETAILS

b) Experiments according to the list provided by GGSIPU

Exp. Experiment Name Date of Date of R1 R2 R3 R4 R5 Total Signature


No. performance checking (3) (3) (3) (3) (3) Marks
(15)
1. Exercises to implement the 1/02/24
basic matrix operations in
Scilab.

2. Exercises to find the 14/02/24


Eigenvalues and
eigenvectors in Scilab

3. Exercises to solve 21/02/24


equations by Gauss
elimination, Gauss Jordan
Method and
Gauss Siedel in Scilab.
4. Exercises to implement the 29/02/24
associative, commutative
and distributive property in a
matrix in Scilab.

5. Exercises to find the 7/03/24


reduced row echelon form
of a matrix in
Scilab.
6. Exercises to plot the 14/03/24
functions and to find its
first and second
derivatives in Scilab.
7. Exercises to present the 14/03/24
data as a frequency table
in SPSS.

8. Exercises to find the outliers 14/03/24


in a dataset in SPSS.
9. Exercises to find the most 21/03/24
risky project out of two
mutually exclusive
projects in SPSS
10. Exercises to draw a scatter 28/03/24
diagram, residual plots,
outliers leverage and
influential data
points in R
11. Exercises to calculate 4/03/24
correlation using R

12. . Exercises to implement 11/03/24


Time series Analysis using
R.

13. Exercises to implement 18/03/24


linear regression using R.

14. Exercises to implement 25/03/24


concepts of probability
and distributions in R

c) Experiments beyond the list provided by GGSIPU

Exp. Experiment Name Date of Date of R1 R2 R3 R4 R5 Total Signature


No. performance checking (3) (3) (3) (3) (3) Marks
(15)

1. To provide an 28/03/24
introduction to
descriptive statistics by
calculating and
interpreting measures
such as mean, median,
mode, variance, and
standard deviation for a
given dataset in
Scilab.

2. To assess whether a 4/03/24


given dataset follows an
exponential distribution
using hypothesis testing
in
Scilab/R.
Program-1

Aim: Exercises to implement the basic matrix operations in Scilab.

Theory:

Addition of Matrices
If A[aij]mxn and B[bij]mxn are two matrices of the same order, then their sum A + B isa matrix, and
each element of that matrix is the sum of the corresponding elements,
i.e. A + B = [aij + bij]mxn.
Consider the two matrices, A and B, of order 2 x 2. Then, the sum is given by:

Subtraction of Matrices
If A and B are two matrices of the same order, then we define

Consider the two matrices, A and B, of order 2 x 2. Then, the difference is given by:

Scalar Multiplication of Matrices


If A = [aij]m×n is a matrix and k any number, then the matrix which is obtained by multiplying the elements
of A by k is called the scalar multiplication of A by k, and it is denoted by k A, thus if A
= [aij]m×n,
Then

Multiplication of Matrices
If A and B be any two matrices, then their product AB will be defined only when the number of columns in A
is equal to the number of rows in B.
If

will be a matrix of order m×p where


Code:

1. Defining Matrices:

// 2x2 matrix
A = [1, 2; 3, 4];

// 3x1 column vector B


= [5; 6; 7];

// 2x2 matrix
C = [8, 9; 10, 11];

2. Accessing Elements:

// Second row, first column of A A(2,


1) // Output: 3

// First element of b
B(1) // Output: 5

Output:

3. Arithmetic Operations:

// Addition
D=A+C
disp(D)
// Subtraction E
=A-C
disp(E)
// Element-wise multiplication (use . for matrix multiplication)
F = A .* C
disp(F)
// Matrix multiplication
G=A*C
disp(G)
// Scalar multiplication H
=2*A
disp(H)

Output:

4. Other Useful Commands:

// Transpose

A_transposed = A'
disp(A’)
// Inverse (if exists)
A_inverse = inv(A)
disp(inv(A))
// Determinant
det(A)
disp(det(A))
// Size size(A)
disp(size(A))
Output:
Program-2

Aim: Exercises to find the Eigenvalues and eigenvectors in Scilab.

Theory:

Eigen Values and Eigen Vector:

Consider a square matrix n × n. If X is the non-trivial column vector solution of the matrix
equation AX = λX, where λ is a scalar, then X is the eigenvector of matrix A, and the
corresponding value of λ is the eigenvalue of matrix A.

Suppose the matrix equation is written as A X – λ X = 0. Let I be the n × n identity matrix.

If I X is substituted by X in the equation above, we obtain A X

– λ I X = 0.

The equation is rewritten as (A – λ I) X = 0.

The equation above consists of non-trivial solutions if and only if the determinant value of the
matrix is 0. The characteristic equation of A is Det (A – λ I) = 0. ‘A’ being an n × n matrix, if (A
– λ I) is expanded, (A – λ I) will be the characteristic polynomialof A because its degree is n.

Code:

// Define your matrix A


= [1 2; 3 4];

// Find eigenvalues and eigenvectors [eigenvalues,


eigenvectors] = spec(A);

// Print eigenvalues
disp('Eigenvalues:');
disp(eigenvalues);

// Print eigenvectors
disp('Eigenvectors:');
disp(eigenvectors);
Output:
Program-3

Aim: Exercises to solve equations by Gauss elimination, Gauss Jordan Method and Gauss
Siedel in Scilab.

Theory:

Gauss Elimination Method

The Gaussian elimination method is known as the row reduction algorithm for solving linear
equations systems. It consists of a sequence of operations performed on the corresponding
matrix of coefficients. We can also use this method to estimate either of the following:

● The rank of the given matrix


● The determinant of a square matrix
● The inverse of an invertible matrix

To perform row reduction on a matrix, we have to complete a sequence of elementary row


operations to transform the matrix till we get 0s (i.e., zeros) on the lower left-hand corner of
the matrix as much as possible. That means the obtained matrix should be an upper triangular
matrix. There are three types of elementary row operations; they are:
● Swapping two rows and this can be expressed using the notation ↔, for
example, R2 ↔ R3
● Multiplying a row by a nonzero number, for example, R1 → kR2 where k is some
nonzero number
● Adding a multiple of one row to another row, for example, R2 → R2 + 3R1

Gauss Jordan Elimination Method

The Gauss-Jordan method is a method for solving systems of linear equations. It is similar to
the Gaussian elimination process, but the entries above and below each pivot are zeroed out.
The result of the Gauss-Jordan method is in reduced row echelon form.

The Gauss-Jordan method uses three elementary row operations on a matrix:


● Swap the positions of two of the rows
● Multiply one of the rows by a nonzero scalar
● Add or subtract the scalar multiple of one row to another row

The Gauss-Jordan method can also be used to find the inverse of any invertible matrix.
The steps for the Gauss-Jordan method are:
● Write the augmented matrix
● Interchange rows if necessary to obtain a non-zero number in the first row, first
column
● Use a row operation to get a 1 as the entry in the first row and first column
● Use row operations to make all other entries as zeros in column one

Gauss-Seidel method

The Gauss-Seidel method is an iterative method for solving a system of linear equations. It's
named after the German mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel.
The method is also known as the Liebmann method or the method of successive displacement.

The Gauss-Seidel method works by:


● Decomposing the matrix A into a lower triangular component and a strictly upper
triangular component U
● Solving the left hand side of the equation
● Using previous values of x

The Gauss-Seidel method is an improvement on the Jacobi method. In the Jacobi method, the
value of the variables is not modified until the next iteration. In the Gauss-Seidel method, the
value of the variables is modified as soon as a new value is evaluated.

The Gauss-Seidel method has several advantages, including:


● Simple calculations
● Less storage needed in computer memory
● Applicable for smaller systems
● Advantageous for large systems of equations because it is less prone to round-off
errors

Code:

1. Gauss Elimination:

// Function to perform forward elimination (Gauss Elimination) function [A,


b] = forward_elimination(A, b)
n = size(A, 1); for
i = 1:n-1
max_pivot_row = i; // Initialize max_pivot_row for
j = i+1:n
if abs(A(j, i)) > abs(A(max_pivot_row, i)) // Find row with largest pivot in column
max_pivot_row = j;
end
end
// Swap rows if necessary for pivoting if
max_pivot_row ~= i
temp = A(i, :);
A(i, :) = A(max_pivot_row, :);
A(max_pivot_row, :) = temp;
temp = b(i);
b(i) = b(max_pivot_row);
b(max_pivot_row) = temp;
end

pivot = A(i, i);


if abs(pivot) < eps * 100 // Check for pivot element close to zero
error('Pivot element close to zero. Consider reordering equations or using a different
method.');
end
factor = A(j, i) / pivot;
A(j, :) = A(j, :) - factor * A(i, :);
b(j) = b(j) - factor * b(i);
end end

// Function to perform back substitution


function x = back_substitution(U, b)
n = size(U, 1);
x = zeros(n, 1);
for i = n:-1:1 sum
= 0;
for j = i+1:n
sum = sum + U(i, j) * x(j);
end
x(i) = (b(i) - sum) / U(i, i);
end
end

// Function to solve using Gauss Elimination function


x = gauss_elimination(A, b)
[U, b] = forward_elimination(A, b); x
= back_substitution(U, b);
end

// Example usage
A = [1 2 3; 4 5 6; 7 8 9];
b = [1; 2; 3];

x = gauss_elimination(A, b);

disp('Solution using Gauss Elimination:'); disp(x);

Output:

2. Gauss Jordan Method:

// Function to perform forward elimination (Gauss Elimination)


function [A, b] = forward_elimination(A, b)
n = size(A, 1);
for i = 1:n-1
max_pivot_row = i; // Initialize max_pivot_row
for j = i+1:n
if abs(A(j, i)) > abs(A(max_pivot_row, i)) // Find row with largest pivot in column
max_pivot_row = j; end
end
// Swap rows if necessary for pivoting if
max_pivot_row ~= i
temp = A(i, :);
A(i, :) = A(max_pivot_row, :);
A(max_pivot_row, :) = temp;
temp = b(i);
b(i) = b(max_pivot_row);
b(max_pivot_row) = temp;
end

pivot = A(i, i);


if abs(pivot) < eps * 100 // Check for pivot element close to zero error('Pivot
element close to zero. Consider reordering equations or using
a different method.'); end
for j = i+1:n
factor = A(j, i) / pivot;
A(j, :) = A(j, :) - factor * A(i, :);
b(j) = b(j) - factor * b(i);
end
end end

// Function to perform back substitution


function x = back_substitution(U, b)
n = size(U, 1);
x = zeros(n, 1);
for i = n:-1:1 sum
= 0;
for j = i+1:n
sum = sum + U(i, j) * x(j);
end
x(i) = (b(i) - sum) / U(i, i);
end
end

// Function to solve using Gauss Jordan elimination function x


= gauss_jordan(A, b)
[U, b] = forward_elimination(A, b); n
= size(U, 1);
// Perform back substitution for each row, transforming U into identity matrix for i
= n:-1:1
for j = i-1:-1:1
factor = U(j, i) / U(i, i);
U(j, :) = U(j, :) - factor * U(i, :);
end
end
// After transformation, diagonal elements of U are 1, so directly copy b to x x = b;
end

// Example usage
A = [1 2 3; 4 5 6; 7 8 9];
b = [1; 2; 3];

x = gauss_jordan(A, b);

disp('Solution using Gauss Jordan:'); disp(x);

Output:

3. Gauss Siedel Method:

function x = gauss_seidel(A, b, x0, tol, max_iter)


// A: coefficient matrix
//b: right-hand side vector
//x0: initial guess vector
//tol: tolerance for convergence
//max_iter: maximum number of iterations

n = length(b);
x = x0;
iter = 0;
while iter < max_iter
x_old = x;
for i = 1:n
sum1 = 0;
sum2 = 0; for
j = 1:i-1
sum1 = sum1 + A(i,j) * x(j); end
for j = i+1:n
sum2 = sum2 + A(i,j) * x_old(j);
end
x(i) = (b(i) - sum1 - sum2) / A(i,i);
end
// Check for convergence if
norm(x - x_old) < tol
break;
end
iter = iter + 1;
end
if iter >= max_iter
disp('Maximum iterations reached without convergence'); else
disp(['Converged in ', string(iter), ' iterations']);
end
end

A = [4, -1, 0; -1, 4, -1; 0, -1, 3];


b = [5; -7; 6];
x0 = [0; 0; 0];
tol = 1e-6;
max_iter = 1000;

x = gauss_seidel(A, b, x0, tol, max_iter);


disp('Solution:');
disp(x);

Output:
Program-4

Aim: Exercises to implement the associative, commutative and distributive


properties in a matrix in Scilab.

Theory:

● Commutative property of addition: A + B = B + A. This means that you can add


two matrices in any order and get the same result.

● Associative property of addition: (A + B) + C = A + (B + C). This means that you


can change the grouping in matrix addition and get the same result.

● Distributive property of multiplication: A(BC) = (AB)C.

● Commutative property of multiplication: p x q = q x p.

● Associative property of multiplication: p x (q x r) = (p x q) x r.

● Distributive property of division and subtraction: p x (a ± b) = p x a ± p x b.

● Inverse property of division and multiplication: p x (1/p) = 1, provided p ≠ 0.

Code:

// Define matrices
A = [1 2; 3 4];
B = [5 6; 7 8];
C = [9 10; 11 12];

// Associativity of matrix addition (A + (B + C) = (A + B) + C) sum1 =


A + (B + C);
sum2 = (A + B) + C;

if (sum1 == sum2)
disp('Associativity of matrix addition holds.'); else
disp('Associativity of matrix addition may not hold.'); end

// Commutativity of matrix addition (A + B = B + A) sum3 =


A + B;
sum4 = B + A;

if (sum3 == sum4)
disp('Commutativity of matrix addition holds.'); else
disp('Commutativity of matrix addition may not hold.'); end

// Distributivity (A * (B + C) = A * B + A * C)
product1 = A * (B + C);
product2 = A * B + A * C;

if (product1 == product2)
disp('Distributivity of matrix holds true.');
else
disp('Distributivity may not hold.');
end

Output:
Program-5

Aim: Exercises to find the reduced row echelon form of a matrix in Scilab.

Theory:

A matrix is in reduced row echelon form (RREF) if it meets the following criteria:
● The matrix is a zero matrix, or
● All of its pivots are 1, and all entries above its pivots are 0
● In each row, the left-most nonzero entry is 1, and the column that contains this 1 has
all other entries equal to 0
● The 1 is called a leading 1

The reduced row echelon form of a matrix is unique and does not depend on the sequence of
elementary row operations used to obtain it.

The reduced row echelon form of a matrix is used to solve the system of linear equations.

Code:
// Function to perform Gaussian elimination
function [U, b] = gaussian_elimination(A, b) n =
size(A, 1);
for i = 1:n-1
for j = i+1:n
pivot = A(i, i);
if abs(pivot) < eps // Check for pivot element close to zero (avoid division by
zero)
error('Pivot element close to zero. Consider reordering equations or using a
different method.');
end
factor = A(j, i) / pivot;
A(j, :) = A(j, :) - factor * A(i, :);
b(j) = b(j) - factor * b(i);
end
end
U = A; // U now represents the upper triangular matrix end

// Example usage
A = [1 2 3; 4 5 6; 7 8 9];
b = [1; 2; 3];
[U, b] = gaussian_elimination(A, b);

disp('Reduced row echelon form:'); disp(U);

Output:
Program-6

Aim: Exercises to plot the functions and to find its first and second derivatives in Scilab.

Theory:

Function
A function is a relationship between a set of inputs and their outputs. A function can be
represented as an equation, a set of ordered pairs, as a table, or as a graph in the coordinate
plane.

Derivative
In mathematics, a derivative is the rate of change of a function with respect to an independent
variable. It is used to measure the sensitivity of one variable with respect to another.

First-order derivatives
These derivatives tell about the direction of the function and can be interpreted as an
instantaneous rate of change. For example, the first derivative of a distance versus time graph
gives you velocity.

Second-order derivatives
These derivatives are used to get an idea of the shape of the graph for the given function. For
example, the second derivative gives you the acceleration.

Graphically
The first derivative represents the slope of the function at a point, and the second derivative
describes how the slope changes over the independent variable in the graph.

Image processing
When you apply sobel convolution matrix to a given image, you get the first derivative of the
input image. When you apply the laplacian matrix to the initial image, you get the second
derivative.

Code:
// Replace 'f(x)' with your actual function definition function
y = f(x)
y = x^2 + sin(x); // Example function end
// Define x-axis range
x = linspace(1, 250, 15);

// Calculate function values at each x point y =


f(x);

// Plot the function


plot(x, y);
xlabel('x');
ylabel('f(x)');
title('Plot of f(x)');
// Define the first derivative function
function dy = df(x)
dy = 2*x + cos(x); // Example derivative of f(x) end

// Calculate first derivative values at each x point dy =


df(x);

// Plot the first derivative (optional)


// plot(x, dy); // Un-comment to plot the first derivative

// Print the first derivative at a specific point (optional)


x_point = 2; // Example point
first_derivative = df(x_point);
disp('First derivative at x=2:')
disp(first_derivative);
// Define the second derivative function
function d2y = d2f(x)
d2y = 2 - sin(x); // Example second derivative of f(x) end

// Calculate second derivative values at each x point d2y =


d2f(x);

// Plot the second derivative (optional)


// plot(x, d2y); // Un-comment to plot the second derivative

// Print the second derivative at a specific point (optional)


second_derivative = d2f(x_point);
disp('Second derivative at x=2:')
disp(second_derivative);
Output:
Program – 7

Aim: Exercises to present the data as a frequency table in SPSS.

Theory:
Frequency tables in SPSS provide a structured way to summarize categorical data, aiding in
data interpretation and analysis. The algorithm to create a frequency table involves several
steps within the SPSS software:

Al:
Step 1: Open Dataset
Open the dataset containing the variable of interest.
Step 2: Access Frequency Analysis
Click on the "Analyze" tab in SPSS. Step 3:
Navigate to Descriptive Statistics
From the Analyze menu, select "Descriptive Statistics." Step 4:
Select Frequencies
In the submenu, choose "Frequencies." Step
5: Choose Variable
In the Frequencies dialog box, locate the variable for which you want to create a
frequency table.
Step 6: Add Variable
Drag the variable from the left panel (Variables) to the right panel (Variable(s)) in
the Frequencies dialog box.
Step 7: Optional: Additional Statistics
If desired, click on the "Statistics" button to include additional descriptive statistics
such as mean, median, etc.
Step 8: Generate Frequency Table
Click on the "OK" button to generate the frequency table.

Output interpretation:
Validity Information:
The generated output provides information about the number of valid and missing
values for the selected variable.
Frequency Table:
The frequency table displays each unique value of the variable, along with its
frequency count and percentage.
The "Frequency" column indicates how many times each unique value occurs in the
dataset.
The "Percent" column shows the percentage of each value relative to the total
number of valid responses.
The sum of the percentages equals 100%.

Data used:
The frequency table output provides a clear summary of the distribution of values within
the variable, facilitating easy interpretation and analysis.
a) The team name Mavs occurs 4 times, which represents 4/11 = 36.4% of all values
in the Team column.
b) The team name Rockets occurs 3 times, which represents 3/11 = 27.3% of all values
in the Team column.
c) The team name Spurs occurs 2 times, which represents 3/11 = 18.2% of all values
in the Team column.
d) The team name Warriors occurs 2 times, which represents 3/11 = 18.2% of all values
in the Team column.
e) Note that the values in the Percent column add up to 100%.
Program – 8
Aim: Exercises to find the outliers in a dataset in SPSS.

Theory:
Identifying outliers in a dataset is crucial for data analysis in SPSS. The algorithm
involves the following steps:

Algorithm:
Step 1: Open Dataset
Open the dataset containing the variable of interest, such as annual income
for individuals.
Step 2: Access Descriptive Statistics
Click on the "Analyze" tab in SPSS.
Step 3: Navigate to Descriptive Statistics
From the Analyze menu, select "Descriptive Statistics." Step
4: Select Explore
In the submenu, choose "Explore." Step
5: Choose Variable
Drag the variable (e.g., income) into the box labeled "Dependent List." Step 6:
Configure Statistics
Click on the "Statistics" button and ensure that the box next to
"Percentiles" is checked.
Click "Continue." Step
7: Generate Box Plot
Click "OK" to generate the box plot.
Step 8: Interpret Box Plot
Examine the box plot to identify any circles or asterisks on either end of the box
plot.
Circles indicate potential outliers, while asterisks indicate extreme outliers.
Step 9: Calculate Interquartile Range (IQR)
Locate the interquartile range (IQR) from the output, typically labeled as "Tukey's
Hinges."
Step 10: Define Outlier Ranges
Calculate outlier ranges using the formula:
Upper bound: 3rd quartile + 1.5 * IQR Lower
bound: 1st quartile - 1.5 * IQR
Step 11: Determine Outliers
Identify any values outside the defined outlier ranges.
Any data points beyond these ranges are considered outliers.

Output interpretation:
1. Box Plot: Examining the box plot visually identifies outliers.
2. Tukey's Hinges: Locate the interquartile range (IQR) from the output.
3. Outlier Ranges: Calculate upper and lower bounds based on the IQR.
4. Identification: Values outside the defined outlier ranges are considered outliers or
extreme outliers.

Handling outliers:
1. Verify Data Entry: Ensure outliers are not the result of data entry errors.
2. Remove Outliers: Consider removing outliers if they significantly impact
analysis.
3. Assign New Values: Replace outlier values with appropriate replacements (e.g., mean,
median) if they are data entry errors.

Data used:
The output includes box plots indicating potential outliers and extreme outliers.
Interquartile range and outlier ranges are calculated to identify and handle outliers
appropriately.

INCOME 18 24 36 34 38 45 48 54 60 73 79 85 94 98 108
Program-9
Aim: Exercises to find the most risky project out of two mutually exclusive projects in SPSS.

Theory:

Steps:-

1. Open the data in SPSS

● Launch SPSS and open the data file containing the "Project," "Cost Estimate," and
"Time Estimate" variables.

2. Analyse Cost Estimates

● Go to Analyze > Descriptive Statistics > Frequencies.


● Select the "Cost Estimate" variable.
● Click "Statistics" and check "Mean," "Median," and "Standard Deviation."
● Click "OK" to run the analysis.

Examine the output:

● Compare the mean, median, and standard deviation of cost estimates for Project
A and Project B.
● A higher standard deviation indicates greater cost variability and potential risk.

3. Analyse Time Estimates

● Repeat the same steps as in Step 2, but select the "Time Estimate" variable instead.

Examine the output:

● Compare the mean, median, and standard deviation of time estimates for both projects.
● A higher standard deviation suggests greater uncertainty in project completion time,
potentially increasing risk.

4. Calculate Cost-Risk

● Create a new variable named "Cost Risk" by multiplying the "Cost Estimate" by a
risk factor.
● Choose a risk factor based on your project context (e.g., 1.1 for moderate risk, 1.2 for
high risk).
● For example, if using a risk factor of 1.1:
Create a new computed variable named "Cost Risk" with the followingformula: Cost Risk =
Cost Estimate * 1.1

5. Calculate Schedule Risk

● Create a new variable named "Schedule Risk" by multiplying the "TimeEstimate" by the
chosen risk factor.
● For example, using the same risk factor of 1.1:
Create a new computed variable named "Schedule Risk" with the followingformula: Schedule
Risk = Time Estimate * 1.1

6. Compare Project Risks

● Calculate the total risk score for each project by summing the "Cost Risk" and
"Schedule Risk" values.
● Compare the total risk scores of Project A and Project B.
● The project with the higher total risk score is considered riskier.

Data:

Schedule_ Total_Risk_
Project Cost_Estimate Time_Estimate Cost_Risk Risk Score
Project A 100000 12 110000 13.2 110013.2

Project A 120000 14 132000 15.4 132015.4


Project A 110000 13 121000 14.3 121014.3
Project B 80000 10 88000 11 88011
Project B 90000 11 99000 12.1 99012.1
Project B 100000 12 110000 13.2 110013.2
Output:

Conclusion:
Project A is risky because of the higher total risk score. The standard deviation is closer to the cost
estimate and time estimate for Project A indicating a higher risk assessment.
Program-10

Aim: Write a program to draw a scatter diagram, residual plots, outliers leverage and influential data
points in R.

Theory:
Data Import: Import the dataset containing variables of interest into R. Ensure that the dataset is
properly formatted and contains the necessary variables for analysis.
Scatter Diagram: Create a scatter plot to visualize the relationship between two variables of interest. Use
the plot() function in R to generate the scatter plot.
# Example scatter plot between variables X and Y:
plot(X, Y, main = "Scatter Plot of X vs. Y", xlab = "X", ylab = "Y")
Linear Regression Model: Fit a linear regression model to the data using the lm() function in R. This will
be used to generate the predicted values and residuals.
# Fit linear regression model:
model <- lm(Y ~ X, data = dataset)
Residual Plots: Create residual plots to assess the goodness-of-fit of the linear regression model and
identify patterns in the residuals. Use the plot() function with the argument which = c(1, 2, 3) to generate
multiple plots.
# Residual plots:
par(mfrow = c(2, 2)) plot(model, which = c(1, 2, 3))
Identifying Outliers: Identify outliers by examining the residual plot for extreme values that do not
follow the general pattern. Outliers are points with large residuals compared to the majority of the data
points.
Leverage Points: Calculate leverage statistics to identify points with high leverage. Leverage points have
extreme predictor values that can significantly influence the regression model.
# Leverage statistics leverage <- hatvalues(model)
Influential Data Points: Identify influential data points using measures such as Cook's distance or
DFFITS. Influential points have a large impact on the regression coefficients or predictions. # Cook's
distance
cooks_distance <- cooks.distance(model)
Visualizing Outliers, Leverage, and Influential Points: Overlay the identified outliers, leverage points,
and influential points on the scatter plot for visualization.

R Script:
# Load Required Libraries
library(ggplot2)
# Generate Sample Data
set.seed(123)
x <- rnorm(100)
y <- 2*x + rnorm(100)
data <- data.frame(x, y) #
Scatter Diagram
ggplot(data, aes(x = x, y = y)) +
geom_point() +
labs(title = "Scatter Diagram", x = "X", y = "Y") #
Fit linear model
model <- lm(y ~ x, data = data) #
Residual Plot
plot(model, which = 1, main = "Residual Plot")# Identify Outliers
residuals_sd <- sd(model$residuals)
outliers <- which(abs(model$residuals) > 2 * residuals_sd)

# Leverage Points
leverage <- hatvalues(model)
leverage_points <- which(leverage > (2 * (ncol(data) + 1) / nrow(data)))

# Influence Index
cooksd <- cooks.distance(model)
influential_points <- which(cooksd > 4 / nrow(data)) # Adjust threshold as needed
Output:
Conclusion: Scatter diagram, residual plots, outliers leverage and influential data points in R has
been plotted and verified.
Program-11

Aim: Write a program to calculate Correlation using R.

Theory: In statistics, correlation or dependence is any statistical relationship, whether causal or not,
between two random variables or bivariate data. It usually refers to the degree to which a pair of
variables are linearly related.

R Script:

#The program evaluates Pearson's correlation coefficient, rho, using covariance and standard
deviation, and directly using correlation function.
xdata <- c(2,4.4,3,3,2,2.2,2,4)
ydata <- c(1,4.4,1,3,2,2.2,2,7)
print('The covariance is:')
print(cov(xdata,ydata))
print('The correlation coefficent, rho, using cov function is:')
rho = cov(xdata,ydata)/(sd(xdata)*sd(ydata)) # Computes the correlation using cov (covariance) # and
sd (standard deviation) function
print(rho)
print('The correlation using cor function is:')
print(cor(xdata,ydata)) # Computes the correlation coefficient using cor function
#We can plot these bivariate observations as a coordinate-based plot (a
scatterplot). #Executing the following gives figure between x and y data points
plot(xdata,ydata,pch=13,cex=1.5) # pch = 13, means a symbol of circle cross; cex is the size of
#pch symbols
## Creating some correlation matrix with predefined packages and dataset.
#Below depicts sample values-to-sample values correlation data("mtcars") #
loading the predefined dataset
library(corrplot) #We are using the corrplot library #To
make the correlation matrix plot - corrplot(cor(mtcars))
#it creates the correlation matrix.
# In the above correlation plot, the circles and color keys represents the "correlation coefficient" #or
value of "rho" or the degree of the linear relationship.
#The value of rho can be -1 to +1. +1 means positive 100% correlation, i.e., if one variable
#increases, the other will also increase, or if it decreases, the other will also decrease. -1 means
#negative 100%, i.e., the other will decrease if one increases. And
0.0 means no linear #relationship.
#The size of the circles is relative to the percentage of
correlation. ## Generating a correlation matrix ##
library(rstatix)
cor_test <- cor_mat(mtcars) #to create the correlation matrix
cor_test
## Another informative is "PerformanceAnalytics", which gives the p- value, ##
distribution (histograms), and correlation coeffecient
library(PerformanceAnalytics)
chart.Correlation(cor(mtcars))
# In the above chart, the red stars define different levels of significance, i.e. * = 0.05, ** = 0.01, ***
= 0.001
## Using package called "Lares", we can rank the correlation and produce a gradually decreasing
## order of columns. This is useful to analyze the top most correlated variables. ##
library(lares)
corr_cross(mtcars, rm.na = T, max_pvalue = 0.05, top = 15, grid = T)

Output:
Program-12

Aim: Write a program to implement Time series Analysis using R.

Theory: Understanding Time Series Analysis: Time series analysis can be used to forecast future
events based on known past data. Examples of time series data include the number of users signed
upfor a service every day, the daily closing price of a company stock, or the number of page views
each minute.

Implementing Time Series Analysis in R: Let’s now look at how we can implement time series analysis
in R. For this tutorial, we will use the Air Passengers dataset, which is the monthly totalsof
international airline passengers from 1949 to 1960.

Step 1: Loading the Dataset


# Load the dataset data("AirPassengers")
# Print the first few rows print(head(AirPassengers))

Step 2: Checking the Structure of the Dataset #


Check the structurestr(AirPassengers)
The str() function will reveal that AirPassengers is a 'ts' (Time Series) object. The start attribute
shows the first year and month of the data, and the end attribute shows the last year and month.

Step 3: Plotting the Dataset


# Plot the dataset plot(AirPassengers)
This will create a time series plot. From this plot, we can observe a trend in the data that suggests that
the number of air passengers was increasing over time.

Step 4: Decomposing the Time Series Data


# Decompose the time series data decomposed <- decompose(AirPassengers)

# Plot the decomposed dataplot(decomposed)


The decompose() function breaks down the time series data into three components: trend, sea-
sonal, and random.

Step 5: Forecasting Future Data


To forecast future data, we can use the forecast() function from the 'forecast' package in R. If this
package is not yet installed, you can install it using the install.packages() function.
# Install the 'forecast'
packageinstall.packages('forecast') # Load the 'forecast'
packagelibrary(forecast)
# Forecast future data
forecast_data <- forecast(AirPassengers, h = 24) #
Plot the forecasted dataplot(forecast_data)
The h parameter determines the number of periods for forecasting.
R Script:
## Air Passengers dataset ##
# Load the dataset
data("AirPassengers")
# Print the first few rows
print(head(AirPassengers
)) # Check the structure
str(AirPassengers)
# Plot the dataset
plot(AirPassengers)
# Decompose the time series data
decomposed <-
decompose(AirPassengers) # Plot the
decomposed data plot(decomposed) #
Load the 'forecast' package
library(forecast)
# Forecast future data
forecast_data <- forecast(AirPassengers, h = 24) #
Plot the forecasted data plot(forecast_data)
################################################################# #
# Print the Nile dataset ##################################################################
print(Nile)
length(Nile)
# Display the first 10 elements of the Nile
dataset head(Nile, n = 10)
# Display the last 12 elements of the Nile
dataset tail(Nile, n = 12)
#plot(Nile, col = "blue")
plot(Nile,col = "blue", xlab = "Year", ylab = "River Volume (1e9 m^{3})”)
Output:
Conclusion: Time series Analysis has been implemented and verified using R
Program-13

Aim: Write a program to implement linear regression using R.


Theory: Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables. The main idea is to find the best-fitting straight line
(or hyperplane in higher dimensions) that describes the relationship between the variables. Here's a
brief overview of linear regression:

• Dependent Variable (Response Variable): This is the variable we want to predict or explain.
• Independent Variables (Predictors): These are the variables used to predict or explain the
dependent variable.
• Linear Relationship: Linear regression assumes a linear relationship between the independent
variables and the dependent variable. It means that the change in the dependent variable is
proportional to the change in the independent variable(s).
• Simple Linear Regression: In simple linear regression, there's only one independent variable.
The relationship between the independent and dependent variables is described by a straight line.
• Multiple Linear Regression: In multiple linear regression, there are multiple independent
variables. The relationship is described by a hyperplane in higher dimensions.

R Script:
data(mtcars) X<-
mtcars$wt Y<-
mtcars$mpg

# Fit a linear regression model model


<- lm(Y ~ X, data = mtcars)

# Plot the scatter plot


plot(X, Y,
xlab = "Weight", ylab
= "Mileage",
main = "Linear Regression: Weight vs Mileage"
)

# Add the line of best fit


abline(model, col = "red")
Output:

Conclusion: Linear Regression using R has been implemented and verified.


Program-14

Aim: Write a program to implement concepts of probability and distributions in R.

Theory: Classical probability, often referred to as “a priori” probability, is a branch of probability


theory that deals with situations where all possible outcomes are equally likely. It provides a
foundational understanding of how probability works and forms the basis for more advanced
probability concepts.

R Script:

###### Binomial Distibution #########


dbinom(x=5,size=8,prob=1/6) # Binomial distribution for Pr(X=5), no. of trials, n
= 8, # probability of sucess = 1/6
x_prob_binomial =
dbinom(x=0:8,size=8,prob=1/6)
print(x_prob_binomial)
print(sum(x_prob_binomial))
print(round(x_prob_binomial, 3))
barplot(x_prob_binomial,names.arg=0:8,space=0,xlab="x",ylab="Pr(X = x)", col = "green")
pbinom(q=3,size=8,prob=1/6) #Evaluating cumulative distribution function for Binomial
# distribution, i.e. Pr(X < = 3)

###### Poisson Distibution #########


dpois(x=3,lambda=3.22) # Poisson distribution. The dpois function provides the
individual # Poisson mass function probabilities
# Pr(X = x) for the Poisson distribution. The ppois function provides the left cumulative
# probabilities, i.e. Pr(X <= x)
dpois(x=0,lambda=3.22)
round(dpois(0:10,3.22),3)
barplot(dpois(x=0:10,lambda=3.22),ylim=c(0,0.25),spa ce =
0,names.arg=0:10,ylab="Pr(X=x)",xlab="x", col = "blue")
ppois(q=2,lambda=3.22)

###### Uniform Distibution #########


min <- 0
max <- 100
# Specify x-values for the function
xpos <- seq(min, max , by = 0.5) #
supplying corresponding y
coordinations ypos <- dunif(xpos, min
= 10, max = 80) # plotting the graph
plot(ypos , type="o")
## CDF for Uniform distribution function
min <- 0
max <- 60
# calculating punif value
punif (15 , min =min , max = max)

####### Normal Distribution #########


# Generating sequence of # numbers from
-15 to
+10 x<-seq(-15,10)
# calculate the Normal distribution function #
based on the mean and sd of data
pdf<- dnorm(x,mean(x),sd(x))
# Plotting the PDF(Normal Distribution
Function plot(x,pdf)
# CDF of Normal Distribution #
Generating sequence of
# numbers from -15 to
+10 x<-seq(-15,10)
# calculate the Cumulative distribution #
function based on the mean and sd of data
cdf<-ecdf(x)
# Plotting the CDF
plot(cdf,xlabel="x",ylabel="y",main="CDF
Graph") # Plotting CDF using gbutils package #
install and load the gbutils package library(gbutils)
# Calculating the CDF Normal Distribution Function
cdf1 <- pnorm(x, mean = -2.5, sd = 7.64) # Plotting the
CDF Normal Distribution
Function plotpdf(cdf1, cdf = cdf1,main='CDF
Plot')
#### QQ-Plot for a sample of 100
# values randomly generated from a normal distribution; as expected, the points #
closely follow the line. If the points roughly fall on the diagonal line, then the sample #
distribution can be
# considered close to normal. ####
norm_samp <- rnorm(100)
qqnorm(norm_samp)
abline(a=0, b=1, col='grey')

###### Student's t-distribution #########


x <- seq(-6, 6, length = 100) # seq function can be used to generate 100 points between say -6 to 6
df = c(1,4,10,30)
colour = c("red", "orange", "green", "yellow", "black")
# Plot a normal distribution
plot(x,dnorm(x), type = "l" , lty = 2, xlab= "t-value", ylab="Density", main = "Comparison of
tdistributions", col = "black") # type = "l" is for line,
# lty = 2 or lty = "dashed" is for dashed line
#Add the t-distributions to the plot for (i in
1:4){
lines(x, dt(x, df[i]), col = colour[i])
}
#Add a legend
legend("topright", c("df = 1", "df = 4", "df = 10", "df = 30", "normal"), col = colour , title =
"t=distributions", lty = c(1, 1, 1, 1, 2))

###### Chi-square distribution ######### df


= 5 #Defining degrees of freedom
vec <- 0:4
#Calculating for the Density function values (pdf) in the interval [0, 4]
print ("Calculating for the values [0, 4]")
dchisq(vec, df = df) # pdf of distribution
pchisq(4, df = df, lower.tail = TRUE) # CDF of distribution, lower.tail = TRUE means P(X <=
# x) and if lower.tail = FALSE means P(X > x).
# computing probability values of 50,000 random values with 4 degrees of
freedom x <- rchisq(50000, df = 4)
hist(x, freq = FALSE, xlim = c(0, 16), ylim = c(0, 0.2), col = 'gray') curve(dchisq(x, df
= 4), from = 0, to = 15, n = 5000, col = 'red', lwd = 2, add
= T)
Output:
Program – 15
Aim: To provide an introduction to descriptive statistics by calculating and interpreting
measures such as mean, median, mode, variance, and standard deviation for a given dataset in
Scilab.

Theory:
Descriptive statistics provide essential insights into the characteristics of a dataset, offering
measures that summarize its central tendency and variability. In Experiment 15, the aim is to
introduce descriptive statistics by calculating and interpreting measures such as mean, median,
mode, variance, and standard deviation for a given dataset in Scilab.

Central Tendency:
- Mean: The mean represents the average value of the dataset and is calculated by summing
all values and dividing by the number of observations.
- Median: The median is the middle value of the dataset when arranged in ascending order. It
divides the dataset into two equal halves.
- Mode: The mode is the most frequently occurring value(s) in the dataset.

Variability:
- Variance: Variance measures the dispersion of data points around the mean. It is
calculated by averaging the squared differences between each data point and the mean.
- Standard Deviation: Standard deviation is the square root of the variance and
provides a measure of the spread of data points around the mean.

In the provided code, custom functions are defined to calculate the mean, median, mode,
variance, and standard deviation. These functions are applied to the given dataset to obtain the
respective measures of central tendency and variability.

By computing and interpreting these descriptive statistics, users gain valuable insights into the
distribution and characteristics of the dataset, enabling better understanding and informed
decision-making in data analysis.

Code
// Custom mode function
function [modes, counts] = custom_mode(data) unique_values
= unique(data);
counts = histc(data, unique_values);
[max_count, idx] = max(counts); modes =
unique_values(idx);
counts = max_count; // Update counts to represent the frequency of the mode endfunction

// Custom sorting function (Bubble sort) function


sorted_data = custom_sort(data)
n = length(data);
sorted_data = data; for i
= 1:n-1
for j = 1:n-i
if sorted_data(j) > sorted_data(j+1) temp
= sorted_data(j); sorted_data(j) =
sorted_data(j+1);
sorted_data(j+1) = temp; end
end end
endfunction

// Function to calculate mean


function mean_value = calculate_mean(data) mean_value =
sum(data) / length(data);
endfunction

// Function to calculate median


function median_value = calculate_median(data) sorted_data =
custom_sort(data);
n = length(sorted_data); if
mod(n, 2) == 0
median_value = (sorted_data(n/2) + sorted_data(n/2 + 1)) / 2; else
median_value = sorted_data(ceil(n/2)); end
endfunction

// Function to calculate variance


function variance_value = calculate_variance(data)
mean_value = calculate_mean(data);
mean_diff = data - mean_value;
squared_diff = mean_diff .^ 2;
variance_value = sum(squared_diff) / length(data);
endfunction

// Function to calculate standard deviation function


stddev_value = calculate_stddev(data)
variance_value = calculate_variance(data); stddev_value =
sqrt(variance_value);
endfunction

// Given dataset
data = [12, 15, 18, 20, 22, 25, 30, 35, 40, 45];

// Calculate mean
mean_value = calculate_mean(data);

// Calculate median
median_value = calculate_median(data);

// Calculate mode
[mode_value, mode_counts] = custom_mode(data);

// Calculate variance
variance_value = calculate_variance(data);

// Calculate standard deviation stddev_value


= calculate_stddev(data);

// Display results
disp('Measures of Central Tendency:');
disp(['Mean: ', string(mean_value)]);
disp(['Median: ', string(median_value)]);
disp(['Mode: ', string(mode_value), ' (', string(mode_counts), ' times)']); disp('Measures of
Variability:');
disp(['Variance: ', string(variance_value)]); disp(['Standard
Deviation: ', string(stddev_value)]);
Experiment – 16

Aim: To assess whether a given dataset follows an exponential distribution using


hypothesis testing in Scilab/R.

Theory:
Hypothesis Testing:
- Hypothesis testing is a statistical method used to make inferences about
population parameters based on sample data.
- It involves formulating a null hypothesis (H0) and an alternative hypothesis (H1),
then using sample data to determine whether there is enough evidence to reject the null
hypothesis.
- In this experiment, the null hypothesis (H0) states that the given dataset follows
an exponential distribution, while the alternative hypothesis (H1) suggests
otherwise.

Kolmogorov-Smirnov Test:
- The Kolmogorov-Smirnov (KS) test is a non-parametric test used to compare the
empirical cumulative distribution function (CDF) of a sample with a specified
theoretical distribution.
- It assesses the goodness-of-fit between the observed data and the
hypothesized distribution.
- The test statistic measures the maximum absolute difference between the
empirical CDF and the theoretical CDF.
- If the calculated p-value is less than a predetermined significance level (e.g., α
= 0.05), the null hypothesis is rejected, indicating that the dataset does not follow the
specified distribution.

Code
# Generate or import your dataset
data <- rexp(100, rate = 0.5) # Example: Generate a random dataset from an exponential
distribution

# Perform Kolmogorov-Smirnov test


print("Kolmogorov-Smirnov test:")
ks_result <- ks.test(data, "pexp", rate = 0.5)
print(ks_result)

You might also like