HITAM
crt
HYDERABAD INSTITUTE OF TECHNOLOGY AND
MANAGEMENT
(UGC Autonomous)
Affiliated to JNTUH, Approved by AICTE, Accreted by
NAAC(A+), NBA
DESCRIPTIVE AND INFERENTIAL STATISTICS
Submitted by
AELE VISHNUTEJA
(23E51A6702)
COMPUTER SCIENCE AND ENGINEERING
IN
DATA SCIENCE
(2025-2026)
Internal Examiner
External ExaminerTABLE OF CONTENT
S.no
CONTENT
PAGE NO.
Statistics
a) Measures of data
3-7
Data Visualization
Pie chart
Bar graph
Histogram
Heat map
Scatter plot
Box & Whisker graph
8-21
Descriptive
Statistics
22
Measures of Central tendency
Mean
Median
Mode
23
Measures of dispersion
Range
b)Variance
Standard Deviation
d)Skewness
Quartile & Percentile
24-29
Bi - Variate Analysis
Cross ~ Tabulation
Co-variance, Co-relation
Least square method
Fitting 2" degree polynomial
30-34
Hypothesis Testing
35-42
36a) Z-test 37-38
b) T-test S942
©) Python Codes
*
DvD
2)
3)
4)
5)
*
INTRODUCTION TO STATISTICS
tatistics is a branch of mathematics that involve collection,
organisation, analysis, interpretation, presentation and visualization of
data to derive meaningful insights (patterns, trends and unknown facts)
and conclusions to make informed decisions.
Key components of Statistics:
Data collection.
Organizing the Data.
Analyzing the data.
Presentation and Visualization.
Interpretation.
Importance of Statistics:
It plays a crucial role in providing the foundation for making data driven
decisions,
building predictive models and decisions, understanding patterns. It also
helps in
DvD
2)
3)
4a
5)
Understanding data distributions.
Hypothesis testing.
Regression analysis.
Data reduction and dimensionality.
Handling large and complex data sets.6) Forecasting and predictive analysis.
* Descriptive Statistics:
It summarises and describes the main features and characteristics of a
dataset. It includes
‘* Measures of central tendency(mean, median, mode, weighted).
* Measures of dispersion (Range, Variance, Standard mean deviation,
Quartiles etc)
Ex: If the marks of 100 students taken from the population of size 1000.
Mean=76, Median=78, Standard deviatio!
.62(sample statistic).
* Inferential Statistics:
It makes estimations or conclusions about the population based on the
sample data.
* Hypothesis testing (t-test, z-test)
* Confidence intervals [drawing conclusion that the average marks of
1000 students is 76 based on the sample data (by using hypothesis
testing)].
* Data: It is the collection of facts and figures which are collected,
analysed and summarised.
* Types of Data:
1. Primary data: A data which is collected for the first time by the
researcher or person himself is called primary data.
2. Secondary data: A data which is collected by the other source and
organized, summarized (by third party) is called secondary data* Population: A population refers to the entire group of observations
which we are interested in studying.
Ex: All tweets posted on twitter that is population.
* Sample: A sample is a subset of population. It isa sample group
chosen from the population, used to make inferences about the entire
population.
Ex: A random selection of 10,000 tweets used to train a sentiment
analysis model.
* Variable: A characteristic that varies across all the unit is called a
variable.
Ex: In a dataset of all the students who passed their 12” grade. Each
column represents a variable.
Roll.n | Name | Age | Board | Passe | Mark | Conta | Paren | Addre | Email
° q s ct No. | ts ss |p
year | (%) Conta
ct
No.
301 A 18 State 89 | XXX1 | XXX. W | ABC
2025 2 9
302 B 7 CBSE 72 | XXX4 | XXX4 X | DEF
2025 8 8
303 Cc 7 State 78 | XXX6 | XXX6 Y | GHI
2025 3 3
304] D | 18 ‘| State Ba [xxxa Pood |Z | IKE
2024 9 8
For the above data if a person wants to know to which group the data belongs
to, the data is classified into 2 categories.
Measures of Data
feQualitative(or)
Quantitative(or)
Categorical data. Numerical data.
Ex: Gender, Colour. Ex: Age, Weight.
* Qualitative Data (Categorical data): it describes the qualities or
characteristics that cannot be measured numerically. It represents
categories or groups.
Ex: Colors(Red, Green, Blue, Pink, Lavender), Gender (Male, Female],
Movie rating
(Excellent, Good, Average, Worst).
* Quantitative Data (Numerical data): It consists of numerical values
that can be measured and quantified. This type of data can be satisfied
to mathematical operations and statistical analysis.
Ex: Age, Salary, University, Size, Temperature
Categorical Data
Nominal Data Ordinal Data
* Measures of Categorical Data:* Nominal data: |t consists of categories that do not have a meaningful
order. It consists of labels, names which are used to identify the
characteristics of observation.
Ex: Country names, subject names.
%* Ordinal data: It also have categories with a meaningful order and the
distance between the categories are not uniform.
Ex: movie ratings, Education level (SSC, Intermediate, B.Tech, MTech),
Food rating (Poor, Average, Good, Excellent).
Numerical Data
DiscreteContinuousinterval Scale Ratio Scale
* Discrete Data: Discrete data is a type of numerical data that consists of
distinct, separate values. These values are countable and cannot be
broken down into smaller part meaningfully. Discrete data is often
obtained through counting rather than measuring.
Ex: Number of students in 2 class (e.g., 25, 30, 32-cannot have 25.5
students), Number
of cars in a parking lot (e.g., 10, 15, 20-cannot have 10.3 cars)
* Continuous Data: Continuous data is a type of numerical data that can
take any value within a given range. It is measurable rather than
countable and can include decimals and fractions. Continuous data is
obtained through measurement and can be infinitely divided into smaller
parts.
Ex: Height of a person (e.g., 5.6 feet, 170.2 cm), Weight of an object
(e.g., 55.3 kg)
%* Interval Scale: The interval scale measures variables where the intervalsbetween values are Meaningful and consistent. However, there is no true
Zero point, meaning zero does not Indicate the complete absence of the
variable.
Ex: 1. Temperature in Celsius: The difference between 10°C and 20°C
is same as
between 30°C and 40°C. However, 0°C does not indicate no
temperature.
+ Ratio Scale: The ratio scale measures variables with meaningful and
consistent interval and a true zero (0) point, which signifies the absence
of the measured attribute.
Ex: 1. A person whose weight is 60kg is twice as some one who's
weight is 30kgs,
2. A distance of Okm means no distance, 10kms is twice as far
skms.
MODULE 1 ENDSMODULE 2
DATA VISUALIZATION
CHARTS FOR CATEGORICAL DATA
BAR GRAPH
Abar graph is a chart that represents categorical data with
rectangular bars. the length and height of each bar represents
the frequency (or) proportions of the categories
Itis used for comparing frequencies (or) counts of different
categories
It is ideal for categorical (or) discrete datae The x-axis represents categories, y- axis represents the
frequency count(or) percentage.
© Advantages of Bar graph
It is easy to understand and interpret
Itis useful for comparing Categories side-by-side
na
It can be vertical or horizontal
Steps to Create a Bar Graph in Excel:
Enter Data in Excel
Select Data
Go to the Insert tab.
Fon
Click on Bar Chart (or Column Chart) and choose Clustered
Column Chart (for vertical bars).
5. Customize Labels add X-axis and Y-axis labels (X-axis =
Grades, Y-axis = Frequency).
6. Click on the chart title and rename it (e.g., "Grade
Distribution").
Ex: The grades of 20 number of students are given below
“Internal 1 Grades Frequency
ona>0
y
8Bar chart for Internal 1 marks
° A B c D
© Steps to do in python
import pandas as pd
from matplotlib import pyplot as pit
import seaborn as sns
data=pd.read_excel("Superstore_data.xlsx")
data.head()
i cma ene eye ME an OL cary lh
BE Dns a ath onan OM EE ce moons SO NL aCategory_plot=sns.countplot(x='Category'data=data)
Category_plot.bar_label(Category_plot.containers[0])
pittitle("Distribution of Category")
plt.show()
Distribution of Category
Furniture Office Supplies Technology
e PIE CHART
1.A pie chart is a circular chart divided into sectors, each
representing a proportion of the category, Each sectors angle
is proportional to the quantity it represents.
2.It is used to show the proportion or frequency of each
category in
relation to the variable.
3. It is best when there are 3 to 5 categories.
124. itis eye-catching and simple for a quick understanding.
EX: The no.of students CSD branch (section wise)
SECTION NO.OF STUDENTS
66
7O
62
NO.OF STUDENTS
e Steps to create Pie-Chart in Excel
1.Enter Data2.Select Data
3.Insert Pie-Chart and choose the style 2D or 3D
4.Customize Chart - Modify colours, styles, and add labels using
the Chart
Tools.
Internal 2 Grades Frequency
18
14
9
14
12
Oon@m>o
Pie chart for internal 2 grades
Chart Title
=o
mA
=e
mc
=D
HISTOGRAMA histogram is a graphical representation of the distribution of a
dataset. It is used for continuous or grouped numerical data. It
provides visuals how data is spread across different intervals or bins.
Each bar in histogram represents the frequency of the data points
within a certain range known as bin. The height of the bar indicates
the no. of observations within each bin. It also consists of
rectangular bars.
Key concepts of histogram
1.the shape of a histogram can provide insights into the distribution
of the data.
e¢ Normal distribution: A bell-shaped curve
© Skewed right: A long tail on the right side
e Skewed-left: A long tail on the left side
2.Interpretation: Histogram are often used to understand central
tendency, spread and shape of the data.
EX: The age group of 20 students are given below
Steps to do in Excel
1.Enter Data
2.Select Data
3.Insert a Histogram
4.Customize the chart: Use Chart Tools to change colours,
styles, and axis
Labels.28 and above 35
50.28
o
35.23 8
©
Bstow 5
moncan»arcornnyomok|
Sgu 88 yal ty wan te yw
ooonsonscaosccnoccltt™
Histogram of distribution of internal 1
Steps to do in Python
import pandas as pd
from matplotlib import pyplot as pit
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
data_normal = np.random.normal(loc=0, scale=1, size=1000)
plt.hist(data_normal, bins=30, edgecolor='black'color='skyblue')
16plt.title(‘Histogram of Data’)
plt.xlabel("Value')
pit.ylabel('Frequency’)
pit.show()
Histogram of Data
data_skewed_right=np.random.exponential(scale=1, size=1000)
plthist(data_skewed_right, bins=30, edgecolor='black',color='orange’)
plt.title('Skewed Right Distribution (Long tail on the Right)')
plt.xlabel('Value’)
plt.ylabel('Frequency')
pit.show()
‘Skewed Right Distribution (Long talon the Right)
7e Line Graph
Aline graph is a graphical representation of data points
connected by _ straight lines. It is used to visualize the trend of a
variable over time as across a continuous range. The line graph is
particularly useful when showing changes over a period of time or in
case data points are sequential. It is used to show trend over times
(Daily, Monthly, Yearly). It displays continuous data such as stock
prices, temperature, scales.
© Steps to do in Excel
1. Enter Your Data: Open Excel and enter data in two
columns.Column 1: Labels (e.g., Months, Years, Categories).
Column 2: Values (e.g., Sales, Profits, Growth)
2. Select the Data
3. Insert a Line Graph and Choose a line chart style (e.g., simple,
stacked,
or with markers).
4. Click on the chart and use Chart Tools to: Change colours,
styles, or add
gridlines. Modify the chart title and axis labels.Ne Mase ke ene LE Ly
Die
One
_ ee ak
Ee
0
oscemte: 55000
Scatter Plot
A scatter plot is a graphical representation of two variable
where each point represent the relation between them. It helps
in visualizing the correlation or association between two
continuous variables.
Key components of a Scatterplot
1.X-axis and Y-axis : Represents two continuous variables.
2.Data Points: Each point shows a data observation based on its x
andy
values
3.Grid lines: It helps to estimate the values or data points.
EX: Experience and salary of 6 employees are given below
Steps to do in Excel1.Enter Data: column 1: X-axis values (e.g., Time, Age, etc.) column 2:
Y-axis values (e.g., Sales, Growth, etc.)
2. Select Data
3. Insert Scatter Plot: Go to the "Insert" tab. In the Charts group, click
"Scatter Chart" (dot graph icon).Choose a scatter plot type (e.g.,
simple, smooth, or with lines).
4. Customize the Chart: Use Chart Tools to edit colours, axis labels,
gridlines, and titles.
Box-Plot
A box plot is a great visualization to understand the distribution of
data, including the median, quartiles, potential out layers. It is useful
when you want to compare a numeriacal variable across different
categories.
Key components of Box-plot
1.Box: It represents the meddle 50% of the data [from ql to q3]
2.Median line: It represents the median (q2) inside the box
3.Whiskers: It shows the spread of the data excluding out layers.
204.OutLayers: Individual points outside the Whiskers, indicating
extreme values
Steps to do in Excel:
1. Enter Data: Open Excel and enter your dataset in a single column
or multiple columns
2: Select Data
3.Insert a Box Plot: Go to the "Insert" tab. Click on "Insert Statistic
Chart". Select "Box and Whisker" chart.
4. Customize the Chart: Use the Chart Tools to modify colours, titles,
and axis labels.
21Steps to do in Python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as pit
# create the boxplot
plt.figure(figsize=(8,6))
sns.boxplot(y='mpg',data=mt,color='yellow')
plt.title(‘boxplot of mpg in mtcars dataset’)
plt.ylabel('‘mpg')
plt.show()
Heat Map
1.A heat map is a advanced data visualization that uses colours to
22represent values in a dataset. Darker or more intense colour usually
indicate higher values, while lighter colour shows lower values.
2.It is used to show patterns, trends and relationships in a larger
dataset.
3.In geographical data for representing sales, population and
temperature etc.,
Steps to do in Excel:
1. Enter and Select Data
2. Apply Conditional Formatting: Go to the "Home" tab. Click
"Conditional Formatting" > "Colour Scales". Choose a preset
colour scale (e.g., red-to-green or blue-to-white)
3. Customize the Colour Scale
4. Remove Gridlines for a Clean LookEll Visharachira
Andhra Pracesh
Lar Pradesh
West Bengal
Odicha
Fasjacthon
Gujarat
B Chhatisaarh
Manipur
Tripura
Meghalaya
‘Arunachal Practesh
Silkirn
agalars
Ladakh
Lakshadweea
‘Andaman are! Nicobar lolz
TaTzATa
968365
“aramt
2321378
2080050,
202790
1289602
200220
231482
‘nate
1044283
"gIS60!
832581
00476
E2755
477407
490009
“405858
‘0606
240540
"07266
"090!
33947
4510
9224
23857
"1362
10167
24DESCRIPTIVE STATISTICS
Descriptive statistics is a branch of statistics focused on summarizing and
organizing data in a meaningful way. It involves using various measures like
mean, median, mode, range, standard deviation, and frequency distribution to
describe key features of a dataset. The goal is to provide a concise and
informative representation of the data without making inferences about a
larger population
Descriptive Statistics
Measure of central tendency
Measure of dispersion
Mean
Range
Median
Variance
Mode
Standard Deviation
Skewness
Percentile & Quartile
25OoNanRWrHo
Module-3
Measure of Central Tendency: It helps us to summarize the data set
by identifying the central value. The 3 main measures are mean,
median and mode.
Mean(Arithmetic Mean): Mean is sum of all values in a data
set divided by the total no. of values. It represents the
average value.
Formula: Sum of all observations + no.of observations [or]
Mean=3xi + N
Where xi represents each value ina data set and N is no. of
observations
Example: no. of vehicles passing
through to a toll gate in 7 days
Day no of veticles
11000
1500
1650
2250
3000
| 20 Therefore, it is
ungrouped data.
Excel :
Date | Product | Region }les_Amoufit_MarginlUnits_Soldtomer_Ra| Month
‘HittRIHH Smartwatc North 34299 9.86 467 2.9 2024-04
satnHHeH Tablet East 17874 «6.7 99 4.4 2024-12
aati Tablet South 37711. -14.92 427 4 2024-09
Ht Laptop South 10539 6.27 124 2.8 2024-04
satkHHeH Tablet North 47405 27.17 114 2.7 2024-03
saneHHeH Tablet West 7557 5.69 400 4.3 2024-07
HHH Laptop South 43360 19.47 205 2.7 2024-01
26mean AVERAGE(select the column) mean 27023.13
median MEDIAN(select the column) median 26683
mode MODE.SNGL(select the column) mode 34299
Python Code:
df['age"].mean()
lean:", mean)
af age’].median()
print("Median:", median)
mode= df["age'].mode()[@]
print("Mode:", mode)
UNIT-2
Measure of Dispersion: Measure of central tendency gives
us an idea about the concentration of the distribution of the
data set by using single value. If we know the average value
alone we cannot forma complete idea about the
distribution.
Dispersion means scatterness, Measure of dispersion study
about the homogeneous and heterogeneity. Homogeneous
Distribution: The data set is considered as homogeneous if
values of the data set are closely clustered around the mean
or central value. It indicates low dispersion, low range,
variance, standard deviation. It suggests the consistency
and uniformity of the distribution.
Heterogeneous Distribution: The data set is considered as
heterogeneous if the values of the distribution are spread
out from the mean or central value. It indicates high
dispersion [high range, variance, standard deviation]. It
suggests the diversity and variability of the distribution.
Range: the Range is the difference between 2
extreme observations of the distribution. i.e,
R=maximum value-minimum value .Quartile:
Quartile for an ungrouped data.
27Min value Qi Q2
Q@3 Max value
25% 50% 75%
Lower quartile middle Upper
Quartile Quartile
For an ungrouped data the 1" Quartile is ungrouped by[n+1/4]
observation and the 3° Quartile is obtained by 3[n+1/4]
observation and 2” Quartile is median.
Quartile for a grouped data: For a grouped data the 1" and 3”
quartiles are calculated by using the following are calculated by
using the following formulas
First Quartile=!1+ (n/4-cf+f1)*h1
Where |1=lower limit of 1" Quartile class
CF=Cumulative Frequency of class Proceeding to the 1”
Quartile class
Fi=Frequency of the 1" Quartile class
Hi=Height or length of the 1* Quartile class
Third Quartile(Q3)=13+ (3N/4-cf3+£3)*h3
Variance: It measured the average squared deviation of each
point from the mean value, it explains how the data is spread
out from the mean higher variance means the data points ere
spreadout widely from the mean, while a lower variance means
they are closer to the mean. It is represented by [0*2] for a
population variance.
Variance
(S*2)=X1xi
X)*2/n-1
Standard
Deviation:
It is the square root of the variance , it gives the dispersion in same
units as the original data unlike variance.
itis
denoted
by 0.
o=WVarian
ce Note:
28Lif o*2
u, moderate variance
3.if 02 >>u, high variance
Percentile: Divides data into 100 equal parts and there are 99
percentiles. A percentile is a measure used in statistics indicating
the value below which a given percentage of observations fall.
percentiles are commonly used to know the relative position of a
particular value in a given data set for eg: if a Candidate Score
90 marks in exam and this score was > than or equal to the
scores of 86 percent of the students taking that test, then
percentile rank would be 86. The Candidate could be in the 86th
percentile.
Percentile for an ungrouped data.
1)Arrange the data in Ascending order.
2) Calculate the ith location by using
P/100 *N where
P = percentile required, N=Number of observations, P =
Percentile location.
3) if i is a whole number then the percentile value is the average
of ith & i+ th location.
4)if | is not a whole no. then the percentile value
is just greater than ith location. Percentile for
grouped data is calculated by: Pi = li + [((i*N)/
100 - cfi)/(fi)]*h
The class interval with C.F just greater than i*N/100 is the ith
percentile class where
Li= lower limit of ith percentile class, cfi= C.F of class preceding
to ith percentile class,
Hi= width of ith percentile class, fi= frequency of ith percentile
class
‘Skewness: Skewness means lack of symmetric it gives an idea about the
shape of the curve of the given data. A distribution is said to be skewed if
i) mean, median &mode fall at diff points i.e.,
meant#median¢mode.
ii) Quartile are not equal distance from median. i.e Q3-Q2#Q2-Q1
ili) The Curve drawn with the help of the data is not symmetrical
but stretched more to one side than to the other.Python code
import pandas as pd
import numpy as np
from scipy inport stats
dfzpd-read_excel("telco.x1sx")
print("Median:", median)
mode= df[' age’ ].mode()[@]
print("Mode:", mode)
mean_income=df[' incone'].nean()
print("nean_incone" ,mean_income)
range = nax(df['age']) - min(df[‘age'])
print ("Range:",range)
Range:59
#inter quartile range for age
Qi np.percentile(df[‘age'],
Q3 = np-percentile(df[‘age'],
TQR = Q3 -Q2
print("“Inter Quartile Range
Inter Quartile Range: 19.¢
min(df[‘ income" ])
fe 1668
median_income=df| ‘incone' ].median()
df-heed()
df.tail()
mean= df[' age’ ].mean() Mean: 41.684
print("Mean:", mean) i :
: ; Median: 40.0
median= df[‘age'].median() Yoga. 33
mmean_incone=df[' income" J.mean()
mmedian_income=dF[' income’ J median()
Print Gmeen_incone”,mean_income)
print ("meéian_income",median_income)
print("median_income" ,median_income)
#find range to the age varieble in telco data set
variable
25)
75)
»TQR)
max(d#[*income*])
30inding percentile value for age variable in telco
-np.percentile(df['age'], 22)
1t("22nd Percentile value”,P22)
| Percentile value 31.0
-np.percentile(df['age'], 89)
1t("S9th Percentile value",P89)
1 Percentile value 58.110000000000014
Lance = np.var(df[‘age'], ddof=1)
1t (* Variance: *, variance)
Lance: 157.72386786786805
viation=np.std(df[‘age"], do’
‘Standard_deviatior
)
»Std_deviation)
rd_deviation: 12.55a216340239556
lating skewness value for the given data:
matplotlib.pyplot as plt
seaborn as sns
(2.,1,2,2,2,2,2,2,2,3,343,3,3434345454555454556)646565 757571858595 45454545454,]
d.Series(data1)
ss_value = si.skeu()
‘Skewness:", skewness_value)
SS: @.4133494554358116
-igure(figsize=(8,5))
\istplot(s1, kde=True, color='blue', bins=8)
tle(f"Data Distribution\nskewness = {skewness_valu
‘label("Data Values")
/label ("Frequency")
srid(True)
show()
2")
31Excel
| Date | Product | Region }les_Amouffit_MarginUnits_Soldtomer_Ra{ Month
| HSHHHHEH Smertwatc North
UHHHHHHEH Tablet East
“EHHHHEH Tablet South
\HEHHHHEH Laptop South
\HHHHHHEH Tablet North
“HEHHHHEH Tablet West
\HHEHHHEH Laptop — South
FOR SALES
range
lar
au
a3
lar
variance
s.d
34299
17874
37711
10539
47405
7557
43360
49080,
5190
43290
12744.5
39603.75
25859.25
192588919.2
13877.64084
9.86 467
6.77 99
14.92 427
6.27 124
27.47 114
5.69 400
19.47 205
FOR PROFIT
range
lar
an
a3.
lar
variance
sd
32
2.9 2024-04
44 2024-12
4 2024-09
2.8 2024-04
2.7 2024-03
4,3 2024-07
2.7 2024-01
29.55
5.38
24.17
9.86
22.9575.
13.0975.
0.48012
7.515725.STEPS
max
min
range
at
at
variance
sd
MAX(select the column)
MIN(select the column)
max-min
QUARTILE.INC(select the
QUARTILE.INC(select the
VAR. p(select the column)
STD.p(select the column)
column,1st quartile)
column 3rd quartile)
| Measure of dispersion
Mazisncs 25.0
Minimum 0.0
Range 25.0
12.75
23
10.25
Skewness
P23
Ps2
Poa
P50
BS.916875
-0.737794
12
23
25
12
Left Skew Distribution|
(Conte TEL
18Module-4
Bi-Variate Analysis
Cross-tabulation :- It is @ type of table that display the
multivariant frequency distribution and it is commonly used
to Compare the results of one or more variables
Note:-1). Cross-table Summarizes the relation between 2
Categories of a data. Ex: Ina class of 70 students 40 are
male, 30 are female Students. out of 40 male 18 are A band
stud, 22 are B band stud. where as among 30 female 16 are
A Band and 14 are B band. and Create a Cross tabulation
for the given data and create Side-by-side chart
Sol: Given 70 stud ina class, 40 are male & 30 are female
igender\branch A
Male 18 22
Female 16 14
320
h [) i
. i
A 8
Gender
34Covariance: cov(x,y)=2(xi-% )(yi-y)/n
rank correlation coefficient.
minimizing the sum of squares of the vertical differences
“Fitting a straight line by least square method. let (x1,y1),
the required equation of st. line. itis required to find the
values of a & b for normal equations. 3 = ax + b Dx
dxy = abx + b x?
substituting those values a&b in eq 1 we get the required
stiline.
“Fitting a Second degree polynomial using Least square
points. y = a + bx + Cx™4 (1) be the required eqn of
parabola. It is required tofind values of a, b, ¢ for the
normal equations.
Sysan+bx+cDx*2-(1)
dxy=abx+brx*2+cEx"3-(2)
Dx" 2y=adx"2+bdx"3+c5x*4-(3). By solving above
equations we get the values off a,b&c while substituting
Using Excel:
mee ey epSCpSCae qe om gear
2 6 160110 39262 16.86 ° 1 4
a 6 160 110 39 2.875 17.02 ° 1 4
2a 6 258 110 «3.083215 19.48 1 ° a
18.7 8360175315 17.02 ° 0 3
18.1 6 225 1052.75 = 8AG (20.22 1 ° 3
143 8 360253285715 BA ° o 3
Correlation: r=X¥+VEX"2-EY*2_, Karl Pearson spearman
By Solving above equations's we get the values of a&b, by
method: let (x1,y1), (x2, y2)-- (xn, yn) be the given Set of n
those values in eq-(1) we get the required eqn of parabola.
Least Square Method: It is a statistical approach used to find
the best fitting line or Curve through a set of data points by
(errors). B/w observed values & predicted values on the line.
(x2,y2)—(xn,yn) be the set of given n points, y=atbx -(1) be
careFitting a straight line
x ¥
2 4
4 5
8 2
10 15
a2 16
Second degres polynomial
eee
Leyes
Srey
Dee a
Cee eu aa
36y
core
Best
predic
era
Prediction
ra
Bate Re ero)
a OREM eRe ee
37Pecinaet er taeee es
Peery 7st feet eee
reer eets eee
COE aE a eee
Sees Roe SOO La Ee +)
Peay nena ete)
Peete
hieserercaea tae
poly = poly.tran:
heer eerrenere yD
Smears Tee ean)
peer terramdyy is
Alegent? ees terre
ese eir pate
eamancearnh Gees
eer ear Sane
rere
eran
ates
© Actual Data
2nd Degree Fit
* Predictions
-
w
Crop Yield (tons/acre)
20 60
Fertilizer (kg/acre)Module-5
Testing of hypothesis for single mean:Z=x
-u+0/vn
Test of significance for difference of mean of 2 large samples:
If two samples are taken from two different populations
having standard deviation 1 &62 with n1&n2 then the tests
statistic is given by:
Z=x1 -x2 +VoI*2/m1+02*2/n2
If two samples are taken from the same population having
standard deviation (o) then the test statistic is given by: Z=X 1-k
2s0v Inl+1/n2
If two samples are taken from the population whose standard
deviation is unknown but the standard deviation of samples are
given S1&S2 then the test statistic is given by:Z=X +X 2+VS12/m1+S2°2/n2
Test of significance for single proportion large samples:
Consider a large random sample of sizen has a sample
proportion p to test the hypothesis of proportion ‘p’ then the
test statistic is given by:
Z=p-P=VPQin
Testing of hypothesis for difference proportion: when we
want to compare 2 population proportions based on the
sample data, we use the hypothesis testing for the
difference of proportions.
Null Hypothesis:Ho:P1-P2, there is no significance
difference
Alternative Hypothesis: H1:P1#P2 [2 tailed-test]
Hi:P1>P2[Right-tailed] [or]
Hi:P130) and population standard deviation are
known we use Z-test for comparing single mean.
Z= x-mu | (sigma /vn)
Where x = mean of sample
mu = mean of population
Sigma = standard deviation of population
n= no. of observations.
Testing of hypothesis for difference of mean (large samples)
‘* If two samples are taken from two different populations having
standard deviations sigma 1,sigma2 with m1 and n2 then the test statistic
is given by
Z = x1- x2 / v(sigmat'/n) +(sigma2’/n2)
‘* If two samples are taken from the same population having standard
deviation sigma then the test statistic is given by
Z= x1-x2 / sigma v(1/m) +(1/n2)
Testing of hypothesis for single proportion
Test statistic is given by
Z=p- P| v(PQin)
Where p = sample proportion
P = population proportion
Testing of hypothesis for difference of proportions
When we want to compare two population proportions based on the sample
data, we use the hypothesis test for difference of proportions.
Z = PY - P2*/ ¥ p%(1-p) (1/m1 + 1/n2)
Pit = x1/n1
P2* = x2/n2
P* = (x1+x2) /(n1+n2)
aT - Test
T - Test is to test the hypothesis for small samples i.e. n<30
Testing of hypothesis for single mean (small sample)
When the sample size is small and population standard deviation are unknown,
we use t-test for comparing single mean
t=x-mu/ (s/n)
Testing of hypothesis for difference of mean (small samples)
When comparing two independent samples trom different populations
assumed to be normally distributed with equal but unknown variance and the
sample size are small (n<30) .
t = x1-x2/ v sp’ (1/m1 + 1/n2)
sp’ = pooled variance =(n1-1) st'+(n2-1) s2° / ml+n2-2
Testing of hypothesis for paired t-test
When we want to compare to related samples which are taken from the same
population before and after treatment then we use paired t-test.
t=d/(sd/vn)
d= x1-x2
sd = v 3(di-d)*/n-1
Steps to find Z-test and t-test using excel
Use the Analysis Tool Pack to run ttest/z-test
First, welll enter each tests data in the way shown below
To begin, highlight all of the
information including the column headers.Then, select the data tab from the top ribbon , followed
by data analysis.
3 Baececmaon | EFS iz
an set | re rect Focen | StH ~
ae a Wsdranced Beas Sheet FEl Sette
Click for required tests Ex: t-test paired two samples for means and then OK ij
the window that displays.
: = = ace fava
|
o| a
bu 3a]
——
a Fill in the following fields then
click ok.
It will displays the comprehensive report.
This will provide the mean of each data set, its variance, the number of
observations included, correlation, and P-value.
We need to look at the P-value, which is 0.02335799, which is much lower
than the predicted P-value of 0.05.PYTHON CODES:
Pees
aasteCe very
amy rerescsalpha =
eras
os
cots
Pets
ietramed
reese
Ret eee
Ta G ate
eus Gui!
ore
rac
ead
EreraTs aT:
itical value =
ete a7
ee
eto
an
eae
cero)
esrrves)
seers
ro
3710
eel)
.
stat
critical
Taam e
Mee MMO erae cst
Daa
Srrres nar)
Crem eee
OgCera]
Peet
eect
crete
emetet:
etait
ee
ers)
as
nz)
eure ao Meaner ees
eee are
norm. ppf(1
) > zeritical
Crem cere)
eect
bar - mu) / (s / math.sart(n))
pf(1 - alpha/2, df)
Sriohstaen (aoe
vst
forts gimme tsi
critical
Eta em apts cy)
aU
4 T test
Sryat
PstgiEs st es
46
ere
Po
nord
ae ep ese
ueen( diff)
Tae ta way
iff = math.sqrt(sum((d - moan_diff
Secs
et
Sees enor)
GARCetc ste ticle
(f°7 critical value = #{t critical
aac a eae
rint("Fail to reject the null hy
5
5
caeuetes recur]
oy
eer w a
math. sqrt (sum((d - mean_d:
Cent Ate
ees ier
amore
Bree asters?
cn
eS ceca
Coear cary
Porom rere es48