Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views48 pages

Dis Vishnu

The document is a comprehensive guide on descriptive and inferential statistics, detailing key concepts such as data collection, organization, analysis, and visualization. It covers various statistical measures, including central tendency and dispersion, as well as data visualization techniques like bar graphs, pie charts, and histograms. Additionally, it discusses hypothesis testing and the importance of statistics in making informed decisions based on data.

Uploaded by

vishnuteja1317
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
12 views48 pages

Dis Vishnu

The document is a comprehensive guide on descriptive and inferential statistics, detailing key concepts such as data collection, organization, analysis, and visualization. It covers various statistical measures, including central tendency and dispersion, as well as data visualization techniques like bar graphs, pie charts, and histograms. Additionally, it discusses hypothesis testing and the importance of statistics in making informed decisions based on data.

Uploaded by

vishnuteja1317
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 48
HITAM crt HYDERABAD INSTITUTE OF TECHNOLOGY AND MANAGEMENT (UGC Autonomous) Affiliated to JNTUH, Approved by AICTE, Accreted by NAAC(A+), NBA DESCRIPTIVE AND INFERENTIAL STATISTICS Submitted by AELE VISHNUTEJA (23E51A6702) COMPUTER SCIENCE AND ENGINEERING IN DATA SCIENCE (2025-2026) Internal Examiner External Examiner TABLE OF CONTENT S.no CONTENT PAGE NO. Statistics a) Measures of data 3-7 Data Visualization Pie chart Bar graph Histogram Heat map Scatter plot Box & Whisker graph 8-21 Descriptive Statistics 22 Measures of Central tendency Mean Median Mode 23 Measures of dispersion Range b)Variance Standard Deviation d)Skewness Quartile & Percentile 24-29 Bi - Variate Analysis Cross ~ Tabulation Co-variance, Co-relation Least square method Fitting 2" degree polynomial 30-34 Hypothesis Testing 35-42 36 a) Z-test 37-38 b) T-test S942 ©) Python Codes * DvD 2) 3) 4) 5) * INTRODUCTION TO STATISTICS tatistics is a branch of mathematics that involve collection, organisation, analysis, interpretation, presentation and visualization of data to derive meaningful insights (patterns, trends and unknown facts) and conclusions to make informed decisions. Key components of Statistics: Data collection. Organizing the Data. Analyzing the data. Presentation and Visualization. Interpretation. Importance of Statistics: It plays a crucial role in providing the foundation for making data driven decisions, building predictive models and decisions, understanding patterns. It also helps in DvD 2) 3) 4a 5) Understanding data distributions. Hypothesis testing. Regression analysis. Data reduction and dimensionality. Handling large and complex data sets. 6) Forecasting and predictive analysis. * Descriptive Statistics: It summarises and describes the main features and characteristics of a dataset. It includes ‘* Measures of central tendency(mean, median, mode, weighted). * Measures of dispersion (Range, Variance, Standard mean deviation, Quartiles etc) Ex: If the marks of 100 students taken from the population of size 1000. Mean=76, Median=78, Standard deviatio! .62(sample statistic). * Inferential Statistics: It makes estimations or conclusions about the population based on the sample data. * Hypothesis testing (t-test, z-test) * Confidence intervals [drawing conclusion that the average marks of 1000 students is 76 based on the sample data (by using hypothesis testing)]. * Data: It is the collection of facts and figures which are collected, analysed and summarised. * Types of Data: 1. Primary data: A data which is collected for the first time by the researcher or person himself is called primary data. 2. Secondary data: A data which is collected by the other source and organized, summarized (by third party) is called secondary data * Population: A population refers to the entire group of observations which we are interested in studying. Ex: All tweets posted on twitter that is population. * Sample: A sample is a subset of population. It isa sample group chosen from the population, used to make inferences about the entire population. Ex: A random selection of 10,000 tweets used to train a sentiment analysis model. * Variable: A characteristic that varies across all the unit is called a variable. Ex: In a dataset of all the students who passed their 12” grade. Each column represents a variable. Roll.n | Name | Age | Board | Passe | Mark | Conta | Paren | Addre | Email ° q s ct No. | ts ss |p year | (%) Conta ct No. 301 A 18 State 89 | XXX1 | XXX. W | ABC 2025 2 9 302 B 7 CBSE 72 | XXX4 | XXX4 X | DEF 2025 8 8 303 Cc 7 State 78 | XXX6 | XXX6 Y | GHI 2025 3 3 304] D | 18 ‘| State Ba [xxxa Pood |Z | IKE 2024 9 8 For the above data if a person wants to know to which group the data belongs to, the data is classified into 2 categories. Measures of Data fe Qualitative(or) Quantitative(or) Categorical data. Numerical data. Ex: Gender, Colour. Ex: Age, Weight. * Qualitative Data (Categorical data): it describes the qualities or characteristics that cannot be measured numerically. It represents categories or groups. Ex: Colors(Red, Green, Blue, Pink, Lavender), Gender (Male, Female], Movie rating (Excellent, Good, Average, Worst). * Quantitative Data (Numerical data): It consists of numerical values that can be measured and quantified. This type of data can be satisfied to mathematical operations and statistical analysis. Ex: Age, Salary, University, Size, Temperature Categorical Data Nominal Data Ordinal Data * Measures of Categorical Data: * Nominal data: |t consists of categories that do not have a meaningful order. It consists of labels, names which are used to identify the characteristics of observation. Ex: Country names, subject names. %* Ordinal data: It also have categories with a meaningful order and the distance between the categories are not uniform. Ex: movie ratings, Education level (SSC, Intermediate, B.Tech, MTech), Food rating (Poor, Average, Good, Excellent). Numerical Data DiscreteContinuousinterval Scale Ratio Scale * Discrete Data: Discrete data is a type of numerical data that consists of distinct, separate values. These values are countable and cannot be broken down into smaller part meaningfully. Discrete data is often obtained through counting rather than measuring. Ex: Number of students in 2 class (e.g., 25, 30, 32-cannot have 25.5 students), Number of cars in a parking lot (e.g., 10, 15, 20-cannot have 10.3 cars) * Continuous Data: Continuous data is a type of numerical data that can take any value within a given range. It is measurable rather than countable and can include decimals and fractions. Continuous data is obtained through measurement and can be infinitely divided into smaller parts. Ex: Height of a person (e.g., 5.6 feet, 170.2 cm), Weight of an object (e.g., 55.3 kg) %* Interval Scale: The interval scale measures variables where the intervals between values are Meaningful and consistent. However, there is no true Zero point, meaning zero does not Indicate the complete absence of the variable. Ex: 1. Temperature in Celsius: The difference between 10°C and 20°C is same as between 30°C and 40°C. However, 0°C does not indicate no temperature. + Ratio Scale: The ratio scale measures variables with meaningful and consistent interval and a true zero (0) point, which signifies the absence of the measured attribute. Ex: 1. A person whose weight is 60kg is twice as some one who's weight is 30kgs, 2. A distance of Okm means no distance, 10kms is twice as far skms. MODULE 1 ENDS MODULE 2 DATA VISUALIZATION CHARTS FOR CATEGORICAL DATA BAR GRAPH Abar graph is a chart that represents categorical data with rectangular bars. the length and height of each bar represents the frequency (or) proportions of the categories Itis used for comparing frequencies (or) counts of different categories It is ideal for categorical (or) discrete data e The x-axis represents categories, y- axis represents the frequency count(or) percentage. © Advantages of Bar graph It is easy to understand and interpret Itis useful for comparing Categories side-by-side na It can be vertical or horizontal Steps to Create a Bar Graph in Excel: Enter Data in Excel Select Data Go to the Insert tab. Fon Click on Bar Chart (or Column Chart) and choose Clustered Column Chart (for vertical bars). 5. Customize Labels add X-axis and Y-axis labels (X-axis = Grades, Y-axis = Frequency). 6. Click on the chart title and rename it (e.g., "Grade Distribution"). Ex: The grades of 20 number of students are given below “Internal 1 Grades Frequency ona>0 y 8 Bar chart for Internal 1 marks ° A B c D © Steps to do in python import pandas as pd from matplotlib import pyplot as pit import seaborn as sns data=pd.read_excel("Superstore_data.xlsx") data.head() i cma ene eye ME an OL cary lh BE Dns a ath onan OM EE ce moons SO NL a Category_plot=sns.countplot(x='Category'data=data) Category_plot.bar_label(Category_plot.containers[0]) pittitle("Distribution of Category") plt.show() Distribution of Category Furniture Office Supplies Technology e PIE CHART 1.A pie chart is a circular chart divided into sectors, each representing a proportion of the category, Each sectors angle is proportional to the quantity it represents. 2.It is used to show the proportion or frequency of each category in relation to the variable. 3. It is best when there are 3 to 5 categories. 12 4. itis eye-catching and simple for a quick understanding. EX: The no.of students CSD branch (section wise) SECTION NO.OF STUDENTS 66 7O 62 NO.OF STUDENTS e Steps to create Pie-Chart in Excel 1.Enter Data 2.Select Data 3.Insert Pie-Chart and choose the style 2D or 3D 4.Customize Chart - Modify colours, styles, and add labels using the Chart Tools. Internal 2 Grades Frequency 18 14 9 14 12 Oon@m>o Pie chart for internal 2 grades Chart Title =o mA =e mc =D HISTOGRAM A histogram is a graphical representation of the distribution of a dataset. It is used for continuous or grouped numerical data. It provides visuals how data is spread across different intervals or bins. Each bar in histogram represents the frequency of the data points within a certain range known as bin. The height of the bar indicates the no. of observations within each bin. It also consists of rectangular bars. Key concepts of histogram 1.the shape of a histogram can provide insights into the distribution of the data. e¢ Normal distribution: A bell-shaped curve © Skewed right: A long tail on the right side e Skewed-left: A long tail on the left side 2.Interpretation: Histogram are often used to understand central tendency, spread and shape of the data. EX: The age group of 20 students are given below Steps to do in Excel 1.Enter Data 2.Select Data 3.Insert a Histogram 4.Customize the chart: Use Chart Tools to change colours, styles, and axis Labels. 28 and above 35 50.28 o 35.23 8 © Bstow 5 moncan»arcornnyomok| Sgu 88 yal ty wan te yw ooonsonscaosccnoccltt™ Histogram of distribution of internal 1 Steps to do in Python import pandas as pd from matplotlib import pyplot as pit import seaborn as sns import matplotlib.pyplot as plt import numpy as np data_normal = np.random.normal(loc=0, scale=1, size=1000) plt.hist(data_normal, bins=30, edgecolor='black'color='skyblue') 16 plt.title(‘Histogram of Data’) plt.xlabel("Value') pit.ylabel('Frequency’) pit.show() Histogram of Data data_skewed_right=np.random.exponential(scale=1, size=1000) plthist(data_skewed_right, bins=30, edgecolor='black',color='orange’) plt.title('Skewed Right Distribution (Long tail on the Right)') plt.xlabel('Value’) plt.ylabel('Frequency') pit.show() ‘Skewed Right Distribution (Long talon the Right) 7 e Line Graph Aline graph is a graphical representation of data points connected by _ straight lines. It is used to visualize the trend of a variable over time as across a continuous range. The line graph is particularly useful when showing changes over a period of time or in case data points are sequential. It is used to show trend over times (Daily, Monthly, Yearly). It displays continuous data such as stock prices, temperature, scales. © Steps to do in Excel 1. Enter Your Data: Open Excel and enter data in two columns.Column 1: Labels (e.g., Months, Years, Categories). Column 2: Values (e.g., Sales, Profits, Growth) 2. Select the Data 3. Insert a Line Graph and Choose a line chart style (e.g., simple, stacked, or with markers). 4. Click on the chart and use Chart Tools to: Change colours, styles, or add gridlines. Modify the chart title and axis labels. Ne Mase ke ene LE Ly Die One _ ee ak Ee 0 oscemte: 55000 Scatter Plot A scatter plot is a graphical representation of two variable where each point represent the relation between them. It helps in visualizing the correlation or association between two continuous variables. Key components of a Scatterplot 1.X-axis and Y-axis : Represents two continuous variables. 2.Data Points: Each point shows a data observation based on its x andy values 3.Grid lines: It helps to estimate the values or data points. EX: Experience and salary of 6 employees are given below Steps to do in Excel 1.Enter Data: column 1: X-axis values (e.g., Time, Age, etc.) column 2: Y-axis values (e.g., Sales, Growth, etc.) 2. Select Data 3. Insert Scatter Plot: Go to the "Insert" tab. In the Charts group, click "Scatter Chart" (dot graph icon).Choose a scatter plot type (e.g., simple, smooth, or with lines). 4. Customize the Chart: Use Chart Tools to edit colours, axis labels, gridlines, and titles. Box-Plot A box plot is a great visualization to understand the distribution of data, including the median, quartiles, potential out layers. It is useful when you want to compare a numeriacal variable across different categories. Key components of Box-plot 1.Box: It represents the meddle 50% of the data [from ql to q3] 2.Median line: It represents the median (q2) inside the box 3.Whiskers: It shows the spread of the data excluding out layers. 20 4.OutLayers: Individual points outside the Whiskers, indicating extreme values Steps to do in Excel: 1. Enter Data: Open Excel and enter your dataset in a single column or multiple columns 2: Select Data 3.Insert a Box Plot: Go to the "Insert" tab. Click on "Insert Statistic Chart". Select "Box and Whisker" chart. 4. Customize the Chart: Use the Chart Tools to modify colours, titles, and axis labels. 21 Steps to do in Python import pandas as pd import seaborn as sns import matplotlib.pyplot as pit # create the boxplot plt.figure(figsize=(8,6)) sns.boxplot(y='mpg',data=mt,color='yellow') plt.title(‘boxplot of mpg in mtcars dataset’) plt.ylabel('‘mpg') plt.show() Heat Map 1.A heat map is a advanced data visualization that uses colours to 22 represent values in a dataset. Darker or more intense colour usually indicate higher values, while lighter colour shows lower values. 2.It is used to show patterns, trends and relationships in a larger dataset. 3.In geographical data for representing sales, population and temperature etc., Steps to do in Excel: 1. Enter and Select Data 2. Apply Conditional Formatting: Go to the "Home" tab. Click "Conditional Formatting" > "Colour Scales". Choose a preset colour scale (e.g., red-to-green or blue-to-white) 3. Customize the Colour Scale 4. Remove Gridlines for a Clean Look Ell Visharachira Andhra Pracesh Lar Pradesh West Bengal Odicha Fasjacthon Gujarat B Chhatisaarh Manipur Tripura Meghalaya ‘Arunachal Practesh Silkirn agalars Ladakh Lakshadweea ‘Andaman are! Nicobar lolz TaTzATa 968365 “aramt 2321378 2080050, 202790 1289602 200220 231482 ‘nate 1044283 "gIS60! 832581 00476 E2755 477407 490009 “405858 ‘0606 240540 "07266 "090! 33947 4510 9224 23857 "1362 10167 24 DESCRIPTIVE STATISTICS Descriptive statistics is a branch of statistics focused on summarizing and organizing data in a meaningful way. It involves using various measures like mean, median, mode, range, standard deviation, and frequency distribution to describe key features of a dataset. The goal is to provide a concise and informative representation of the data without making inferences about a larger population Descriptive Statistics Measure of central tendency Measure of dispersion Mean Range Median Variance Mode Standard Deviation Skewness Percentile & Quartile 25 OoNanRWrHo Module-3 Measure of Central Tendency: It helps us to summarize the data set by identifying the central value. The 3 main measures are mean, median and mode. Mean(Arithmetic Mean): Mean is sum of all values in a data set divided by the total no. of values. It represents the average value. Formula: Sum of all observations + no.of observations [or] Mean=3xi + N Where xi represents each value ina data set and N is no. of observations Example: no. of vehicles passing through to a toll gate in 7 days Day no of veticles 11000 1500 1650 2250 3000 | 20 Therefore, it is ungrouped data. Excel : Date | Product | Region }les_Amoufit_MarginlUnits_Soldtomer_Ra| Month ‘HittRIHH Smartwatc North 34299 9.86 467 2.9 2024-04 satnHHeH Tablet East 17874 «6.7 99 4.4 2024-12 aati Tablet South 37711. -14.92 427 4 2024-09 Ht Laptop South 10539 6.27 124 2.8 2024-04 satkHHeH Tablet North 47405 27.17 114 2.7 2024-03 saneHHeH Tablet West 7557 5.69 400 4.3 2024-07 HHH Laptop South 43360 19.47 205 2.7 2024-01 26 mean AVERAGE(select the column) mean 27023.13 median MEDIAN(select the column) median 26683 mode MODE.SNGL(select the column) mode 34299 Python Code: df['age"].mean() lean:", mean) af age’].median() print("Median:", median) mode= df["age'].mode()[@] print("Mode:", mode) UNIT-2 Measure of Dispersion: Measure of central tendency gives us an idea about the concentration of the distribution of the data set by using single value. If we know the average value alone we cannot forma complete idea about the distribution. Dispersion means scatterness, Measure of dispersion study about the homogeneous and heterogeneity. Homogeneous Distribution: The data set is considered as homogeneous if values of the data set are closely clustered around the mean or central value. It indicates low dispersion, low range, variance, standard deviation. It suggests the consistency and uniformity of the distribution. Heterogeneous Distribution: The data set is considered as heterogeneous if the values of the distribution are spread out from the mean or central value. It indicates high dispersion [high range, variance, standard deviation]. It suggests the diversity and variability of the distribution. Range: the Range is the difference between 2 extreme observations of the distribution. i.e, R=maximum value-minimum value .Quartile: Quartile for an ungrouped data. 27 Min value Qi Q2 Q@3 Max value 25% 50% 75% Lower quartile middle Upper Quartile Quartile For an ungrouped data the 1" Quartile is ungrouped by[n+1/4] observation and the 3° Quartile is obtained by 3[n+1/4] observation and 2” Quartile is median. Quartile for a grouped data: For a grouped data the 1" and 3” quartiles are calculated by using the following are calculated by using the following formulas First Quartile=!1+ (n/4-cf+f1)*h1 Where |1=lower limit of 1" Quartile class CF=Cumulative Frequency of class Proceeding to the 1” Quartile class Fi=Frequency of the 1" Quartile class Hi=Height or length of the 1* Quartile class Third Quartile(Q3)=13+ (3N/4-cf3+£3)*h3 Variance: It measured the average squared deviation of each point from the mean value, it explains how the data is spread out from the mean higher variance means the data points ere spreadout widely from the mean, while a lower variance means they are closer to the mean. It is represented by [0*2] for a population variance. Variance (S*2)=X1xi X)*2/n-1 Standard Deviation: It is the square root of the variance , it gives the dispersion in same units as the original data unlike variance. itis denoted by 0. o=WVarian ce Note: 28 Lif o*2 u, moderate variance 3.if 02 >>u, high variance Percentile: Divides data into 100 equal parts and there are 99 percentiles. A percentile is a measure used in statistics indicating the value below which a given percentage of observations fall. percentiles are commonly used to know the relative position of a particular value in a given data set for eg: if a Candidate Score 90 marks in exam and this score was > than or equal to the scores of 86 percent of the students taking that test, then percentile rank would be 86. The Candidate could be in the 86th percentile. Percentile for an ungrouped data. 1)Arrange the data in Ascending order. 2) Calculate the ith location by using P/100 *N where P = percentile required, N=Number of observations, P = Percentile location. 3) if i is a whole number then the percentile value is the average of ith & i+ th location. 4)if | is not a whole no. then the percentile value is just greater than ith location. Percentile for grouped data is calculated by: Pi = li + [((i*N)/ 100 - cfi)/(fi)]*h The class interval with C.F just greater than i*N/100 is the ith percentile class where Li= lower limit of ith percentile class, cfi= C.F of class preceding to ith percentile class, Hi= width of ith percentile class, fi= frequency of ith percentile class ‘Skewness: Skewness means lack of symmetric it gives an idea about the shape of the curve of the given data. A distribution is said to be skewed if i) mean, median &mode fall at diff points i.e., meant#median¢mode. ii) Quartile are not equal distance from median. i.e Q3-Q2#Q2-Q1 ili) The Curve drawn with the help of the data is not symmetrical but stretched more to one side than to the other. Python code import pandas as pd import numpy as np from scipy inport stats dfzpd-read_excel("telco.x1sx") print("Median:", median) mode= df[' age’ ].mode()[@] print("Mode:", mode) mean_income=df[' incone'].nean() print("nean_incone" ,mean_income) range = nax(df['age']) - min(df[‘age']) print ("Range:",range) Range:59 #inter quartile range for age Qi np.percentile(df[‘age'], Q3 = np-percentile(df[‘age'], TQR = Q3 -Q2 print("“Inter Quartile Range Inter Quartile Range: 19.¢ min(df[‘ income" ]) fe 1668 median_income=df| ‘incone' ].median() df-heed() df.tail() mean= df[' age’ ].mean() Mean: 41.684 print("Mean:", mean) i : : ; Median: 40.0 median= df[‘age'].median() Yoga. 33 mmean_incone=df[' income" J.mean() mmedian_income=dF[' income’ J median() Print Gmeen_incone”,mean_income) print ("meéian_income",median_income) print("median_income" ,median_income) #find range to the age varieble in telco data set variable 25) 75) »TQR) max(d#[*income*]) 30 inding percentile value for age variable in telco -np.percentile(df['age'], 22) 1t("22nd Percentile value”,P22) | Percentile value 31.0 -np.percentile(df['age'], 89) 1t("S9th Percentile value",P89) 1 Percentile value 58.110000000000014 Lance = np.var(df[‘age'], ddof=1) 1t (* Variance: *, variance) Lance: 157.72386786786805 viation=np.std(df[‘age"], do’ ‘Standard_deviatior ) »Std_deviation) rd_deviation: 12.55a216340239556 lating skewness value for the given data: matplotlib.pyplot as plt seaborn as sns (2.,1,2,2,2,2,2,2,2,3,343,3,3434345454555454556)646565 757571858595 45454545454,] d.Series(data1) ss_value = si.skeu() ‘Skewness:", skewness_value) SS: @.4133494554358116 -igure(figsize=(8,5)) \istplot(s1, kde=True, color='blue', bins=8) tle(f"Data Distribution\nskewness = {skewness_valu ‘label("Data Values") /label ("Frequency") srid(True) show() 2") 31 Excel | Date | Product | Region }les_Amouffit_MarginUnits_Soldtomer_Ra{ Month | HSHHHHEH Smertwatc North UHHHHHHEH Tablet East “EHHHHEH Tablet South \HEHHHHEH Laptop South \HHHHHHEH Tablet North “HEHHHHEH Tablet West \HHEHHHEH Laptop — South FOR SALES range lar au a3 lar variance s.d 34299 17874 37711 10539 47405 7557 43360 49080, 5190 43290 12744.5 39603.75 25859.25 192588919.2 13877.64084 9.86 467 6.77 99 14.92 427 6.27 124 27.47 114 5.69 400 19.47 205 FOR PROFIT range lar an a3. lar variance sd 32 2.9 2024-04 44 2024-12 4 2024-09 2.8 2024-04 2.7 2024-03 4,3 2024-07 2.7 2024-01 29.55 5.38 24.17 9.86 22.9575. 13.0975. 0.48012 7.515725. STEPS max min range at at variance sd MAX(select the column) MIN(select the column) max-min QUARTILE.INC(select the QUARTILE.INC(select the VAR. p(select the column) STD.p(select the column) column,1st quartile) column 3rd quartile) | Measure of dispersion Mazisncs 25.0 Minimum 0.0 Range 25.0 12.75 23 10.25 Skewness P23 Ps2 Poa P50 BS.916875 -0.737794 12 23 25 12 Left Skew Distribution| (Conte TEL 18 Module-4 Bi-Variate Analysis Cross-tabulation :- It is @ type of table that display the multivariant frequency distribution and it is commonly used to Compare the results of one or more variables Note:-1). Cross-table Summarizes the relation between 2 Categories of a data. Ex: Ina class of 70 students 40 are male, 30 are female Students. out of 40 male 18 are A band stud, 22 are B band stud. where as among 30 female 16 are A Band and 14 are B band. and Create a Cross tabulation for the given data and create Side-by-side chart Sol: Given 70 stud ina class, 40 are male & 30 are female igender\branch A Male 18 22 Female 16 14 320 h [) i . i A 8 Gender 34 Covariance: cov(x,y)=2(xi-% )(yi-y)/n rank correlation coefficient. minimizing the sum of squares of the vertical differences “Fitting a straight line by least square method. let (x1,y1), the required equation of st. line. itis required to find the values of a & b for normal equations. 3 = ax + b Dx dxy = abx + b x? substituting those values a&b in eq 1 we get the required stiline. “Fitting a Second degree polynomial using Least square points. y = a + bx + Cx™4 (1) be the required eqn of parabola. It is required tofind values of a, b, ¢ for the normal equations. Sysan+bx+cDx*2-(1) dxy=abx+brx*2+cEx"3-(2) Dx" 2y=adx"2+bdx"3+c5x*4-(3). By solving above equations we get the values off a,b&c while substituting Using Excel: mee ey epSCpSCae qe om gear 2 6 160110 39262 16.86 ° 1 4 a 6 160 110 39 2.875 17.02 ° 1 4 2a 6 258 110 «3.083215 19.48 1 ° a 18.7 8360175315 17.02 ° 0 3 18.1 6 225 1052.75 = 8AG (20.22 1 ° 3 143 8 360253285715 BA ° o 3 Correlation: r=X¥+VEX"2-EY*2_, Karl Pearson spearman By Solving above equations's we get the values of a&b, by method: let (x1,y1), (x2, y2)-- (xn, yn) be the given Set of n those values in eq-(1) we get the required eqn of parabola. Least Square Method: It is a statistical approach used to find the best fitting line or Curve through a set of data points by (errors). B/w observed values & predicted values on the line. (x2,y2)—(xn,yn) be the set of given n points, y=atbx -(1) be care Fitting a straight line x ¥ 2 4 4 5 8 2 10 15 a2 16 Second degres polynomial eee Leyes Srey Dee a Cee eu aa 36 y core Best predic era Prediction ra Bate Re ero) a OREM eRe ee 37 Pecinaet er taeee es Peery 7st feet eee reer eets eee COE aE a eee Sees Roe SOO La Ee +) Peay nena ete) Peete hieserercaea tae poly = poly.tran: heer eerrenere yD Smears Tee ean) peer terramdyy is Alegent? ees terre ese eir pate eamancearnh Gees eer ear Sane rere eran ates © Actual Data 2nd Degree Fit * Predictions - w Crop Yield (tons/acre) 20 60 Fertilizer (kg/acre) Module-5 Testing of hypothesis for single mean:Z=x -u+0/vn Test of significance for difference of mean of 2 large samples: If two samples are taken from two different populations having standard deviation 1 &62 with n1&n2 then the tests statistic is given by: Z=x1 -x2 +VoI*2/m1+02*2/n2 If two samples are taken from the same population having standard deviation (o) then the test statistic is given by: Z=X 1-k 2s0v Inl+1/n2 If two samples are taken from the population whose standard deviation is unknown but the standard deviation of samples are given S1&S2 then the test statistic is given by: Z=X +X 2+VS12/m1+S2°2/n2 Test of significance for single proportion large samples: Consider a large random sample of sizen has a sample proportion p to test the hypothesis of proportion ‘p’ then the test statistic is given by: Z=p-P=VPQin Testing of hypothesis for difference proportion: when we want to compare 2 population proportions based on the sample data, we use the hypothesis testing for the difference of proportions. Null Hypothesis:Ho:P1-P2, there is no significance difference Alternative Hypothesis: H1:P1#P2 [2 tailed-test] Hi:P1>P2[Right-tailed] [or] Hi:P130) and population standard deviation are known we use Z-test for comparing single mean. Z= x-mu | (sigma /vn) Where x = mean of sample mu = mean of population Sigma = standard deviation of population n= no. of observations. Testing of hypothesis for difference of mean (large samples) ‘* If two samples are taken from two different populations having standard deviations sigma 1,sigma2 with m1 and n2 then the test statistic is given by Z = x1- x2 / v(sigmat'/n) +(sigma2’/n2) ‘* If two samples are taken from the same population having standard deviation sigma then the test statistic is given by Z= x1-x2 / sigma v(1/m) +(1/n2) Testing of hypothesis for single proportion Test statistic is given by Z=p- P| v(PQin) Where p = sample proportion P = population proportion Testing of hypothesis for difference of proportions When we want to compare two population proportions based on the sample data, we use the hypothesis test for difference of proportions. Z = PY - P2*/ ¥ p%(1-p) (1/m1 + 1/n2) Pit = x1/n1 P2* = x2/n2 P* = (x1+x2) /(n1+n2) a T - Test T - Test is to test the hypothesis for small samples i.e. n<30 Testing of hypothesis for single mean (small sample) When the sample size is small and population standard deviation are unknown, we use t-test for comparing single mean t=x-mu/ (s/n) Testing of hypothesis for difference of mean (small samples) When comparing two independent samples trom different populations assumed to be normally distributed with equal but unknown variance and the sample size are small (n<30) . t = x1-x2/ v sp’ (1/m1 + 1/n2) sp’ = pooled variance =(n1-1) st'+(n2-1) s2° / ml+n2-2 Testing of hypothesis for paired t-test When we want to compare to related samples which are taken from the same population before and after treatment then we use paired t-test. t=d/(sd/vn) d= x1-x2 sd = v 3(di-d)*/n-1 Steps to find Z-test and t-test using excel Use the Analysis Tool Pack to run ttest/z-test First, welll enter each tests data in the way shown below To begin, highlight all of the information including the column headers. Then, select the data tab from the top ribbon , followed by data analysis. 3 Baececmaon | EFS iz an set | re rect Focen | StH ~ ae a Wsdranced Beas Sheet FEl Sette Click for required tests Ex: t-test paired two samples for means and then OK ij the window that displays. : = = ace fava | o| a bu 3a] —— a Fill in the following fields then click ok. It will displays the comprehensive report. This will provide the mean of each data set, its variance, the number of observations included, correlation, and P-value. We need to look at the P-value, which is 0.02335799, which is much lower than the predicted P-value of 0.05. PYTHON CODES: Pees aasteCe very amy rerescs alpha = eras os cots Pets ietramed reese Ret eee Ta G ate eus Gui! ore rac ead EreraTs aT: itical value = ete a7 ee eto an eae cero) esrrves) seers ro 3710 eel) . stat critical Taam e Mee MMO erae cst Daa Srrres nar) Crem eee Og Cera] Peet eect crete emetet: etait ee ers) as nz) eure ao Meaner ees eee are norm. ppf(1 ) > zeritical Crem cere) eect bar - mu) / (s / math.sart(n)) pf(1 - alpha/2, df) Sriohstaen (aoe vst forts gimme tsi critical Eta em apts cy) aU 4 T test Sryat PstgiEs st es 46 ere Po nord ae ep ese ue en( diff) Tae ta way iff = math.sqrt(sum((d - moan_diff Secs et Sees enor) GARCetc ste ticle (f°7 critical value = #{t critical aac a eae rint("Fail to reject the null hy 5 5 caeuetes recur] oy eer w a math. sqrt (sum((d - mean_d: Cent Ate ees ier amore Bree asters? cn eS ceca Coear cary Porom rere es 48

You might also like