Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views38 pages

Unit II Notes

The document outlines the types of data and variables, focusing on qualitative and quantitative data, their advantages and disadvantages, and the differences between discrete and continuous variables. It also discusses independent and dependent variables, observational studies, confounding variables, and methods for describing data using tables and graphs. Additionally, it covers scales of measurement, including nominal, ordinal, interval, and ratio scales.

Uploaded by

leela.flwr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views38 pages

Unit II Notes

The document outlines the types of data and variables, focusing on qualitative and quantitative data, their advantages and disadvantages, and the differences between discrete and continuous variables. It also discusses independent and dependent variables, observational studies, confounding variables, and methods for describing data using tables and graphs. Additionally, it covers scales of measurement, including nominal, ordinal, interval, and ratio scales.

Uploaded by

leela.flwr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

P.T.

Lee Chengalvaraya Naicker


College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

UNIT II : Describing Data

Syllabus

Types of Data - Types of Variables - Describing Data with Tables and Graphs -Describing Data
with Averages - Describing Variability - Normal Distributions and Standard (z) Scores.

2.1 Types of Data

• Data is collection of facts and figures which relay something specific, but which are not
organized in any way. It can be numbers, words, measurements, observations or even just
descriptions of things. We can say, data is raw material in the production of information.

• Data set is collection of related records or information. The information may be on some entity
or some subject area.

• Collection of data objects and their attributes. Attributes captures the basic characteristics of an
object

• Each row of a data set is called a record. Each data set also has multiple attributes, each of
which gives information on a specific characteristic.

Qualitative and Quantitative Data

• Data can broadly be divided into following two types: Qualitative data and quantitative data.

Qualitative data:

• Qualitative data provides information about the quality of an object or information which
cannot be measured. Qualitative data cannot be expressed as a number. Data that represent
nominal scales such as gender, economic status, religious preference are usually considered to be
qualitative data.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Qualitative data is data concerned with descriptions, which can be observed but cannot be
computed. Qualitative data is also called categorical data. Qualitative data can be further
subdivided into two types as follows:

1. Nominal data

2. Ordinal data

Qualitative data:

• Qualitative data is the one that focuses on numbers and mathematical calculations and can be
calculated and computed.

• Qualitative data are anything that can be expressed as a number or quantified. Examples of
quantitative data are scores on achievement tests, number of hours of study or weight of a
subject. These data may be represented by ordinal, interval or ratio scales and lend themselves to
most statistical manipulation.

• There are two types of qualitative data: Interval data and ratio data.

Difference between Qualitative and Quantitative Data


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Advantages and Disadvantages of Qualitative Data

1. Advantages:

• It helps in-depth analysis

• Qualitative data helps the market researchers to understand the mindset of their

customers.

• Avoid pre-judgments

2. Disadvantages:

• Time consuming

• Not easy to generalize

• Difficult to make systematic comparisons

Advantages and Disadvantages of Quantitative Data

1. Advantages:

• Easier to summarize and make comparisons.

• It is often easier to obtain large sample sizes

• It is less time consuming since it is based on statistical analysis.

2. Disadvantages:

• The cost is relatively high.

• There is no accurate generalization of data the researcher received

Ranked Data

• Ranked data is a variable in which the value of the data is captured from an ordered set, which
is recorded in the order of magnitude. Ranked data is also called as Ordinal data.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Ordinal represents the "order." Ordinal data is known as qualitative data or categorical data. It
can be grouped, named and also ranked.

• Characteristics of the Ranked data:

a) The ordinal data shows the relative ranking of the variables

b) It identifies and describes the magnitude of a variable

c) Along with the information provided by the nominal scale, ordinal scales give the rankings of
those variables

d) The interval properties are not known

e) The surveyors can quickly analyze the degree of agreement concerning the identified order of
variables

• Examples:

a) University ranking : 1st, 9th, 87th...

b) Socioeconomic status: poor, middle class, rich.

c) Level of agreement: yes, maybe, no.

d) Time of day: dawn, morning, noon, afternoon, evening, night

Scale of Measurement

• Scales of measurement, also called levels of measurement. Each level of measurement scale has
specific properties that determine the various use of statistical analysis.

• There are four different scales of measurement. The data can be defined as being one of the
four scales. The four types of scales are: Nominal, ordinal, interval and ratio.

Nominal

• A nominal data is the 1 level of measurement scale in which the numbers serve as "tags" or
"labels" to classify or identify the objects.

• A nominal data usually deals with the non-numeric variables or the numbers that do not have
any value. While developing statistical models, nominal data are usually transformed before
building the model.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• It is also known as categorical variables.

Characteristics of nominal data:

1. A nominal data variable is classified into two or more categories. In this measurement
mechanism, the answer should fall into either of the classes.

2. It is qualitative. The numbers are used here to identify the objects.

3. The numbers don't define the object characteristics. The only permissible aspect of numbers in
the nominal scale is "counting".

• Example:

1. Gender: Male, female, other.

2. Hair Color: Brown, black, blonde, red, other.

Interval

• Interval data corresponds to a variable in which the value is chosen from an interval set.

• It is defined as a quantitative measurement scale in which the difference between the two
variables is meaningful. In other words, the variables are measured in an exact manner, not as in
a relative way in which the presence of zero is arbitrary.

• Characteristics of interval data:

a) The interval data is quantitative as it can quantify the difference between the values.

b) It allows calculating the mean and median of the variables.

c) To understand the difference between the variables, you can subtract the values between the
variables.

d) The interval scale is the preferred scale in statistics as it helps to assign any numerical values
to arbitrary assessment such as feelings, calender types, etc.

• Examples:

1. Celsius temperature

2. Fahrenheit temperature

3. Time on a clock with hands.


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Ratio

• Any variable for which the ratios can be computed and are meaningful is called ratio data.

• It is a type of variable measurement scale. It allows researchers to compare the differences or


intervals. The ratio scale has a unique feature. It processes the character of the origin or zero
points.

• Characteristics of ratio data:

a) Ratio scale has a feature of absolute zero.

b) It doesn't have negative numbers, because of its zero-point feature.

c) It affords unique opportunities for statistical analysis. The variables can be orderly added,
subtracted, multiplied, divided. Mean, median and mode can be calculated using the ratio scale.

d) Ratio data has unique and useful properties. One such feature is that it allows unit conversions
like kilogram - calories, gram - calories, etc.

• Examples: Age, weight, height, ruler measurements, number of children.

Example 2.1.1: Indicate whether each of the following terms is qualitative; ranked or
quantitative:

(a) ethnic group

(b) academic major

(c) age

(d) family size

(e) net worth (in Rupess)

(f) temperature

(g) sexual preference

(h) second-place finish

(i) IQ score

(j) gender
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Solution :

(a) ethnic group→ Qualitative

(b) age → Quantitative

(c) family size → Quantitative

(d) academic major → Qualitative

(e) sexual preference → Qualitative

(f) IQ score → Quantitative

(g) net worth (in Rupess) → Quantitative

(h) second-place finish → ranked

(i) gender → Qualitative

(j) temperature → Quantitative

2.2 Types of Variables

• Variable is a characteristic or property that can take on different values.

Discrete and Continuous Variables

Discrete variables:

• Quantitative variables can be further distinguished in terms of whether they are discrete or
continuous.

• The word discrete means countable. For example, the number of students in a class is countable
or discrete. The value could be 2, 24, 34 or 135 students, but it cannot be 23/32 or 12.23
students.

• Number of page in the book is a discrete variable. Discrete data can only take on certain
individual values.

Continuous variables:

• Continuous variables are a variable which can take all values within a given interval or range.
A continuous variable consists of numbers whose values, at least in theory, have no restrictions.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Example of continuous variables is Blood pressure, weight, high and income.

• Continuous data can take on any value in a certain range. Length of a file is a continuous
variable.

Difference between Discrete variables and Continuous variables

Approximate Numbers

• Approximate number is defined as a number approximated to the exact number and there is
always a difference between the exact and approximate numbers.

• For example, 2, 4, 9 are exact numbers as they do not need any approximation.

• But √2, л, √3 are approximate numbers as they cannot be expressed exactly by a finite digits.
They can be written as 1.414, 3.1416, 1.7320 etc which are only approximations to the true
values.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Whenever values are rounded off, as is always the case with actual values for continuous
variables, the resulting numbers are approximate, never exact.

• An approximate number is one that does have uncertainty. A number can be approximate for
one of two reasons:

a) The number can be the result of a measurement.

b) Certain numbers simply cannot be written exactly in decimal form. Many fractions and all
irrational numbers fall into this category

Independent and Dependent Variables

• The two main variables in an experiment are the independent and dependent variable. An
experiment is a study in which the investigator decides who receives the special treatment.

1. Independent variables

• An independent variable is the variable that is changed or controlled in a scientific experiment


to test the effects on the dependent variable.

• An independent variable is a variable that represents a quantity that is being manipulated in an


experiment.

• The independent variable is the one that the researcher intentionally changes or controls.

• In an experiment, an independent variable is the treatment manipulated by the investigator.


Mostly in mathematical equations, independent variables are denoted by 'x'.

• Independent variables are also termed as "explanatory variables," "manipulated variables," or


"controlled variables." In a graph, the independent variable is usually plotted on the X-axis.

2. Dependent variables

• A dependent variable is the variable being tested and measured in a scientific experiment.

• The dependent variable is 'dependent' on the independent variable. As the experimenter


changes the independent variable, the effect on the dependent variable is observed and recorded.

• The dependent variable is the factor that the research measures. It changes in response to the
independent variable or depends upon it.

• A dependent variable represents a quantity whose value depends on how the independent
variable is manipulated.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Mostly in mathematical equations, dependent variables are denoted by 'y'.

• Dependent variables are also termed as "measured variable," the "responding variable," or the
"explained variable". In a graph, dependent variables are usually plotted on the Y-axis.

• When a variable is believed to have been influenced by the independent variable, it is called a
dependent variable. In an experimental setting, the dependent variable is measured, counted or
recorded by the investigator.

• Example: Suppose we want to know whether or not eating breakfast affects student test
scores. The factor under the experimenter's control is the presence or absence of breakfast, so we
know it is the independent variable. The experiment measures test scores of students who ate
breakfast versus those who did not. Theoretically, the test results depend on breakfast, so the test
results are the dependent variable. Note that test scores are the dependent variable, even if it
turns out there is no relationship between scores and breakfast.

Observational Study

• An observational study focuses on detecting relationships between variables not manipulated


by the investigator. An observational study is used to answer a research question based purely on
what the researcher observes. There is no interference or manipulation of the research subjects
and no control and treatment groups.

• These studies are often qualitative in nature and can be used for both exploratory and
explanatory research purposes. While quantitative observational studies exist, they are less
common.

• Observational studies are generally used in hard science, medical and social science fields. This
is often due to ethical or practical concerns that prevent the researcher from conducting a
traditional experiment. However, the lack of control and treatment groups means that forming
inferences is difficult and there is a risk of confounding variables impacting user analysis.

Confounding Variable

• Confounding variables are those that affect other variables in a way that produces spurious or
distorted associations between two variables. They confound the "true" relationship between two
variables. Confounding refers to differences in outcomes that occur because of differences in the
baseline risks of the comparison groups.

• For example, if we have an association between two variables (X and Y) and that association is
due entirely to the fact that both X and Y are affected by a third variable (Z), then we would say
that the association between X and Y is spurious and that it is a result of the effect of a
confounding variable (Z).
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• A difference between groups might be due not to the independent variable but to a confounding
variable.

• For a variable to be confounding:

a) It must have connected with independent variables of interest and

b) It must be connected to the outcome or dependent variable directly.

• Consider the example, in order to conduct research that has the objective that alcohol drinkers
can have more heart disease than non-alcohol drinkers such that they can be influenced by
another factor. For instance, alcohol drinkers might consume cigarettes more than non drinkers
that act as a confounding variable (consuming cigarettes in this case) to study an association
amidst drinking alcohol and heart disease.

• For example, suppose a researcher collects data on ice cream sales and shark attacks and finds
that the two variables are highly correlated. Does this mean that increased ice cream sales cause
more shark attacks? That's unlikely. The more likely cause is the confounding variable
temperature. When it is warmer outside, more people buy ice cream and more people go in the
ocean.

2.3Describing Data with Tables

2.3.1 Frequency Distributions for Quantitative Data

• Frequency distribution is a representation, either in a graphical or tabular format, that displays


the number of observations within a given interval. The interval size depends on the data being
analyzed and the goals of the analyst.

• In order to find the frequency distribution of quantitative data, we can use the following table
that gives information about "the number of smartphones owned per family."
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• For such quantitative data, it is quite straightforward to make a frequency distribution table.
People either own 1, 2, 3, 4 or 5 laptops. Then, all we need to do is to find the frequency of 1, 2,
3, 4 and 5. Arrange this information in table format and called as frequency table for quantitative
data.

• When observations are sorted into classes of single values, the result is referred to as a
frequency distribution for ungrouped data. It is the representation of ungrouped data and is
typically used when we have a smaller data set.

• A frequency distribution is a means to organize a large amount of data. It takes data from a
population based on certain characteristics and organizes the data in a way that is
comprehensible to an individual that wants to make assumptions about a given population.

• Types of frequency distribution are grouped frequency distribution, ungrouped frequency


distribution, cumulative frequency distribution, relative frequency distribution and relative
cumulative frequency distribution

1. Grouped data:

• Grouped data refers to the data which is bundled together in different classes or categories.

• Data are grouped when the variable stretches over a wide range and there are a large number of
observations and it is not possible to arrange the data in any order, as it consumes a lot of time.
Hence, it is pertinent to convert frequency into a class group called a class interval.

• Suppose we conduct a survey in which we ask 15 familys how many pets they have in their
home. The results are as follows:

1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8

• Often we use grouped frequency distributions, in which we create groups of values and then
summarize how many observations from a dataset fall into those groups. Here's an example of a
grouped frequency distribution for our survey data :
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Guidelines for Constructing FD

1. All classes should be of the same width.

2. Classes should be set up so that they do not overlap and so that each piece of data belongs to
exactly one class.

3. List all classes, even those with zero frequencies.

4. There should be between 5 and 20 classes.

5. The classes are continuous.

• The real limits are located at the midpoint of the gap between adjacent tabled boundaries; that
is, one-half of one unit of measurement below the lower tabled boundary and one-half of one
unit of measurement above the upper tabled boundary.

• Table 2.3.4 gives a frequency distribution of the IQ test scores for 75 adults.

• IQ score is a quantitative variable and according to Table, eight of the individuals have an IQ
score between 80 and 94, fourteen have scores between 95 and 109, twenty-four have scores
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

between 110 and 124, sixteen have scores between 125 and 139 and thirteen have scores between
140 and 154.

• The frequency distribution given in Table is composed of five classes. The classes are: 80-94,
95-109, 110- 124, 125-139 and 140- 154. Each class has a lower class limit and an upper class
limit. The lower class limits for this distribution are 80, 95, 110, 125 and 140. The upper class
limits are 94,109, 124, 139 and 154.

• If the lower class limit for the second class, 95, is added to the upper class limit for the first
class,94 and the sum divided by 2, the upper boundary for the first class and the lower
boundary for the second class is determined. Table 2.3.5 gives all the boundaries for Table
2.3.5.

• If the lower class limit is added to the upper class limit for any class and the sum divided by 2,
the class mark for that class is obtained. The class mark for a class is the midpoint of the class
and is sometimes called the class midpoint rather than the class mark.

Example 2.3.1: Following table gives the frequency distribution for the cholesterol values of
45 patients in a cardiac rehabilitation study. Give the lower and upper class limits and
boundaries as well as the class marks for each class.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Solution: Below table gives the limits, boundaries and marks for the classes.

Example 2.3.2: The IQ scores for a group of 35 school dropouts are as follows:

a) Construct a frequency distribution for grouped data.

b) Specify the real limits for the lowest class interval in this frequency distribution.

• Solution: Calculating the class width

(123-69)/ 10=54/10=5.4≈ 5

a) Frequency distribution for grouped data


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

b) Real limits for the lowest class interval in this frequency distribution = 64.5-69.5.

Example 2.3.3: Given below are the weekly pocket expenses (in Rupees) of a group of 25
students selected at random.

37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38, 49, 45, 44, 37, 36

Construct a grouped frequency distribution table with class intervals of equal widths,
starting from 25-30, 30-35 and so on. Also, find the range of weekly pocket expenses.

Solution:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• In the given data, the smallest value is 26 and the largest value is 49. So, the range of the
weekly pocket expenses = 49-26=23.

Outliers

• 'In statistics, an Outlier is an observation point that is distant from other observations.'

• An outlier is a value that escapes normality and can cause anomalies in the results obtained
through algorithms and analytical systems. There, they always need some degrees of attention.

• Understanding the outliers is critical in analyzing data for at least two aspects:

a) The outliers may negatively bias the entire result of an analysis;

b) The behavior of outliers may be precisely what is being sought.

• The simplest way to find outliers in data is to look directly at the data table, the dataset, as data
scientists call it. The case of the following table clearly exemplifies a typing error, that is, input
of the data.

• The field of the individual's age Antony Smith certainly does not represent the age of 470
years. Looking at the table it is possible to identify the outlier, but it is difficult to say which
would be the correct age. There are several possibilities that can refer to the right age, such as:
47, 70 or even 40 years.

Relative and Cumulative Frequency Distribution

• Relative frequency distributions show the frequency of each class as a part or fraction of the
total frequency for the entire distribution. Frequency distributions can show either the actual
number of observations falling in each range or the percentage of observations. In the latter
instance, the distribution is called a relative frequency distribution.

• To convert a frequency distribution into a relative frequency distribution, divide the frequency
for each class by the total frequency for the entire distribution.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• A relative frequency distribution lists the data values along with the percent of all
observations belonging to each group. These relative frequencies are calculated by dividing the
frequencies for each group by the total number of observations.

• Example: Suppose we take a sample of 200 India family's and record the number of people
living there. We obtain the following:

Cumulative frequency:

• A cumulative frequency distribution can be useful for ordered data (e.g. data arranged in
intervals, measurement data, etc.). Instead of reporting frequencies, the recorded values are the
sum of all frequencies for values less than and including the current value.

• Example: Suppose we take a sample of 200 India family's and record the number of people
living there. We obtain the following:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• To convert a frequency distribution into a cumulative frequency distribution, add to the


frequency of each class the sum of the frequencies of all classes ranked below it.

Frequency Distributions for Qualitative (Nominal) Data

• In the set of observations, any single observation is a word, numerical code or letter, then data
are qualitative data. Frequency distributions for qualitative data are easy to construct.

• It is possible to convert frequency distributions for qualitative variables into relative frequency
distribution.

• If measurement is ordinal because observations can be ordered from least to most, cumulative
frequencies can be used.

2.4 Graphs for Quantitative Data

1. Histogram

• A histogram is a special kind of bar graph that applies to quantitative data (discrete or
continuous). The horizontal axis represents the range of data values. The bar height represents
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

the frequency of data values falling within the interval formed by the width of the bar. The bars
are also pushed together with no spaces between them.

• A diagram consisting of rectangles whose area is proportional to the frequency of a variable


and whose width is equal to the class interval.

• Here the data values only take on integer values, but we still split the range of values into
intervals. In this case, the intervals are [1,2), [2,3), [3,4), etc. Notice that this graph is also close
to being bell-shaped. A symmetric, bell-shaped distribution is called a normal distribution.

• Fig. 2.4.1 shows histogram.

• Notice that all the rectangles are adjacent and they have no gaps between them unlike a bar
graph.

• This histogram above is called a frequency histogram. If we had used the relative frequency to
make the histogram, we would call the graph a relative frequency histogram.

• If we had used the percentage to make the histogram, we would call the graph a percentage
histogram.

• A relative frequency histogram is the same as a regular histogram, except instead of the bar
height representing frequency, it now represents the relative frequency (so the y-axis runs from 0
to 1, which is 0% to 100%).

2. Frequency polygon
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Frequency polygons are a graphical device for understanding the shapes of distributions. They
serve the same purpose as histograms, but are especially helpful for comparing sets of data.
Frequency polygons are also a good choice for displaying cumulative frequency distributions.

• We can say that frequency polygon depicts the shapes and trends of data. It can be drawn with
or without a histogram.

• Suppose we are given frequency and bins of the ages from another survey as shown in Table
2.4.1.

• The midpoints will be used for the position on the horizontal axis and the frequency for the
vertical axis. From Table 2.4.1 we can then create the frequency polygon as shown in Fig. 2.4.2.

• A line indicates that there is a continuous movement. A frequency polygon should therefore be
used for scale variables that are binned, but sometimes a frequency polygon is also used for
ordinal variables.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Frequency polygons are useful for comparing distributions. This is achieved by overlaying the
frequency polygons drawn for different data sets.

Example 2.4.1: The frequency polygon of a frequency distribution is shown below.

Answer the following about the distribution from the histogram.

(i) What is the frequency of the class interval whose class mark is 15?

(ii) What is the class interval whose class mark is 45?

(iii) Construct a frequency table for the distribution.

• Solution:

(i) Frequency of the class interval whose class mark is 15 → 8

(ii) Class interval whose class mark is 45→40-50


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

(iii) As the class marks of consecutive overlapping class intervals are 5, 15, 25, 35, 45, 55 we
find the class intervals are 0 - 10, 10-20, 20 - 30, 30 - 40, 40 - 50, 50 - 60. Therefore, the
frequency table is constructed as below.

3. Steam and Leaf diagram:

• Stem and leaf diagrams allow to display raw data visually. Each raw score is divided into a
stem and a leaf. The leaf is typically the last digit of the raw value. The stem is the remaining
digits of the raw value.

• Data points are split into a leaf (usually the ones digit) and a stem (the other digits)

• To generate a stem and leaf diagram, first create a vertical column that contains all of the stems.
Then list each leaf next to the corresponding stem. In these diagrams, all of the scores are
represented in the diagram without the loss of any information.

• A stem-and-leaf plot retains the original data. The leaves are usually the last digit in each data
value and the stems are the remaining digits.

• Create a stem-and-leaf plot of the following test scores from a group of college freshmen.

• Stem and Leaf Diagram :


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

2.5 Graph for Qualitative (Nominal) Data

• There are a couple of graphs that are appropriate for qualitative data that has no natural
ordering.

1. Bar graphs

• Bar Graphs are like histograms, but the horizontal axis has the name of each category and there
are spaces between the bars.

• Usually, the bars are ordered with the categories in alphabetical order. One variant of a bar
graph is called a Pareto Chart. These are bar graphs with the categories ordered by frequency,
from largest to smallest.

• Fig. 2.5.1 shows bar graph.

• Bars of a bar graph can be represented both vertically and horizontally.


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• In bar graph, bars are used to represent the amount of data in each category; one axis displays
the categories of qualitative data and the other axis displays the frequencies.

2.6 Misleading Graph

• It is a well known fact that statistics can be misleading. They are often used to prove a point
and can easily be twisted in favour of that point.

• Good graphs are extremely powerful tools for displaying large quantities of complex data; they
help turn the realms of information available today into knowledge. But, unfortunately, some
graphs deceive or mislead.

• This may happen because the designer chooses to give readers the impression of better
performance or results than is actually the situation. In other cases, the person who prepares the
graph may want to be accurate and honest, but may mislead the reader by a poor choice of a
graph form or poor graph construction.

• The following things are important to consider when looking at a graph:

1. Title

2. Labels on both axes of a line or bar chart and on all sections of a pie chart

3. Source of the data

4. Key to a pictograph

5. Uniform size of a symbol in a pictograph

6. Scale: Does it start with zero? If not, is there a break shown

7. Scale: Are the numbers equally spaced?

• A graph can be altered by changing the scale of the graph. For example, data in the two graphs
of Fig. 2.6.1 are identical, but scaling of the Y-axis changes the impression of the magnitude of
differences.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Example 2.6.1: Construct a frequency distribution for the number of different residences
occupied by graduating seniors during their college career, namely: 1, 4, 2, 3, 3, 1, 6, 7, 4, 3,
3, 9, 2, 4, 2, 2, 3, 2, 3, 4, 4, 2, 3, 3, 5. What is the shape of this distribution?

Solution:

Normal distribution: The normal distribution is one of the most commonly encountered types
of data distribution, especially in social sciences. Due to its bell-like shape, the normal
distribution is also referred to as the bell curve.

Histogram of given data:


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

2.7 Describing Data with Averages

• Averages consist of numbers (or words) about which the data are, in some sense, centered.

1. Mean :

• The mean of a data set is the average of all the data values. The sample mean x is the point
estimator of the population mean μ.

2. Median :

Sum of the values of then observations Number of observations in the sample

Sum of the values of the N observations Number of observations in the population

• The median of a data set is the value in the middle when the data items are arranged in
ascending order. Whenever a data set has extreme values, the median is the preferred measure of
central location.

• The median is the measure of location most often reported for annual income and property
value data. A few extremely large incomes of property values can inflate the mean.

• For an off number of observations:

7 observations== 26, 18, 27, 12, 14, 29, 19.

Numbers in ascending order 12, 14, 18, 19, 26, 27, 29

• The median is the middle value.

Median=19

• For an even number of observations :

8 observations=26 18 29 12 14 27 30 19

Numbers in ascending order=12, 14, 18, 19, 26, 27, 29,30

The median is the average of the middle two values.

3. Mode:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• The mode of a data set is the value that occurs with greatest frequency. The greatest frequency
can occur at two or more different values. If the data have exactly two modes, the data have
exactly two modes, the data are bimodal. If the data have more than two modes, the data are
multimodal.

• Weighted mean : Sometimes, each value in a set may be associated with a weight, the weights
reflect the significance, importance or occurrence frequency attached to their respective values.

• Trimmed mean: A major problem with the mean is its sensitivity to extreme (e.g., outlier)
values. Even a small number of extreme values can corrupt the mean. The trimmed mean is the
mean obtained after cutting off values at the high and low

extremes.

• For example, we can sort the values and remove the top and bottom 2 % before computing the
mean. We should avoid trimming too large a portion (such as 20 %) at both ends as this can
result in the loss of valuable information.

• Holistic measure is a measure that must be computed on the entire data set as a whole. It
cannot be computed by partitioning the given data into subsets and merging the values obtained
for the measure in each subset.

2.8 Describing Variability

• Variability, almost by definition, is the extent to which data points in a statistical distribution or
data set diverge, vary from the average value, as well as the extent to which these data points
differ from each other. Variability refers to the divergence of data from its mean value and is
commonly used in the statistical and financial sectors.

• The goal for variability is to obtain a measure of how spread out the scores are in a distribution.
A measure of variability usually accompanies a measure of central tendency as basic descriptive
statistics for a set of scores.

• Central tendency describes the central point of the distribution and variability describes how
the scores are scattered around that central point. Together, central tendency and variability are
the two primary values that are used to describe a distribution of scores.

• Variability serves both as a descriptive measure and as an important component of most


inferential statistics. As a descriptive statistic, variability measures the degree to which the scores
are spread out or clustered together in a distribution.

• Variability can be measured with the range, the interquartile range and the standard
deviation/variance. In each case, variability is determined by measuring distance.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Range

• The range is the total distance covered by the distribution, from the highest score to the lowest
score (using the upper and lower real limits of the range).

Range=Maximum value - Minimum value

Merits :

a) It is easier to compute.

b) It can be used as a measure of variability where precision is not required. Demerits :

a) Its value depends on only two scores

b) It is not sensitive to total condition of the distribution.

Variance

• Variance is the expected value of the squared deviation of a random variable from its mean. In
short, it is the measurement of the distance of a set of random numbers from their collective
average value. Variance is used in statistics as a way of better understanding a data set's
distribution.

• Variance is calculated by finding the square of the standard deviation of a variable.

σ2= Σ(Χ - μ)2 /N

• In the formula above, μ represents the mean of the data points, x is the value of an individual
data point and N is the total number of data points.

• Data scientists often use variance to better understand the distribution of a data set. Machine
learning uses variance calculations to make generalizations about a data set, aiding in a neural
network's understanding of data distribution. Variance is often used in conjunction with
probability distributions.

Standard Deviation

• Standard deviation is simply the square root of the variance. Standard deviation measures the
standard distance between a score and the mean.

Standard deviation=√Variance
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• The standard deviation is a measure of how the values in data differ from one another or how
spread out data is. There are two types of variance and standard deviation in terms of sample and
population.

• The standard deviation measures how far apart the data points in observations are from each.
we can calculate it by subtracting each data point from the mean value and then finding the
squared mean of the differenced values; this is called Variance. The square root of the variance
gives us the standard deviation.

• Properties of the Standard Deviation :

a) If a constant is added to every score in a distribution, the standard deviation will not be
changed.

b) The center of the distribution (the mean) changes, but the standard deviation remains the
same.

c) If each score is multiplied by a constant, the standard deviation will be multiplied by the same
constant.

d) Multiplying by a constant will multiply the distance between scores and because the standard
deviation is a measure of distance, it will also be multiplied.

• If user are given numerical values for the mean and the standard deviation, we should be able to
construct a visual image (or a sketch) of the distribution of scores. As a general rule, about 70%
of the scores will be within one standard deviation of the mean and about 95% of the scores will
be within a distance of two standard deviations of the mean.

• The mean is a measure of position, but the standard deviation is a measure of distance (on
either side of the mean of the distribution).

• Standard deviation distances always originate from the mean and are expressed as positive
deviations above the mean or negative deviations below the mean.

• Sum of Square (SS) for population definition formula is given below:

Sum of Square (SS) = Σ(x-μ)2

• Sum of Square (SS) for population computation formula is given below:

SS= ΣΧ2- (ΣΧ)2/ N

• Sum of Squares for sample definition formula:

SS = Σ (X-X̄ )2
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Sum of Squares for sample computation formula :

SS = Σx2 - (Σx)2/n

Example 2.8.1: The heights of animals are: 600 mm, 470 mm, 170 mm, 430 mm and 300
mm. Find out the mean, the variance and the standard deviation.

Solution:

Mean = 600+ 470 + 170+ 430 + 300 / 5

=1970 /5= 394

σ2= Σ(Χ - μ)2/ N

Variance = (600-394)2 + (470-394)2 + (170-394)2 + (430-394)2 + (300-394)2 /5

Variance = 42436+5776+ 50176 + 1296 +8836 / 5

Variance = 21704

Standard deviation = √Variance = √21704

= 142.32 ≈ 142

Example 2.8.2: Using the computation formula for the sum of squares, calculate the
population standard deviation for the scores: 1, 3, 7, 2, 0, 4, 7, 3.

Solution: Calculate mean of data

Mean = 1+3+7+2+0+4+7+3 / 8 = 3.375

Variance = (3.375-1)2 + (3.375-3)2 + (3.375-7)2 + (3.375-2)2 + (3.375-0)2 + (3.375−4)2 +


(3.375 − 7)2 + (3.375 – 3)2 /8

= (-2.375)2 + (0.375)2 + (3.625)2 + (−1.375)2 + (-3.375)2 + (0.625)2 + (3.625)2 + (−0.375)2 /8

= 5.64+0.14+13.14+1.89+11.39+0.39+13.14+0.14 /8

= 45.87 /8 = 5.73

Variance = 5.73

The population standard deviation is the square root of the variance = (5.73)1/2 = 2.393
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

The Interquartile Range

• The interquartile range is the distance covered by the middle 50% of the distribution (the
difference between Q1 and Q3).

• Fig. 2.8.1 shows IQR.

• The first quartile, denoted Q1, is the value in the data set that holds 25% of the values below it.
The third quartile, denoted Q3, is the value in the data set that holds 25% of the values above it.

Example 2.8.3: Determine the values of the range and the IQR for the following sets of
data.

(a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63

(b) Residence changes: 1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 4

Solution:

a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63

Range = Max number - Min number = 70-45

Range = 25

IQR:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Step 1: Arrange given number form lowest to highest.

45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70

Median

Q1=60 , Q3 65

IQR = Q3-Q1=65-60 = 5

2.9 Normal Distributions and Standard (z) Scores

• The normal distribution is a continuous probability distribution that is symmetrical on both


sides of the mean, so the right side of the center is a mirror image of the left side. The area under
the normal distribution curve represents probability and the total area under the curve sums to
one.

• The normal distribution is often called the bell curve because the graph of its probability
density looks like a bell. It is also known as called Gaussian distribution, after the German
mathematician Carl Gauss who first described it.

• Fig. 2.9.1 shows normal curve.

• A normal distribution is determined by two parameters the mean and the variance. A normal
distribution with a mean of 0 and a standard deviation of 1 is called a standard normal
distribution.

Two Marks Questions with Answers

Q.1 Define qualitative data.

Ans. Qualitative data provides information about the quality of an object or information which
cannot be measured. Qualitative data cannot be expressed as a number. Data that represent
nominal scales such as gender, economic status and religious preference are usually considered
to be qualitative data. It is also called categorical data.

Q.2 What is quantitative data ?

Ans.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Quantitative data is the one that focuses on numbers and mathematical calculations and can be
calculated and computed. Quantitative data are anything that can be expressed as a number or
quantified. Examples of quantitative data are scores on achievement tests, number of hours of
study or weight of a subject.

Q.3 What is nominal data ?

Ans. : A nominal data is the 1st level of measurement scale in which the numbers serve as "tags"
or "labels" to classify or identify the objects. Nominal data is type of qualitative data. A nominal
data usually deals with the non-numeric variables or the numbers that do not have any value.
While developing statistical models, nominal data are usually transformed before building the
model.

Q.4 Describe ordinal data.

Ans. : Ordinal data is a variable in which the value of the data is captured from an ordered set,
which is recorded in the order of magnitude. Ordinal represents the "order." Ordinal data is
known as qualitative data or categorical data. It can be grouped, named and also ranked.

Q.5 What is an interval data ?

Ans. Interval data corresponds to a variable in which the value is chosen from an interval set.

It is defined as a quantitative measurement scale in which the difference between the two
variables is meaningful. In other words, the variables are measured in an exact manner, not as in
a relative way in which the presence of zero is arbitrary.

Q.6 What do you mean observational study?

Ans. An observational study focuses on detecting relationships between variables not


manipulated by the investigator. An observational study is used to answer a research question
based purely on what the researcher observes. There is no interference or manipulation of the
research subjects and no control and treatment groups.

Q.7 What is frequency distribution?

Ans. Frequency distribution is a representation, either in a graphical or tabular format, that


displays the number of observations within a given interval. The interval size depends on the
data being analyzed and the goals of the analyst.

Q.8 What is cumulative frequency?

Ans. A cumulative frequency distribution can be useful for ordered data (e.g. data arranged in
intervals, measurement data, etc.). Instead of reporting frequencies, the recorded values are the
sum of all frequencies for values less than and including the current value.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Q.9 Explain histogram.

Ans. A histogram is a special kind of bar graph that applies to quantitative data (discrete or
continuous). The horizontal axis represents the range of data values. The bar height represents
the frequency of data values falling within the interval formed by the width of the bar. The bars
are also pushed together with no spaces between them.

Q.10 What is goal of variability?

Ans. The goal for variability is to obtain a measure of how spread out the scores are in a
distribution. A measure of variability usually accompanies a measure of central tendency as basic
descriptive statistics for a set of scores.

Q.11 How to calculate range?

Ans. The range is the total distance covered by the distribution, from the highest score to the
lowest score (using the upper and lower real limits of the range).

Range = Maximum value - Minimum value

Q.12 What is an Independent variables?

Ans. An independent variable is the variable that is changed or controlled in a scientific


experiment to test the effects on the dependent variable.

Q.13 What is an observational study?

Ans. An observational study focuses on detecting relationships between variables not


manipulated by the investigator. An observational study is used to answer a research question
based purely on what the researcher observes. There is no interference or manipulation of the
research subjects and no control and treatment groups.

Q.14 Explain frequency polygon.

Ans. : Frequency polygons are a graphical device for understanding the shapes of distributions.
They serve the same purpose as histograms, but are especially helpful for comparing sets of data.
Frequency polygons are also a good choice for displaying cumulative frequency distributions.

Q.15 What is Steam and Leaf diagram?

Ans. Stem and leaf diagrams allow to display raw data visually. Each raw score is divided into a
stem and a leaf. The leaf is typically the last digit of the raw value. The stem is the remaining
digits of the raw value. Data points are split into a leaf (usually the ones digit) and a stem (the
other digits).
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

z Scores

• The Z-score or standard score, is a fractional representation of standard deviations from the
mean value. Accordingly, z-scores often have a distribution with no average and standard
deviation of 1. Formally, the z-score is defined as :

Z = X-μ / σ

where μ is mean, X is score and σ is standard deviation

• The z-score works by taking a sample score and subtracting the mean score, before then
dividing by the standard deviation of the total population. The z-score is positive if the value lies
above the mean and negative if it lies below the mean.

• A z score consists of two parts:

a) Positive or negative sign indicating whether it's above or below the mean; and

b) Number indicating the size of its deviation from the mean in standard deviation units

• Why are z-scores important?

• It is useful to standardized the values (raw scores) of a normal distribution by converting them
into z-scores because:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

(a) It allows researchers to calculate the probability of a score occurring within a standard
normal distribution;

(b) And enables us to compare two scores that are from different samples (which may have
different means and standard deviations).

• Using the z-score technique, one can now compare two different test results based on relative
performance, not individual grading scale.

Example 2.9.1: A class of 50 students who have written the science test last week. Rakshita
student scored 93 in the test while the average score of the class was 68. Determine the z-
score for Rakshita's test mark if the standard deviation is 13.

Solution: Given,

Rakshita's test score, x = 93, Mean (u) = 68, Standard deviation (σ) = 13 The z-score for
Rakshita's test score can be calculated using formula as,

Ꮓ = X- μ / σ = 93-68 / 13 = 1.923

Example 2.9.2: Express each of the following scores as a z score:

(a) Margaret's IQ of 135, given a mean of 100 and a standard deviation of 15

(b) A score of 470 on the SAT math test, given a mean of 500 and a standard deviation of
100.

Solution :

a) Margaret's IQ of 135, given a mean of 100 and a standard deviation of 15

Given, Margaret's IQ (X) = 135, Mean (u) = 100, Standard deviation (o) = 15

The z-score for Margaret's calculated using formula as,

Z = X- μ / σ = 135-100 / 15 =2.33

b) A score of 470 on the SAT math test, given a mean of 500 and a standard deviation of
100

Given,

Score (X) = 470, Mean (u) = 500, Standard deviation (6)= 100

The z-score for Margaret's calculated using formula as,


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Z = X-μ / σ = 470-500 /100 = 0.33

Standard Normal Curve

• If the original distribution approximates a normal curve, then the shift to standard or z-scores
will always produce a new distribution that approximates the standard normal curve.

• Although there is an infinite number of different normal curves, each with its own mean and
standard deviation, there is only one standard normal curve, with a mean of 0 and a standard
deviation of 1.

Example 2.9.3: Suppose a random variable is normally distributed with a mean of 400 and
a standard deviation 100. Draw a normal curve with parameter label.

Solution:

You might also like