Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views7 pages

Tutorial 1

Uploaded by

lijiayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

Tutorial 1

Uploaded by

lijiayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

GEA1000 QUANTITATIVE REASONING WITH DATA

TUTORIAL 1
Please work on the problems before coming to class. In class, you will engage in group work.

In this tutorial, we will be looking at two case studies:


• Exploring box office records for movies and
• A study on the usage of new learning software in an enrichment centre.

Case Study 1: Exploring Box office records for movies.

Refer to the csv file “Movies.csv”. Below is a brief description of some of the variables which
may not be considered common knowledge. The other variables in the data set, not listed
below are self-explanatory.

Variable Description
Production_Budget The cost involved in making the movie.
Worldwide_Gross The revenue generated from screening the movie in public
theatres around the world.
CPI The Consumer Price Index (CPI) is an important economic
metric used to measure the change in general prices of goods
and services (otherwise known as inflation) in a country. Very
often, this index is derived by considering a weighted average
of the prices of goods and services consumed in the country.
The weight assigned to each good or service is determined by
the monetary authority of the country but is very often linked
to the consumption patterns of the average individual in the
country.
MPAA_Rating The Motion Picture Association of America film rating system
is used in the United States and its territories to rate a motion
picture’s suitability for certain audiences based on its content.
If a film has not been submitted for a rating or is an uncut
version of a film that was submitted, the label Not Rated (NR)
is used.
IMDb_Rating IMDb is an abbreviation for Internet Movie Database, which is
an online database of information related to films. The IMDb
offers a rating scale that allows users to rate films on a scale of
one to ten based on how they feel about the film, with a low
score being regarded as an unfavourable view and a high score
being regarded as a favourable one.
The submitted ratings are filtered and weighted in various
ways (partly depending on the stature of the person rating the
movie) to produce a “weighted mean” that is displayed for
each film and IMDb keeps the formula of the weighted mean
as confidential.
Voter_Numbers The number of voters who rated the movie on IMDb which
eventually resulted in the IMDb rating.

1
Sampling
This data obtained was taken on 14th August 2023 from the following website

https://www.the-numbers.com/movie/budgets/all

consisting of 6420 movies, via the following procedure. All 6420 movies’ information was
downloaded. A random number between 1 and 6420 was generated and assigned to each
movie without allowing for repetition of numbers. Then the movies labelled “1” to “1091”
were selected and the information provided from the website was merged with other publicly
available information pertaining to them, to construct the data.

a) State the sampling frame and determine what type of sampling was done here.

Types of Variables

b) For all the variables, except IMDb Rating, determine which are numerical and which
are categorical. Were there instances where it was unclear whether to classify the
variable as numerical or categorical?

Data Cleaning
When working with data, we are not usually fortunate enough to be handling data that is
pristine and requires no form of cleaning/tidying. In this part we will highlight two
manifestations of potentially dirty data and how we can go about dealing with them.

Manifestation 1: Missing values


c) (i) Identify the variables for which there are missing values.

(ii) Name the variables for which the number of missing values is relatively small
(i.e., greater than 0 and less than 30). Given that the number of missing values
for these variables is small, suggest what can be done for this form of dirty
data.

(iii) Name the variables for which the number of missing values is very large (i.e.,
more than 100). If you were to try and implement your suggestion(s) in (ii), to
what extent is it feasible here?

Manifestation 2: Unusual values


For numerical values in data sets, in addition to calculating the mean and standard
deviation of these variables, it is a common practice for the description to include
what is known as the “Five-Number Summary”, which consists of
- Minimum
- Q1
- Median
- Q3
- Maximum

2
Whilst the Five-Number summary, together with the mean and standard deviation
(SD), are used to provide quantitative information about groups of data, sometimes
they can also be used to identify sources of unusual data values.

d) Give the Five-number summary together with the mean and SD for the variable
Worldwide_Gross. Describe if there are any anomalies in the summary statistics
values. Explain what could result in such an anomaly and hence, explain the
circumstances under which it would be justifiable to remove these anomalous data
points. (Hint: Refer to the description of Worldwide_Gross given in the table.)

Data Visualisation
While we now have some idea that the data cleaning process can be an extensive one, for the
rest of this question, you may ignore the movies where Release_Year or CPI are blank.

Inflation is the rate of increase in prices over a given period. Let us adjust for the Production
Budget with reference to 2022’s CPI which is 294.4. Here is an example of how it can be done.
For example, Star Wars Ep. VII: The Force Awakens (the first movie in the data set) has a
production budget of $306,000,000 and the year of release was 2015. The CPI for 2015 is 237.
This means that the “equivalent” production budget for 2022 is estimated to be
294.4
× 306,000,000 = $380,111,392 (to the nearest whole number)
237

e) Create a new variable called Adjusted_Production_Budget where all


production budgets are valued with reference to the 2022 CPI.

f) Use a suitable visualisation to depict the trend of the averages for Production_
Budget and the averages for Adjusted_Production_Budget between the
years 2012 – 2022. Compare the trends obtained between adjusting for inflation and
not adjusting for inflation.

Generalisability

g) Suppose that we wished to investigate trends and patterns in the movie industry, and
we were to do an analysis using this sample data to help us understand those trends
and patterns. Then to what extent can any findings obtained using this data be
generalisable to the movie industry? Base your answer on the generalisability criteria
that you have learnt as well as the information given on the website.

3
Case Study 2: New learning software and Mathematics Enrichment

Consider the following hypothetical scenario: A tuition centre has just developed an online
learning software called ‘Mathemagic’, that all students can download onto their mobile
devices. Mathemagic is designed for Primary 1 to Primary 6 students studying Mathematics.
Imagine you are an intern working in this tuition centre, and before officially rolling out this
software, your Chief Executive Officer (CEO) wants to know whether teaching via
‘Mathemagic’ actually improves Mathematics performance when compared to the current
practice (which involves no usage of any software). If it does improve performance, then the
software will be rolled out to all the teachers who are teaching Primary School Mathematics
at the tuition centre. Your employer gives you authority and resources to design a study to
come to a conclusion. You have also been granted permission to use the tuition centre’s
students as study subjects. You are also allowed to make the following assumptions for
simplicity:

- There are 6000 Primary School students in the tuition centre – 1000 students in each
level (Primary 1 – 1000 students, Primary 2 – 1000 students, ...).

- The teachers working at the tuition centre have no preference in terms of teaching
using the software versus not using it.

- All teachers are proficient in the usage of the software if they need to use it to teach.

a) Due to the large student base, you feel that it might be a more prudent approach to
select a sample, instead of doing the study on all 6000 students. Knowing that you
are well-versed in Quantitative Reasoning, the CEO agrees with you, and suggests
that you should roll a fair die of 6 sides, and the number that appears (for example,
5), will be the level of students (Primary 5) you would conduct the study on. Do you
agree with her? If so, provide supporting reasons why the approach is appropriate. If
not, provide an alternative approach to the sampling process.

b) What design of study would be suitable to determine the effectiveness of


Mathemagic in improving students’ proficiency in Mathematics? Based on your
proposed study design, do you think that you should measure Mathematics
proficiency twice (before and after the intervention), or will it be sufficient to just
measure once (after the intervention)?

4
c) Based on your choice in (b), give some details on how you can proceed to set up the
study. Your answer should clearly state the following:
• The rationale for choosing the study design,
• The measurement of the variable that will help us to determine if the usage
of Mathemagic is better than not using it,
• How the assignment can be done,
• Whether blinding would be possible, and
• Some limitations and difficulties that may be encountered in the process of
conducting the study.
• Consequently, is the study feasible?

Remarks: For part c), we do not expect you to give an answer to the level of an expert
working in the company as such people would have significantly more technical knowledge
and experience. However, you can work towards giving an answer that has some basic ideas
in place and is reasonable from an implementation point of view.

5
Appendix: How to Plot a Line Graph using Radiant (Case Study 1, Question f)

Before we begin the data visualisation process, it would be helpful to know that for categorical
variables that are represented by numbers, Radiant treats them as numerical variables, which should
not be the case. Therefore, before doing any visualisation of such variables, we first convert them to
the correct variable type.

Changing Numerical to Categorical Variables

- Click on the “Transform” Tab.


- Under “Select variables”, select the relevant categorical variables that have been labelled as “numeric”.
- Under “Transformation type”, select “Change type”.
- Under “Change variable type”, select “As factor”.
- Store the data and rename it (for example, as “Movies1”).

6
Plotting a Line Graph

1. Under “Plot-type”, select “Line”.


2. Under “Y-variable”, select “Production_Budget (numeric)”.
3. Under “X-variable”, select “Release_Year (factor)”.
4. Under “Function”, select “mean”.
5. Click “Create plot”. You should obtain the line graph as shown in the diagram above.

You might also like