0% found this document useful (0 votes)

25 views7 pages

Tutorial 1

Uploaded by

lijiayi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views7 pages

Tutorial 1

Uploaded by

lijiayi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

GEA1000 QUANTITATIVE REASONING WITH DATA

TUTORIAL 1
Please work on the problems before coming to class. In class, you will engage in group work.

In this tutorial, we will be looking at two case studies:

• Exploring box office records for movies and
• A study on the usage of new learning software in an enrichment centre.

Case Study 1: Exploring Box office records for movies.

Refer to the csv file “Movies.csv”. Below is a brief description of some of the variables which
may not be considered common knowledge. The other variables in the data set, not listed
below are self-explanatory.

Variable Description
Production_Budget The cost involved in making the movie.
Worldwide_Gross The revenue generated from screening the movie in public
theatres around the world.
CPI The Consumer Price Index (CPI) is an important economic
metric used to measure the change in general prices of goods
and services (otherwise known as inflation) in a country. Very
often, this index is derived by considering a weighted average
of the prices of goods and services consumed in the country.
The weight assigned to each good or service is determined by
the monetary authority of the country but is very often linked
to the consumption patterns of the average individual in the
country.
MPAA_Rating The Motion Picture Association of America film rating system
is used in the United States and its territories to rate a motion
picture’s suitability for certain audiences based on its content.
If a film has not been submitted for a rating or is an uncut
version of a film that was submitted, the label Not Rated (NR)
is used.
IMDb_Rating IMDb is an abbreviation for Internet Movie Database, which is
an online database of information related to films. The IMDb
offers a rating scale that allows users to rate films on a scale of
one to ten based on how they feel about the film, with a low
score being regarded as an unfavourable view and a high score
being regarded as a favourable one.
The submitted ratings are filtered and weighted in various
ways (partly depending on the stature of the person rating the
movie) to produce a “weighted mean” that is displayed for
each film and IMDb keeps the formula of the weighted mean
as confidential.
Voter_Numbers The number of voters who rated the movie on IMDb which
eventually resulted in the IMDb rating.

1
Sampling
This data obtained was taken on 14th August 2023 from the following website

https://www.the-numbers.com/movie/budgets/all

consisting of 6420 movies, via the following procedure. All 6420 movies’ information was
downloaded. A random number between 1 and 6420 was generated and assigned to each
movie without allowing for repetition of numbers. Then the movies labelled “1” to “1091”
were selected and the information provided from the website was merged with other publicly
available information pertaining to them, to construct the data.

a) State the sampling frame and determine what type of sampling was done here.

Types of Variables

b) For all the variables, except IMDb Rating, determine which are numerical and which
are categorical. Were there instances where it was unclear whether to classify the
variable as numerical or categorical?

Data Cleaning
When working with data, we are not usually fortunate enough to be handling data that is
pristine and requires no form of cleaning/tidying. In this part we will highlight two
manifestations of potentially dirty data and how we can go about dealing with them.

Manifestation 1: Missing values

c) (i) Identify the variables for which there are missing values.

(ii) Name the variables for which the number of missing values is relatively small
(i.e., greater than 0 and less than 30). Given that the number of missing values
for these variables is small, suggest what can be done for this form of dirty
data.

(iii) Name the variables for which the number of missing values is very large (i.e.,
more than 100). If you were to try and implement your suggestion(s) in (ii), to
what extent is it feasible here?

Manifestation 2: Unusual values

For numerical values in data sets, in addition to calculating the mean and standard
deviation of these variables, it is a common practice for the description to include
what is known as the “Five-Number Summary”, which consists of
- Minimum
- Q1
- Median
- Q3
- Maximum

2
Whilst the Five-Number summary, together with the mean and standard deviation
(SD), are used to provide quantitative information about groups of data, sometimes
they can also be used to identify sources of unusual data values.

d) Give the Five-number summary together with the mean and SD for the variable
Worldwide_Gross. Describe if there are any anomalies in the summary statistics
values. Explain what could result in such an anomaly and hence, explain the
circumstances under which it would be justifiable to remove these anomalous data
points. (Hint: Refer to the description of Worldwide_Gross given in the table.)

Data Visualisation
While we now have some idea that the data cleaning process can be an extensive one, for the
rest of this question, you may ignore the movies where Release_Year or CPI are blank.

Inflation is the rate of increase in prices over a given period. Let us adjust for the Production
Budget with reference to 2022’s CPI which is 294.4. Here is an example of how it can be done.
For example, Star Wars Ep. VII: The Force Awakens (the first movie in the data set) has a
production budget of $306,000,000 and the year of release was 2015. The CPI for 2015 is 237.
This means that the “equivalent” production budget for 2022 is estimated to be
294.4
× 306,000,000 = $380,111,392 (to the nearest whole number)
237

e) Create a new variable called Adjusted_Production_Budget where all

production budgets are valued with reference to the 2022 CPI.

f) Use a suitable visualisation to depict the trend of the averages for Production_
Budget and the averages for Adjusted_Production_Budget between the
years 2012 – 2022. Compare the trends obtained between adjusting for inflation and
not adjusting for inflation.

Generalisability

g) Suppose that we wished to investigate trends and patterns in the movie industry, and
we were to do an analysis using this sample data to help us understand those trends
and patterns. Then to what extent can any findings obtained using this data be
generalisable to the movie industry? Base your answer on the generalisability criteria
that you have learnt as well as the information given on the website.

3
Case Study 2: New learning software and Mathematics Enrichment

Consider the following hypothetical scenario: A tuition centre has just developed an online
learning software called ‘Mathemagic’, that all students can download onto their mobile
devices. Mathemagic is designed for Primary 1 to Primary 6 students studying Mathematics.
Imagine you are an intern working in this tuition centre, and before officially rolling out this
software, your Chief Executive Officer (CEO) wants to know whether teaching via
‘Mathemagic’ actually improves Mathematics performance when compared to the current
practice (which involves no usage of any software). If it does improve performance, then the
software will be rolled out to all the teachers who are teaching Primary School Mathematics
at the tuition centre. Your employer gives you authority and resources to design a study to
come to a conclusion. You have also been granted permission to use the tuition centre’s
students as study subjects. You are also allowed to make the following assumptions for
simplicity:

- There are 6000 Primary School students in the tuition centre – 1000 students in each
level (Primary 1 – 1000 students, Primary 2 – 1000 students, ...).

- The teachers working at the tuition centre have no preference in terms of teaching
using the software versus not using it.

- All teachers are proficient in the usage of the software if they need to use it to teach.

a) Due to the large student base, you feel that it might be a more prudent approach to
select a sample, instead of doing the study on all 6000 students. Knowing that you
are well-versed in Quantitative Reasoning, the CEO agrees with you, and suggests
that you should roll a fair die of 6 sides, and the number that appears (for example,
5), will be the level of students (Primary 5) you would conduct the study on. Do you
agree with her? If so, provide supporting reasons why the approach is appropriate. If
not, provide an alternative approach to the sampling process.

b) What design of study would be suitable to determine the effectiveness of

Mathemagic in improving students’ proficiency in Mathematics? Based on your
proposed study design, do you think that you should measure Mathematics
proficiency twice (before and after the intervention), or will it be sufficient to just
measure once (after the intervention)?

4
c) Based on your choice in (b), give some details on how you can proceed to set up the
study. Your answer should clearly state the following:
• The rationale for choosing the study design,
• The measurement of the variable that will help us to determine if the usage
of Mathemagic is better than not using it,
• How the assignment can be done,
• Whether blinding would be possible, and
• Some limitations and difficulties that may be encountered in the process of
conducting the study.
• Consequently, is the study feasible?

Remarks: For part c), we do not expect you to give an answer to the level of an expert
working in the company as such people would have significantly more technical knowledge
and experience. However, you can work towards giving an answer that has some basic ideas
in place and is reasonable from an implementation point of view.

5
Appendix: How to Plot a Line Graph using Radiant (Case Study 1, Question f)

Before we begin the data visualisation process, it would be helpful to know that for categorical
variables that are represented by numbers, Radiant treats them as numerical variables, which should
not be the case. Therefore, before doing any visualisation of such variables, we first convert them to
the correct variable type.

Changing Numerical to Categorical Variables

- Click on the “Transform” Tab.

- Under “Select variables”, select the relevant categorical variables that have been labelled as “numeric”.
- Under “Transformation type”, select “Change type”.
- Under “Change variable type”, select “As factor”.
- Store the data and rename it (for example, as “Movies1”).

6
Plotting a Line Graph

1. Under “Plot-type”, select “Line”.

2. Under “Y-variable”, select “Production_Budget (numeric)”.
3. Under “X-variable”, select “Release_Year (factor)”.
4. Under “Function”, select “mean”.
5. Click “Create plot”. You should obtain the line graph as shown in the diagram above.

Java Interview JavaTpoint
100% (1)
Java Interview JavaTpoint
170 pages
Notes - Chapter 2 - IT Skills and Data Analysis I
No ratings yet
Notes - Chapter 2 - IT Skills and Data Analysis I
22 pages
Solutions To II Unit Exercises From Kamber
83% (42)
Solutions To II Unit Exercises From Kamber
16 pages
PlayStation Vita's First Year
50% (2)
PlayStation Vita's First Year
33 pages
MAS202 - Homework For Chapters 1-2-3
100% (1)
MAS202 - Homework For Chapters 1-2-3
19 pages
Exercises 1
100% (1)
Exercises 1
7 pages
Tutorial 1
No ratings yet
Tutorial 1
6 pages
Film Data Analysis
No ratings yet
Film Data Analysis
3 pages
Understanding Data
No ratings yet
Understanding Data
21 pages
GEA 1000 Tutorial 1 Solution
No ratings yet
GEA 1000 Tutorial 1 Solution
12 pages
Business Statistics Basics
No ratings yet
Business Statistics Basics
16 pages
DS Mini
No ratings yet
DS Mini
3 pages
Symbolic Data Analysis
No ratings yet
Symbolic Data Analysis
62 pages
Statistics I Problem Sets Guide
No ratings yet
Statistics I Problem Sets Guide
52 pages
Session 3
No ratings yet
Session 3
81 pages
Data Analysis for Business Students
No ratings yet
Data Analysis for Business Students
17 pages
Mini-Case 1.1: Inventory Planning
No ratings yet
Mini-Case 1.1: Inventory Planning
11 pages
Data Handling Learner Notes
No ratings yet
Data Handling Learner Notes
28 pages
Excel & Python Statistical Functions
No ratings yet
Excel & Python Statistical Functions
44 pages
XII Applied Maths Procedure and Topics of Practical File Preparation
No ratings yet
XII Applied Maths Procedure and Topics of Practical File Preparation
2 pages
Business Statistics for Managers
No ratings yet
Business Statistics for Managers
5 pages
Chapter 1 RM
No ratings yet
Chapter 1 RM
44 pages
Data Mining Caselets
No ratings yet
Data Mining Caselets
10 pages
IGCSE 7 COMPUTING Chapter 2 - Managing Data - Topic Wise Unsolved Questions (2.1 - 2.2, Check Your Progress)
No ratings yet
IGCSE 7 COMPUTING Chapter 2 - Managing Data - Topic Wise Unsolved Questions (2.1 - 2.2, Check Your Progress)
3 pages
Data Information Knowledge 1
No ratings yet
Data Information Knowledge 1
24 pages
Business Analytics Unit 2
No ratings yet
Business Analytics Unit 2
12 pages
AS Pratical (Theory) Cheat Sheet
No ratings yet
AS Pratical (Theory) Cheat Sheet
4 pages
BDM 1
No ratings yet
BDM 1
34 pages
Data - Investigation - Interpretation - Year 8
No ratings yet
Data - Investigation - Interpretation - Year 8
34 pages
The ABCs of Statistics
No ratings yet
The ABCs of Statistics
4 pages
Lecture One - Statistical Data
No ratings yet
Lecture One - Statistical Data
9 pages
Individual Assignment - Alejandro Gutierrez - Data Science
No ratings yet
Individual Assignment - Alejandro Gutierrez - Data Science
4 pages
DVP Unit1
No ratings yet
DVP Unit1
44 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Data Literacy
No ratings yet
Data Literacy
9 pages
Midterm Self Tests
No ratings yet
Midterm Self Tests
4 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
6 pages
SSI Securities Corporation
No ratings yet
SSI Securities Corporation
10 pages
DBA Maths
No ratings yet
DBA Maths
98 pages
Data Analytics & Visualization Guide
No ratings yet
Data Analytics & Visualization Guide
77 pages
SAT MATH Math Lesson #5
No ratings yet
SAT MATH Math Lesson #5
8 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
Midterm 1
No ratings yet
Midterm 1
14 pages
Biostatistics - Data and Its Types
No ratings yet
Biostatistics - Data and Its Types
11 pages
Gce Advance Level Exam 2022 Business Statistics Model Papers National Institute of Education 6524de0a0d39b
No ratings yet
Gce Advance Level Exam 2022 Business Statistics Model Papers National Institute of Education 6524de0a0d39b
32 pages
Assignment Module04 Part2 KI 20220407
100% (1)
Assignment Module04 Part2 KI 20220407
6 pages
Aiht Notes Dev 1-5
No ratings yet
Aiht Notes Dev 1-5
236 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 3 - Univariate Analysis
No ratings yet
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 3 - Univariate Analysis
3 pages
Introduction To Statistics..Final
No ratings yet
Introduction To Statistics..Final
221 pages
Punyashlok Ahilyadevi Holkar Solapur University, Solapur Final Year B.Tech. (Electronics & Telecommunication Engg.) (Part - II) CBCS Pattern
No ratings yet
Punyashlok Ahilyadevi Holkar Solapur University, Solapur Final Year B.Tech. (Electronics & Telecommunication Engg.) (Part - II) CBCS Pattern
6 pages
Math Concepts for Test Prep
No ratings yet
Math Concepts for Test Prep
8 pages
Lesson 1: Brief History of Statistics
No ratings yet
Lesson 1: Brief History of Statistics
17 pages
Probability and Statistics 1 Session 1 - 3: Instructor: Prof. Deepika Jain E-Mail Id: Deepika - Jain@iimrohtak - Ac.in
No ratings yet
Probability and Statistics 1 Session 1 - 3: Instructor: Prof. Deepika Jain E-Mail Id: Deepika - Jain@iimrohtak - Ac.in
180 pages
Data Collection and Presentation
No ratings yet
Data Collection and Presentation
21 pages
Intreb Statist
No ratings yet
Intreb Statist
47 pages
AD3301 Data Exploration and Visualization
No ratings yet
AD3301 Data Exploration and Visualization
278 pages
Chapter 7 Analyzing Quantitative Data - 38a9893c A87e 4e72 8a3c 9df200042566
No ratings yet
Chapter 7 Analyzing Quantitative Data - 38a9893c A87e 4e72 8a3c 9df200042566
52 pages
Chapter 1. Introduction: 1 Terminology Basics
No ratings yet
Chapter 1. Introduction: 1 Terminology Basics
27 pages
Data Analysis and Interpretations Chapter 8
No ratings yet
Data Analysis and Interpretations Chapter 8
41 pages
Notes of B.Stats
No ratings yet
Notes of B.Stats
23 pages
25 Cleverly Designed Minimal Logo Designs For Inspiration - Designbeep
No ratings yet
25 Cleverly Designed Minimal Logo Designs For Inspiration - Designbeep
13 pages
Flow Chart 0: Overall Flow For Normal Purchase Procedure
No ratings yet
Flow Chart 0: Overall Flow For Normal Purchase Procedure
1 page
Lesson One Quantitative Techniques in Management
No ratings yet
Lesson One Quantitative Techniques in Management
5 pages
CA 13 VectorProcessors
No ratings yet
CA 13 VectorProcessors
16 pages
MATLAB Scripts & Functions Guide
No ratings yet
MATLAB Scripts & Functions Guide
38 pages
HUAWEI G730-U10 V100R001C00B112CUSTC433D002 Update Guide
No ratings yet
HUAWEI G730-U10 V100R001C00B112CUSTC433D002 Update Guide
15 pages
Box Plot
No ratings yet
Box Plot
4 pages
HWS701 Manual
No ratings yet
HWS701 Manual
24 pages
GSTN Informatin Booklet
No ratings yet
GSTN Informatin Booklet
100 pages
DLL - Math6 - Week 1
No ratings yet
DLL - Math6 - Week 1
12 pages
L21 L22 Varying CTReconstruction Parameters
No ratings yet
L21 L22 Varying CTReconstruction Parameters
24 pages
333 High Frequency GRE Words With Meanings
No ratings yet
333 High Frequency GRE Words With Meanings
7 pages
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
No ratings yet
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
24 pages
IBM POST & BIOS Error Codes Guide
No ratings yet
IBM POST & BIOS Error Codes Guide
4 pages
The Diagnostic Process: Learning Objectives Key Terms
No ratings yet
The Diagnostic Process: Learning Objectives Key Terms
21 pages
Anaconda Training PDF
100% (1)
Anaconda Training PDF
2 pages
Sinhgad Institute of Management, Pune-41: Assignment No.4
No ratings yet
Sinhgad Institute of Management, Pune-41: Assignment No.4
2 pages
Python Lab
No ratings yet
Python Lab
21 pages
Fan Kit Instruction
No ratings yet
Fan Kit Instruction
4 pages
Confirmatory Composite Analysis Guide
No ratings yet
Confirmatory Composite Analysis Guide
10 pages
KHUSH
No ratings yet
KHUSH
21 pages
Fast Newton-Raphson Power Flow Analysis Based On Sparse Techniques and Parallel Processing
No ratings yet
Fast Newton-Raphson Power Flow Analysis Based On Sparse Techniques and Parallel Processing
11 pages
Disco Externo Iomega Datasheet
No ratings yet
Disco Externo Iomega Datasheet
2 pages
Interview Questions
No ratings yet
Interview Questions
50 pages
Maintenance Manual mb491 PDF
No ratings yet
Maintenance Manual mb491 PDF
298 pages
Data Acquisition in MATLAB
No ratings yet
Data Acquisition in MATLAB
27 pages
Engineering & Industry 4.0 Insights
No ratings yet
Engineering & Industry 4.0 Insights
32 pages

Tutorial 1

Uploaded by

Tutorial 1

Uploaded by

GEA1000 QUANTITATIVE REASONING WITH DATA

In this tutorial, we will be looking at two case studies:

Case Study 1: Exploring Box office records for movies.

Manifestation 1: Missing values

Manifestation 2: Unusual values

e) Create a new variable called Adjusted_Production_Budget where all

b) What design of study would be suitable to determine the effectiveness of

Changing Numerical to Categorical Variables

- Click on the “Transform” Tab.

1. Under “Plot-type”, select “Line”.

You might also like