Lecture 1:
Introducing CS5702 - Modern Data
Prof. Martin Shepperd
[email protected]
CS5702 > Lecture 1: Introduction 1
Lecture Poll
Please use your mobile or computer to go to:
pollev.com/mshepperd
We will need it in a few minutes. Thanks!
CS5702 > Lecture 1: Introduction 2
Welcome
— Continuing challenges of covid-19 BUT
— Great time to study data science and machine
learning
CS5702 > Lecture 1: Introduction 3
Agenda
1. Module overview
2. Teaching approach and resources
3. What is R and why do data scientists use it?
4. R basics
5. Getting help
6. Week 1 goals
7. Extension questions and study
CS5702 > Lecture 1: Introduction 4
1. Module overview
Meet the team
— Professor Martin Shepperd (module leader)
— Professor Xiaohui (Hui) Liu (support lecturer)
— Tasin Islam (GTA)
— Ziyan Fu (GTA)
— Yu Cao (GTA)
CS5702 > Lecture 1: Introduction 5
Virtual/physical lecture protocol
1. I will start at 1105 prompt; please be ready
2. Be aware, lectures (and seminars) will be recorded
3. Please mute your microphone
4. If you have a question can you use the chat facility or ...
5. ... ask during in a question gap (2-3 per lecture)
6. Thumbs-up questions that are important to you
7. Please consider wearing a mask
CS5702 > Lecture 1: Introduction 6
What is data?
Wisdom of crowds
CS5702 > Lecture 1: Introduction 7
What's Modern Data about?
To provide an introduction to data management and
exploration. ... appreciation of the richness and
availability of different data sources ... techniques,
methods and processes for modern data analysis.
— Study Guide
CS5702 > Lecture 1: Introduction 8
Data science is ...
Wisdom of crowds
CS5702 > Lecture 1: Introduction 9
Data science is ...
Data science is an exciting discipline that allows you to
turn raw data into understanding, insight, and
knowledge.
— Hadley Wickham and Garrett Grolemund
CS5702 > Lecture 1: Introduction 10
Opportunities
... there is more data and richer data available than ever
before, coupled with more and more powerful analysis
tools.
... incredible opportunities to collect, clean, merge,
analyse and visualise data both for good, and for less
good purposes.
CS5702 > Lecture 1: Introduction 11
Course structure
CS5702 > Lecture 1: Introduction 12
Module
structure
CS5702 > Lecture 1: Introduction 13
Week structure
CS5702 > Lecture 1: Introduction 14
2. Teaching approach and resources
— Practical
— Lecture, seminar and lab
— Use R and RStudio
— Being organised
— Decide on your personal to-do / checklist system now!
— Keep up to date
— Weekly checklists (as text files) can be grabbed from
here
CS5702 > Lecture 1: Introduction 15
Learning Resources:
— Blackboard VLE
— The "Modern Data" interactive book
— Worksheets and linked files from GitHub
— Quizzes
— Reading list and references (also your own research)
CS5702 > Lecture 1: Introduction 16
3. What is R and why do data scientists
use it?
R is an open, purpose-designed, highly-extensible,
statistical and data analysis programming language.
CS5702 > Lecture 1: Introduction 17
R Advantages
— designed by statisticians — interactive dashboards
— powerful data handling, — large, open community
wrangling and storage — easy integration with
capabilies e.g., C, C++, FORTRAN
— flexible graphical — widely used by
facilities researchers
— integrates with machine
learning e.g., TensorFlow
etc
CS5702 > Lecture 1: Introduction 18
Source: www.alfredogmarquez.com/2019/12/29/r-s-
demise-higly-overblown/
CS5702 > Lecture 1: Introduction 19
4. R basics
As a prerequisite you should have completed the
Getting Ready Chapter, in particular to have installed R
and RStudio and run some simple R test examples.
CS5702 > Lecture 1: Introduction 20
R and variables
A variable is a named container for information and this
information can be set, modified or referenced.
# This R code creates three different variables
numericVariable <- 10
stringVariable <- "Hello world!"
logicVariable <- TRUE
R infers the data type from what you assign. This is called
coercion.
CS5702 > Lecture 1: Introduction 21
Data types
The data type is an attribute of a variable which tells the
R interpreter how we intend to use the data.
— defines the operations that can be done
— the meaning of the data
— limits to values that can be stored, e.g., if the type is
logical, only TRUE and FALSE
CS5702 > Lecture 1: Introduction 22
Simple data types in R
— numeric or floating point
— character or character string (if 2+ in length)
— logical (TRUE or FALSE)
CS5702 > Lecture 1: Introduction 23
Manipulating variables
> # Initialise (or overwrite if it already exists) y to 5.3
> y <- 5.3
> # Multiply y by 13
> y <- y * 13
> # Display y
> y
[1] 68.9
> # Alternatively, you can use the print() function
> print(y)
[1] 68.9
CS5702 > Lecture 1: Introduction 24
Useful complex data types
— vector: multiple instances of the same type (see -
Modern Data book)
— data frame: multiple instances of different types
(see - Modern Data book
CS5702 > Lecture 1: Introduction 25
Creating and using vectors
— So far we have mainly focused on atomic variables.
— Often useful to store/analyse multiple instances e.g.,
the height of all the people in a sample.
— Can have a vector of the same type of atomic
variables
CS5702 > Lecture 1: Introduction 26
A simple vector in R
# The c() function combines elements into a vector
> sample.heights <- c(168, 176, 170)
> is.vector(sample.heights)
[1] TRUE
> sample.heights[2]
[1] 176
> sample.heights
[1] 168 176 170
CS5702 > Lecture 1: Introduction 27
A data frame in R
— Is a 2-dimensional structure
— A workhorse for the data analyst
— Multiple data types e.g., numeric and character
— Sometimes referred to as 'rectangular' data because
each row is the same length (a special case of a List)
— Similar(ish) to a spreadsheet
CS5702 > Lecture 1: Introduction 28
A simple data frame in R
> name <- c('Suzie Smith','Ahmed Abbas','Atilla The Hun')
> salary <- c(21000, 23400, 26800)
> employ.data <- data.frame(name, salary)
> # Show the top 6 rows of the dataframe
> head(employ.data)
name salary
1 Suzie Smith 21000
2 Ahmed Abbas 23400
3 Atilla The Hun 26800
CS5702 > Lecture 1: Introduction 29
The View() function
# Note View has an upper case 'V'
> View(employ.data)
CS5702 > Lecture 1: Introduction 30
For more details on R basics
— Modern Data book
— also Kabacoff 1
— the lab worksheets
1
Kabacoff, R. (2015). R in Action: Data Analysis and Graphics With R (2nd ed.). Manning Publications.
CS5702 > Lecture 1: Introduction 31
5. Getting help ...
1. find/read the relevant cheatsheet
2. perspiration e.g., see this five step approach
3. talk it over with a fellow student
4. module FAQs on Blackboard
5. Stack overflow
6. ask a member of the course team
For more suggestions visit the subsection 0.2 "vi) Learn how to
get help" in the Modern Data book.
CS5702 > Lecture 1: Introduction 32
6. Week 1 goals
By the end of this week you should:
[ ] completed the Week 0 Getting Ready chapter
[ ] completed the Week 1 Introduction chapter
[ ] have an appreciation of the background of R
[ ] understand the main features of the R ecosystem
[ ] able to write, execute and save simple R programs
[ ] manipulate basic variables including vectors and data
frames
[ ] be able to write, execute save and load code using RStudio
CS5702 > Lecture 1: Introduction 33
7. Extension activity
Read Provost and Fawcett (it's only 8 pages) and 2
determine what are the fundamental concepts of data
science. Then rank them in order of importance.
2
Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-driven decision making. Big Data,
1(1), 51--59. Access via Google Scholar
CS5702 > Lecture 1: Introduction 34
References
— Kabacoff, R. (2015). R in Action: Data Analysis and
Graphics With R (2nd ed.). Manning Publications.
— Provost, F., & Fawcett, T. (2013). Data science and its
relationship to big data and data-driven decision
making. Big Data, 1(1), 51--59.
— Wickham, H., & Grolemund, G. (2018). R for data
science. O'Reilly Media, Inc.
CS5702 > Lecture 1: Introduction 35