Part X
Code Basics &
Data Manipulation with R
Literature:
Wickham & Grolemund; R for Data Science; ch. 3, 16
42
Code basics
Very basics:
• The simple calculator: 5 / 200 * 30
• Type pi and press enter
Functions:
• Functions are written as function( <value>)
• Call a build-in function: sin(pi/2) or log(2.718)
Objects:
• Create objects with x <- 1 and watch your global environment window
• Inspect an object by typing its name, e.g. x
• Create a vector with v <- c(1, 5)
• Create a vector with v2 <- c ('r', ‘f’)
• Inspect the vectors
Lectures in Data Mining Winter 2018 44
Code basics
• Create a vector with v2 <- c ('r', 'f’)
• Create a vector with v <- c(1, 5)
• Create a vector v <- seq(1, 5)
• Create a vector v2 <- seq(1, 10, length.out = 5)
• Inspect the vector
• Try (v <- seq(1, 10, length.out = 5)) …. What happens?
• Inspect the third element of vector v v[3]
Lectures in Data Mining Winter 2018 45
Code basics
- useful functions -
Compute a lagged version of a time series, shifting the time base back by
a given number of observations:
lag() and lead()
(x <- seq(1, 10))
lag(x)
Drawbacks:
• Need exactly one variable for sorting or sorted dataset.
• Does not support partitioning.
Lectures in Data Mining Winter 2018 46
Code basics
- remember tibbles -
tibble( seq(1, 10), seq(10, 1) )
tibble(a = seq(1,10), b = a*a, c = b - lag(b))
Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(),
cummax() and cummean() RcppRoll-package for rolling windows.
tibble(a = sample(1:10, 20), b = cumsum(a))
Rankings: see min_rank(), dense_rank(), percent_rank()
Lectures in Data Mining Winter 2018 47
Code basics
- tibbles and cross table -
Columns in tables can be addressed by using <TABLENAME>$<COLUMN>
Try:
tibble(mpg$manufacturer, mpg$drv)
tibble(a = sample(1:10, 20), b = cumsum(a)) $b
What is a cross table?
A cross table is a two-way table consisting of columns and rows. It is also
known as a pivot table or a multi-dimensional table. Its greatest strength
is its ability to structure, summarize and display large amounts of data.
Cross tables can also be used to determine whether there is a relation
between the row variable and the column variable or not.
Lectures in Data Mining Winter 2018 48
Code basics
- cross table -
Table() uses the cross-classifying factors to build a contingency table of
the counts at each combination of factors.
Try:
table(mpg$manufacturer)
table(mpg$manufacturer, mpg$drv)
table(mpg$manufacturer, mpg$drv, mpg$cyl)
Exercise:
table(mpg$manufacturer, front = mpg$drv == 'f’ )
Try out and explain!
Lectures in Data Mining Winter 2018 49
Data manipulation in R
with sqldf (SELECT)
• Install the package and load the library with:
install.packages ("sqldf")
library(sqldf)
• Remember: Install a package once but load the library
each time you restart R.
• Test:
sqldf ( "select manufacturer, cyl, hwy from mpg limit 30")
• Library RODBC allow to connect an Oracle DB to R.
Lectures in Data Mining Winter 2018 50
The dplyr package
• Great for data exploration and transformation
• Intuitive to write and easy to read, especially when using the
“pipelining” syntax (covered below)
• Fast on data frames
dplyr functionality
• Five basic verbs: filter, select, arrange, mutate, summarize (plus
group_by)
• Can work with data stored in databases and data tables
• Joins: inner join, left join, semi-join, anti-join
• Window functions for calculating ranking, offsets, and more
Lectures in Data Mining Winter 2018 51
The dplyr package
- filter -
Similar syntax of all dplyr functions:
• First argument is data set
• Subsequent arguments describe what to do with the data set
• Functions return modified data sets
General syntax for filter():
filter( data = #DATA, <CONDITION>, <MORE_CONDITIONS>)
filter(mpg, cyl == 6)
filter(mpg, cyl ==6, drv == 'r')
Lectures in Data Mining Winter 2018 52
The dplyr package
- filter -
• Filter uses Conditions:
– Equals (==)
– Larger (>) and larger or equal (>=)
– Smaller (<) and smaller or equal (<=)
• NA argument always return NA (not FALSE and not TRUE)
– Try NA == NA
– Try is.na(NA)
• Conditions may contain uses logical operators and brakets
– And: & , Or: |, Not: !
– In : %in% <VECTOR>
• Try using condition & instead of ,
• Replace & by | (or) and look what happens.
Lectures in Data Mining Winter 2018 53
The dplyr package
- arrange -
• Arrange sorts the data set by one or more attributes.
• General syntax for arrange():
arrange( data = #DATA, <ATTRIBUTE>, <MORE_ATTRIBUTES>)
• Use functions desc(<ATTRIBUTE>) and asc(<ATTRIBUTE>) to define
the order.
arrange(mpg, hwy)
arrange(mpg, desc(hwy))
Lectures in Data Mining Winter 2018 54
The dplyr package
- select -
• Select allows to take subsets out of a data set by specifying the subset
by its attributes.
• General syntax of select():
select( data = #DATA, <ATTRIBUTE1>, <ATTRIBUTES2>, …)
• To find columns in tables with a lot of attributes use the functions
starts_with(), contains() or ends_with() to define the subset.
select(mpg, 1, 2, 9)
select(mpg, manufacturer, model, hwy)
select(mpg, "manufacturer", "model", "hwy")
select(mpg, starts_with("m"), hwy)
Lectures in Data Mining Winter 2018 55
The dplyr package
- select -
• Select allows to take subsets out of a data set by specifying the subset
by its attributes.
• General syntax of select():
select( data = #DATA, <ATTRIBUTE_SET> …)
• To find select several subsequent columns at one in a table use the
colon in the select function to define the subset.
select(mpg, 1 : 4)
select(mpg, manufacturer : year)
Lectures in Data Mining Winter 2018 56
The dplyr package
- select -
• Select allows to alias attributes by specifying new attribute names.
• General syntax of select():
select( data = #DATA, <NEW_NAME> = <ATTRIBUTE1>, …)
• To keep the remaining columns in tables use the function
everything().
• To remove a column you can use minus (– <ATTRIBUTE>)
Try the following statements and look what happens:
select(mpg, maker = manufacturer, model, hwy)
select(mpg, maker = manufacturer, everything())
select(mpg, maker = manufacturer, everything(), category = class)
select(mpg, maker = manufacturer, everything(), - cyl)
Lectures in Data Mining Winter 2018 57
The dplyr package
- select and rename-
• Rename also allows to alias attributes by specifying new attribute
names and keeps the remaining columns.
• General syntax of rename():
rename( data = #DATA, <NEW_NAME> = <ATTRIBUTE1>, …)
Try and look what happens:
rename(mpg, maker = manufacturer)
Lectures in Data Mining Winter 2018 58
The dplyr package
- mutate -
• Before we start let’s switch the example and prepare the data:
install.packages("nycflights13")
library(nycflights13)
flights
(flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance, air_time
))
Lectures in Data Mining Winter 2018 59
The dplyr package
- mutate -
• Mutate adds attributes (or columns) that are functions of other
attributes at the end of the dataset.
• General syntax of mutate():
mutate( data = #DATA, <NEW_NAME> = <FUNCTION>, …)
Try:
mutate( flights_sml,
gain = arr_delay – dep_delay,
speed = distance / air_time * 60)
Arithmetic operations: +, -, *, /, ^
Try transmute with the same parameters and look what happens.
Lectures in Data Mining Winter 2018 60
The dplyr package
- mutate -
• Mutate allows to reuse variables:
mutate( flights_sml,
gain = arr_delay – dep_delay,
hours = air_time / 60,
gain_per_hour = gain/hours)
Lectures in Data Mining Winter 2018 61
The dplyr package
- mutate functions -
Modular arithmetic: %/% - integer division
%% - remainder
mutate( flights_sml, air_time,
air_hours = air_time %/% 60,
air_mins = air_time %% 60)
Lectures in Data Mining Winter 2018 62
The dplyr package
- Grouping and summarize -
• Summarize() collapses a data set to a single row.
• Summarize() can be paired with group_by(). Consequently, the
aggregation functions use the attributes of the grouped data set
(here: #DATA)
#NEW_DATA <- group_by( data = #DATA, <ATTRIBUTES>)
summarize( #NEW_DATA,
<NEW_NAME> = <AGGR_FUNCTION>, …)
by_dest <- group_by(flights, dest)
delay <- summarize( by_dest, count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = T))
Lectures in Data Mining Winter 2018 63
The dplyr package
- Grouping and summarize -
• Let’s visualize:
by_dest <- group_by(flights, dest)
delay <- summarize( by_dest, count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = T))
delay <- filter( delay, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth(se = FALSE)
Lectures in Data Mining Winter 2018 64
The dplyr package
- Piping -
• Instead of creating data set it is possible to send data from one
command to the next by using %>%.
#NEW_DATA <-
<FUNCTION> (#DATA %>%, <ATTRIBUTES>) %>%
<NEXT_FUNCTION> (<ATTRIBUTES>)
Applied to our example:
delay <- group_by( flights, dest) %>%
summarize( count = n(),
dist = mean(distance, na.rm = T),
delay = mean(arr_delay, na.rm = T)) %>%
filter( count > 20, dest != "HNL")
Lectures in Data Mining Winter 2018 65
The dplyr package
- Piping -
• Alternatively, it is possible to put the filter from the summary to the
input data:
delay <- filter(flights, !is.na(arr_delay)) %>%
group_by(dest) %>%
summarize( count = n(),
dist = mean(distance, na.rm = T),
delay = mean(arr_delay, na.rm = T)) %>%
filter( count > 20, dest != "HNL")
• To count only non empty values use
count = sum(!is.na(arr_delay))
Lectures in Data Mining Winter 2018 66
The dplyr package
- Piping -
• To make adding commands less work, start with the data set and pipe
it to the first command by using %>%.
Applied to our example:
delay <- flights %>%
filter(flights, !is.na(arr_delay)) %>%
group_by( dest) %>%
summarize( count = n(),
dist = mean(distance),
delay = mean(arr_delay) %>%
filter( delay, count > 20, dest != "HNL")
Lectures in Data Mining Winter 2018 67
The dplyr package
- More functions -
• Some other functions:
– lag and lead return the predecessor or the successor of the dataset
flights %>%
group_by(month) %>%
summarise(flight_count = n()) %>%
arrange(month) %>%
mutate(change = flight_count - lag(flight_count))
– Function sample_n( #x) returns #x randomly sampled data objects from
the data set.
– Function sample_frac( #frac) returns #frac * number of row data objects
from the data set. Default option is “without replacing”.
flights %>% sample_n(10 )
flights %>% sample_frac( 0.01, replace=TRUE)
Lectures in Data Mining Winter 2018 68
Power Supplier
- Switch Process -
Energy Grid
New Supplier Current
Customer (you) Supplier
Request delivery and
Announce delivery Request
declare consumption deregistration
Answer dereg.
Deregistration Check
Request
Confirm
Confirm registration deregistration
Consumption correction or
confirmation of declaration
Confirm delivery
Lectures in Data Mining Winter 2018 69
Exercise or Homework
What data is available
• SWITCHING_REASON: The reason of the new contract: Move into a new home (MOVE_IN)
or change of supplier (SUPPLIER_SWITCH)
• DECLARATION_CONSUMP_CUST: The energy consumption the customer expects to use
and thus declares to the new supplier.
• DECLARATION_CONSUMP_GRID: The grid has its own estimation of the energy
consumption. If it (significantly) differs from the above value the supplier received it in the
delivery confirmation.
• CORRECTIVE_VALUE: Pre-calculated by your lecturer as “hypothetical” correction of
consumption if the grid is correct. (to make it easier for you)
• MONTH_BILLING: The month when the customer received its first bill from the new
supplier.
• DAYS_INVOICED: Number of days between delivery start and billing.
• CONSUMPTION_INVOICED: Consumption as invoiced in the first bill of the new supplier.
• READ_OUT_CAT: The first year consumption can have different sources. Depending on the
situation it is a value from meter reading or an estimation (see categories).
• BILLING_TYPE : Final if the customer quits or rotational
Lectures in Data Mining Winter 2018 70
Exercise or homework
• Import data sets
HH_SAMPLE_POWER_CONSUMPTION_GRID
_A.csv into R Studio
• Download the zip file from moodle
Topic “Visualisation” “Examples”
and extract to a local folder
• library(readr)
consumption <-
read_delim("C:/PATH_TO_FILE/HH_SAMPLE_POWER_CONSU
MPTION_GRID_A.csv", ";", na = "empty", trim_ws = TRUE)
Lectures in Data Mining Winter 2018 71
Exercise or Homework
Rename the dataset and the columns
• Create a new table or tibble with the name hh_spc:
– month_billing = MONTH_BILLING
– swt_reason = SWITCHING_REASON
– cust_decl_kwh = DECLARATION_CONSUMP_CUST
– grid_decl_kwh = DECLARATION_CONSUMP_CUST + CORRECTIVE_VALUE
– bill = BILLING_TYPE
– inv_kwh_365 = CONSUMPTION_INVOICED * 365 / DAYS_INVOICED
• Use round function to cut decimals
• Limit the selection to:
– inv_kwh_365 < 10000 consumption <-
read_delim("C:/PATH_TO_FILE/H
– cust_decl_kwh < 7500
H_SAMPLE_POWER_CONSUMPT
– cust_decl_kwh > 0 ION_GRID_A.csv", ";", na =
• Order by month_billing "empty", trim_ws = TRUE)
Lectures in Data Mining Winter 2018 72
Exercise or Homework
• Plot a sample of 400 of these point & including smoothing function
– Compare customers’ declared consumption and (invoiced) 365 day consumption
– Separate switching reason by color
– Use se = FALSE for the smoothing
• Create a boxplot with switching reason and 365 day consumption by
using a 10% sample.
(Hint: If it does not work change x and y attributes)
Lectures in Data Mining Winter 2018 73