Using R in Azure Machine Learning
Take Azure ML to the next level with R
Agenda
Using R in Azure Machine Learning
R – Ecosystem Fundamentals
R – Selected Language Elements
Data Science Principles (including some lingo)
Azure ML Quick Overview
Azure ML + R
Assumptions
• We can’t do any one topic proper justice.
• So this talk will introduce the core ecosystem for your own follow up.
• No mathematical proofs.
Hopefully.
• You will know what you don’t know about Data Science
• Set expectations and realities about Data Science
These slides are meant to be used !
References
Coursera – R Programming (Johns Hopkins)
• Quality of explanations variable.
• But the practical assignments are good, and deadline driven.
Safari Books Online
• Video. Introduction to Data Science with R. Garrett Grolemund (R Studio).
• Most published books on R
Hadley Wickham (@hadleywickham) – Modern Godfather of R
• Chief Data Scientist at R Studio
• Author of influential R packages
Azure Machine Learning
• Intro video at Microsoft Virtual Academy
References contnued
edX – Azure ML with R/ Python
• A number of demos driven by this content.
R – Ecosystem Fundamentals
R Pros
R vs Python
• More mature data science support (20 years +), purpose built
• More established ML support
• T-SQL integration in 2016 (out of scope)
Python Pros
• Best all round script language. Data science support improving.
• Better 64 bit support and scalability?
R performance and scalability – don’t forget Revolution Analytics
https://cran.r-project.org/ (Revolution Analytics R not covered)
Get R Studio. https://www.rstudio.com/
R Ecosystem
• Essential IDE. But–much
Essential Bits RPubs etc..
more, packages,
• Download R, then R Studio.
Get a Github account. https://github.com/ and Github shell.
• Distributed source code control system.
• Essential part aofbetter
RStudio R social network.
environment for test and debug than Azure ML!
From github.com
Github Lifecycle Cheat Sheet
• Create repository (or fork someone else’s).
From local Github shell.
git clone <URL_of_repository>
cd <repository>
git add <files>
git commit –a –m “some_message”
git push
R Package Management
You’ll be doing this a lot!
To install a package at the command line.
install.packages("ggplot2“) (multiple dependency options)
Or use R Studio.
R
To Package Install/
use an installed package.Reference
At the command line.
library("ggplot2")
R Studio code hint.
Can install libraries from Github (user/repository)
library(“devtools")
install_github( 'ramnathv/rCharts')
Older versions of install_github have user and repository as separate arguments
plot
R Visualisation
• Standard Packages
package. Easy to use but presentation ordinary.
lattice
• Enhanced package. Not very widely adopted.
ggplot (by Hadley Wickham) – Grammar of Graphics
• Best quality presentations yet easy to use
• Layers approach: ggplot
• Quickie version: qplot
qplot simple example
ggplot2 example inc Linear Model
ggplot2 … if you really want to get funky…
ggplot2 and the Boxplot
Concise way to show median, 1st/ 3rd quartiles, 1.5 * IQR and outliers.
Scatter plot matrix and R pairs function
Concise way to relationships between all features.
R Data Wrangling Packages
dplyr
• Extensive function set for select/ sort/ filter/ derived columns/ group by/ top n.
• Note %>% directive to chain dplyr functions – pipeline like
tidyr (Hadley Wickham)
• Statisticians called cleansed data tidy data.
• Normalise/ denormalise.
sqldf
• Surprisingly good SQL syntax fidelity
knitr
R
• RDynamic
Markdown +Report
embeddedPackages
R code => reports. HTML/ PDF/ Latex.
• Ideal platform for Reproducible Research.
• Demo. Properly cool.
shiny
• Interactive publishing of R driven web pages. Client and server bits.
slidify
• Generation of slide decks from R Markdown/ YAML/ R.
R – Selected Language Basics
R Fundamental Data Structures
Script language (Perl/ Python/ Ruby) data structures.
• Scalar
• Array
• Hash (key/value)
Contrast with R data structures
• Vector (a “scalar” is really a 1 element vector)
• Matrix (caveat – data of same type)
R is case sensitive everywhere! (Variables, functions etc.)
The data frame is an operational tabular structure, integral to data manipulation.
R Data Types
Atomic data types.
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
typeof function handy
R Assignment and c function
Two different modes, generally equivalent.
• The <- form most popular.
c for combine to build free form vectors.
Reading Data and Missing Values
A number of functions to read data files (usually read.table).
• Generally into data frames.
How are values not entered handled?
• R default is NA
• This can be overwritten
Looking at the data
A number of handy functions. (Factor – discrete values)
R as a Functional Programming Language
In R, functions are 1st class objects. This is widely used.
Eg apply family of functions. apply, sapply, lapply
View command – R Studio Console
Needs no further introduction!
Data Science Principles
Some General Notes
Algorithms vs Data
• Lots of data tends to be more influential than choice of algorithm
• Data collection methodology is critical
Correlation implies Causation?
• No!
Outliers
• Extreme values well outside the norm. Eg Australia’s billionaires
• How are they handled? Depends.
Variable Types (affects Algorithm choice)
• Continuous, eg apartment price
• Discrete, eg species of Iris. Don’t forget R function stringsAsFactors
Data Analysis Flowchart
Codebook and Interpetation
Codebook is what Statisticians call the document that is
• Field spec of the data
• Details about the data collection
Reference to data set
• US NOAA storm database
http://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype
Read and interpret the Codebook carefully
• Eg Time based issues, all weather events only recorded since 1/1/96
• Careful combining features, eg # fatalities + # injuries does
not make sense
Machine Learning – Predictive Types
Supervised Learning
• Train model based on past results, validate with test data
• Independent variables or features as predictors
• Label or dependent variable to predict.
• Eg predict house price based on size, # rooms etc
Unsupervised Learning
• No past results to train on, thus more difficult to evaluate
• Find patterns, often using clustering
• Eg Google News
Supervised Learning Experiments
Split available data into training and test samples
• Often training 70% as a rule of thumb
• Fit a model against training of close to just right accuracy
• Validate model against test set
Beware of.
• Underfitting. Not a convincing predictor.
• Overfitting. Too much fitting of errors/ outliers. Great fit of training
data, rubbish for other data sets.
Experiment Types
At a very high level.
• Regression. Fit mathematical (often linear) to predict continuous
values.
• Classification. Predict discrete values.
• Clustering. Group data items based on similarity.
• Recommender.
• Anomaly Detection. Detect exception cases.
Feature Selection
Your training data has a lot of features. Should we use them all?
• No! Too many dimensions, too much noise.
• Punt collinear features, those with marginal value
• Combine features where it makes sense
• randomForest model to assess importance
• Stepwise elimination of features, R has step() function
• Be ruthless!
Averages and Standard Deviation
How to do an average.
• Mean. Sum of observations / # of observations – outlier sensitive
• Median. Middle value
• Mode. Most common value, best for factors (categorical)
Spread of data.
• Variance is (Value – Mean) squared / # observations. Square to (a)
take absolute value (b) better vibe of the data.
• Take square root of variance to get Standard Deviation which brings
value in same scale as observations, thus commonly used.
Normalize Data/ R scale function
Features you want to compare naturally have different scales.
• Eg
• The bigger numbers will swamp small numbers in importance.
Solution? Scaling.
• Common solution is to normalize data to a scale where mean = 0 and
standard deviation = 1.
Note Azure ML has a Normalize Data module. R has a scale function.
Hypothesis Testing and Confidence Intervals
The protocol for hypothesis.
• Hypothesis 0 is the status quo.
• Hypothesis 1 is the alternative (eg new drug).
• Aim is to reject H0 in favour of H1 (or not)
The result is generally framed within a confidence level (p value).
• Commonly use 95%, a throwback to pre computer days.
• Controversy. The Earth is Round (p < 0.05)
Tidy Data
Described by Hadley Wickham in
• Paper - http://vita.had.co.nz/papers/tidy-data.pdf
• Video - https://vimeo.com/33727555
Principles
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Azure ML – Quick Overview
Azure ML – Get Started
What you need. Site is https://studio.azureml.net
• Azure account – does not have to be a trial, Machine Learning has a
free tier.
• Storage account.
* Trap. Storage account must be in same location as ML. Australia
might not be available.
Azure ML – Flowchart
Azure ML Example – Compare Two Models
Azure ML Example – Prefab Data Wrangling
Azure ML – Re-use and Monetisation
Re-use via web services.
• REST APIs
• Code snippets in C#, R and Python.
Publish said web services to Azure Marketplace.
• Fairly involved diligence process including approvals.
Sadly, both topics out of scope.
Apply SQL Transformation Module
Use SQL syntax for data wrangling, based on SQLite.
I/O
• 3 input ports, internally use “tables” t1, t2 and t3
• 1 output port with results
Within Azure ML, an easier alternative to the R package sqldf.
Extend ML with R
• Its own environment (avoid namespace collisions)
• Need to load packages
Execute R Script and I/O
• Install new packages via zip
Execute R Script
• Dataset[12]; Azure table -> R data frame
• Script bundle; Zip -> code, objects, packages
3 input ports
2 output ports
• Results; R data frame -> Azure table
• R Device; stdout, stderr, graphics
Template code for Execute R Script
Execute R Script – a “real” example
Debugging R Code
What if code runs ok in RStudio but not in ML?
There is no debugger as such in ML, so
• Induce an error in R code, eg refer uninitialised object
• Right click R script module, select View Error Log
• Right click R script module, select View Output Log
Latter has more detail
Sample Output Log
Create Your Own R Library
Fairly mechanical.
• Create your own source function(s) in a .R file
• Zip up that file, with the name you want displayed in ML
• In ML, call Add Dataset to import file.
• Visible in My Datasets in ML.
Own R Library Example
Create R Model Module
A module which includes model and scoring scripts
• Own R environment
• Only pre loaded R packages
• Only one output, no graphics
I/O
• Input. Training data frame
• Output. Model object.
Scripts
• Trainer script
• Scorer: uses R predict function
Sample R Model Module Code
Note most set and get functions local to R Model Module.
Sample training script.
Sample scoring script.
Loading R Packages into Azure ML
There are “only” 350 R Packages in Azure ML – you’ll eventually want to
use other packages.
To load an R Package into Azure ML.
• Find the package and download as zip locally
• In ML Studio, select the big “+ NEW” option bottom LHS
• Select DATASET -> FROM LOCAL FILE
• Follow the bouncing ball
Using Loaded R Packages in Azure ML
Effectively need to install each use in Execute R Script.
Demos – CA Dairy Data
Really simple example of R, plus custom library in action.
Steps we take.
• Make Overall Height and Orientation categorical (what R calls Factors).
Energy
• Efficiency
Make all column Visualisation
headers CamelCase (remove spaces) to play nicer with R.
• Add R code to use dplyr to create derived columns for squares and cubes.
• Normalize Data for all numeric columns, transformation method MinMax. Mean 0
and standard deviation 1.
• Add R code to visualise data.
Now let’s do some data science !
Energy
•
Efficiency Visualisation
Project Columns module to punt a few columns.
continued
• Use the Linear Regression, solution method Ordinary Least Squares.
• Split Data module – 60% training, 40% test
• Train Model module – Linear Regression plus Training data
• Permutation Feature Importance to score model against Test data
Energy Efficiency Visualisation – the score
The relative feature importance.
Summary
Please take this presentation as a call to action.
Alnis Bajars. Email: [email protected] Twitter: @alnisb