Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views212 pages

ST790 2015 Spring LecNotes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views212 pages

ST790 2015 Spring LecNotes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 212

ST790-003: Advanced Statistical Computing

Mon/Wed 10:15am-11:30am, SAS Hall 1216


Instructor: Dr Hua Zhou, [email protected]

1 Lecture 1: Jan 7
Today
• Introduction and course logistics

• Linux fundamentals

What is this course about?

Statisticians used to ... Now we spend all time ...

• Statistics, the science of data analysis, is the applied mathematics in the 21st century.

• Data is increasing in volume, velocity, and variety. Classification of data sets by Huber
(1994, 1996).

1
Data Size Bytes Storage Mode
Tiny 102 Piece of paper
Small 104 A few pieces of paper
Medium 106 (megabyte) A floppy disk
Large 108 Hard disk
Huge 109 (gigabytes) Hard disk(s)
Massive 1012 (terabytes) RAID storage

• Themes of statistics (borrowed from Kenneth Lange’s talk)

– Three pillars: estimation, hypothesis testing, model selection.


– Two philosophies: frequentist, Bayesian.
– Mathematical underpinnings: optimization, penalization, asymptotics, integra-
tion, Monte Carlo sampling.
– Statistics is partly empirical and partly mathematical. It is now almost entirely
computational.

• This course covers some topics on computing I found useful for working statisticians
but not covered in ST758 or typical statistics curriculum. Advanced does not mean
more difficult here.

• General topics.

– Operating systems: Linux and scripting basics


– Programming languages: R (package development, Rcpp, ...), Matlab, Julia
– Tools for collaborative and reproducible research: Git, R Markdown, sweave
– Parallel computing: multi-core, cluster, GPU
– Convex optimization
– Integer and mixed integer programming
– Dynamic programming
– Advanced topics on EM/MM algorithms
– Algorithms for sparse regression
– More advanced optimization methods motivated by modern statistical and ma-
chine learning problems, e.g., ALM, ADMM, svm, online algorithms, ...

• Last version (2013 Spring) of this course may give you a rough idea.
http://www.stat.ncsu.edu/people/zhou/courses/st810/LectureNotes
Of course topics on computing change fast.

2
Course logistics
• Check course website frequently for updates and announcements.
http://hua-zhou.github.io/teaching/st790-2015spr/schedule.html
Pre-lecture notes will be posted before each lecture. Cumulative lecture notes will be
updated and posted after each lecture.

• My office hours: Mon @ 4P-5P, Wed @ 4P-5P, or by appointment.

• TA office hours: Tue @ 2P-3P, Fri @ 2P-3P, at 1101 SAS Hall.

• 5 to 8 homework assignments. Group (20) or solo work (14)?

• A course final project. Survey results: 31 (course project) vs 2 (final exam). Group or
solo?

• Final grade: roughly 70% HW + 30% final project.

Linux: brief introduction


• Which operating system (OS) are you using? Survey results:

3
• Linux is the most common platform for scientific computing.

– E.g., both department HPC (Beowulf cluster) and campus HPC run on CentOS
Linux. It’s a lot of computing power sitting there.
– Open source and community support.
– Things break, when they break using linux its easy to fix them!
– Scalability: portable devices (Android, iOS), laptops, servers, and supercomput-
ers.
– Cost: it’s free!

• Distributions of Linux. http://upload.wikimedia.org/wikipedia/commons/1/1b/


Linux_Distribution_Timeline.svg

– CentOS is well supported in the department and on campus.


– Ubuntu is another popular choice for personal computers.
– cat /etc/issue displays the distribution on Linux command line

– R Mac OS was originally derived from Unix/Linux (Darwin kernel). It is POSIX


compliant. Most shell commands we review here apply to Mac OS terminal as
well. Windows/DOS, unfortunately, is a totally different breed.

• Linux directory structure.

4
By default, upon log-in user is at his/her home directory

• Linux shells.

– A shell translates commands to OS instructions.

R
– Most commonly used shells: bash, csh, tcsh, ...
– Sometimes a script or a command does not run simply because it’s written
for another shell.
– Determine the current shell you are working on: echo $0 or echo $SHELL.
– List available shells: cat /etc/shells.

5
– Change your login shell permanently: chsh -s /bin/bash userid. Then log out
and log in.

• Move around the file system.

– Knowing where you are.


pwd prints the current working directory.
– ls lists contents of a directory.
ls -l lists detailed contents of a directory.
ls -a lists all contents of a directory, including those start with “.” (hidden

R
folders).
Options for many Linux commands can be combined. E.g., ls -al.

6
– File permissions.

chmod g+x file makes a file executable to group members.


chmod 751 file sets permission rwxr-x–x to a file.

RR
groups userid shows which group(s) a user belongs to.
– .. denotes the parent of current working directory.

R . denotes the current working directory.


~ denotes user’s home directory.
cd .. changes to parent directory.
cd or cd ~ changes to home directory.
cd / changes to root directory.
pushd changes the working directory but pushes the current directory into a stack.
popd changes the working directory to the last directory added to the stack.

• Manipulate files and directories.

– cp copies file to a new location.


– mv moves file to a new location.
– touch creates a file, if file already exists it is left unchanged.
– rm deletes a file.
– mkdir creates a new directory.
– rmdir deletes an empty directory.
– rm -rf deletes a directory and all contents in that directory (be cautious using
the -f option ...)
– locate locates a file by name. E.g., to find files with names containing “libcublas.so”

7
– find is similar to locate but has more functionalities, e.g., select files by age,
size, permissions, .... , and is ubiquitous.

• View/peek text files.

– cat prints the contents of a file.


– head -l prints the first l lines of a file
– tail -l prints the last l lines of a file
– more browses a text file screen by screen (only downwards). Scroll down one page
(paging) by pressing the spacebar; exit by pressing the q key.
– less is also a pager, but has more functionalities, e.g., scroll upwards and down-

R
wards through the input.
“less is more, and more is less”.
– grep prints lines that match an expression.
– Wildcard characters:

Wildcard Matches
? or . Any single character
* Any string of characters
+ One or more of preceding pattern
ˆ beginning of the line
[set] Any character in set
[!set] Any character not in the set
[a-z] Any lowercase letter
[0-9] Any number (same as [0123456789])

8
E.g.

– Above wildcards are examples of regular expressions. Regular expressions are


a powerful tool to efficiently sift through large amounts of text: record linking,
data cleaning, scraping data from website or other data-feed. Google ‘regular
expressions’ to learn.
– Piping and redirection.
| sends output from one command as input of another command.
> directs output from one command to a file.
>> appends output from one command to a file.
< reads input from a file.

9
– Other useful text editing utilities include
sed, stream editor
awk, filter and report writer
and so on.
– Combinations of shell commands (grep, sed, awk, ...), piping and redirection, and
regular expressions allow us pre-process and reformat huge text files efficiently.

10
2 Lecture 2, Jan 12
Announcements
• TA office hours changed to Tue @ 1P-2P and Fri @ 2P-3P.

• HW1 posted. Due Mon Jan 19.

Last Time
• Course introduction and logistics.

• Linux introduction: why linux, move around file system, viewing/peeking text files,
and simple manipulation of text file.

Today
• Linux introduction (continued).

• Key authentication.

• Version control using Git.

Linux introduction (continued)


• Text editors. “Editor war” http://en.wikipedia.org/wiki/Editor_war.

11
– Emacs is a powerful text editor with extensive support for many languages in-
cluding R, LATEX, python, and C/C++; however it’s not installed by default on
many Linux distributions. Basic survival commands:
∗ emacs filename to open a file with emacs.
∗ CTRL-x CTRL-f to open an existing or new file.
∗ CTRL-x CTRX-s to save.
∗ CTRL-x CTRL-w to save as.
∗ CTRL-x CTRL-c to quit.
Google “emacs cheatsheet” to find something like

GNU Emacs Reference Card Motion Multiple Windows


(for version 24) When two commands are shown, the second is a similar com-
entity to move over backward forward
character C-b C-f mand for a frame instead of a window.
Starting Emacs word M-b M-f delete all other windows C-x 1 C-x 5 1
line C-p C-n split window, above and below C-x 2 C-x 5 2
To enter GNU Emacs 24, just type its name: emacs go to line beginning (or end) C-a C-e delete this window C-x 0 C-x 5 0
sentence M-a M-e split window, side by side C-x 3
paragraph M-{ M-}
Leaving Emacs page C-x [ C-x ]
scroll other window C-M-v
switch cursor to another window C-x o C-x 5 o
sexp C-M-b C-M-f select buffer in other window C-x 4 b C-x 5 b
suspend Emacs (or iconify it under X) C-z
function C-M-a C-M-e display buffer in other window C-x 4 C-o C-x 5 C-o
exit Emacs permanently C-x C-c
go to buffer beginning (or end) M-< M-> find file in other window C-x 4 f C-x 5 f
scroll to next screen C-v find file read-only in other window C-x 4 r C-x 5 r
Files scroll to previous screen M-v run Dired in other window C-x 4 d C-x 5 d
scroll left C-x < find tag in other window C-x 4 . C-x 5 .
read a file into Emacs C-x C-f scroll right C-x >
save a file back to disk C-x C-s grow window taller C-x ^
scroll current line to center, top, bottom C-l shrink window narrower C-x {
save all files C-x s
insert contents of another file into this buffer C-x i goto line M-g g grow window wider C-x }
replace this file with the file you really want C-x C-v back to indentation M-m
write buffer to a specified file C-x C-w
Formatting
toggle read-only status of buffer C-x C-q Killing and Deleting indent current line (mode-dependent) TAB
indent region (mode-dependent) C-M-\
entity to kill backward forward indent sexp (mode-dependent) C-M-q
Getting Help character (delete, not kill) DEL C-d indent region rigidly arg columns C-x TAB
word M-DEL M-d indent for comment M-;
The help system is simple. Type C-h (or F1) and follow the di- line (to end of) M-0 C-k C-k
insert newline after point C-o
rections. If you are a first-time user, type C-h t for a tutorial. sentence C-x DEL M-k
move rest of line vertically down C-M-o
remove help window C-x 1 sexp M-- C-M-k C-M-k
delete blank lines around point C-x C-o
scroll help window C-M-v kill region C-w join line with previous (with arg, next) M-^
apropos: show commands matching a string C-h a copy region to kill ring M-w delete all white space around point M-\
describe the function a key runs C-h k kill through next occurrence of char M-z char put exactly one space at point M-SPC
describe a function C-h f yank back last thing killed C-y fill paragraph M-q
get mode-specific information C-h m replace last yank with previous kill M-y set fill column to arg C-x f
set prefix each line starts with C-x .
Error Recovery Marking set face M-o

abort partially typed or executing command C-g set mark here C-@ or C-SPC Case Change
recover files lost by a system crash M-x recover-session exchange point and mark C-x C-x uppercase word M-u
undo an unwanted change C-x u, C-_ or C-/ set mark arg words away M-@ lowercase word M-l
restore a buffer to its original contents M-x revert-buffer mark paragraph M-h capitalize word M-c
redraw garbaged screen C-l mark page C-x C-p uppercase region C-x C-u
mark sexp C-M-@ lowercase region C-x C-l
mark function C-M-h
Incremental Search mark entire buffer C-x h The Minibuffer
search forward C-s The following keys are defined in the minibuffer.
search backward C-r
Query Replace complete as much as possible TAB
regular expression search C-M-s interactively replace a text string M-% complete up to one word SPC
reverse regular expression search C-M-r using regular expressions M-x query-replace-regexp complete and execute RET
select previous search string M-p show possible completions ?
Valid responses in query-replace mode are fetch previous minibuffer input M-p
select next later search string M-n
exit incremental search RET replace this one, go on to next SPC or y fetch later minibuffer input or default M-n
undo effect of last character DEL replace this one, don’t move , regexp search backward through history M-r
abort current search C-g skip to next without replacing DEL or n regexp search forward through history M-s
replace all remaining matches ! abort command C-g
Use C-s or C-r again to repeat the search in either direction. If back up to the previous match ^ Type C-x ESC ESC to edit and repeat the last command that
Emacs is still searching, C-g cancels only the part not matched. exit query-replace RET used the minibuffer. Type F10 to activate menu bar items on
c 2012 Free Software Foundation, Inc. Permissions on back. enter recursive edit (C-M-c to exit) C-r text terminals.

C-<key> means hold the control key, and press <key>


M-<key> means press the Esc key once, and press <key>
– vi is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can
edit nothing on some clusters. Basic survival commands:
∗ vi filename to start editing a file.

12
∗ vi is a modal editor: insert mode and normal mode. Pressing i switches
from the normal mode to insert mode. Pressing ESC switches from the insert
mode to normal mode.
∗ :x<Return> quit vi and save changes.
∗ :wq<Return> quit vi and save changes.
∗ :q!<Return> quit vi without saving latest changes.
∗ :w<Return> saves changes.
Google “vi cheatsheet” to find something like

Vi Command Cheat Sheet

Quittin Motion Buffers


:x Exit, saving changes h Move left Named buffers may be specified before any deletion, change, yank or put
:q Exit as long as there have been no changes j Move down command, The general prefix has the form "c where c is any lowercase
ZZ Exit and save changes if any have been made k Move up character. for example, "adw deletes a word into buffer a, It may
:q! Exit and ignore any changes I Move right thereafter be put back into text with an appropriate "ap.
w Move to next word
Insertino Text W Move to next blank delim ited word Markers
Insert before cursor b Move to the beginning of the word Named markers may be set on any line in a file. Any lower case leiter
Insert before line B Move to the beginning of blank delimited word ma be a marker name. Markers ma also be used as limits for ran es.
a Append after cursor e Move to the end of the word mc Set marker c on this line
A Append after line E Move to the end of blank delimited word 'c Go to beginning of marker cline.
o Open a new line after current line ( Move a sentence back 'c Go to first non-blank character of marker clin e.
o Open a new line before current line ) Move a sentence forward
Replace one character { Move a paragraph back Replace
R IReplace many characters } Move a paragraph forward The search and replace function is accomplished with the :s command. It
o Move to the beginning of the line is commonl used in combination with ran es or the: command below.
Deleting Text $ Move to the end of the line :sfpatternfstring/flags Replace pattern with string according to flags.
Almost all deletion commands are performed by G Move to the first line of the file g Flag - Replace all occurrences of pattern
t ing d followed by a motion, G Move to the last line of the file c Flag - Confirm replaces.
dw Delete word nG Move to nth line of the file & Repeat last :s command
x Delete character to the right of cursor :n Move to nth line of the file
X Delete character to the left of cursor fc Move forward to c Counts
D Delete to the end of the line Fc Move back to c
dd Delete current line H Move to top of screen Nearly every command may be preceded by a number that specifies how
:d Delete current line M Move to middle of screen many times it is to be performed. For example, 5dw will delete 5 words
L Move to button of screen
Yanking Text Ctrl+u Page up
Almost all yank commands are performed by typing Ctrl+d Page down Ranges
followed bv a motion. % Move to associated ( ), { }, [ 1 Ranges may precede most "colon" commands and cause them to be
y$IYank to the end of the line executed on a line or IinesJ'or example :3,7d would delete lines 3-7.
yy Yank the current line Search for strinas :n,m I Range Lines n-m
:y Yank the current line fstring Search forward for string Range - Current line
?string Search back for string :$ Range - Last line
Changing text n Search for next instance of string :'c Range - Marker c
The change command is a deletion command that N Search for previous instance of string :% Range - All lines in file
leaves the editor in insert mode, It is performed by :g/pattern/ Range - All lines that contain pattern
Ping c followed by a motion. Other

m n
cw Change word TOggle capital and lower-case Files
C Change to the end of the line J Join lines :wfile Write to file
cc Change the whole line . Repeat last text-changing command :r file Read file in after line
u Undo last change :n Go to next file
PuUi no text U Undo all changes to line :p Go to previous file
p
P
I
Put after the position or after the line
Put before the position or before the line
:e file Edit file
Based on http://www.lagmonster.org/docslvi.html !!program Replace line with output from program

– Statisticians write a lot of code. Critical to adopt a good IDE (integrated develop-
ment environment) that goes beyond code editing: syntax highlighting, executing
code within editor, debugging, profiling, version control, ...
R Studio, Matlab, Visual Studio, Eclipse, Emacs, ...

• Bash completion. Bash provides the following standard completion for the Linux users
by default. Much less typing errors and time!

13
1. Pathname completion
2. Filename completion
3. Variablename completion
E.g., echo $[TAB][TAB]
4. Username completion
E.g., cd ~[TAB][TAB]
5. Hostname completion
E.g., ssh hzhou3@[TAB][TAB]

It can also be customized to auto-complete other stuff such as options and command’s
arguments. Google “bash completion” for more information.

• OS runs processes on behalf of user.

– Each process has Process ID (PID), Username (UID), Parent process ID (PPID),
Time and data process started (STIME), time running (TIME), ...
– ps command provides info on processes.
ps -eaf lists all currently running processes
ps -fp 1001 lists process with PID=1001
ps -eaf | grep python lists all python processes
ps -fu userid lists all processes owned by a user.
– kill kills a process. E.g., kill 1001 kills process with PID=1001.
killall kills a bunch of processes. E.g., killall -r R kills all R processes.
– top prints realtime process info (very useful).

14
(Seamless) remote access to Linux machines
• SSH (secure shell) is the dominant cryptographic network protocol for secure network
connection via an insecure network.

– On Linux or Mac, access the teaching server by


ssh [email protected]
– Windows machines need the PuTTY program (free).

• Forget about passwords. Use keys! Why?

– Much more secure. Most passwords are weak.


– Script or a program may need to systematically SSH into other machines.
– Log into multiple machines using the same key.
– Seamless use of many other services: Git, svn, Amazon EC2 cloud service, parallel
computing on multiple hosts in Julia, ...
– Many servers only allow key authentication and do not accept password authen-
tication. E.g., NCSU arc cluster.

• Key authentication.

15
– Public key. Put on the machine(s) you want to log in.
– Private key. Put on your own computer. Consider this as the actual key in your
pocket; never give to others.
– Messages from server to your computer is encrypted with your public key. It can
only be decrypted using your private key.
– Messages from your computer to server is signed with your private key (digital
signatures) and can be verified by anyone who has your public key (authentica-
tion).

16
• Generate keys.

1. On Linux or Mac, ssh-keygen generates key pairs. E.g., on the teaching server

Use a (optional) paraphrase different form password.


2. Set right permissions on the .ssh folder and key files

17
3. Append the public key to the ~/.ssh/authorized keys file of any Linux machine
we want to SSH to, e.g., the Beowulf cluster (hpc.stat.ncsu.edu).

4. Now you don’t need password each time you connect from the teaching server to
the Beowulf cluster.
5. If you set paraphrase when generating keys, you’ll be prompted for the paraphrase
each time the private key is used. Avoid repeatedly entering the paraphrase by

R
using ssh-agent on Linux/Mac or Pagent on Windows.
Same key pair can be used between any two machines. We don’t need to

R
regenerate keys for each new connection.
For Windows users, the private key generated by ssh-keygen cannot be
directly used by PuTTY; use PuTTYgen for conversion. Then let PuTTYgen
use the converted private key. Read Sections A and B of the tutorial http://
tipsandtricks.nogoodatcoding.com/2010/02/svnssh-with-tortoisesvn.html

• Transfer files between machines.

– scp copies files via SSH.


scp filehere [email protected]:~/remotefolder
scp [email protected]:~/remotefile folderhere
– sftp is FTP via SSH.

18
– GUIs for Windows (WinSCP) or Mac (Cyberduck).
– (My preferred way) Use a version control system to sync project files between

R
different machines and systems.

• Line breaks in text files. Windows uses a pair of CR and LF for line breaks.
Linux/Unix uses an LF character only. Mac X also uses a single LF character. But old
Mac OS used a single CR character for line breaks. If transferred in binary mode (bit by
bit) between OSs, a text file could look a mess. Most transfer programs automatically
switch to text mode when transferring text files and perform conversion of line breaks
between different OSs; but I used to run into problems using WinSCP. Sometimes you
have to tell WinSCP explicitly a text file is being transferred.

Summary of Linux
• Practice Linux machine for this class:
teaching.stat.ncsu.edu
Start using it right now.

• Ask for help (order matters): Google (paste the error message to Google often helps),
man command if no internet access, friends, Terry, ...

• Homework (ungraded): set up keys for connecting your own computer to the teaching
server.

Version control by Git


If it’s not in source control, it doesn’t exist.

19
• Collaborative research. Statisticians, as opposed to “closet mathematicians”, rarely do
things in vacuum.

– We talk to scientists/clients about their data and questions.


– We write code (a lot!) together with team members or coauthors.
– We run code/program on different platforms.
– We write manuscripts/reports with co-authors.
– ...

• 4 things distinguish professional programmers from amateurs:

– Use a version control system.


– Automate repetitive taks.
– Systematic testing.
– Use debugging aids rather than print statements.

• Why version control?

– A centralized repository helps coordinate multi-person projects.


– Synchronize files across multiple computers and platforms.
– Time machine. Keep track of all the changes and revert back easily (reproducible).
– Storage efficiency. This is what I often see ...

• Available version control tools.

20
– Open source: cvs, subversion (aka svn), Git, ...
– Proprietary: Visual SourceSafe (VSS), ...
– Dropbox? Mostly for file back and sharing, limited version control (1 month?), ...

We use Git in this course.

• Why Git?

– The Eclipse Community Survey in 2014 shows Git is the most widely used source
code management tool now. Git (33.3%) vs svn (30.7%).
– History: Initially designed and developed by Linus Torvalds in 2005 for Linux
kernel development. “git” is the British English slang for “unpleasant person”.
I’m an egotistical bastard, and I name all my projects after myself. First
’Linux’, now ’git’.
Linus Torvalds

– A fundamental difference between svn (centralized version control system, left


plot) and Git (distributed version control system, right plot):

– Advantages of Git.
∗ Speed and simple (?) design.
∗ Strong support for non-linear development (1000s of parallel branches).
∗ Fully distributed. Fast, no internet required, disaster recovery,
∗ Scalable to large projects like the Linux kernel project.
∗ Free and open source.
– Be aware that svn is still widely used in IT industry (Apache, GCC, SourceForge,
Google Code, ...) and R development. E.g., type
svn log -v -l 5 https://svn.r-project.org/R
on command line to get a glimpse of what R development core team is doing.

21
– Good to master some basic svn commands.

• What do I need to use Git?

– A Git server enabling multi-person collaboration through a centralized repository.


∗ github.com: unlimited public repositories, private repositories costs $, aca-
demic user can get 5 private repositories for free.
∗ github.ncsu.edu: unlimited public or private repositories, but space limita-
tion (300M?), not accessible by non-NCSU collaborators.
∗ bitbucket.org: unlimited private repositories for academic account (register

R
for free using your NCSU email).
For this course, use github.ncsu.edu please.
– Git client.
∗ Linux: installed on many servers, including teaching.stat.ncsu.edu and
hpc.stat.ncsu.edu. If not, install on CentOS by yum install git.
∗ Mac: install by port install git.

R
∗ Windows: GitHub for Windows (GUI), TortoiseGIT (is this good?)
Don’t rely on GUI. Learn to use Git on command line.

22
3 Lecture 3, Jan 21
Announcements
• Today’s office hours change to 5P-6P.

• Install Linux on your personal computer?

• Want to use R Studio on teaching server?

Access via http://teaching.stat.ncsu.edu:8787. However you need to change


password on command line (passwd).

Last Time
• Key authentication.

• Version control using Git.

Today
• Version control using Git (cont’d).

• Reproducible research.

• Next week: languages (R, Matlab, Julia)

23
Version control using Git (cont’d)
• Life cycle of a project.
Stage 1:

– A project (idea) is born on github.ncsu.edu, with directories say codebase,


datasets, manuscripts, talks, ...
– Advantage of github.ncsu.edu: privacy of research ideas (free private reposito-
ries).
– Downside of github.ncsu.edu: not accessible by off-campus collaborators; 300M
storage limit.
– bitbucket.org is a good alternative. Unlimited private repositories for academic
accounts (register with .edu email).

Stage 2:

– Hopefully, research idea pans out and we want to put up a standalone software
development repository at github.com.
– This usually inherits from the codebase folder and happens when we submit a
paper.
– Challenges: keep all version history. Read Cai Li’s slides (http://hua-zhou.
github.io/teaching/st790-2015spr/gitslides-CaiLi.pdf) for how to migrate
part of a project to a new repository while keeping all history.

Stage 3:

– Active maintenance of the public software repository.


– At least three branches: develop, master, gh-pages.
develop: main development area.
master: software release.
gh-pages: software webpage.

R
– Maintaining and distributing software on github.com.

Josh Day will cover how to distribute R package from github next week.

• Basic workflow of Git.

24
– Synchronize local Git directory with remote repository (git pull).
– Modify files in local working directory.
– Add snapshots of them to staging area (git add).
– Commit: store snapshots permanently to (local) Git repository (git commit).
– Push commits to remote repository (git push).

• Basic Git usage.

– Register for an account on a Git server, e.g., github.ncsu.edu. Fill out your
profile, upload your public key to the server, ...
– Identify yourself at local machine:
git config --global user.name "Hua Zhou"
git config --global user.email "hua [email protected]"
Name and email appear in each commit you make.
– Initialize a project:
∗ Create a repository, e.g., st790-2015spr, on the server github.ncsu.edu.
Then clone to local machine
git clone [email protected]:unityID/st790-2015spr.git
∗ Alternatively use following commands to initialize a Git directory from a lo-
cal folder and then push to the Git server
git init

25
git remote add origin [email protected]:unityID/st790-2015spr.git
git push -u origin master
– Edit working directory.
git pull update local Git repository with remote repository (fetch + merge).
git status displays the current status of working directory.
git log filename displays commit logs of a file.
git diff shows differences (by default difference from the most recent commit).
git add ... adds file(s) to the staging area.
git commit commits changes in staging area to Git directory.
git push publishes commits in local Git directory to remote repository.
Following demo session is on my local Mac machine.

26
git reset --soft HEAD 1 undo the last commit.
git checkout filename go back to the last commit.

R
git rm different from rm.
Although git rm deletes files from working directory. They are still in Git
history and can be retrieved whenever needed. So always be cautious to put large
data files or binary files into version control.

27
• Branching in Git.

– Branches in a project:

28
– For this course, you need to have two branches: develop for your own develop-
ment and master for releases (homework submission). Note master is the default
branch when you initialize the project; create and switch to develop branch im-
mediately after project initialization.

– Commonly used commands:


git branch branchname creates a branch.
git branch shows all project branches.
git checkout branchname switches to a branch.
git tag shows tags (major landmarks).
git tag tagname creates a tag.
– Let’s look at a typical branching and merging workflow.
∗ Now there is a bug in v0.0.3 ...

R How to organize version number of your software? Read blog “R Package

29
Versioning” by Yihui Xie
http://yihui.name/en/2013/06/r-package-versioning/

Now ‘debug’ in develop branch is ahead of master branch.


∗ Merge bug fix to the master branch.

30
∗ Tag a new release v0.0.4.

31
• Further resources:

– Book Pro Git, http://git-scm.com/book/en/v2


– Google
– Cai Li’s slides http://hua-zhou.github.io/teaching/st790-2015spr/gitslides-CaiLi.
pdf (migrate repositories or folders of a repository, how branching and merging
work)

• Some etiquettes of using Git and version control systems in general.

– Be judicious what to put in repository


∗ Not too less: Make sure collaborators or yourself can reproduce everything
on other machines.
∗ Not too much: No need to put all intermediate files in repository.
Strictly version control system is for source files only. E.g. only xxx.tex, xxx.bib,
and figure files are necessary to produce a pdf file. Pdf file doesn’t need to be

R
version controlled or frequently committed.
– “Commit early, commit often and don’t spare the horses”
– Adding an informative message when you commit is not optional. Spending
one minute now saves hours later for your collaborators and yourself. Read the

R
following sentence to yourself 3 times:
“Write every commit message like the next person who reads it is an axe-
wielding maniac who knows where you live.”

• Acknowledgement: some material in this lecture are taken from Cai Li’s group meeting
presentation.

32
Reproducible research (in computational science)
An article about computational result is advertising, not scholarship. The
actual scholarship is the full software environment, code and data, that pro-
duced the result.

Buckheit and Donoho (1995)


also see Claerbout and Karrenbach (1992)

• 3 stories of not being reproducible.

– Duke Potti Scandal.

Potti et al. (2006) Genomic signatures to guide the use of chemotherapeutics,


Nature Medicine, 12(11):1294–1300.
Baggerly and Coombes (2009) Deriving chemosensitivity from cell lines: Forensic
bioinformatics and reproducible research in high-throughput biology, Ann. Appl.
Stat., 3(4):1309–1334. http://projecteuclid.org/euclid.aoas/1267453942

33
More information is available at
http://en.wikipedia.org/wiki/Anil_Potti
http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/
– Nature Genetics (2013 Impact Factor: 29.648). 20 articles about microarray
profiling published in Nature Genetics between Jan 2005 and Dec 2006.

– Bible code.

34
Witztum et al. (1994) Equidistant letter sequences in the book of genesis. Statist.
Sci., 9(3):429438. http://projecteuclid.org/euclid.ss/1177010393
McKay et al. (1999) Solving the Bible code puzzle, Statist. Sci., 14(2):150–173.
http://cs.anu.edu.au/~bdm/dilugim/StatSci/

• Why reproducible research?

– Replicability has been a foundation of science. It helps accumulate scientific


knowledge.
– Better work habit boosts quality of research.
– Greater research impact.
– Better teamwork. For you, it probably means better communication with your
advisor (Buckheit and Donoho, 1995).
– ...

• Readings.

– Buckheit and Donoho (1995) Wavelab and reproducible research, in Wavelets and
Statistics, volume 103 of Lecture Notes in Statistics, page 55–81. Springer Newt
York. http://statweb.stanford.edu/~donoho/Reports/1995/wavelab.pdf
Donoho (2010) An invitation to reproducible computational research, Biostatis-
tics, 11(3):385-388.
– Peng (2009) Reproducible research and biostatistics, Biostatistics, 10(3):405–408.
Peng (2011) Reproducible research in computational science, Science, 334(6060):1226–
1227.

35
Roger Peng’s blogs Treading a New Path for Reproducible Research.
http://simplystatistics.org/2013/08/21/treading-a-new-path-for-reproducible-res
http://simplystatistics.org/2013/08/28/evidence-based-data-analysis-treading-a-
http://simplystatistics.org/2013/09/05/implementing-evidence-based-data-analysi
– Reproducible research with R and RStudio by Christopher Gandrud. It covers

R
many useful tools: R, RStudio, LATEX, Markdown, knitr, Github, Linux shell, ...
This book is nicely reproducible. Git clone the source from https://github.
com/christophergandrud/Rep-Res-Book and you should be able to compile into
a pdf.
– Reproducibility in Science at
http://ropensci.github.io/reproducibility-guide/

• How to be reproducible in statistics?

When we publish articles containing figures which were generated by computer,


we also publish the complete software environment which generates the figures.

Buckheit and Donoho (1995)

– For theoretical results, include all detailed proofs.


– For data analysis or simulation study
∗ Describe your computational results with painstaking details.
∗ Put your code on your website or in an online supplement (required by many
journals, e.g., Biostatistics, JCGS, ...) that allow replication of entire analysis
or simulation study. A good example:
http://stanford.edu/~boyd/papers/admm_distr_stats.html
∗ Create a dynamic version of your simulation study/data analysis.

• What can we do now? At least make your homework reproducible!

– Document everything!
– Everything is a text file (.csv, .tex, .bib, .Rmd, .R, ...) They aid future proof and

R
are subject to version control.
Word/Excel are not text files.
– All files should be human readable. Abundant comments and adopt a good style.
– Tie your files together.

36
– Use a dynamic document generation tool (weaving/knitting text, code, and output
together) for documentation. For example
http://hua-zhou.github.io/teaching/st758-2014fall/hw01sol.html
http://hua-zhou.github.io/teaching/st758-2014fall/hw02sol.html
...
http://hua-zhou.github.io/teaching/st758-2014fall/hw07sol.html
http://hua-zhou.github.io/teaching/st758-2014fall/hw08sol.html
– Use a version control system proactively.

R
– Print sessionInfo() in R.

For your homework, submit (put in the master branch) a final pdf report and all
files and instructions necessary to reproduce all results.

• Tools for dynamic document/report generation.

– R: RMarkdown, knitr, Sweave.


– Matlab: automatic report generator.
– Python: IPython, Pweave.
– Julia: IJulia.

We will briefly talk about these features when discussing specific languages.

37
4 Lecture 4, Jan 26
Announcements
• Helpful tutorial about Git branching
http://pcottle.github.io/learnGitBranching/
shared by Bo Ning.

• Want to use R Studio on teaching server?

Access via http://teaching.stat.ncsu.edu:8787. However you need to change


password on command line (passwd).

Last Time
• Version control using Git (cont’d).

• Reproducible research.

Today
• This week: languages (R, Matlab, Julia)

38
Computer Languages

To do a good job, an artisan needs the best tools.

The Analects by Confucius (about 500 BC)

• What features are we looking for in a language?

– Efficiency (in both run time and memory) for handling big data.
– IDE support (debugging, profiling).
– Open source.
– Legacy code.
– Tools for generating dynamic report.
– Adaptivity to hardware evolution (parallel and distributed computing).

• Types of languages

1. Compiled languages: C/C++, Fortran, ...


– Directly compiled to machine code that is executed by CPU
– Pros: fast, memory efficient
– Cons: longer development time, hard to debug
2. Interpreted language: R, Matlab, SAS IML, ...
– Interpreted by interpreter
– Pros: fast prototyping
– Cons: excruciatingly slow for loops
3. Mixed languages: Julia, Python, JAVA, Matlab (JIT), R (JIT), ...
– Compiled to bytecode and then interpreted by virtual machine
– Pros: relatively short development time, cross-platform, good at data prepro-
cessing and manipulation, rich libraries and modules
– Cons: not as fast as compiled language
4. Script languages: shell scripts, Perl, ...
– Extremely useful for data preprocessing and manipulation

39
• Messages

– To be versatile in the big data era, be proficient in at least one language in each
category.
– To improve efficiency of interpreted languages such as R or Matlab, avoid loops
as much as possible. Aka, vectorize code
“The only loop you are allowed to have is that for an iterative algorithm.”
– For some tasks where looping is necessary, consider coding in C or Fortran. It
is convenient to incorporate compiled code into R or Matlab. But do this only
after profiling!
Success stories: glmnet and lars packages in R are based on Fortran.
– When coding using C, C++, Fortran, make use of libraries for numerical linear

R
algebra: BLAS, LAPACK, ATLAS, ...

Julia seems to combine the strengths of all these languages. That is to achieve
efficiency without vectorizing code.

40
5 Lecture 5, Jan 28
Announcements
• HW2 (NNMF, GPU computing) posted. Due Feb 11.

Last Time
• Computer languages.

• Productivity tools (Rcpp, Boost, Armadillo, R markdown) of R (Josh Day).

Today
• Matlab and Julia.

Computer languages (cont’d)


“As some of you may know, I have had a (rather late) mid-life crisis and run
off with another language called Julia. http: // julialang. org ”

Doug Bates (on the knitr Google Group)

• Language features of R, Matlab, and Julia

Features R Matlab julia

Open source , / ,
IDE R Studio ,, ,,, /
Dynamic document ,,,(RMarkdown) ,,, ,,,(IJulia)
Multi-threading parallel pkg , ,
JIT compiler pkg , ,
Call C/Fortran wrapper wrapper no glue code
Call shared library wrapper wrapper no glue code
Typing / ,, ,,,
Pass by reference / / ,,,
Linear algebra / MKL, Arpack OpenBLAS, Eigpack
Distributed computing / , ,,,
Sparse linear algebra / (Matrix package) ,,, ,,,
Documentation / ,,, ,,

41
• Benchmark code R-benchmark-25.R from http://r.research.att.com/benchmarks/
R-benchmark-25.R covers many commonly used numerical operations used in statis-
tics. We ported to Matlab and Julia and report the run times (averaged over 5 runs)
here.

Machine specs: Intel i7 @ 2.6GHz (4 physical cores, 8 threads), 16G RAM, Mac OS 10.9.5.
Test R 3.1.1 Matlab R2014a julia 0.3.5

Matrix creation, trans, deformation (2500 × 2500) 0.80 0.17 0.16


Power of matrix (2500 × 2500, A.1000 ) 0.22 0.11 0.23
Quick sort (n = 7 × 106 ) 0.65 0.24 0.64
Cross product (2800 × 2800, AT A) 9.95 0.35 0.38
LS solution (n = p = 2000) 1.21 0.07 0.10

FFT (n = 2, 400, 000) 0.34 0.04 0.14


Eigen-decomposition (600 × 600) 0.78 0.31 0.56
Determinant (2500 × 2500) 3.52 0.18 0.23
Cholesky (3000 × 3000) 4.03 0.15 0.23
Matrix inverse (1600 × 1600) 3.05 0.16 0.22

Fibonacci (vector) 0.28 0.17 0.66


Hilbert (matrix) 0.22 0.07 0.18
GCD (recursion) 0.47 0.14 0.20
Toeplitz matrix (loops) 0.34 0.0014 0.03
Escoufiers (mixed) 0.38 0.40 0.17

• A slightly more complicated (or realistic) example taken from Doug Bates’s slides
http://www.stat.wisc.edu/~bates/JuliaForRProgrammers.pdf. The task is to
use Gibbs sampler to sample from bivariate density

f (x, y) = kx2 exp(−xy 2 − y 2 + 2y − 4x), x > 0,

using the conditional distributions


 
1
X|Y ∼ Γ 3, 2
y +4
 
1 1
Y |X ∼ N , .
1 + x 2(1 + x)
Let’s sample 10,000 points from this distribution with a thinning of 500.

– How long does R take?


http://hua-zhou.github.io/teaching/st790-2015spr/gibbs_r.html
– How long does Julia take?
http://hua-zhou.github.io/teaching/st790-2015spr/gibbs_julia.html

42
– With similar coding efforts, Julia offers ∼ 100 fold speed-up! Somehow JIT in R
didn’t kick in. (Neither does Matlab, which took about 20 seconds.)
– Julia offers the capability of strong typing of variables. This facilitates the opti-
mization by compiler.

R
– With little efforts, we can do parallel and distributed computing using Julia.

Benchmark of the same example in other languages including Rcpp is available


in the blogs by Darren Wilkinson (http://bit.ly/IWhJ52) and Dirk Eddelbuettel’s
(http://dirk.eddelbuettel.com/blog/2011/07/14/).

Julia
• IDE in Julia.

– Juno (http://junolab.org) is the currently recommended IDE for Julia. It


has limited capabilities (syntax highlighting, tab completion, executing lines from
editor, ... ) compared to R Studio or Matlab IDE.
– No easy-to-use debugging tool yet (set breakpoint, inspect variables at breakpoint,
break at error, ...) /
– Profiling. The language itself provides many very useful profiling tools.
http://julia.readthedocs.org/en/latest/stdlib/profile
@profile macro shows line-by-line analysis how many times each line is sampled
by the profiler.

43
@time macro displays memory footprint and significant gc (garbage collection)
along with run time.

Finer analysis of line-by-line memory allocation is also available.


http://docs.julialang.org/en/release-0.3/manual/profile/#memory-allocation-anal

• Work flow in Julia.

– Tim Holy:
“quickly write the simple version first (which is really pleasant thanks to Julia’s

44
design and the nice library that everyone has been contributing to)” → “run it”
→ “ugh, too slow” → “profile it” → “fix a few problems in places where it actually
matters” → “ah, that’s much nicer!”
– Stefan Karpinski:
1. You can write the easy version that works, and
2. You can usually make it fast with a bit more work.

• Types (data structures) in Julia.

– Julia provides a very rich collection of static data types, abstract types, and user-
defined types.
http://julia.readthedocs.org/en/latest/manual/integers-and-floating-point-numbe
http://julia.readthedocs.org/en/latest/manual/types/.

• Functions (methods, algorithms) in Julia.

– Functions in Julia are really methods. All functions in Julia are generic so any
function definition is actually a method definition.
– Same function (method) names can be applied to different argument signatures.
– Templated methods and data types. Sometimes you want to define algorithms on
the abstract type with minor variations for, say, the element type.

– In Julia, all arguments to functions are passed by reference. A Julia function can
modify its arguments. Such mutating functions should have names ending in “!”.

45
• Call compiled code.

– In Julia, usually it’s unnecessary to write C/C++ or Fortran code for perfor-
mance. Just write loops in Julia and leave the work to its compiler.
– Still in many situations, we’d like to call functions in some compiled libraries
(developed in C or Fortran). Use the ccall() function in Julia; no glue code is
needed. For example, Mac OS has the math library libm.dylib, from which we
can call the sin function

mysin(x) = ccall((:sin,"libm"), Cdouble, (Cdouble,), x)

We can vectorize the single argument function myin by

@vectorize_1arg Real mysin

– They must be shared libraries available in the load path, and if necessary a direct
path may be specified.

• Call Julia function from other languages like C.


http://docs.julialang.org/en/release-0.3/manual/embedding/

46
• Documentation.

– Online help: ? funname.


– Online documentation is mostly clear, but seems to lack plenty of examples.
– I haven’t found a good in-source documentation system like roxygen for R /.
– Julia provides tab completion, like bash completion.

• Package management in Julia all centers around Github. No manual censorship on


CRAN anymore ,

• Julia summary.

“In my opinion Julia provides the best of both worlds and is the technical
programming language of the future.”

Doug Bates

47
6 Lecture 6, Feb 2
Announcements
• HW1 graded. Feedback

– Solution sketch: http://hua-zhou.github.io/teaching/st790-2015spr/hw01sol.


– grade unityID.md committed to your master branch.
– Don’t forget git tag: Tagging time will be used as your homework submission
time.
– “Commit early, commit often and don’t spare the horses”
– Reproducibility (source code for reproducing results and instructions). Dynamic
document (Rmd, IPython, ...) is worth learning.

• HW2:

– Think more carefully about algorithmic updates.


– Use V0.txt and W0.txt as starting points for timing.

• HW3 (Convex or Not?) posted. Due Mon, Feb 23.

Last Time
• Julia: a promising language to know about.

Today
• Matlab.

• Parallel computing.

Matlab
• Matlab IDE. A powerful IDE comes with Matlab. Familiarity with it prevents tons
of pain.

48
– Essentials: syntax highlighting, code indenting/wrapping/folding, text width (de-
fault = 72 characters), ...
– Code cells delimited by %%. Cells break script into logical segments and facilitate
automatically generating documentation.
– Code analyzer. Are you greened? Check upper-right corner.

• Matlab functions.

– Matlab development revolves around functions.

R
– Each function is a separate file: fun1.m, fun2.m, ...
R and Julia can have multiple functions in one file.
– Add help/documentation immediately below the function definition. It facilitates
the help command and automatically generating documentation.
– If there are more than one functions in a file, only the first one is callable. Others
are local functions, equivalent of subroutines/subfunctions in other languages.
– Nested function. It has access to the variables in its parent function. Memory
saver!
– Function help follows a fixed format: declaration, calling convention, see also,
example, copyright, ...

49
– Help command

– Support variable number of input/output arguments


% linear regression

50
b = glmfit(x,y);
% logistic regression
b = glmfit(x,y,’binomial’);
% probit regression
b = glmfit(x,y,’binomial’,’link’,’probit’);
% probit regression with observation weights

R
b = glmfit(x,y,’binomial’,’link’,’probit’,’weights’,wts);

inputParser for parsing name/value pairs

• Debugging in Matlab.

– Execute code cell-by-cell, line-by-line, ...


– Breakpoints.
– Examine intermediate values:
data tips in editor, command window, workspace browser
– Error breakpoints.

• Profiling in Matlab.

– Timing: tic/toc (wall time).


– Profiling: profile on/viewer.

51
– Let’s profile the lsq_sparsepath() function.

– profile viewer produces a summary in html that includes line by line analysis.

52
• Call compiled code in Matlab.

– Step 0: Are you sure you want to do this? Profile first!


– Step 1: Check compiler compatibility.
∗ What compilers are supported by Matlab 2014a (Linux)? Check
http://www.mathworks.com/support/compilers/R2014a/index.html
∗ Compilers not supported? Tweak the mexopts.sh file
– Step 2: Write C or Fortran code.
∗ Develop C or Fortran code as usual
∗ If you obtain source code from open source projects, internet, book (e.g.
Numerical Recipes), ..., follow license and give credit
– Step 3: Write mex function wrapper.

53
∗ The name of the mex function file is the name of your Matlab function
∗ Purpose: match data types between Matlab and C/Fortran, transfer
input/output (pass by value!), ...
∗ Format for mex function: Google for “matlab mex function”
– Step 4: Use mex command to compile source.

54
∗ This produces binary code: funname.mexmaci64 (Mac), funname.mexw64
(Windows), or funname.mexa64 (Linux)
∗ These binaries are what you need to run program. Just use as native Matlab
functions

• Toolbox development in Matlab.

– Toolbox in Matlab is equivalent to the packages in R. You can submit to Mat-


lab Central (equivalent of CRAN) or simply publish on your github or website.
– Basic steps to create a toolbox.
1. Write functions and demo scripts
2. Debug, test, profile, document
debug, test, profile, document
...
3. “Publish” your demo scripts as html by clicking the Publish button. It works
just like knit.
http://www.mathworks.com/help/matlab/matlab_prog/marking-up-matlab-comments-
html
4. Edit the info.xml and helptoc.xml files. They help automatically generate
the help documentation and put the toolbox to the Matlab start menu
5. Write the COPYRIGHT.txt, INSTALL.txt and RELEASE_NOTES.txt documents

55
6. Zip the toolbox folder and publish on your website or to Matlab Central

– Contents of a toolbox.
∗ Function files. The private folder “hides” functions and compiled binaries
not directly accessible by user
∗ Demo scripts
∗ The html (or any other name) folder holds the documentation generated by
“publishing” demo scripts
∗ The info.xml file contains essential information about the toolbox. It puts
the toolbox to the start menu of Matlab and links to the help documenta-
tion. See screenshots in next two slides for an example

56
57
• More features of Matlab.

– Object-oriented programming (OOP)


– GUI development.
– More productivity tools: help report, TODO/FIXME report, code analyzer re-
port, dependency report, ...

• Matlab summary.

– Good points.
∗ Highly efficient, esp. for numerical linear algebra.
∗ Good IDE. Debugging and profiling is a breeze.
∗ The language of choice for some technical computing areas. E.g., my research
requires a lot solving ODE (ordinary differential equations) and tensor (multi-
dimensional array) computing, which are not available (or not good enough)
in R.
∗ Existence of Matlab sets a high standard for other competing technical com-
puting languages. Examples are R Studio and Julia.

58
∗ Reasonably update with hardware technology. For example, > 200 native
functions in Matlab supports GPU and distributed computing toolbox enables
cluster computing of large scale problems.
– Pitfalls.
∗ Not open source! $$$
∗ Limited statistical functionalities compared to R packages.

Summary of languages
• Choosing language(s) for your project mostly depends your specific tasks, legacy code,
and which “church you happen to frequent”.

• Trade-off between development time and run time.

• Never believe others’ benchmark results. Do your own profiling and benchmark.

• Don’t be afraid to learn new languages. Having more tools in your toolbox is always
a plus.

Parallel computing – what and why?


• Parallel computing, in contrast to serial computing, means carrying out computation
simultaneously.

• Recent change in the landscape of parallel computing due to end of frequency scaling
game in 2004.

• Run time = # instructions × avg. time per instruction.

• Cranking up clock frequency (frequency scaling) obviously reduces avg. time per in-
struction, but unfortunately ... increases power consumption and worsens cooling prob-
lem too.

2
Power = Capacitance × Voltage × Frequency .

• This is what I see when running Matlab benchmark code on a MacBook Pro with a
2.6 GHz Intel Core i7 CPU.

59
You can cook eggs on that CPU ...

• Intel canceled its Tejas and Jayhawk lines in 2004 due to power consumption constraint,
which declared the end of frequency scaling and start of parallel scaling.

• This paradigm shift changes the way we do computation. Running the serial code
written for single-core CPU on a multi-core CPU will not make it faster.

• There are many modes of parallel computing: multi-core, cluster, GPU, ...

Multi-core parallel computing


• A typical CPU on a server.

– Issue cat /proc/cpuinfo on teaching.stat.ncsu.edu.

60
– Intel c Xeon c E5-2640 chip, with 6 physical cores

– Intel’s hyperthreading “interleaves” two threads on one core

61
– In total it appears as 12 “processors” (logical processors, virtual cores, logical
cores, siblings) to the OS on the teaching server.
– Theoretical throughput of the machine is
120 DP GFLOPS ≈ 4 DP FLOPs/cycle x 2.5 GHz x 12 logical processors
It’s almost impossible to achieve this theoretical throughput.

• A typical CPU on current PCs and laptops.

– For example, MacBook Pro has an Intel R CoreTM i7-3720QM CPU @ 2.60GHz.
– 4 physical cores and 8 threads. It appears as 8 virtual cores to the OS.

• Some terminology.

– A thread is a (serial) sequence of instructions.


– A process is a collection of threads, which share resources such as memory. Dif-
ferent processes run in separate memory spaces.
– An application may have multiple processes

– Example: Web browser (Google Chrome) is an application, each tab is a process,


threads for each tab control text, music and so on.

• Multi-core or multi-thread computation relies on communication between processes/threads

– Message passing libraries (MPI, PVM)


∗ very powerful

62
∗ designed for C/C++, Fortran
∗ not easy to use from R (rpvm, Rmpi packages), Matlab, ...
– Forking.
– Sockets.

Ideally we need a transparent R interface that hides these communication details.

63
7 Lecture 7, Feb 4

Last Time
– Matlab.
– Parallel computing: what and why.

Today
– A debug-profile-optimize session on NNMF (HW2)?
– Parallel computing: multi-core in R, Matlab, Julia.

A debug-profile-optimize exercise on NNMF (HW2)


I was preparing for the solution to HW2, and thought it might be a good example simple
enough that we can go through a debug-profile-optimize exercise in class.

Multi-core computing in R
• Fact: base R is single-threaded.

• Running a benchmark script (random number generation, numerical linear algebra) on


the teaching server occupies only 1 out of the 12 logical processors.

• To perform multi-core computation in R

– Develop multi-threaded code or libraries in C/C++, Fortran, ... and call from R.
– For embarrassingly parallel single-threaded tasks.
∗ Option 1: Manually run multiple R sessions
∗ Option 2: Make multiple system ("Rscript") calls. Typically automated
by a scripting language (Python, Perl, shell script) or within R.
∗ Option 3: Use package parallel

64
• parallel package in R.

– Included in R since 2.14.0 (2011).


– Based on the snow (Luke Tierney) and multicore (Urbanek) packages.
– Authors: Brian Ripley, Luke Tieney, Simon Urbanek.
– To find number of cores in the teaching server.

> library(parallel)
> detectCores()
[1] 12

– How to utilize these cores to speed up computation?

• Case study: One common embarrassingly parallel task in statistics is Monte carlo
simulation study.

– E.g., in ST758 (2012, 2013), students are asked to carry out a simulation study
to compare three procedures (LRT, eLRT, eRLRT) for testing H0 : σa2 = 0 vs
Ha : σa2 > 0 in a linear mixed model (variance component model)

y ∼ N (µ1, σ22 V1 + σe2 I).

Want to compare the size of power of three methods across 16 V1 pattern (stored
in n.pattern.list) and 7 σa2 /σe2 ratios (stored in sigma2.ratio.list).
– Monte carlo estimate of size/power and its standard deviation for each method/pattern/σa2
combination can be summarized in a table

– Suppose we have a function compare.tests that compares the methods at a fixed


pattern and signal-to-noise ratio on a large number of Monte carlo replicates

65
compare.tests <- function( n.pattern, sigma2.ratio,
mc.size = 10000, ... )

– Need to loop over n.pattern.list and sigma2.ratio.list.


– 112 embarrassingly parallel tasks. Each might take long with many Monte carlo
replicates.
– Let’s try to parallelize the serial code (HW submission by Yichi Zhang) using the
parallel package.
– Run the double loop (encapsulated in the compare.tests.all function) on the
teaching server. Monte carlo sample size (mc.size) is set at (ridiculously small)
10
> # perform simulations -- serial code
> set.seed (123, "L’Ecuyer")
> system.time (result.serial <- compare.tests.all (
+ n.pattern.list, sigma2.ratio.list, mc.size = 10))
user system elapsed
238.410 0.148 239.409

– Only 1 out of the 12 logical processors being used

– Run the same task using mcmapply() function (parallel analogue of mapply) in
the package parallel
> # parallel simulations using mcmapply w/o load balancing
> set.seed (123, "L’Ecuyer")
> system.time (result.mcmapply <- mcmapply ( compare.tests,
+ rep (n.pattern.list, each = length (sigma2.ratio.list), times = 1),
+ rep (sigma2.ratio.list, each = 1, times = length (n.pattern.list)),
+ MoreArgs = list (mc.size = 10), mc.cores = 12))
user system elapsed
218.226 0.840 22.378

66
– mc.cores=12 instructs using 12 cores

67
8 Lecture 8, Feb 9
Announcements
• No class and office hours this Wednesday. Instructor out of town.

• HW2 due this Wed @ 11:59PM. Tagging time will be your submission time. No tagging
time = no hw submission.

• HW2 progress.

– Matlab (CPU+GPU). Easy.


– Julia (CPU+GPU). Use CUDArt and CUBLAS packages.
– Python (CPU+GPU). Ask Xiang Zhang.
– R (GPU?).

• A list of potential course projects.


http://hua-zhou.github.io/teaching/st790-2015spr/project.html
More topics will be added. Talk to me about your course project.

Last Time
• A debug-profile-optimize session on NNMF (HW2).

• Parallel computing: multi-core in R.

Today
• Multi-core computing in R (cont’d), Matlab, Julia.

• Cluster computing.

Multi-core computing in R (cont’d)


• Last time we demonstrated how to use parallel package to do multi-core computing
in R on a simulation study. Steps are

1. Write a function to carry out Monte carlo simulation and method comparison for
one combination of levels (one cell in the table).
2. Use mcapply for multi-core parallel computing.

68
3. Results are automatically collected in the master session. No need for extra

R
scripting to collect results from parallel runs.

Demo code is available on course webpage http://hua-zhou.github.io/teaching/


st790-2015spr/vcsim.r

• Load balancing: Good for small number of parallel tasks with wildly different compu-
tation times
No load balancing: Good for numerous parallel tasks with similar computation times

• Turn on load balancing by setting mc.preschedule=FALSE

> # parallel simulations using mcmapply with load balancing


> set.seed (123, "L’Ecuyer")
> system.time (result.mcmapplylb <- mcmapply ( compare.tests,
+ rep (n.pattern.list, each = length(sigma2.ratio.list), times = 1),
+ rep (sigma2.ratio.list, each = 1, times = length(n.pattern.list)),
+ MoreArgs = list (mc.size = 10), mc.cores = 12, mc.preschedule = FALSE))
user system elapsed
263.397 5.486 21.792

• Forking creates a new R process by taking a complete copy of the master process,
including the workspace and random number stream. The copy will share memory

R
with the master until modified so forking is very fast.
mcmapply, mclapply and related functions rely on the forking capability of POSIX
operating systems (e.g. Linux, MacOS) and is not available in Windows

• parLapply, parApply, parCapply, parRapply, clusterApply, clusterMap, and re-


lated functions create a cluster of workers based on either socket (default) or forking

cl <- makeCluster(<size of pool>)


# one or more parLapply calls

R
stopCluster(cl)

Socket is available on all platforms: Linux, Mac OS, Windows.

69
• Same simulation example using clusterMap with load balancing

> # parallel simulations using clusterMap with load balancing


> cl <- makeCluster (getOption ("cl.cores", 12))
> clusterSetRNGStream(cl, 123)
> clusterExport (cl, c ("generate.design", "generate.response",
+ "simulate.null.samples") )
> clusterEvalQ (cl, library(nlme) )
> clusterEvalQ (cl, library(RLRsim) )
> system.time (result.clusterMaplb <- clusterMap (cl, compare.tests,
+ rep (n.pattern.list, each = length (sigma2.ratio.list), times = 1),
+ rep (sigma2.ratio.list, each = 1, times = length (n.pattern.list)),
+ MoreArgs = list (mc.size = 10), .scheduling = "dynamic") )
user system elapsed
0.115 0.011 22.310
> stopCluster (cl)

• clusterSetRNGStream control random number streams.

• clusterExport and clusterEvalQ copy environment of the master to slaves.

• Many embarrassingly parallel tasks in statistics can be organized in a similar way using
parallel.

– simulation across multiple factors (methods, generative models, signal/noise ra-


tios, sparsity levels, ...)
– bootstrap
– solution path/surface in regularization methods
– independent MCMC chains
– cross validation
– spatial prediction (kriging)
– ...

• 5 ∼ 15 fold speed-up, depending on the number of cores on your machine.

• Need to make sure each task is thread-safe.

70
Multi-core and multi-thread computing in Matlab
• Many Matlab functions, esp. numerical linear algebra (MKL libraries), are multi-
threaded since 2007.

• For example, running a benchmark script on the teaching server occupies up to all 7
(virtual) cores.

• parfor (parallel for loop) mechanism for embarrassingly parallel tasks.

• Parallel Computing Toolbox has more to offer (distributed array and SPMD, GPU
computing, parallel MapReduce, cluster computing, ...)
http://www.mathworks.com/help/distcomp/index.html

Multi-core and multi-thread computing in Julia


• Numerical linear algebra (OpenBLAS library) is already multi-threaded.

• Distributed computing capabilities are built in core language.


http://docs.julialang.org/en/release-0.3/manual/parallel-computing/

• Perhaps I can say more later this semester ...

Cluster computing
• Architecture of a computer cluster - computing parts.

– Cluster: a network of workstations (nodes).

71
– Compute nodes, login nodes, gateway (I/O) nodes, management nodes, file servers

R When log into a cluster, always keep in mind you’re interacting with the login
nodes, not the commute nodes.
– A chassis (or rack ) houses one or more nodes together with power, cooling, con-
nectivity, management, ... This is a rack of our beowulf cluster.

– A node (or blade) contains one or more sockets, memory, a modest size disk drive
holding OS, swap space, and a small local scratch space.

72
– Each socket holds one processor, e.g., Intel Xeon or AMD Opteron.
– A processor contains one or more cores (logical processors).
– The cores perform FLOPS.

• Architecture of a computer cluster - network.

– Infiniband: 2.5 Gbits/sec.


– 4 x Infiniband: 10 Gbits/sec.
– Hardware: Adapter, switches.
– Nodes within a single chassis usually communicate faster.

Beowulf HPC cluster in department


• Access via

R
ssh [email protected]
Use git or svn to synchronize project files.

• Read instructions for submitting jobs.


http://www.stat.ncsu.edu/computing/beowulf_instructions.php
bwsubmit submits single-threaded jobs.

R
bwsubmit mult submits multi-threaded jobs.
Each user can use 20 threads at a time.

• Write one script using parallel package and submit by bwsubmit mult seems the
easiest way for organizing embarrassingly parallel R jobs.

73
henry2 HPC cluster at NCSU
• 1053 nodes (dual Xeon blade servers)

• For more info about henry2 configuration: http://www.ncsu.edu/itd/hpc/main.php

• Ask your advisor for an account.

• Log in: ssh [email protected].

• Users do not interact with compute nodes directly. Users submit jobs which are sched-
uled to be run on compute nodes.

• Some commonly used job schedulers

– Platform LSF (Load Sharing Facility)


– Altair PBS Pro
– Sun Grid Engine
– Microsoft HPC Server 2008
– TORQUE

• henry2 uses LSF.

• Some commonly used LSF commands

– bsub: submit a batch job to LSF system


– bkill: kill a running job
– bjobs: status of jobs in the queue
– bpeek: display output and error files
– bhist: history of one or more LSF jobs

74
– bqueues: info about LSF batch queues

• Example: Let’s try to run an R script simulate.fork.r on henry2.

• Have following files ready in the working directory

– simulate.fork.r
R script file
– R.csh
R configuration file
– henry2_submit_fork
Shell script file for LSF job submission
– RLRsim_2.0-11.tar.gz
necessary files for installing R libraries

• simulate.fork.r is the R script to be run on cluster

...
# RLRsim package required for LR and RLR test
install.packages ("./RLRsim_2.0-11.tar.gz", repos=NULL, lib="./libs")
library (RLRsim,lib.loc="./libs")
# load libraries
library (compiler)
library (nlme) # requied for lme()
library (parallel)
...
# parallel simulations using mcmapply with load balance
set.seed (123, "L’Ecuyer")
mc = detectCores ()
mc
system.time (result.mcmapplylb <- mcmapply (compare.tests,
rep (n.pattern.list, each = length(sigma2.ratio.list), times = 1),
rep (sigma2.ratio.list, each = 1, times = length(n.pattern.list)),
MoreArgs = list (mc.size = 10), mc.cores = mc, mc.preschedule = FALSE))
...

• henry2_submit_fork is the shell script for LSF job submission

75
– #BSUB -n 12 requests 12 processors (logical cores, threads).
– #BSUB -W 10 requests maximum of 10 minutes.
– #BSUB -R em64t requests 64-bit machines.
– #BSUB -R span[hosts=1] requests all 12 processors to be on the same machine.
Note mcmapply relies on forking, which is a shared memory model.
– #BSUB -o out.%J and #BSUB -o err.%J specify output files

• R.csh configures the path for R program

• Submit job to LSF scheduler by bsub and check status by bjobs

76
• Wait for the job to finish. Several files are generated in working directory

– out.111391 and err.111391: standard and error LSF output files


– simulate.fork.r.Rout: screen display of R session
– result.fork.RData: output data saved by R script

• Portion of out.111391

• Portion of simulate.fork.r.Rout

...
> # parallel simulations using mcmapply with load balance
> set.seed (123, "L’Ecuyer")
> mc = detectCores ()
> mc
[1] 12
> system.time (result.mcmapplylb <- mcmapply (
+ compare.tests,
+ rep (n.pattern.list, each = length(sigma2.ratio.list), times = 1),
+ rep (sigma2.ratio.list, each = 1, times = length(n.pattern.list)),
+ MoreArgs = list (mc.size = 10), mc.cores = mc, mc.preschedule = FALSE))
user system elapsed
284.954 8.231 26.172
>

77
> # save results
> save(n.pattern.list, sigma2.ratio.list,
+ result.mcmapplylb, file = "result.fork.RData")
>
>
> proc.time()
user system elapsed
287.951 8.374 29.419

• Shell script for submitting simulate.socket.r which uses clusterMap

– Note clusterMap relies on socket and in principle works with any number of
processors
– setenv MPICH_NO_LOCAL 1 specifies that all MPI messages will be passed through
sockets, not using shared memory available on a node

Other HPC resources on campus


• BRC cluster. R/Matlab and GPUs available. Ask Tao Hu.
http://scarlatti.statgen.ncsu.edu/cluster_workshop/doku.php

• ARC cluster. Ask your advisor for an account. R/Matlab not available. Only compiled
code. GPUs available.
http://moss.csc.ncsu.edu/~mueller/cluster/arc/

78
9 Lecture 9, Feb 16
Announcements
• HW2 graded. grade unityID.md committed to your master branch.

Last Time
• Cluster computing.

Today
• HW2 feedback.

• GPU computing.

HW2 feedback
• Solution sketch in Matlab and Julia:
http://hua-zhou.github.io/teaching/st790-2015spr/hw02sol.html

• Languages (Matlab, Julia, R, Python).

– For CPU code, Julia offers more low-level memory management capabilities, lead-
ing to more efficient computation.
– For GPU programming, Matlab wins hands down in ease of use. Julia GPU com-
puting relies on the CUDArt.jl and CUBLAS.jl packages. Currently CUBLAS.jl
implements approximately half of BLAS functions, including gemm. For non-BLAS
computations such as elementwise multiplication and division, users need to write
their own CUDA kernel functions.
For using GPU in Python, ask Xiang Zhang and Zhen Han. For using gputools
package in R, ask Brian Naughton.

• Effects of starting points. Non-convexity implies possible existence of multiple local


minima. Identifiability issue: V W = V OO −1 W for any non-singular r × r matrices.
(0) (0)
What happens when starting from vij = wjk ≡ 1?

• Interpretability of basis images from NNMF. The following figure (Hastie et al., 2009,
p55) contrasts the different basis images obtained by NNMF, VQ (vector quantization),
and PCA. For a mathematical explanation of what NNMF does, see Donoho and
Stodden (2004).

79
• Different kinds of GPUs. I ran the same Matlab and Julia code on the teaching server,
a desktop, and a laptop. They represent common GPUs we see everyday. Note these
models are a couple years old and stand for technology around 2011.

• CPU vs GPU.

– Gain of GPU over CPU depends on specific cards and precision. Baby GPUs on
laptops show no gain on DP computations.

80
• GPU SP (single precision) vs GPU DP (double precision).

– Do they get same objective values? Do we have to use double precision? For ex-
ample, in MCMC, Monte Carlo errors often far exceed numerical roundoff errors.
– How’s the timing using SP vs DP? Tesla card has similar SP and DP performance.
GTX card has higher SP performance than DP. Baby GPUs on laptops show no
gain on DP computations.

Introduction to GPU computing


• GPUs are ubiquitous: servers, desktops, and laptops.

NVIDIA GPUs Tesla M2090 GTX 580 GT 650M

Computers servers, cluster desktop laptop

Main usage scientific computing gaming gaming


Current version K40 GTX 980 GTX 900M
Memory 6GB 1.5GB 1GB
Memory bandwidth 177GB/sec 192GB/sec 80GB/sec
Number of cores 512 512 384
Processor clock 1.3GHz 1.5GHz 0.9GHz
Peak DP performance 666Gflops
Peak SP performance 1332Gflops 1581Gflops 691Gflops
Release price $2500 $500 OEM

• Cost effective for high performance computing.

81
• GPU architecture vs CPU architecture.

– GPUs contain 100s of processing cores on a single card; several cards can fit in a
desktop PC
– Each core carries out the same operations in parallel on different input data –
single program, multiple data (SPMD) paradigm.
– Extremely high arithmetic intensity *if* one can transfer the data onto and results
off of the processors quickly.

82
CPU GPU

An analogy taken from Andrew Beam’s presentation in ST790. Also see https:
//www.youtube.com/watch?v=-P28LKWTzrI.

• Which cards to use?

– Three major manufacturers of GPUs: AMD, NVIDIA, and Intel.


– So far NVIDIA cards are more widely adopted for GPGPU.
E.g., GPU servers in our department and NCSU henry2 cluster all have NVIDIA.
– NVIDIA has a much richer set of GPU math libraries

AMD NVIDIA Intel

Cards ATI Radeon GTX, Tesla Xeon Phi coprecessor


Language OpenCL CUDA C/C++, PGI CUDA Fortran C/C++, Fortran, OpenCL
GPU math libraries clMath (BLAS,FFT) cuBLAS, cuFFT, cuSPARSE, cuSolver MKL
cuRAND, CUDA MATH, Thrust, ...

R
Platforms Linux, Windows Linux, Windows, MacOS Linux, Windows

On the other hand, cross-platform feature of OpenCL, adopted by Intel and AMD,
is attractive.

• My experience with GPGPU (general purpose GPU computing).

– Almost always involve (new) algorithm development and/or revamping CPU code.

83
– Research before going for GPGPU.
– Easier to develop in C/C++ (free compiler), Fortran (compiler $), and Matlab.
– Do not reinvent the wheel – use libraries.

• Before using GPUs, do following.

0. Frustrated by slow code ...


1. Am I using the right algorithm(s)?
Go to your ST758 notes or a numerical analysis book.
2. Repeat: Profile and optimize original code
3. Can a compiled language or optimized library (MKL, ATLAS) help?
4. Identify the bottleneck routine and research the potential gain on GPU. Do your
own benchmark specific to your own problem and data size
5. Can my data fit into GPU memory?
6. Can other steps besides the bottleneck be easily implemented on GPU? Will any
of them become the new bottleneck?
7. Decide the toolchain: Matlab, Julia, CUDA, PGI toolchain, ...

• GPGPU development toolchains.

– Use a higher level language such as Matlab, Julia or Python, if they happen to
provide all functions we need.
– CUDA R toolchain provided by NVIDIA R
https://developer.nvidia.com/cuda-zone
∗ C/C++
∗ free
∗ only for NIVIDA cards
– PGI R toolchain (CUDA Fortran)
https://www.pgroup.com/resources/cudafortran.htm
∗ C/C++, Fortran
∗ $$$
∗ only for NVIDIA cards
– OpenCLTM (Open Computing Language)
∗ open source

84
∗ Specs for cross-platform, parallel programming of modern processors (PCs,
servers, handheld/embedded devices)
∗ Adopted by Intel, AMD, NVIDIA, Qualcomm, ...

Mathematical libraries on GPUs


• Many statistical computing subroutiens are covered by the BLAS, LAPACK, sparse
linear algebra, random number generation, and other standard libraries.

• Availability of mathematical libraries on GPUs.

– NVIDIA c CUDA c math libraries.


∗ Optimized for NVIDIA GPUs
∗ cuBLAS, cuSPARSE, cuRAND, cuFFT, CUDA Math Library, Thrust (data
structures and algorithms), cuSolver (CUDA v7.0).
∗ Platforms: Linux (free), MacOS (free) and Windows (free)
– Intel c MKL library.
∗ Support both Intel c CPUs and Xeon Phi coprocessors since v11.0 (2013)
∗ BLAS, LAPACK, FFT, sparse linear algebra, random number generation, ...
∗ Platforms: Linux (free) and Windows ($), no MacOS support /
– AMD c clMath c library.
∗ For AMD GPUs
∗ BLAS, FFT
∗ Platforms: Linux (free) and Windows (free)
– Third-party libraries
∗ CULA ($): CUDA LAPACK

R
∗ MAGMA (free): OpenCL LAPACK

NVIDIA’s rich collection of math libraries is very attractive.

• Some dense linear algebra benchmark results.

– cuBLAS on NVIDIA K40m.

85
– zgemm cuBLAS on NVIDIA K40m vs MKL on Xeon E5-2697 v2 @ 2.7GHz.

– dgemm MKL on Xeon Phi c 7120P vs MKL on Xeon 12-core E5-2697 v2 @ 2.7GHz.

86
• Sparse linear algebra. cuSPARSE on K40m vs MKL on Xeon E5-2697 v2 @ 2.7GHz.

• Random number generation. cuRAND on K40m vs MKL on Xeon E5-2697 v2 @


2.7GHz.

87
10 Lecture 10, Feb 18
Announcements
• TA’s Friday office hour changes to Thu Feb 19 @ 2P-3P.

• HW3 deadline extended to Feb 25 @ 11:59PM.

Last Time
• GPU computing: introduction.

Today
• GPU computing: Matlab, Julia, R.

• GPU computing: case studies.

• Convex programming.

GPU computing in Matlab


• Getting started.

– Query available GPU devices: gpuDevice().

– List built-in functions that support GPU: methods(’gpuArray’). Nearly 300


built-in functions in Matlab 2014a support GPU.

88
• Scheme for GPU algorithm development on Matlab.

% transfer data to GPU and initialize variables


gX = gpuArray (X);
gY = gpuArray (Y);
gBetahat = gpuArray.randn (5, 1);
...

% computation on GPU
...

% transfer result off GPU


betahat = gather (gBetahat);

R Key: minimize memory transfer between host memory and GPU memory

• Always benchmark the specific bottleneck routine in CPU. If the bottleneck routine
does not enjoy GPU acceleration, there is no point embarking on GPU computing. E.g.,
to benchmark A\b (solve linear equations) on my desktop: paralleldemo_gpu_backslash()
in Matlab 2014a

89
Intel i7 960 CPU vs NVIDIA GTX 580 GPU

GPU computing in R
• Not supported in base R (opportunity? HiPLARM package).

• A few contributed packages in specific application areas: gputools (some data-mining


algorithms), cudaBayesreg (fMRI analysis), ...

• Develop in C/C++ or Fortran and call compiled code from R.

GPU computing in Julia


Various packages are being developed at https://github.com/JuliaGPU. See HW2 solution
for the NNMF example using CUDArt.jl and CUBLAS.jl packages.

GPU case study 1: NNMF


If your language (Matlab or Julia) happens to provide interface to all GPU libraries you need,
then the job can be easily done. In the NNMF example, we only need matrix multiplication
and elementwise matrix multiplication and division. See HW2 solution for sample code.

90
GPU case study 2: PET imaging

• Data: tube readings y = (y1 , . . . , yd ).

• Estimate: photon emission intensities (pixels) λ = (λ1 , . . . , λp ).

• Poisson Model:
p
!
X
Yi ∼ Poisson cij λj ,
j=1

where cij is the (pre-calculated) cond. prob. that a photon emitted by j-th pixel is
detected by i-th tube.

• Log-likelihood:
" ! #
X X X
L(λ|y) = yi ln cij λj − cij λj + const.
i j j

Essentially a Poisson regression with constraint λj ≥ 0.

• Issues: grainy image and slow convergence

91
• Regularized log-likelihood for smoother image:
µ X
L(λ|y) − (λj − λk )2
2
{j,k}∈N
" ! #
X X X µ X
= yi ln cij λj − cij λj − (λj − λk )2 ,
i j j
2
{j,k}∈N

where µ ≥ 0 is a tuning constant.

• Which algorithm?

– Newton algorithm needs to solve a large linear system at each iteration /


– In ST758 (2014, notes p145-p149), we derived an MM algorithm for minimizing
the regularized log-likelihood.

• MM algorithm for PET:


(0)
Initialize: λj = 1
repeat
(t) (t) P (t)
zij = (yi cij λj )/( k cik λk )
for j = 1 to p do
(t) P (t) P (t)
a = −2µ|Nj |, b = µ(|Nj |λj + k∈Nj λk ) − 1, c = i zij
(t+1) √
λj = (−b − b2 − 4ac)/(2a)
end for
until convergence occurs

• Parameter constraints λj ≥ 0 are satisfied when start from positive initial values.
(t)
• Update of zij succumbs to BLAS (matrix-vector multiplication) and elementwise mul-
tiplication and division.

• The loop for updating pixels can be carried out independently – massive parallelism.

• A simulation example with n = 2016 and p = 4096 (provided by Ravi Varadhan).


CPU code implemented using BLAS in the GNU Scientific Library (GSL). GPU code
implemented using cuBLAS.

– Runtime on a typical computer in 2009:


CPU: Xeon E5450 @ 3GHZ (1 thread)
GPU: NVIDIA GeForce GTX 280

92
CPU GPU

Penalty µ Iters Time Function Iters Time Function Speedup

0 100000 14790 -7337.152765 100000 282 -7337.153387 52


10−7 24457 3682 -8500.083033 24457 70 -8508.112249 53
10−6 6294 919 -15432.45496 6294 18 -15432.45586 51
10−5 589 86 -55767.32966 589 2 -55767.32970 43

– Runtime on a typical computer in 2011:


CPU: i7 @ 3.20GHZ (1 thread)
GPU: NVIDIA GeForce GTX 580

CPU GPU

Penalty µ Iters Time Function Iters Time Function Speedup

0 100000 11250 -7337.152765 100000 140 -7337.153387 80


10−7 24506 2573 -8500.082605 24506 35 -8508.112249 74
10−6 6294 710 -15432.45496 6294 9 -15432.45586 79
10−5

R
589 67 -55767.32966 589 0.8 -55767.32970 84

Performance of CPU increases by about 30%, while GPU increases by 100%

• Lessons learnt.

– Algorithm development. EM/MM


∗ separate variables. Break a complex optimization into numerous independent
simple optimizations (massive parallelism)
∗ avoid solving large linear systems; only BLAS routines involved
∗ exploit high throughput of BLAS routines on GPU
– cuBLAS library eases the GPU implementation

• C++ source code is available at http://hua-zhou.github.io/teaching/st790-2015spr/


pet.tar.gz.

GPU case study 3: MDR for GWAS


• SNP and GWAS.

– Human genome consists of 3 billion pairs of letters (A,C,G,T)

93
– Two people’s genome sequences are 99.9% identical
– SNP (single nucleotide polymorphism) is a single-letter change in DNA
– About 1 in 1000 DNA letters vary in the form of a SNP
– Genome-wide association study (GWAS) tries to find association of the trait of
interest (disease or not, blood pressure, height, ...) and each SNP

• MDR for detecting SNP interactions.

– Multifactor dimensionality reduction (MDR) is a method for detecting association


of a binary trait (0/1,control/disease) and SNP pairs
– For each SNP pair
∗ count number of 0s and 1s for each genotype combination
∗ declare that genotype combination as causal (n1 > n0 ) or protective (n1 < n0 )
∗ predict disease status using the declared causal/protective status of genotype
combinations
– Rank SNP pairs according to their predictive power
– Alternatively we can do Pearson’s χ2 test for contingency table

94
• Computation challenge and parallelism.
p

– For either MDR or Pearson, we need to construct tables for 2
SNP pairs
– For p = 106 , p2 ≈ 5 × 1011


– Massive parallelism: tables for SNP pairs (1, 2), . . . , (p − 1, p) obviously can be
constructed in parallel
– How to organize? Merry-go-round.
Golub and Van Loan (1996, Section 8.4)

• Try it.

– Download the source code


wget http://hua-zhou.github.io/teaching/st790-2015spr/mds.tar.gz
– Extract files tar -zxvf mds.tar.gz
– Browse the contents of the mds folder
∗ source: main.cpp, mds.cpp, mds.h, mds_kernel.cu
∗ make file: Makefile
∗ test data: gaw17.txt (500 individuals, 10000 SNPs)
– Compiling on the teaching server

95
g++ -c -O2 -I/usr/local/cuda-6.5/include *.cpp
nvcc -O2 -c *.cu
g++ -o mdsmain -L/usr/local/cuda-6.5/lib64 -lcudart *.o

or use the make file. It yields the executable mdsmain.


– Run it on teaching server.
CPU: Xeon E5-2640 @ 2.5GHZ (1 thread)
GPU: NVIDIA Tesla M2090.
We see > 20 fold speed up.

• CPU host code.

96
• GPU host code.

• GPU device code.

97
• Lessons learnt.

– Recognize massive parallelism. Common in genomics and statistics


– Algorithm development. Merry-go-round for organizing parallel pairs

• C++ source code is available at http://hua-zhou.github.io/teaching/st790-2015spr/


mds.tar.gz.

Convex optimization problems


• A mathematical optimization problem, or just optimization problem, has the form

minimize f0 (x)
subject to fi (x) ≤ bi , i = 1, . . . , m.

Here f0 : Rn 7→ R is the objective function and fi : Rn 7→ R, i = 1, . . . , m, are the

R
constraint functions.
An equality constraint fi (x) = bi can be absorbed into inequality constraints
fi (x) ≤ bi and −fi (x) ≤ −bi .

• If the objective and constraint functions are convex, then it is called a convex opti-
mization problem.

98
R In a convex optimization problem, only linear equality constraint of form Ax = b
is allowed.

• Convex optimization is becoming a technology. Therefore it is important to recognize,


formulate, and solve convex optimization problems.

• A definite resource is the book Convex Optimization by Boyd and Vandenberghe, which
is freely available at http://stanford.edu/~boyd/cvxbook/. Same website has links
to slides, code, and lecture videos.

• In this course, we learn basic terminology and how to recognize and solve some standard
convex programming problems.

99
11 Lecture 11, Feb 23
Announcements
• HW4 posted (Linear Programming). Due next Friday Mar 6 @ 11:59PM.

Last Time
• GPU computing: Matlab, Julia, R.

• GPU computing: case studies.

• Convex optimization: introduction

Today
• Convex sets and convex functions.

Convex sets
• The line segment (interval) connecting points x and y is the set

{z : αx + (1 − α)y for all α ∈ [0, 1]}.

• A set C is convex if for every pair of points x and y lying in C the entire line segment
connecting them also lies in C.

• Examples of convex sets.

1. Any singleton.

100
2. Rn .
3. Any normed ball Br (c) = {x : kx − ck ≤ r}, open or closed, of radius r centered
at c.

R lp (x) = (
Pn
i=1
1/p
|xi |p ) is not a proper norm for 0 < p < 1.
4. Any hyperplane {x : xT v = c}.
5. Any closed half space {x : xT v ≤ c} or open half space {x : xT v < c}.
6. Any polyhedron

P = {x : aTj x ≤ bj , j = 1, . . . , m, cTj x = dj , j = 1, . . . , p}
= {x : Ax  b, Cx = d}.

101
7. The set Sn++ of n × n pd matrices and the set Sn+ of n × n psd matrices.

8. The translate C + w of a convex set C.


9. The image A(C) of a convex set C under a linear map A.
10. The inverse image A−1 (C) of a convex set C under a linear map A.
11. The Cartesian product of two convex sets.

• A set C is a cone if for each x ∈ C the set {θx : θ ≥ 0} is also in C (closed by


multiplication by nonnegative scalars). A cone that is convex is called a convex cone.

• Examples of cone:

1. The set Sn+ of psd matrices is a convex cone.

102
2. Is the set Sn++ of pd matrices a cone?
3. The set {(x, t) : kxk2 ≤ t} is called an ice cream (or Lorentz, or second order, or
quadratic) cone.

4. Any norm cone {(x, t) : kxk ≤ t} is a convex cone.


5. Can you give a non-convex cone?

• A set C is said to be affine if

{z : θx + (1 − θ)y for all θ ∈ R} ⊂ C

for all x, y ∈ C. Note θ is not restricted to the unit interval. An affine set is convex
but not conversely. Every affine set A can be represented as a translate v + S of a
vector subspace S.

103
• Example: The solution set of linear equations C = {x : Ax = b} is affine. The
converse is also true. Every affine set can be expressed as the solution set of a system
of linear equations.

• The intersection of an arbitrary collection of convex, affine, or conical sets is convex,


affine, conical, respectively.

• Any convex combination m


P
i=1 αi xi of points from a convex set C belongs to C. By
convex combination we mean each αi ≥ 0 and m
P
i=1 αi = 1.

Similar closure properties apply to convex cones and affine sets if either the restriction
Pm
i=1 αi = 1 or the constraints αi ≥ 0, respectively, are lifted.

• The convex hull conv C of a nonempty set C is the smallest convex set containing C.
Equivalently, conv C is the set generated by taking all convex combinations m
P
i=1 αi xi
of elements of C.

The convex conical hull and affine hull of C are generated in a similar manner.

104
What is the affine hull of circle C = {x ∈ R2 : kxk22 = 1}?

• (Carathéodory) For a nonempty set S ⊂ Rn , every point in conv S can be written as a


convex combination of at most n + 1 points from S. Furthermore, when S is compact,
conv S is also compact.

Convex functions
• A function f (x) on Rn is convex if

f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y)

R
for all x, y and all α ∈ [0, 1].
To define a convex function f (x) on Rn , it is convenient to allow the value ∞ and
disallow the value −∞.
The set {x : f (x) < ∞} is a convex set called the essential domain of f and written
dom f . A convex function is proper if dom f 6= ∅ and f (x) > −∞ for all x.

105
• If the inequality in the definition is strict on dom f when α > 0, β > 0, and x 6= y,
then the function is said to be strictly convex.

R
• A function f (x) is concave if its negative −f (x) is convex.
For concave functions we allow the value −∞ and disallow the value ∞.

• Examples of convex functions on Rn .

1. Affine function. Any affine function f (x) = aT x + b is both convex and concave.
2. Norm. Any norm (scalar homogeneity, triangle inequality and separates points)
on Rn is convex.
3. Indicator function. The indicator function

0 x ∈ C
δC (x) =
∞ x ∈ /C

of a nonempty set C is convex if an only if the set itself is convex.


4. Quadratic-over-linear function. The function f (x, y) = x2 /y, with dom f =
R × R++ is convex.

5. log-sum-exp. The function f (x) = ln(ex1 + · · · + exn ) is convex.


1/n
6. Geometric mean. The geometric mean f (x) = ni=1 xi is concave.
Q

7. Log-det. The function f (X) = ln det X is concave on Sn++ . (Two proofs below.)

• Sublevel sets {x : f (x) ≤ c} of a convex function f (x) are convex. If f (x) is continuous
as well, then all sublevel sets are also closed.
The converse is not true. For example, the sublevel set {x ∈ R2+ : 1 − x1 x2 ≤ 0} is
closed and convex, but the function 1 − x1 x2 is not convex on the domain R2+ = {x :
x1 ≥ 0, x2 ≥ 0}.

106
• (Jenen’s inequality) A function f (x) is convex if and only if
m
! m
X X
f αi x i ≤ αi f (xi ),
i=1 i=1
Pm
for all αi ≥ 0 and i=1 αi = 1.
The probabilistic version states f [E(X)] ≤ E[f (X)].

• (First order condition, support hyperplane inequality) If f (x) is differentiable on the


open convex set C, then a necessary and sufficient condition for f (x) to be convex is

f (y) ≥ f (x) + df (x)(y − x)

for all x, y ∈ C. Furthermore, f (x) is strictly convex if and only if strict inequality
holds for all y 6= x.

• (Second order condition) Let f (x) be a twice differentiable function on the open convex
set C ⊂ Rn . If its Hessian matrix d2 f (x) is psd for all x, then f (x) is convex. When
d2 f (x) is pd for all x, f (x) is strictly convex.

107
12 Lecture 12, Feb 25
Announcements
• HW3 due today @ 11:59PM. Commit to your master branch and tag.

• HW4 posted (Linear Programming). Due next Friday Mar 6 @ 11:59PM (?).

Last Time
• Convex sets and convex functions.

Today
• Convex functions (cont’d).

• Overview of optimization softwares.

Convex function (cont’d)


• Closure properties of convex functions often offer the easiest way to check convexity.

1. (Nonnegative weighted sums) If f (x) and g(x) are convex and α and β are non-
negative constants, then αf (x) + βg(x) is convex.
2. (Composition) h(x) is convex and increasing, and g(x) is convex and finite, then
the functional composition f (x) = h ◦ g(x) is convex.
3. (Composition with affine mapping) If f (x) is convex, then the functional compo-
sition f (Ax + b) of f (x) with an affine function Ax + b is convex.
4. (Pointwise maximum and supremum) If fi (x) is convex for each fixed i ∈ I, then
g(x) = supi∈I fi (x) is convex provided it is proper. Note the index set I may be
infinite.
5. (Pointwise limit) If fm (x) is a sequence of convex functions, then limm→∞ fm (x)
is convex provided it exists and is proper.
6. (Integration) If f (x, y) is convex in x for each fixed y and µ is a measure, then

R
R
the integral g(x) = f (x, y)dµ(y) is convex provided it is proper.
It is generalization of the nonnegative weighted sum rule.
7. (Minimum) If f (x, y) is jointly convex in (x, y), then g(x) = inf y∈C f (x, y) is
convex provided it is proper and C is convex.

108
R Product of two convex functions is not necessarily convex. Counter example:
x3 = xx2 . However if both functions are convex, nondecreasing (or nonincreasing),
and positive functions on an interval, then the product is convex.

• Example: The function f (x) = x[1] + · · · + x[k] , the sum of the k largest components

R
of x ∈ Rn , is convex.
This is hint for HW3 Q3.

Proof. Write the function f as

f (x) = max{xi1 + · · · + xik : 1 ≤ i1 < i2 < · · · ik ≤ n},

i.e., the maximum of all possible sums of k different components of x. Since it is the
pointwise maximum of nk linear functions, it is convex.


• Example: Dominant eigenvalue of a symmetric matrix

λmax (M ) = max xT M x
kxk=1

is convex in M since it is pointwise maximum of linear functions. Similarly the mini-

R
mum eigenvalue λmin (M ) is concave in M .
Sum of k largest eigenvalues is convex on Sn .

• More on composition rule. Scalar composition f = h ◦ g, where h : R 7→ R and


g : R 7→ R:

– f is convex if h is convex and nondecreasing, and g is convex.


– f is convex if h is convex and nonincreasing, and g is concave.
– f is concave if h is concave and nondecreasing, and g is concave.

R
– f is concave if h is concave and nonincreasing, and g is convex.

Remember by f 00 (x) = h00 (g(x))g 0 (x)2 + h0 (g(x))g 00 (x). But same results apply to
non-differential functions as well.
Vector composition f (x) = h ◦ g(x) = h(g1 (x), . . . , gk (x)), where gi : Rn 7→ R and
h : Rk 7→ R.

– f is convex if h is convex, h is nondecreasing in each argument, and gi are convex.


– f is convex if h is convex, h is nonincreasing in each argument, and gi are concave.
– f is concave if h is concave, h is nondecreasing in each argument, and gi are
concave.

109
R Remember by d2 f (x) = Dg(x)T d2 h(g(x))Dg(x) + (Dh(g(x)) ⊗ In )d2 g(x). But
same results apply to non-differential functions as well.

• The epigraph of a function f (x) is the set

epif = {(x, r) : f (x) ≤ r}.

• A function f (x) is convex if and only if its epigraph is a convex set.

• Example: The matrix fractional function

f (x, Y ) = xT Y −1 x

is convex on domain Rn × Sn++ . This generalizes the convexity of quadratic-over-linear

R
function f (x, y) = x2 /y on R × R++ .
This is hint for HW3 Q5 and Q10.

Proof (by epigraph). The epigraph of matrix fractional function is

epif = {(x, Y , t) : Y  0, xT Y −1 x ≤ t}
( ! )
Y x
= (x, Y , t) :  0, Y  0 ,
xT t

which is convex. The second equality is from the linear algebra fact that a block matrix
!
A B
BT C

is psd if and only if A is psd, the Schur complement C − B T A−1 B is psd, and
(I − AA− )B = 0 (B ∈ C(A)).

110
R Same argument yields joint convexity of the matrix function f (X, Y ) = X T Y −1 X

R
on Rm×n × Sn++ .
(Singular case) The result can be further extended to show that the function

 1 uT X T Y + Xu Xu ∈ C(Y )
f (X, Y ) = 2
∞ Xu ∈/ C(Y )

on Rm×n × Sn+ is jointly convex in X and Y for any choice of u.

• (Line theorem) A function is convex if and only if it is convex when restricted to a line
that intersects its domain. That is f (x) is convex if and only if for any x ∈ domf and
v ∈ Rn , then function

g(t) = f (x + tv)

R
is convex on dom g = {t : x + tv ∈ domf }.
Not sure if a function is convex? Generate a bunch of lines through the domain
and plot. If any of them are not convex, the function is not convex.

• Example: Concavity of ln det Ω on Sn++ . This generalizes the concavity of ln x for

R
x > 0.
This is hint for HW3 Q10.

Proof. Let X ∈ Sn++ and V ∈ Sn . Then

g(t) = ln det(X + tV )
= ln det X 1/2 (I + tX −1/2 V X −1/2 )X 1/2
= ln det X + ln det(I + tX −1/2 V X −1/2 )
Xn
= ln det X + ln(1 + λi t),
i=1

where λi are eigenvalues of X −1/2 V X −1/2 . g(t) is concave in t thus ln det function is
concave too.

Log-convexity
• A positive function f (x) is said to be log-convex if ln f (x) is convex.

• A log-convex function is convex. Why?

111
• Log-convex functions enjoy the same closure properties 1 through 7. In part 2 (com-
position rule), g is convex and h is log-convex.
In addition the collection of log-convex functions is closed under the formation of

R
products and powers.
Not all rules apply to log-concave functions! For instance, nonnegative sum of
log-concave functions is not necessarily log-concave.

• Examples:

1. The beta function


Z 1
B(x, y) = ux−1 (1 − u)y−1 du
0

is log-convex. Why?
2. The gamma function
Z ∞
Γ(t) = xt−1 e−x dx
0

is log-convex. Why?
3. The moment function
Z ∞
M (x) = ux f (u) du,
0

where f is density of a nonnegative random variable, is log-convex. Why?


4. The Riemmann zeta function

xs−1
Z
1
ζ(s) = dx
Γ(s) 0 ex − 1

is log-convex. Why?
5. The Normal cdf
Z x
1 2 /2
Φ(x) = √ e−u du
2π −∞

is log-concave. See (Boyd and Vandenberghe, 2004, Exercise 3.54).

• Example: Concavity of ln det Ω on Sn++ . This generalizes the concavity of ln x for

R
x > 0.
This is hint for HW3 Q10.

112
Proof by log-concavity. Integration of the multivariate Gaussian density with pd co-
variance Σ
1 −1
−1/2 −xT Σ x/2
f (x) = | det Σ| e
(2π)n/2
produces
Z
1 −1
e−x Σ x/2 dx.
1/2 T
| det Σ| =
(2π)n/2
This identity can be restated in terms of the precision matrix Ω = Σ−1 as
Z
ln det Ω = n ln(2π) − 2 ln e−x Ωx/2 dx.
T

The integral on the right is log-convex. Why? Is the integral log-concave? Thus
ln det Ω is concave.

Hierarchy of convex optimization problems


In ST758, we spent a fair amount of time on the LS (least squares) problem. In this course,
we study LP (linear programming), QP (quadratic programming), SOCP (second-order cone
programming), SDP (semidefinite programming), and GP (geometric programming), with
an emphasis on statistical applications and software implementation.

Optimization softwares
Like computer languages, getting familiar with good optimization softwares broadens the
scope and scale of problems we are able to solve in statistics.

113
• Following table lists some of the best convex optimization softwares. Use of Gurobi

R
and/or Mosek is highly recommended.
Gurobi is named after its founders: Zonghao Gu, Edward Rothberg, and Robert
Bixby. Bixby founded the CPLEX at IBM, while Rothberg and Gu led the CPLEX
development team for nearly a decade.

• Difference between modeling tool and solvers.

– Solvers (Gurobi, Mosek, ...) are concrete software implementation of optimization


algorithms.
– Modeling tools such as cvx and Convex.jl (Julia analog of cvx) implement the
disciplined convex programming (DCP) paradigm proposed by Grant and Boyd
(2008). http://stanford.edu/~boyd/papers/disc_cvx_prog.html. DCP pre-
scribes a set of simple rules from which users can construct convex optimization
problems easily.
Modeling tools usually have the capability to use a variety of solvers. But mod-
eling tools are solver agnostic so users do not have to worry about specific solver
interface.

LP MILP SOCP MISOCP SDP GP NLP MINLP R Matlab Julia Python Cost

JuMP.jl D D D D D D D O
Convex.jl D D D D D D O
cvx D D D D D D D D A

Gurobi D D D D D D D D A
Mosek D D D D D D D D D D D A
CPLEX D D D ? D D D A
SCS D D D D D D O
SeDuMi D D D ? D O
SDPT3 D D D ? D O
KNITRO D D D D D D D $

LP = Linear Programming, MILP = Mixed Integer LP, SOCP = Second-order cone pro-
gramming (includes QP, QCQP), MISOCP = Mixed Integer SOCP, SDP = Semidefinite
Programming, GP = Geometric Programming, NLP = (constrained) Nonlinear Program-
ming (includes general QP, QCQP), MINLP = Mixed Integer NLP, O = Open source, A =
Free academic license

114
Set up Gurobi on the teaching server
1. Gurobi 6.0 has been installed on the teaching server at
/use/local/gurobi600
But you have to obtain a license (free) first in order to use it.

2. Register for an account on http://www.gurobi.com/account. Be sure to use your


edu email and check Academic as your account type.

3. After confirmation of your academic account, log into your account and request a free
academic license at http://www.gurobi.com/download/licenses/free-academic.

4. Run grbgetkey command on the teaching server and enter the key you obtained in
step 3. Place the file at /home/USERID/.gurobi/

5. Now you should be able to use Gurobi in Matlab, R, and Julia.

Set up Mosek on the teaching server


1. Mosek 7 has been installed on the teaching server at
/usr/local/mosek/7/
License file is already put into your home directory.
/home/unityID/mosek/mosek.lic

2. You should be able to use Mosek in Matlab or R already.

Set up CVX on the teaching server


1. CVX v2.1 has been installed on the teaching server at
/use/local/cvx
But you have to obtain a license (free) first in order to use it.

2. Request a free academic (professional) license at http://cvxr.com/cvx/academic/


using your edu email. Your will receive the license file license.dat by email. Place
the license file at /home/USERID/.cvx/

3. Within Matlab, type


cvx setup /home/hzhou3/.cvx/cvx license.dat

R
4. Now you should be able to use CVX in Matlab.
The standard license comes with free solvers SeDuMi and SDPT3. The Academic
license also bundles with Gurobi and Mosek.

115
13 Lecture 13, Mar 2
Announcements
• HW4 (LP) deadline extended to Mon, Mar 16 @ 11:59PM.

• HW5 (QP, SOCP) posted. Due Fri, Mar 20 @ 11:59PM. http://hua-zhou.github.


io/teaching/st790-2015spr/ST790-2015-HW5.pdf

Last Time
• Convex and log-convex functions.

• Overview of optimization softwares.

Today
• LP (linear programming).

Linear programming (LP)


• A general linear program takes the form

minimize cT x
subject to Ax = b
Gx  h.

Linear program is a convex optimization problem, why?

116
• The standard form of an LP is

minimize cT x
subject to Ax = b
x  0.

To transform a general linear program into the standard form, we introduce the slack
variables s  0 such that Gx + s = h. Then we write x = x+ − x− , where x+  0
and x−  0. This yields the problem

minimize cT (x+ − x− )
subject to A(x+ − x− ) = b
G(x+ − x− ) + s = h
x+  0, x−  0, s  0

R
in x+ , x− , and s.
Slack variables are often used to transform a complicated inequality constraint to
simple non-negativity constraints.

• The inequality form of an LP is

minimize cT x
subject to Gx  h.

R Some softwares, e.g., solveLP in R, require an LP be written in either standard or


inequality form. However a good software should do this for you!

• A piecewise-linear minimization problem

minimize max (aTi x + bi )


i=1,...,m

can be transformed to an LP

minimize t
subject to aTi x + bi ≤ t, i = 1, . . . , m,

in x and t. Apparently
minimize max |aTi x + bi |
i=1,...,m

and
minimize max (aTi x + bi )+
i=1,...,m

117
R
are also LP.
Any convex optimization problem

minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m
aTi x = bi , i = 1, . . . , p,

where f0 , . . . , fm are convex functions, can be transformed to the epigraph form

minimize t
subject to f0 (x) − t ≤ 0
fi (x) ≤ 0, i = 1, . . . , m
aTi x = bi , i = 1, . . . , p

in variables x and t. That is why people often say linear program is universal.

• The linear fractional programming


cT x + d
minimize
eT x + f
subject to Ax = b
Gx  h
eT x + f > 0

can be transformed to an LP

minimize cT y + dz
subject to Gy − zh  0
Ay − zb = 0
eT y + f z = 1
z≥0

in y and z, via transformation of variables


x d
y= , z= .
eT x +f eT x +f
See Boyd and Vandenberghe (2004, Section 4.3.2) for proof.

• Example. Compressed sensing (Candès and Tao, 2006; Donoho, 2006) tries to address
a fundamental question: how to compress and transmit a complex signal (e.g., musical
clips, mega-pixel images), which can be decoded to recover the original signal?

118
Suppose a signal x ∈ Rn is sparse with s non-zeros. We under-sample the signal by
multiplying a measurement matrix y = Ax, where A ∈ Rm×n has iid normal entries.
Candès et al. (2006) show that the solution to

minimize kxk1
subject to Ax = y

exactly recovers the true signal under certain conditions on A when n  s and m ≈
s ln(n/s). Why sparsity is a reasonable assumption? Virtually all real-world images
have low information content.
The `1 minimization problem apparently is an LP, by writing x = x+ − x− ,

minimize 1T (x+ + x− )
subject to A(x+ − x− ) = y
x+  0, x−  0.

Let’s work on a numerical example. http://hua-zhou.github.io/teaching/st790-2015spr/


demo_cs.html

119
• Example. Quantile regression (HW4). In linear regression, we model the mean of
response variable as a function of covariates. In many situations, the error variance is
not constant, the distribution of y may be asymmetric, or we simply care about the
quantile(s) of response variable. Quantile regression offers a better modeling tool in
these applications.

120
In τ -quantile regression, we minimize the loss function
n
X
f (β) = ρτ (yi − xTi β),
i=1

where ρτ (z) = z(τ − 1{z<0} ). Writing y − Xβ = r + − r − , this is equivalent to the LP

minimize τ 1T r + + (1 − τ )1T r −
subject to r + − r − = y − Xβ
r +  0, r −  0

in r + , r − , and β.

• Example: `1 regression (HW4). A popular method in robust statistics is the median


absolute deviation (MAD) regression that minimizes the `1 norm of the residual vector
ky − Xβk1 . This apparently is equivalent to the LP

minimize 1T (r + + r − )
subject to r + − r − = y − Xβ
r +  0, r −  0

R
in r + , r − , and β.
`1 regression = MAD = 1/2-quantile regression.

• Example: `∞ regression (Chebychev approximation). Minimizing the worst possible


residual ky − Xβk∞ is equivalent to the LP

minimize t
subject to −t ≤ yi − xTi β ≤ t, i = 1, . . . , n

in variables β and t.

• Example: Dantzig selector (HW4). Candès and Tao (2007) propose a variable selection
method called the Dantzig selector that solves

minimize kX T (y − Xβ)k∞
p
X
subject to |β j | ≤ t,
j=2

which can be transformed to an LP. Indeed they name the method after George
Dantzig, who invented the simplex method for efficiently solving LP in 50s.

121
R Apparently any loss/penalty or loss/constraint combinations of form

{`1 , `∞ , quantile} × {`1 , `∞ , quantile},

possibly with affine (equality and/or inequality) constraints, can be formulated as an


LP.

• Example: 1-norm SVM (HW4). In two-class classification problems, we are given train-
ing data (xi , yi ), i = 1, . . . , n, where xi ∈ Rp are feature vectors and yi ∈ {−1, 1} are
class labels. Zhu et al. (2004) propose the 1-norm support vector machine (svm) that
achieves the dual purpose of classification and feature selection. Denote the solution
of the optimization problem
p
n
" !#
X X
minimize 1 − y i β0 + xij βj
i=1 j=1 +
p
X
subject to kβk1 = |βj | ≤ t
j=1

by β̂0 (t) and β̂(t). 1-norm svm classifies a future feature vector x by the sign of fitted
model

fˆ(x) = β̂0 + xT β̂.

• Many more applications: Airport scheduling (Copenhagen airport uses Gurobi), airline
flight scheduling, NFL scheduling, match.com, LATEX, ...

122
14 Lecture 14, Mar 4
Announcements
• HW4 (LP) deadline extended to Mon, Mar 16 @ 11:59PM.

• HW5 (QP, SOCP) posted. Due Fri, Mar 20 @ 11:59PM. http://hua-zhou.github.


io/teaching/st790-2015spr/ST790-2015-HW5.pdf

Last Time
• LP (linear programming).

Today
• QP (quadratic programming).

• SOCP (second order cone programming).

More LP
• In the worst k error regression (HW3), we minimize ki=1 |r|(i) where |r|(1) ≥ |r|(2) ≥
P

· · · ≥ |r|(n) are order statistics of the absolute values of residuals |ri | = |yi − xTi β|.
This can be solved by the LP

minimize kt + 1T z
subject to −t1 − z  y − Xβ  t1 + z
z0

in variables β ∈ Rp , t ∈ R, and z ∈ Rn .

• Our catalogue of linear parts: composition of `1 (absolute values), `∞ (max), check


loss (quantile), hinge loss (svm), sum of k largest component, ... with affine functions.

Quadratic programming (QP)


• A quadratic program (QP) has quadratic objective function and affine constraint func-
tions

minimize (1/2)xT P x + q T x + r
subject to Gx  h
Ax = b,

123
where we require P ∈ Sn+ (why?).

• Example. The least squares problem minimizes ky − Xβk22 , which obviously is a QP.

• Example. Least squares with linear constraints. For example, nonnegative least squares
(NNLS)

minimize (1/2)ky − Xβk22


subject to β  0.

R In NNMF (nonnegative matrix factorization), the objective kX − V W k2F can be


minimized by alternating NNLS.

• Example. Lasso regression (Tibshirani, 1996; Donoho and Johnstone, 1994) minimizes
the least squares loss with `1 (lasso) penalty
1
minimize ky − β0 1 − Xβk22 + λkβk1 ,
2
where λ ≥ 0 is a tuning parameter. Writing β = β + − β − , the equivalent QP is
11T
 
1 + − T
minimize (β − β ) X T
I− X(β + − β − ) +
2 n
11T
 
T
y I− X(β + − β − ) + λ1T (β + + β − )
n
subject to β +  0, β −  0

in β + and β − .

124
• Example: Elastic net (Zou and Hastie, 2005)
1
minimize ky − β0 1 − Xβk22 + λ(αkβk1 + (1 − α)kβk22 ),
2
where λ ≥ 0 and α ∈ [0, 1] are tuning parameters.

• Example: Generalized lasso


1
minimize ky − Xβk22 + λkDβk1 ,
2
where λ ≥ 0 is a tuning parameter D is a fixed regularization matrix. This generates
numerous applications (Tibshirani and Taylor, 2011).

• Example: Image denoising by anisotropic penalty. See HW5.

• Example: (Linearly) constrained lasso


1
minimize ky − β0 1 − Xβk22 + λkβk1
2
subject to Gβ  h
Aβ = b,

where λ ≥ 0 is a tuning parameter.

125
• Example: The Huber loss function

r 2 |r| ≤ M
φ(r) =
M (2|r| − M ) |r| > M

is commonly used in robust statistics. The robust regression problem


n
X
minimize φ(yi − β0 − xTi β)
i=1

can be transformed to a QP

minimize uT u + 2M 1T v
subject to −u − v  y − Xβ  u + v
0  u  M 1, v  0

in u, v ∈ Rn and β ∈ Rp . Hint: write |ri | = (|ri | ∧ M ) + (|ri | − M )+ = ui + vi .

• Example: Support vector machines (SVM, HW5). In two-class classification problems,


we are given training data (xi , yi ), i = 1, . . . , n, where xi ∈ Rn are feature vector and
yi ∈ {−1, 1} are class labels. Support vector machine solves the optimization problem
p
n
" !#
X X
minimize 1 − yi β0 + xij βj + λkβk22 ,
i=1 j=1 +

where λ ≥ 0 is a tuning parameters. This is a QP.

126
Second-order cone programming (SOCP)
• A second-order cone program (SOCP)

minimize fTx
subject to kAi x + bi k2 ≤ cTi x + di , i = 1, . . . , m
Fx = g

over x ∈ Rn . This says the points (Ai x + bi , cTi x + di ) live in the second order cone
(ice cream cone, Lorentz cone, quadratic cone)

Qn+1 = {(x, t) : kxk2 ≤ t}

R
in Rn+1 .
QP is a special case of SOCP. Why?

• When ci = 0 for i = 1, . . . , m, SOCP is equivalent to a quadratically constrained


quadratic program (QCQP)

minimize (1/2)xT P0 x + q0T x


subject to (1/2)xT Pi x + qiT x + ri ≤ 0, i = 1, . . . , m
Ax = b,

where Pi ∈ Sn+ , i = 0, 1, . . . , m. Why?

• Example: Group lasso (HW5). In many applications, we need to perform variable


selection at group level. For instance, in factorial analysis, we want to select or de-
select the group of regression coefficients for a factor simultaneously. Yuan and Lin
(2006) propose the group lasso that
G
1 2
X
minimize ky − β0 1 − Xβk2 + λ wg kβ g k2 ,
2 g=1

where β g is the subvector of regression coefficients for group g, and wg are fixed group
weights. This is equivalent to the SOCP
11T
 
1 T T
minimize β X I− Xβ +
2 n
G
11T
  X
T
y I− Xβ + λ wg tg
n g=1
subject to kβ g k2 ≤ tg , g = 1, . . . , G,

R
in variables β and t1 , . . . , tG .
Overlapping groups are allowed here.

127
• Example. Sparse group lasso
G
1 2
X
minimize ky − β0 1 − Xβk2 + λ1 kβk1 + λ2 wg kβ g k2
2 g=1

achieves sparsity at both group and individual coefficient level and can be solved by
SOCP as well.

R Apparently we can solve any previous loss functions (quantile, `1 , composite quan-
tile, Huber, multi-response model) plus group or sparse group penalty by SOCP.

128
15 Lecture 15, Mar 16
Announcements
• HW4 (LP) due today 11:59PM.

• HW5 (QP, SOCP) due this Fri, Mar 20 @ 11:59PM.

Last Time
• QP (quadratic programming).

• SOCP (second order cone programming).

Today
• SOCP (cont’d).

SOCP (cont’d)
• Example. Square-root lasso (Belloni et al., 2011) minimizes

ky − β0 1 − Xβk2 + λkβk1

by SOCP. This variant generates the same solution path as lasso (why?) but simplifies
the choice of λ.
A demo example: http://hua-zhou.github.io/teaching/st790-2015spr/demo_lasso.
html

• Example: Image denoising by ROF model. See HW5 Q4.

• A rotated quadratic cone in Rn+2 is

Qn+2
r = {(x, t1 , t2 ) : kxk22 ≤ 2t1 t2 , t1 ≥ 0, t2 ≥ 0}.

A point x ∈ Rn+1 belongs to the second order cone Qn+1 if and only if
 
In−2 0 0
√ √ 
 0 −1/ 2 1/ 2 x

√ √
0 1/ 2 1/ 2

belongs to the rotated quadratic cone Qn+1


r .

129
R Gurobi allows users to input second order cone constraint and quadratic con-

R
straints directly.
Mosek allows users to input second order cone constraint, quadratic constraints,
and rotated quadratic cone constraint directly.

• Following sets are (rotated) quadratic cone representable sets:

– (Absolute values) |x| ≤ t ⇔ (x, t) ∈ Q2 .


– (Euclidean norms) kxk2 ≤ t ⇔ (x, t) ∈ Qn+1 .
– (Squared Euclidean norms) kxk22 ≤ t ⇔ (x, t, 1/2) ∈ Qn+2
r .
– (Ellipsoid) For P ∈ Sn+ and if P = F T F , where F ∈ Rn×k , then

(1/2)xT P x + cT x + r ≤ 0
⇔ xT P x ≤ 2t, t + cT x + r = 0
⇔ (F x, t, 1) ∈ Qk+2
r , t + cT x + r = 0.

Similarly,

R
kF (x − c)k2 ≤ t ⇔ (y, t) ∈ Qn+1 , y = F (x − c).

This fact shows that QP and QCQP are instances of SOCP.


– (Second order cones) kAx + bk2 ≤ cT x + d ⇔ (Ax + b, cT x + d) ∈ Qm+1 .
– (Simple polynomial sets)

{(t, x) : |t| ≤ x, x ≥ 0} = {(t, x) : (t, x, 1/2) ∈ Q3r }

{(t, x) : t ≥ x−1 , x ≥ 0} = {(t, x) : ( 2, x, t) ∈ Q3r }
{(t, x) : t ≥ x3/2 , x ≥ 0} = {(t, x) : (x, s, t), (s, x, 1/8) ∈ Q3r }
{(t, x) : t ≥ x5/3 , x ≥ 0} = {(t, x) : (x, s, t), (s, 1/8, z), (z, s, x) ∈ Q3r }
{(t, x) : t ≥ x(2k−1)/k , x ≥ 0}, k ≥ 2, can be represented similarly

{(t, x) : t ≥ x−2 , x ≥ 0} = {(t, x) : (s, t, 1/2), ( 2, x, s) ∈ Q3r }
{(t, x, y) : t ≥ |x|3 /y 2 , y ≥ 0} = {(t, x, y) : (x, z) ∈ Q2 , (z, y/2, s), (s, t/2, z) ∈ Q3r }

– (Geometric mean) The hypograph of the (concave) geometric mean function

Kngm = {(x, t) ∈ Rn+1 : (x1 x2 · · · xn )1/n ≥ t, x  0}

can be represented by rotated quadratic cones. See (Lobo et al., 1998) for deriva-
tion. For example,

K2gm = {(x1 , x2 , t) : x1 x2 ≥ t, x1 , x2 ≥ 0}

= {(x1 , x2 , t) : ( 2t, x1 , x2 ) ∈ Q3r }.

130
Pn −1
– (Harmonic mean) The hypograph of the harmonic mean function n−1 i=1 x−1
i
can be represented by rotated quadratic cones
n
!−1
X
n−1 x−1
i ≥ t, x  0
i=1
n
X
⇔ n−1 x−1
i ≤ y, x  0
i=1
n
X
⇔ xi zi ≥ 1, zi = ny, x  0
i=1
Xn
⇔ 2xi zi ≥ 2, zi = ny, x  0, z  0
i=1

⇔ ( 2, xi , zi ) ∈ Q3r , 1T z = ny, x  0, z  0.

– (Convex increasing rational powers) For p, q ∈ Z+ and p/q ≥ 1,

Kp/q = {(x, t) : xp/q ≤ t, x ≥ 0} = {(x, t) : (t1q , 1p−q , x) ∈ Kpgm }.

– (Convex decreasing rational powers) For any p, q ∈ Z+ ,

K−p/q = {(x, t) : x−p/q ≤ t, x ≥ 0} = {(x, t) : (x1p , t1q , 1) ∈ Kp+q


gm }.

– (Power cones) The power cone with rational powers is


( n
)
p /q
Y
Kn+1 n
α = (x, y) ∈ R+ × R : |y| ≤ xj j j ,
j=1
Pn
where pj , qj are integers satisfying 0 < pj ≤ qj and j=1 pj /qj = 1. Let β =
lcm(q1 , . . . , qn ) and
j
X pk
sj = β , j = 1, . . . , n − 1.
k=1
qk

Then it can be represented as

|y| ≤ (z1 z2 · · · zβ )1/q

R
z1 = · · · = zs1 = x1 , zs1 +1 = · · · = zs2 = x2 , zsn−1 +1 = · · · = zβ = xn .

References for above examples: Papers(Lobo et al., 1998; Alizadeh and Goldfarb,
2003) and book (Ben-Tal and Nemirovski, 2001, Lecture 3). Now our catalogue of

R
SOCP terms includes all above terms.
Most of these function are implemented as the built-in function in the convex
optimization modeling language cvx.

131
• Example. `p regression with p ≥ 1 a rational number

minimize ky − Xβkp

can be formulated as a SOCP. Why? For instance, `3/2 regression combines advantage

R
of both robust `1 regression and least squares.
norm(x, p) is a built-in function in the convex optimization modeling language
cvx.

132
16 Lecture 16, Mar 18
Announcements
• HW5 (QP, SOCP) due this Fri, Mar 20 @ 11:59PM.

• HW6 (SDP, GP, MIP) posted http://hua-zhou.github.io/teaching/st790-2015spr/


ST790-2015-HW6.pdf. Due Mon, Mar 30 @ 11:59PM.

• HW4 (LP) solution sketch posted. http://hua-zhou.github.io/teaching/st790-2015spr/


hw04sol.html

Last Time
• SOCP (cont’d).

Today
• SDP (semidefinite programming).

• GP (geometric programming).

Semidefinite programming (SDP)

• A semidefinite program (SDP) has the form

minimize cT x
subject to x1 F1 + · · · + xn Fn + G  0 (LMI, linear matrix inequality)
Ax = b,

133
R
where G, F1 , . . . , Fn ∈ Sk , A ∈ Rp×n , and b ∈ Rp .
When G, F1 , . . . , Fn are all diagonal, SDP reduces to LP.

• The standard form SDP has form

minimize tr(CX)
subject to tr(Ai X) = bi , i = 1, . . . , p
X  0,

where C, A1 , . . . , Ap ∈ Sn .

• An inequality form SDP has form

minimize cT x
subject to x1 A1 + · · · + xn An  B,

with variable x ∈ Rn , and parameters B, A1 , . . . , An ∈ Sn , c ∈ Rn .

• Exercise. Write LP, QP, QCQP, and SOCP in form of SDP.

• Example. Nearest correlation matrix. Let Cn be the convex set of n × n correlation


matrices

C = {X ∈ Sn+ : xii = 1, i = 1, . . . , n}.

Given A ∈ Sn , often we need to find the closest correlation matrix to A

minimize kA − XkF
subject to X ∈ C.

This projection problem can be solved via an SDP

minimize t
subject to kA − XkF ≤ t
X = X T , diag(X) = 1
X0

in variables X ∈ Rn×n and t ∈ R. The SOC constraint can be written as an LMI


!
tI vec(A − X)
0
vec(A − X)T t

by the Schur complement lemma.

134
• Eigenvalue problems. Suppose

A(x) = A0 + x1 A1 + · · · xn An ,

where Ai ∈ Sm , i = 0, . . . , n. Let λ1 (x) ≥ λ2 (x) ≥ · · · ≥ λm (x) be the ordered


eigenvalues of A(x).

– Minimize the maximal eigenvalue is equivalent to the SDP

minimize t
subject to A(x)  tI

R
in variables x ∈ Rn and t ∈ R.
Minimizing the sum of k largest eigenvalues is an SDP too. How about

R
minimizing the sum of all eigenvalues?
Maximize the minimum eigenvalue is an SDP as well.
– Minimize the spread of the eigenvalues λ1 (x) − λm (x) is equivalent to the SDP

minimize t1 − tm
subject to tm I  A(x)  t1 I

in variables x ∈ Rn and t1 , tm ∈ R.
– Minimize the spectral radius (or spectral norm) ρ(x) = maxi=1,...,m |λi (x)| is equiv-
alent to the SDP

minimize t
subject to −tI  A(x)  tI

in variables x ∈ Rn and t ∈ R.
– To minimize the condition number κ(x) = λ1 (x)/λm (x), note λ1 (x)/λm (x) ≤ t
if and only if there exists a µ > 0 such that µI  A(x)  µtI, or equivalently,
I  µ−1 A(x)  tI. With change of variables yi = xi /µ and s = 1/µ, we can
solve the SDP

minimize t
subject to I  sA0 + y1 A1 + · · · yn An  tI
s ≥ 0,

in variables y ∈ Rn and s, t ≥ 0. In other words, we normalize the spectrum by


the smallest eigenvalue and then minimize the largest eigenvalue of the normalized
LMI.

135
– Minimize the `1 norm of the eigenvalues |λ1 (x)| + · · · + |λm (x)| is equivalent to
the SDP

minimize tr(A+ ) + tr(A− )


subject to A(x) = A+ − A−
A+  0, A−  0,

in variables x ∈ Rn and A+ , A− ∈ Sn+ .


– Roots of determinant. The determinant of a semidefinite matrix det(A(x)) =
Qm
i=1 λi (x) is neither convex or concave, but rational powers of the determinant
can be modeled using linear matrix inequalities. For a rational power 0 ≤ q ≤
1/m, the function det(A(x))q is concave and we have

t ≤ det(A(x))q
!
A(x) Z
⇔ T
 0, (z11 z22 · · · zmm )q ≥ t,
Z diag(Z)

where Z ∈ Rm×m is a lower-triangular matrix. Similarly for any rational q > 0,


we have

t ≥ det(A(x))−q
!
A(x) Z
⇔  0, (z11 z22 · · · zmm )−q ≤ t
Z T diag(Z)
for a lower triangular Z.
Pm
– Trace of inverse. trA(x)−1 = i=1 λ−1
i (x) is a convex function and can be mini-
mized using SDP

minimize trB !
B I
subject to  0.
I A(x)
Pm
Note trA(x)−1 = i=1 eTi A(x)−1 ei . Therefore another equivalent formulation is
m
X
minimize ti
i=1
subject to eTi A(x)−1 ei ≤ ti .

Now the constraints can be expressed by LMI


!
A(x) ei
eTi A(x)−1 ei ≤ ti ⇔  0.
eTi ti

136
R See (Ben-Tal and Nemirovski, 2001, Lecture 4, p146-p151) for the proof of above

R
facts.
lambda max, lambda min, lambda sum largest, lambda sum smallest, det rootn,

R
and trace inv are implemented in cvx for Matlab.
lambda max, lambda min are implemented in Convex.jl package for Julia.

• Example. Experiment design. See HW6 Q1 http://hua-zhou.github.io/teaching/


st790-2015spr/ST790-2015-HW6.pdf

137
17 Lecture 17, Mar 23
Announcements
• HW6 (SDP, GP, MIP) due next Mon, Mar 30 @ 11:59PM. http://hua-zhou.github.
io/teaching/st790-2015spr/ST790-2015-HW6.pdf

• Lecture pace too fast / For this course I put priority on diversity over thoroughness of
topics. The goal is to introduce variety of tools that I consider useful but not covered in
standard statistics curriculum. That means, given time limitation, many details have
to be omitted. On the other hand, I have tried hard to point you to the best resources
I know of (text book, lecture video, best software, ...) regarding these topics. It is your
responsibility to follow up, understand and do homework problems, and internalize the
material to become your own tools.

For the convex optimization part, the most important thing is to keep a catalog of
problems that can be solved by each problem class (LP, QP, SOCP, SDP, GP) and get
familiar with the good convex optimization tools for solving them.

• On course project:

– Ideally I hope you can come up a project that benefits yourself. You’ve learnt a lot
tools from this course. Do something with them, that can turn into a manuscript,
a software package, or a blog, and most importantly, something that interests
yourself.
∗ Re-examine the computational issues in your research projects. Is that slow?
What’s the bottleneck? Would Rcpp or changing to another language like
Julia help? Is there an optimization problem there? Is that a convex prob-
lem? Can I do convex relaxation? Can I formulate the problem as a standard
problem class (LP, QP, ...)?

138
∗ Create new applications by try different combinations of the terms in each
category. Say XXX loss + XXX penalty? Can they solve some problems
better (or faster) than current methods?
∗ Reverse engineering. Go over the examples and exercises in the textbook
(Boyd and Vandenberghe, 2004) and ask yourself “this is cool, can I apply
this to solve some statistical problems?”
∗ Do not worry about how to satisfy the instructor. Think about doing some-
thing that benefit yourself in the long run. Be creative and do not be afraid
your idea dose not work. Even negative results are valuable; I appreciate
negative results as far as I see a strong motivation and efforts in them and
you provide some hindsights why the method does not work as you thought.
And seriously, you should write a blog for whatever negative results you got.
I think they have as much intellectual merits as published positive results.
“If your mentor handed you a sure-fire project, then it probably is dull.” (Kenneth
Lange)

Give a man a fish, he eats for a day. Teach him to fish, he will never go
hungry.

– If you really lack ideas, work on an active competition on kaggle.com. Provide


your best position in the leaderboard in your final project report.
– The final project report should look like a paper: introduction, motivation, method,
algorithm, simulation studies if necessary, real data analysis, conclusion.

• Up to technology? NVIDIA CUDA v7.0 was released last week. A new library
cuSOLVER provides a collection of dense and sparse direct solvers. https://developer.
nvidia.com/cusolver This potentially opens up a lot GPU computing opportunities
for statistics.

Last Time
• SDP.

Today
• SDP (cont’d).

139
SDP (cont’d)
• Singular value problems. Let A(x) = A0 + x1 A1 + · · · xn An , where Ai ∈ Rp×q and
σ1 (x) ≥ · · · σmin{p,q} (x) ≥ 0 be the ordered singular values.

– Spectral norm (or operator norm or matrix-2 norm) minimization. Consider min-
imizing the spectral norm kA(x)k2 = σ1 (x). !Note kAk2 ≤ t if and only if
tI A
AT A  t2 I (and t ≥ 0) if and only if  0. This results in the SDP
AT tI

minimize t !
tI A(x)
subject to 0
A(x)T tI

R
in variables x ∈ Rn and t ∈ R.
Minimizing the sum of k largest singular values is an SDP as well.
– Nuclear norm minimization. Minimization of the nuclear norm (or trace norm)
P
kA(x)k∗ = i σi (x) can be formulated as an SDP.
Argument 1: Singular values of A coincides with the eigenvalues of the symmetric
matrix
!
0 A
,
AT 0

which has eigenvalues (σ1 , . . . , σp , −σp , . . . , −σ1 ). Therefore minimizing the nu-
clear norm of A is same as minimizing the `1 norm of eigenvalues of the augmented
matrix, which we know is an SDP.
Argument 2: An alternative characterization of nuclear norm is kAk∗ = supkZk2 ≤1 tr(AT Z).
That is

maximize tr(AT Z)
!
I ZT
subject to  0,
Z I

with the dual problem

minimize tr(U + V )/2


!
U A(x)T
subject to  0.
A(x) V

140
Therefore the epigraph of nuclear norm can be represented by LMI

kA(x)k∗ ≤ t
!
U A(x)T
⇔  0, tr(U + V )/2 ≤ t.
A(x) V

R
Argument 3: See (Ben-Tal and Nemirovski, 2001, Proposition 4.2.2, p154).

See (Ben-Tal and Nemirovski, 2001, Lecture 4, p151-p154) for the proof of above

R
facts.

R
sigma max and norm nuc are implemented in cvx for Matlab.
operator norm and nuclear norm are implemented in Convex.jl package for Ju-
lia.

• Example. Matrix completion. See HW6 Q2 http://hua-zhou.github.io/teaching/


st790-2015spr/ST790-2015-HW6.pdf

• Quadratic or quadratic-over-linear matrix inequalities. Suppose

A(x) = A0 + x1 A1 + · · · + xn An
B(y) = B0 + y1 B1 + · · · + yr Br .

Then

A(x)T B(y)−1 A(x)  C


!
B(y) A(x)T
⇔ 0
A(x) C

R
by the Schur complement lemma.
matrix frac() is implemented in both cvx for Matlab and Convex.jl package
for Julia.

• General quadratic matrix inequality. Let X ∈ Rm×n be a rectangular matrix and

F (X) = (AXB)(AXB)T + CXD + (CXD)T + E

be a quadratic matrix-valued function. Then

F (X)  Y
!
I (AXB)T
⇔ 0
AXB Y − E − CXD − (CXD)T

by the Schur complement lemma.

141
• Another matrix inequality

X  0, Y  (C T X −1 C)−1
⇔ Y  Z, Z  0, X  CZC T .

See (Ben-Tal and Nemirovski, 2001, 20.c, p155).

• Cone of nonnegative polynomials. Consider nonnegative polynomial of degree 2n

f (t) = xT v(t) = x0 + x1 t + · · · x2n t2n ≥ 0, for all t.

The cone

Kn = {x ∈ R2n+1 : f (t) = xT v(t) ≥ 0, for all t ∈ R}

can be characterized by LMI

f (t) ≥ 0 for all t ⇔ xi = hX, Hi i, i = 0, . . . , 2n, X ∈ Sn+1


+ ,

where Hi ∈ R(n+1)×(n+1) are Hankel matrices with entries (Hi )kl = 1 if k + l = i or 0


otherwise. Here k, l ∈ {0, 1, . . . , n}.
Similarly the cone of nonnegative polynomials on a finite interval

Kna,b = {x ∈ Rn+1 : f (t) = xT v(t) ≥ 0, for all t ∈ [a, b]}

can be characterized by LMI as well.

– (Even degree) Let n = 2m. Then


m−1
Kna,b = {x ∈ Rn+1 : xi = hX1 , Him i + hX2 , (a + b)Hi−1 − abHim−1 − Hi−2
m−1
i,
i = 0, . . . , n, X1 ∈ Sm m−1
+ , X2 ∈ S+ }.

– (Odd degree) Let n = 2m + 1. Then

Kna,b = {x ∈ Rn+1 : xi = hX1 , Hi−1


m
− aHim i + hX2 , bHim − Hi−1
m
i,
i = 0, . . . , n, X1 , X2 ∈ Sm

R
+ }.

References: paper (Nesterov, 2000) and the book (Ben-Tal and Nemirovski, 2001,
Lecture 4, p157-p159).

• Example. Polynomial curve fitting. We want to fit a univariate polynomial of degree


n

f (t) = x0 + x1 t + x2 t2 + · · · xn tn

142
to a set of measurements (ti , yi ), i = 1, . . . , m, such that f (ti ) ≈ yi . Define the
Vandermonde matrix
 
1 t1 t21 ··· tn1
1 t2 t22 ··· tn2 
 
A=  .. .. .. ,
.. 
. . . .
1 tm t2m · · · tnm

then we wish Ax ≈ y. Using least squares criterion, we obtain the optimal solution
xLS = (AT A)−1 AT y. With various constraints, it is possible to find optimal x by
SDP.

1. Nonnegativity. Then we require x ∈ Kna,b .


2. Monotonicity. We can ensure monotonicity of f (t) by requiring that f 0 (t) ≥ 0 or
n−1 n−1
f 0 (t) ≤ 0. That is (x1 , 2x2 , . . . , nxn ) ∈ Ka,b or −(x1 , 2x2 , . . . , nxn ) ∈ Ka,b .
3. Convexity or concavity. Convexity or concavity of f (t) corresponds to f 00 (t) ≥ 0
or f 00 (t) ≤ 0. That is (2x2 , 2x3 , . . . , (n − 1)nxn ) ∈ Kn−2
a,b or −(2x2 , 2x3 , . . . , (n −

R
n−2
1)nxn ) ∈ Ka,b .

nonneg poly coeffs() and convex poly coeffs() are implemented in cvx. Not
in Convex.jl yet.

143
18 Lecture 18, Mar 25
Announcements
• HW6 (SDP, GP, MIP) due next Mon, Mar 30 @ 11:59PM. http://hua-zhou.github.
io/teaching/st790-2015spr/ST790-2015-HW6.pdf

• HW5 (QP, SOCP) solution sketch posted http://hua-zhou.github.io/teaching/


st790-2015spr/hw05sol.html

• The teaching server is reserved for teaching purpose. Please do not run and store
your research stuff on it. Each ST790-003 homework problem should take no longer
than a few minutes. Most of them take only a couple seconds.

Last Time
• SDP (cont’d).

Today
• SDP (cont’d).

• GP (geometric programming).

SDP (cont’d)
• Example. Nonparametric density estimation by polynomials. See notes.

• SDP relaxation of combinatorial problems. An effective strategy to solve difficult


combinatorial optimization problem (NP hard) is to bound the unknown optimal value.
Upper bound is provided by any feasible point, while lower bound is often provided by
a convex relaxation of the original problem.

– SDP relaxation of binary optimization. Consider a binary linear optimization


problem

minimize cT x
subject to Ax = b, x ∈ {0, 1}n .

Note

xi ∈ {0, 1} ⇔ x2i = xi ⇔ X = xxT , diag(X) = x.

144
By relaxing the rank 1 constraint on X, we obtain an SDP relaxation

minimize cT x
subject to Ax = b, diag(X) = x, X  xxT ,

which can be efficiently solved and provides a lower bound to the original problem.
If the optimal X has rank 1, then it is a solution to the original binary problem
also. Note X  xxT is equivalent to the LMI
!
1 xT
 0.
x X
We can tighten the relaxation by adding other constraints that cut away part of
the feasible set, without excluding rank 1 solutions. For instance, 0 ≤ xi ≤ 1 and
0 ≤ Xij ≤ 1.
– SDP relaxation of boolean optimization. For Boolean constraints x ∈ {−1, 1}n ,
we note

R
xi ∈ {0, 1} ⇔ X = xxT , diag(X) = 1.

References: Paper (Laurent and Rendl, 2005) and book (Ben-Tal and Nemirovski,
2001, Lecture 4.3).

Geometric programming (GP)


• A function f : Rn 7→ R with domf = Rn++ defined as

f (x) = cxa11 xa22 · · · xann ,

where c > 0 and ai ∈ R, is called a monomial.

• A sum of monomials
K
X
f (x) = ck xa11k xa22k · · · xannk ,
k=1

where ck > 0, is called a posynomial.

• Posynomials are closed under addition, multiplication, and nonnegative scaling.

• A geometric program is of form

minimize f0 (x)
subject to fi (x) ≤ 1, i = 1, . . . , m
hi (x) = 1, i = 1, . . . , p

145
where f0 , . . . , fm are posynomials and h1 , . . . , hp are monomials. The constraint x  0

R
is implicit.
Is GP a convex optimization problem?

• With change of variable yi = ln xi , a monomial

f (x) = cxa11 xa22 · · · xann

can be written as
T y+b
f (x) = f (ey1 , . . . , eyn ) = c(ey1 )a1 · · · (eyn )an = ea ,

where b = ln c. Similarly, we can write a posynomial as


K
X
f (x) = ck xa11k xa22k · · · xannk
k=1
K
T
X
= eak y+bk ,
k=1

where ak = (a1k , . . . , ank ) and bk = ln ck .

• The original GP can be expressed in terms of the new variable y


K0
T
X
minimize ea0k y+b0k
k=1
Ki
T
X
subject to eaik y+bik ≤ 1, i = 1, . . . , m
k=1
T
egi y+hi = 1, i = 1, . . . , p,

where aik , gi ∈ Rn . Taking log of both objective and constraint functions, we obtain
the geometric program in convex form
K0
!
T
X
minimize ln ea0k y+b0k
k=1
Ki
!
aT
X
subject to ln eik y+bik ≤ 0, i = 1, . . . , m
k=1
giT y + hi = 0, i = 1, . . . , p.

R Mosek is capable of solving GP. cvx has a GP mode that recognizes and transforms
GP problems.

146
• Example. Logistic regression as GP. Given data (xi , yi ), i = 1, . . . , n, where yi ∈ {0, 1}
and xi ∈ Rp , the likelihood of the logistic regression model is
n
Y
pyi i (1 − pi )1−yi
i=1
!yi 
i β
n xT
1−yi
Y e 1
=
1 + exi β 1 + exi β
T T
i=1
n  
xi β yi
Y T Y 1
= e .
1 + exi β
T
i:yi =1 i=1

The MLE solves


n  
i β
−xT
1 + exi β .
T
Y Y
minimize e
i:yi =1 i=1

Let zj = eβj , j = 1, . . . , p. The objective becomes


p p
n
!
−x x
Y Y Y Y
zj ij 1+ zj ij .
i:yi =1 j=1 i=1 j=1

This leads to a GP
Y n
Y
minimize si ti
i:yi =1 i=1
p
Y −x
subject to zj ij ≤ si , i = 1, . . . , m
j=1
p
x
Y
1+ zj ij ≤ ti , i = 1, . . . , n,
j=1

in variables s ∈ Rm , t ∈ Rn , and z ∈ Rp . Here m is the number of observations with

R
yi = 1.
+ −
How to incorporate lasso penalty? Let zj+ = eβj , zj− = eβj . Lasso penalty takes
the form eλ|βj | = (zj+ zj− )λ .

• Example. Bradley-Terry model for sports ranking. See ST758 HW8 http://hua-zhou.
github.io/teaching/st758-2014fall/ST758-2014-HW8.pdf. The likelihood is
Y  γi yij
.
i,j
γi + γj

147
MLE is solved by GP
y
Y
minimize tijij
i,j

subject to 1 + γi−1 γj ≤ tij


2
in γ ∈ Rn and t ∈ Rn .

148
19 Lecture 19, Mar 30
Announcements
• HW6 (SDP, GP, MIP) deadline extended to this Wed, Apr 1 @ 11:59PM. Some hints
if you use Convex.jl package in Julia for HW6:

– Q1(a): Convex.jl does not implement root determinant function but it imple-
ments the logdet function that you can use
– Q1(d): Convex.jl does not implement trace inv function but you can easily
formulate it as an SDP
– Q4(a): Convex.jl does not model GP (geometric program), but you can use
change of variable yi = ln xi and utilize the logsumexp function in Convex.jl
– Q4(b): Convex.jl does not have a log normcdf function but you can learn the
quadratic approximation trick from cvx https://github.com/cvxr/CVX/blob/
master/functions/%40cvx/log_normcdf.m

Last Time
• SDP (cont’d).

• GP (geometric programming).

Today
• Cone programming.

• Separable convex optimization in Mosek.

• Mixed integer programming (MIP).

• Planned topics for remaining of the course: algorithms for sparse and regularized
regressions, dynamic programming, EM/MM advanced topics: s.e., convergence and
acceleration, and online estimation.

Generalized inequalities and cone programming


• A cone K ∈ Rn is proper if it is closed, convex, has non-empty interior, and is pointed,
i.e., x ∈ K and −x ∈ K implies x = 0.

149
A proper cone defines a partial ordering on Rn via generalized inequalities: x K y if
and only if y − x ∈ K and x ≺ y if and only if y − x ∈ int(K).
E.g., X  Y means Y − X ∈ Sn+ and X ≺ Y means Y − X ∈ Sn++ .

• A conic form problem or cone program has the form

minimize cT x
subject to F x + g K 0
Ax = b.

• The conic form problem in standard form is

minimize cT x
subject to x K 0
Ax = b.

• The conic form problem in inequality form is

minimize cT x
subject to F x + g K 0.

• Special cases of cone programming.

– Nonnegative orthant {x|x  0}: LP


– Second order cone {(x, t)|kxk2 ≤ t}: SOCP
– Rotated quadratic cone {(x, t1 , t2 )|kxk22 ≤ 2t1 t2 }: SOCP
– Geometric mean cone {(x, t)|( xi )1/n ≥ y, x  0}: SOCP
Q

– Semidefinite cone Sn+ : SDP


– Nonnegative polynomial cone: SDP
– Monotone polynomial cone: SDP
– Convex/concave polynomial cone: SDP
– Exponential cone {(x, y, z)|yex/y ≤ z, y > 0}. Terms logsumexp, exp, log,
entropy, lndet, ... are exponential cone representable.

• Where is today’s technology up to?

– Gurobi implements up to SOCP.

150
– Mosek implements up to SDP.
– SCS (free solver accessible from Convex.jl) can deal with exponential cone pro-
gram.
– cvx uses a successive approximation strategy to deal with exponential cone rep-
resentable terms, which only relies on SOCP.

R
http://web.cvxr.com/cvx/doc/advanced.html#successive
cvx implements log det and log sum exp.
– Convex.jl accepts exponential cone representable terms, which can solve using

R
SCS.
Convex.jl implements logsumexp, exp, log, entropy, and logistic loss.

• Example. Logistic regression as an exponential cone problem


n  
xT β
X X
T
minimize − xi β + ln 1 + e i .
i:yi =1 i=1

See cvx example library for an example for logistic regression. http://cvxr.com/cvx/
examples/
See the link for an example using Julia. http://nbviewer.ipython.org/github/
JuliaOpt/Convex.jl/blob/master/examples/logistic_regression.ipynb

• Example. Gaussian covariance estimation and graphical lasso

ln det(Σ) + tr(SΣ) − λkvecΣk1

involves exponential cones because of the ln det term.

Separable convex optimization in Mosek


• Mosek is posed to solve general convex nonlinear programs (NLP) of form

minimize f (x) + cT x
subject to li ≤ gi (x) + aTi x ≤ ui , i = 1, . . . , m
l x  x  ux .

Here functions f : Rn 7→ R and gi : Rn 7→ R, i = 1, . . . , m must be separable in


parameters.

• The example

minimize x1 − ln(x1 + 2x2 )


subject to x21 + x22 ≤ 1

151
is not separable. But the equivalent formulation

minimize x1 − ln(x3 )
subject to x21 + x22 ≤ 1, x1 + 2x2 − x3 = 0, x3 ≥ 0

is.

• It should cover a lot statistical applications. But I have no experience with its perfor-
mance yet.

• Which modeling tool to use?

– cvx and Convex.jl can not model general NLP.


– JuMP.jl in Julia can model NLP or even MINLP. See http://jump.readthedocs.
org/en/latest/nlp.html

Other topics in convex optimization


• Duality theory. (Boyd and Vandenberghe, 2004, Chapter 5).

• Algorithms. Interior point method. (Boyd and Vandenberghe, 2004) Part III (Chapters
9-11).

• History:

1. 1948: Dantzig’s simplex algorithm for solving LP.


2. 1984: first practical polynomial-time algorithm (interior point method) by Kar-
markar.
3. 1984-1990: efficient implementations for large-scale LP.
4. around 1990: polynomial-time interior-point methods for nonlinear convex pro-
gramming by Nesterov and Nemirovski.
5. since 1990: extensions and high-quality software packages.

Mixed integer programming


• Mixed integer program allows certain optimization variables to be integer.

R
• Current technology can solve small to moderately sized MILP and MIQP.
cvx allows binary and integer variables.
Convex.jl for Julia does not allow integer variables.
JuMP.jl for Julia allows binary and integer variables.

152
• Modeling using integer variables. References (Nemhauser and Wolsey, 1999; Williams,
2013).

– (Positivity) If 0 ≤ x < M for a known upper bound M , then we can model the
implication (x > 0) → (z = 1) by linear inequality x ≤ M z, where z ∈ {0, 1}.
Similarly if 0 < m ≤ x for a known lower bound m. Then we can model the
implication (z = 1) → (x > 0) by the linear inequality x ≥ mz, where z ∈ {0, 1}.
– (Semi-continuity) We can model semi-continuity of a variable x ∈ R, x ∈ 0 ∪ [a, b]
where 0 < a ≤ b using a double inequality az ≤ x ≤ bz where z ∈ {0, 1}.
– (Constraint satisfaction) Suppose we know the upper bound M on aT x − b. Then
the implication (z = 1) → (aT x ≤ b) can be modeled as

aT x ≤ b + M (1 − z),

where z ∈ {0, 1}. Equivalently the reverse implication (aT x ≤ b) → (z = 1) is


modeled as

aT x ≥ b + (m − )z + .

where m < aT x − b is a lower bound. Collectively we model (aT x ≤ b) ↔ (z = 1)


as

aT x ≤ b + M (1 − z), aT x ≥ b + (m − )z + .

In a similar fashion, (z = 1) ↔ (aT x ≥ b) is modeled as

aT x ≥ b + M (1 − z), aT x ≤ b + (m − )z + 

using the lower bound m < aT x − b and upper bound M > aT x − b.


Combining both we can model equality aT x = b by modeling (z = 1) → (aT x = b
by

aT x ≤ b + M (1 − z), aT x ≥ b + m(1 − z)

and (z = 0) → (aT x 6= b) by

aT x ≥ b + (m − )z1 + , aT x ≤ b + (M + )z2 − , z1 + z2 − z ≤ 1,

where z1 + z2 − z ≤ 1 is equivalent to (z = 0) → (z1 = 0) ∨ (z2 = 0).

153
– (Disjunctive constraints) The requirement that at least one out of a set of con-
straints is satisfied (z = 1) → (aT1 x ≤ b1 ) ∨ (aT2 x ≤ b2 ) ∨ · · · ∨ (aTk x ≤ bk ) can be
modeled by

z = z1 + · · · zk ≥ 1, aTj x ≤ bj + M (1 − zj ), for all j,

where zj ∈ {0, 1} are binary variables and M > aTj − bj for all j is a collective
upper bound.
The reverse implication (aT1 x ≤ b1 ) ∨ (aT2 x ≤ b2 ) ∨ · · · ∨ (aTk x ≤ bk ) → (z = 1) is
modeld as

aTj x ≥ b + (m − )z + , j = 1, . . . , k,

with a lower bound m < aTj x − bj for all j and z ∈ {0, 1}.
– (Pack constraints) The requirement at most one of the constraints are satisfied is
modeled as

z1 + · · · + zk ≤ 1, aTj x ≤ bj + M (1 − zj ), for all j.

– (Partition constraints) The requirement exactly one of the constraints are satisfied
is modeled as

z1 + · · · + zk = 1, aTj x ≤ bj + M (1 − zj ), for all j.

– Boolean primitives.
∗ Complement
x ¬x
0 1
1 0
is modeled as ¬x = 1 − x.
∗ Conjunction
x y x∧y
0 0 0
1 0 0
0 1 0
1 1 1
z = (x ∧ y) is modeled as z + 1 ≥ x + y, x ≥ z, y ≥ z.
∗ Disjunction

154
x y x∨y
0 0 0
1 0 1
0 1 1
1 1 1
is modeled as x + y ≥ 1.
∗ Implication
x y x→y
0 0 1
1 0 0
0 1 1
1 1 1
is modeled as x ≤ y.
– Special ordered set constraint: SOS1 and SOS2. See (Williams, 2013, Section 9.3)
or (Bertsimas and Weismantel, 2005).
An SOS1 constraint is a set of variables for which at most one variable in the
set may take a value other than zero. An SOS2 constraint is an ordered set of
variables where at most two variables in the set may take non-zero values. If two

R
take non-zeros values, they must be contiguous in the ordered set.
Gurobi solver allows SOS1 and SOS2 constraints. JuMP.jl modeling tool for
Julia accepts SOS1 and SOS2 constraints and pass them to solvers that support
them. cvx and Convex.jl dose not take SOS constraints.

• Example. Best subset regression. HW6 Q3. Consider

minimize ky − β0 1 − Xβk22
subject to kβk0 ≤ k.

Introducing binary variables zj ∈ {0, 1} such that (βj 6= 0) → (zj = 1), then it can be
formulated as a MIQP

minimize ky − β0 1 − Xβk22
subject to −M zj ≤ βj ≤ M zj
p
X
zj ≤ k,
j=1

where M ≥ kβk∞ . Alternatively, we may model cardinality constraint by SOS1 con-


straints {βj , zj } ∈ SOS1, which does not need a pre-defined M .

155
R We should be able to do best subset XXX for all problems in HW4/5 by formulating
a corresponding MILP, MIQP or MISOCP.

• Example. Variable selection in presence of interaction. Consider variable selection for


linear regression with p predictors and their pairwise interactions. For better inter-
pretability, we may want to retain interaction terms only when their main effects are
in the model as well. We may achieve this by
n p
!2 p
!
1X X X X X
minimize yi − µ − xij βj − xij xij 0 βjj 0 + λ |βj | + |βjj 0 |
2 i=1 j=1 j,j 0 j=1 j,j 0

subject to the logical constraints (βjj 0 6= 0) → (βj 6= 0) ∧ (βj 0 6= 0).

• Example. Sudoku. How to solve Sudoku using integer programming?


Define solution as a binary array X ∈ {0, 1}9×9×9 with entries xijk = 1 if and only if
(i, j)-th entry is integer k. We need constraints
P9
1. Each square in the 2D grid has exactly one value. So k=1 xijk = 1.
2. Each row i of the 2D grid has exactly one value out of each of the digits from 1
to 9. So 9j=1 xijk = 1.
P

3. Each column i of the 2D grid has exactly one value out of each of the digits from
1 to 9. So 9i=1 xijk = 1.
P

4. The major 3-by-3 grids have similar property. So 3i=1 3j=1 xi+U,j+V,k = 1, where
P P

U, V ∈ {0, 3, 6}.
5. Observed entries prescribe xijm = 1 if (i, j)-th entry is integer m.

Julia: http://nbviewer.ipython.org/github/JuliaOpt/juliaopt-notebooks/blob/
master/notebooks/JuMP-Sudoku.ipynb
Matlab: http://www.mathworks.com/help/optim/ug/solve-sudoku-puzzles-via-integer-prog
html

• Optimization involving piecewise-linear functions can be formulated as MIP. See (Vielma


et al., 2010).

156
20 Lecture 20, Apr 1
Announcements
• HW6 (SDP, GP, MIP) due today @ 11:59PM. Don’t forget git tag your submission.

• A few more course project ideas added on http://hua-zhou.github.io/teaching/


st790-2015spr/project.html

Last Time
• Cone programming.

• Separable convex optimization in Mosek.

• Mixed integer programming (MIP).

Today
• Sparse regression.

Sparse regression: what and why?


• Famous lasso (Donoho and Johnstone, 1994, 1995; Tibshirani, 1996)
p
1 X
minimize ky − β0 1 − Xβk22 + λ |βj |.
2 j=1

Why everyone does this?

– Shrinkage
– Model selection
– Computational efficiency (convex optimization)

• Why shrinkage? Idea of shrinkage dates back to one of the most surprising results in
mathematical statistics in the 20th century. Let’s consider the simple task of estimating
population mean(s).

157
– How to estimate hog weight in Montana?
– How to estimate hog weight in Montana and tea consumption in China?
– How to estimate hog weight in Montana, tea consumption in China, and speed of
light?

• Stein’s paradox.
 
(m−2)
µ̂LS = y or µ̂JS = 1 − kyk22
y?

The James-Stein shrinkage estimator µ̂JS dominates the least squares estimate µ̂LS
when the number of populations m ≥ 3!

• Observe independent yi ∼ N (µi , 1), i = 1, . . . , m.

Theorem 1. For m ≥ 3, the James-Stein estimator µ̂JS everywhere dominates the


MLE µ̂LS in terms of the expected total squared error; that is

Eµ kµ̂JS − µk22 < Eµ kµ̂LS − µk22

for every choice of µ.

– Stein (1956) showed the inadmissibility of µ̂LS ; his student James and himself
later proposed the specific form of µ̂JS in (James and Stein, 1961).
– Message: when estimating many parameters, shrinkage helps improve risk prop-
erty, even when the parameters are totally unrelated to each other.

158
• Efron’s famous baseball example (Efron and Morris, 1977).

MLE empirical risk: 0.076. James-Stein (shrinkage towards average) empirical risk:
0.021
0.4
true
MLE (0.076)
James−Stein (0.021)

0.35

0.3

0.25

0.2

0 2 4 6 8 10 12 14 16 18

• Stein’s effect is universal and underlies many modern statistical learning methods

• Empirical Bayes connection (Efron and Morris, 1973)


“Learning from the experience of the others” (John Tukey)

159
• Why m ≥ 3? Connection with transience/recurrence of Markov chains (Brown, 1971;
Eaton, 1992)
“A drunk man will eventually find his way home but a drunk bird may get lost forever.”
(Kakutani at a UCLA colloquium talk)

• Now we see the benefits of shrinkage. Lasso has the added benefit of model selection.

The left plot shows the solution path of ridge regression


p
X
minimize ky − β0 1 − Xβk22 +λ βj2
j=1

for the prostate cancer data in HW4/5/6. The right plot shows the lasso solution path
on the same data set. We see both ridge and lasso shrink β̂. But lasso has the extra
benefit of performing variable selection.

• A general sparse regression minimizes the criterion


p
X
f (β) + Pη (|βj |, λ)
j=1

– f a differentiable loss function


∗ f (β) = ky − Xβk22 /2: linear regression
∗ f (β) = −`(β): negative log-likelihood (GLM, Cox model, ...)
∗ ...

160
– P : the penalty function
– λ: penalty tuning parameter
– η: index a penalty family

• Power family penalty (bridge regression) (Frank and Friedman, 1993)

Pη (|w|, λ) = λ|w|η , η ∈ [0, 2].

– η ∈ (0, 1): concave, η ∈ [1, 2]: convex


– η = 2: ridge, η = 1: lasso, η = 0: `0 (best subset) regression

• Elastic net penalty (Zou and Hastie, 2005)

Pη (|w|, λ) = λ {(η − 1)w2 /2 + (2 − η)|w|} , η ∈ [1, 2]

– Enet tries to combine both lasso and ridge penalty.


– η = 1: lasso, η = 2: ridge.
– Friedman (2008) calls the (concave) log penalty generalized enet.

161
• SCAD (Fan and Li, 2001),

λ|w|


 |w| < λ
Pη (|w|, λ) = λ2 + ηλ(|w|−λ) w2 −λ2
η−1
− 2(η−1)
|w| ∈ [λ, ηλ]


λ2 (η + 1)/2

|w| > ηλ

– for small signals |w| < λ, it acts as lasso; for large signals |w| > ηλ, the penalty
flattens and leads to the unbiasedness of the regularized estimate

• Log penalty (Candès et al., 2008; Armagan et al., 2013)

Pη (|w|, λ) = λ ln(η + |w|), η>0

162
• MC+ penalty (Zhang, 2010)
 
w2 λ2 η
Pη (|w|, λ) = λ|w| − 2η
1{|w|<λη} + 1
2 {|w|≥λη}
, η > 0,

is quadratic on [0, λη] and flattens beyond λη. Varying η from 0 to ∞ bridges hard
thresholding (`0 regression) to lasso (`1 ) shrinkage.

Sparse regression: overview of algorithms


• Difficulties in minimizing
p
X
f (β) + Pη (|βj |, λ).
j=1

– Non-smooth. Not differentiable at βj = 0.


– Possibility non-convex.
– Extremely high dimensions in modern applications. E.g., p ∼ 106 in genetics.

• We discuss following algorithms.

– Convex optimization softwares if applicable.


– Coordinate descent.
– Nesterov method (accelerated proximal gradient method).
– Path following algorithm.

• We have seen many examples where convex optimization softwares apply. For a convex
loss f and convex penalty P , write βj = βj+ − βj− , where βj+ = max{βj , 0} and
βj− = − min{βj , 0}. Then we minimize the objective
p
X
+ −
f (β − β ) + Pη (βj+ + βj− , λ)
j=1

R
subject to nonnegativity constraints βj+ , βj− ≥ 0 using a convex optimization solver.
To guarantee βj+ and βj− to be the positive and negative part of βj , we also need
the (non-convex) constraint βj+ βj− = 0. This condition can be dispensed in sparse
regression because the penalty function is an increasing function in (βj+ + βj− ). So the
solution will always put βj+ or βj− to be 0.

• May not be efficient for extremely high dimensional, unstructured problems.

163
21 Lecture 21, Apr 6
Announcements
• HW6 solution sketch posted: http://hua-zhou.github.io/teaching/st790-2015spr/
hw06sol.html

Last Time
• Sparse regression: introduction.

Today
• Coordinate descent for sparse regression.

• Proximal gradient method.

Coordinate descent (CD)


• Idea: coordinate-wise minimization of βj
(t+1) (t+1) (t+1) (t)
βj ← argminβj f (β1 , . . . , βj−1 , βj , βj+1 , . . . , βp(t) ) + Pη (|βj |, λ)
for j = 1, . . . , p

until objective value converges. Similar to the Gauss-Seidel method for solving linear
equations. Why objective value converges?

• Success stories

– Linear regression (Fu, 1998; Daubechies et al., 2004; Friedman et al., 2007; Wu
and Lange, 2008): GlmNet in R.
– GLM (Friedman et al., 2010): GlmNet in R.
– Non-convex penalties (Mazumder et al., 2011): SparseNet in R.

• Why CD works for sparse regressions?

– Q1: Given a non-convex function f , if we are at a point x such that f is minimized


along each coordinate axis, is x a global minimum?
∗ Exercise: consider f (x, y) = (y − x2 )(y − 2x2 ). Show that all directional
derivatives at (0,0) is nonnegative, but it is not a local minimum

164
1
4

3.5

3
0.5
2.5

y
0 1.5

2 2 0.5
f(x,y) = (y−x )(y−2x )
−0.5 y = 1.4x2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
x

Answer: No.
– Q2: Same question, but for a convex, differentiable f .

Answer: Yes. Why?


– Q3: Same question, but for a convex, non-differentiable f .

Answer: No.

165
P
– Q4: Same question, but for h(x) = f (x) + j gj (xj ), where f is convex and
differentiable and gj are convex but not necessarily differentiable.

Yes. Proof: for any y,


X
h(y) − h(x) = f (y) − f (x) + [gj (yj ) − gj (xj )]
j
X
T
≥ ∇f (x) (y − x) + [gj (yj ) − gj (xj )]
j
X
= [∇j f (x)(yj − xj ) + gj (yj ) − gj (xj )]
j
≥ 0.

The first inequality is by supporting hyperplane inequality for f . The second


inequality is because h is minimized along xj coordinate thus by the first order
optimality condition

0 ∈ ∇j f (x) + ∂gj (xj )

or equivalently

−∇j f (x) ∈ ∂gj (xj ).

Then, by the definition of subgradient,

gj (yj ) − gj (xj ) ≥ −∇j f (x)(yj − xj ).

– This justifies the CD algorithm for sparse regression of form f (β)+ pj=1 Pη (|βj |, λ),
P

when both loss and penalty are convex.


– Tseng (2001) rigorously shows the convergence of CD. For f continuous on com-
pact set {x : f (x) ≤ f (x(0) )} and attaining its minimum, any limit point of CD
is a minimizer of f .

166
• Example. Lasso penalized linear regression.
n p
1X T 2
X
min (yi − β0 − xi β) + λ |β j |.
β 0 ,β 2 i=1 j=1

– Update of intercept β0
n
(t+1) 1X
β0 = (yi − xTi β (t) )
n i=1
n
1X (t) (t)
= (yi − β0 − xTi β (t) + β0 )
n i=1
n
(t) 1 X (t)
= β0 + r .
n i=1 i

– Update of βj
n i2
(t+1) 1 Xh (t) (t)
βj = arg min yi − β0 − xTi β (t) − (βj − βj )xij + λ|βj |
βj 2
i=1
n h i2
1 X (t) (t)
= arg min ri − (βj − βj )xij + λ|βj |
βj 2
i=1
!2
T T (t)
x·j x·j (t) x ·j r
= arg min βj − βj − T + λ|βj |
βj 2 x·j x·j
!
T (t)
(t) x ·j r λ
= ST βj + T , T ,
x·j x·j x·j x·j

where
1
ST(z, γ) = arg min (x − z)2 + γ|x| = sgn(z)(|z| − γ)+
x 2

is the soft-thresholding operator.


– Organize computation around residuals r. Each coordinate update requires com-
(t) (t+1)
puting xT·j r (t) and update of r (t+1) ← r (t) + (βj − βj )x·j , totally O(n) flops
or less with structures.

• Example. Lasso penalized generalized linear model (GLM).


p
X
minimize f (β) + λ |βj |,
j=1

where f is the negative log-likelihood of a GLM.

167
– Method 1: Use Newton method to update coordinate βj (Wu et al., 2009).
– Method 2 (IWLS): Each coordinate descent sweep is performed on the quadratic
approximation
n p
1 X (t) (t) T 2
X
w (z − xi β) + λ |βj |,
2 i=1 i i j=1

(t) (t)
where wi are the working weights and zi are the working responses (Friedman

R
et al., 2010).
IWLS becomes more popular because it needs much less exponentiations.

• Remarks on CD.

– Scalable to extremely large p with careful implementation, because most variables


keep parked at 0 at large λ. Can be slow at smaller λ, where many βj are non-zero.
– Active set strategy. Keep updating active predictors until convergence and then
check other predictors. See (Tibshirani et al., 2012).
– Warm start from large λ: move from sparser solutions to dense ones, using solu-
tion at previous λ as initial value for next λ.
– Coding in lower level language (C/C++, Fortran, Julia?) is almost necessary for

R
efficiency due to extensive looping.
What Trevor Hastie calls the FFT trick: Friedman + Fortran + some nu-
merical Tricks = no waste flops.
– Wide applicability of CD: `1 regression (Wu and Lange, 2008), svm (Platt, 1999),
group lasso (block CD), graphical lasso (Friedman et al., 2008), ...

168
22 Lecture 22, Apr 8
Last Time
• Coordinate descent for sparse regression.

Today
• Proximal gradient and accelerated proximal gradient method.

Proximal gradient and accelerated proximal gradient method: why?


• Because of applications in machine learning and statistics, there is a resurgence of
interests in first order optimization methods that use only gradient information since
90s.

• The classical gradient descent (steepest descent) method minimizes a differentiable


function f by iterating

x(t+1) = x(t) − s∇f (x(t) )

– Step size s can be fixed or determined by line search (backtracking or exact)


– Advantages
∗ Each iteration is inexpensive.
∗ No need to derive, compute, store and invert Hessians; attractive in large
scale problems.
– Disadvantages
∗ Slow convergence (zigzagging).

169
∗ Do not work for non-smooth problems.

– Remedies
∗ Slow convergence:
· conjugate gradient method
· quasi-Newton
· accelerated gradient method
∗ Non-differentiable or constrained problems:
· subgradient method
· proximal gradient method
· smoothing method
· cutting-plane methods

R
Proximal gradient method
A definite resource for learning about proximal algorithms is (Parikh and Boyd, 2013)
https://web.stanford.edu/~boyd/papers/prox_algs.html

• “Much like Newton’s method is a standard tool for solving unconstrained smooth min-
imization problems of modest size, proximal algorithms can be viewed as an analogous
tool for nonsmooth, constrained, large-scale, or distributed versions of these problems.”

• The proximal mapping (or prox-operator ) of a convex function g is


 
1 2
proxg (x) = argminu g(u) + ku − xk2 .

R
2

Intuitively proxg (x) moves towards the minimum of g but not far away (proximal)
from the point x.

170
R
• Fact: For a closed convex g, proxg (x) exists and is unique for all x.
A function f (x) with domain Rn and range (−∞, ∞] is said to be closed (or lower
semicontinuous) if every sublevel set {x : f (x) ≤ c} is closed. Alternative definition
is f (x) ≤ lim inf m f (xm ) whenever limm xm = x. Another definition is the epigraph
{(x, y) ∈ Rn ×R : f (x) ≤ y} is closed. Examples of closed functions are all continuous
functions, matrix rank, and set indicators.

• Examples of proximal mapping.

1. (Constant function) g(x) ≡ c: proxg (x) = x.


2. (Indicator ) g(x) = χC (x): projection operator
 
1 2
proxg (x) = argminu χC (u) + ku − xk2 = PC (x).

R
2

In this sense, proximal operator generalizes the projection operator to a closed


convex set.
3. (Lasso) g(x) = λkxk1 : soft-thresholding (shrinkage) operator
 
1 2
proxg (x)i = argminui λ|ui | + (ui − xi )
2
= sgn(xi )(|xi | − λ)+ .

Proof. If ui ≥ 0, then stationarity condition dictates ui = (xi − λ)+ . If ui < 0,


then stationarity condition dictates ui = xi + λ = −(−xi − λ)+ .

171
4. (Group lasso) g(x) = λkxk2 : group soft-thresholding
 
1 2
proxg (x) = argminu λkuk2 + ku − xk2
2

(1 − λ/kxk )x kxk ≥ λ
2 2
= .
0 otherwise

Proof. Assuming kuk2 > 0, stationarity condition says


λ
u−x+ u=0
kuk2
or equivalently
 
λ
1+ u = x.
kuk2
Taking `2 norm on both sides shows kuk2 = kxk2 − λ. Therefore
 
∗ λ
u = 1− x
kxk2
is the global minimum, when kxk2 ≥ λ. For kxk2 < λ, we have
1
ku − xk22 + λkuk2
2
1 1
= kxk22 + kuk22 − hu, xi + λkuk2
2 2
1 1
≥ kxk2 + kuk22 − kuk2 kxk2 + λkuk2
2
2 2
1 1
= kxk2 + kuk22 + kuk2 (λ − kxk2 )
2
2 2
1 2
≥ kxk2 .
2

R
Therefore u∗ = 0 is the global minimum.

When x is a scalar, it reduces to the soft thresholding operator sgn(x)(|x| −


λ)+ .
5. (Nuclear norm) h(X) = λkXk∗ : matrix soft-thresholding
 
1 2
proxg (X) = argminY λkY k∗ + kY − XkF
2
T
= U diag((σ − λ)+ )V
= Sλ (X),

where X = U diag(σ)V T is the SVD of X. See ST758 (2014 fall) lecture notes
p159 for the proof.

172
R
6. ...

It is worthwhile to maintain a library of projection and proximal operators in Julia


because they form the building blocks of many machine learning algorithms.

• Proximal gradient algorithm minimizes the composite function

h(x) = f (x) + g(x),

where f is convex and differentiable and g is a closed convex function with inexpensive
prox-operator by iterating

x(t+1) = proxsg x(t) − s∇f (x(t) )



 
1 (t) (t) 2
= argminu g(u) + ku − x + s∇f (x )k2
2s
 
(t) (t) T (t) 1 (t) 2
= argminu g(u) + f (x ) + ∇f (x ) (u − x ) + ku − x k2 .
2s

R
Here s is a constant step size or determined by line search.
Interpretation: from the third line, we see x(t+1) minimizes g(x) plus a simple

R
quadratic local model of f (x) around x(t) .
Interpretation: the function on the third line
1
h(x|x(t) ) := g(x) + f (x(t) ) + ∇f (x(t) )T (x − x(t) ) + kx − x(t) k22
2s
majorizes f (x)+g(x) at current iterate x(t) when s ≤ 1/L (why?). Therefore proximal

R
gradient is an MM algorithm as well.

R
The function to be minimized in each iteration is separated in parameters ,
When g is constant, proximal gradient method reduces to the classical gradient
descent (or steepest descent) method. When g is indicator function χC (x), proximal
gradient method reduces to the projected gradient method.

• Example. Lasso regression


1
minimize ky − β0 1 − Xβk22 + λkβk1 ,
2
where we identify f (β) = 21 ky − β0 1 − Xβk22 and g(β) = λkβk1 . Then the proximal
gradient method iterates according to

β (t+1) = proxsg (β (t) + sX T (y − Xβ (t) ))


= ST(β (t) + sX T (y − Xβ (t) ), sλ).

That is we do iterative soft-thresholding. Note the intercept is not penalized so we do


not apply soft-thresholding to it.

173
• Convergence of proximal gradient method.

– Assumptions
∗ f is convex and ∇f (x) is Lipschitz continuous with parameter L > 0
∗ g is a closed convex function (so that proxsg is well-defined)
∗ optimal value h∗ = inf x h(x) is finite and attained at x∗
– Theorem: With fixed step size s = 1/L,

Lkx(0) − x∗ k22
h(x(t) ) − h∗ ≤ .
2t
Similar result for backtracking line search without knowing L.
– Same convergence rate as the classical gradient method for smooth functions:
O(1/) steps to reach h(x(t) ) − h∗ ≤ .
– Q: Can the O(1/t) rate be improved?

174
23 Lecture 23, Apr 15
Announcements
• HW7 posted http://hua-zhou.github.io/teaching/st790-2015spr/ST790-2015-HW7.
pdf

• Typo in lecture notes p167 (CD for lasso penalized least squares).

Last Time
• Proximal gradient algorithm.

Today
• Accelerated proximal gradient method.

Accelerated proximal gradient method


• Now we have a powerful tool, the proximal gradient method, for dealing with the
non-smooth term in sparse regression. But it converges slowly at the O(1/t) rate /
Nesterov comes to the rescue ,

• History:

– Nesterov:
∗ Nesterov (1983): original acceleration method for smooth functions
∗ Nesterov (1988): second acceleration method for smooth functions
∗ Nesterov (2005): smoothing techniques for nonsmooth functions, coupled
with original acceleration method
∗ Nesterov (2007): acceleration for composite functions
– Beck and Teboulle (2009b): extension of Nesterov (1983) to composite functions
(FISTA).
– Tseng (2008): unified analysis of acceleration techniques (all of these, and more).

• FISTA: Fast Iterative Shrinkage-Thresholding Algorithm (Beck and Teboulle, 2009b).

– Minimize

h(x) = f (x) + g(x),

175
where f is convex and differentiable and g is convex with inexpensive prox-
operator.
– FISTA algorithm: choose any x(0) = x(−1) ; for t ≥ 1, repeat
t − 2 (t−1)
y ← x(t−1) + (x − x(t−2) ) (extrapolation)
t+1
x(t) ← proxsg (y − s∇f (y)) (prox. grad. desc.)

R
Step size s is fixed or determined by line search.
Interpretation: proximal gradient step is performed on the extrapolated point

R
y based on the previous two iterates.
Physical interpretation of Nesterov acceleration? (Pointed to me by Xiang
Zhang) http://cs231n.github.io/neural-networks-3/#sgd

• Convergence of FISTA.

– Assumptions
∗ f is convex and ∇f (x) is Lipschitz continuous with parameter L > 0
∗ g is closed convex (so that proxsg is well-defined)
∗ optimal value h∗ = inf x h(x) is finite and attained at x∗
– Theorem: With fixed step size s = 1/L,

Lkx(0) − x∗ k22
h(x(t) ) − h∗ ≤ .
2(t + 1)2

R
Similar result for backtracking line search.

Need O(1/ ) iterations to get h(x(t) ) − h∗ ≤ . To appreciate this acceler-
ation, to get close to optimal value within  = 10−4 , proximal gradient method
requires up to 104 iterations, while accelerated proximal gradient method requires
up to 100 iterations.

• Improvement of convergence rate from O(1/t) to O(1/t2 ) is remarkable. Can we do


better? Nesterov says no. Formally

– Assumptions (smooth case)


∗ f is convex and differentiable
∗ ∇f (x) is Lipschitz continuous with parameter L > 0
∗ optimal value f ∗ = inf x f (x) is finite and attained at x∗

176
– First order method : any iterative algorithm that selects x(k) in

x(0) + span{∇f (x(0) ), . . . , ∇f (x(k−1) )}

is called a first order method.


– Problem class: any function that satisfies the above assumptions.
– Theorem (Nesterov, 1983): for every integer t ≤ (n − 1)/2 and every x(0) , there
exist functions in the problem class such that for any first-order method

3 Lkx(0) − x∗ k22
f (x(t) ) − f ∗ ≥

R
32 (t + 1)2

This says O(1/t2 ) is the best rate first order methods can achieve.
– Nesterov’s accelerated gradient method achieves the optimal O(1/t2 ) rate among
all first-order methods!
– Similarly FISTA achieves the optimal O(1/t2 ) rate among all first-order methods
for minimizing composite function h(x) = f (x) + g(x). See (Beck and Teboulle,
2009b) for proof.

• Example. Lasso (n = 100, p = 500): 100 instances.

• Example. Lasso logistic regression (n = 100, p = 500): 100 instances.

177
• Numerous applications of FISTA.

– Constrained optimization. When the projection to the constraint set C is inex-


pensive, accelerated projected gradient method applies.
– Lasso: f (β) + λ pj=1 |βj |.
P
P
– Group lasso: f (β) + λ g kβ g k2 .
– Matrix completion: (1/2)kPΩ (A) − PΩ (B)k2F + λkBk∗
It yields an algorithm different from the MM algorithm we learned in ST758.
– Regularized matrix regression: f (B) + λkBk∗ (Zhou and Li, 2014).
– ...

• Remarks.

– Whenever we do (proximal) gradient method, use Nesterov’s acceleration. It is


“free” but makes a big difference in convergence rate.
– For regularization problems, warm start strategy may diminish the need for ac-
celeration.
– FISTA is not a monotone algorithm. See (Beck and Teboulle, 2009a) for a mono-
tone version.
– In practice the Lipschitz constant L is unknown.
∗ Obtain an initial estimate of L using the fact a twice differentiable f has
Lipschitz continuous gradient with parameter L iff LI − d2 f (x) is psd for

178
all x iff the largest eigenvalue of Hessian is bounded above by L. For least
squares, we have L = λmax (X T X). For logistic regression, we have L =
0.25λmax (X T X).
∗ See (Beck and Teboulle, 2009b) for the line search strategy. Same 1/t2 con-
vergence rate.
t − 2 (t−1)
y ← x(t−1) + (x − x(t−2) ) (extrapolation)
t+1
Repeat (line search)
xtemp ← proxsg (y − s∇f (y))
s ← s/2
until h(xtemp ) ≤ h(xtemp |y)
x(t) ← xtemp

– For non-convex f , convergence to stationarity point. See (Beck and Teboulle,


2009a, Theorem 1.3).
– Alternative Nesterov acceleration sequence. Original Nesterov acceleration se-
quence takes the form (starting from α(−2) = 0, α(−1) = 1)

α(t−2) − 1 (t−1)
y ← x(t−1) + (x − x(t−2) ) (extrapolation)
α(t−1)
x(t) ← proxsg (y − s∇f (y)) (prox. grad. desc.)
p
1 + 1 + (2α(t−1) )2
α(t) ← .
2
See (Beck and Teboulle, 2009b). Same O(1/t2 ) convergence rate.

179
24 Lecture 24, Apr 20
Announcements
• HW7 due Tue, 4/21 @ 11:59PM.

Last Time
• Accelerated proximal gradient algorithm.

Today
• Path algorithm.

• ALM.

Path algorithm for regularization problems


• In statistics and machine learning, regularization problems solve

minimizeβ f (β) + λJ(β)

for all λ ≥ 0.

– λ controls the balance between model fit and model complexity.


– Most time we seek whole solution path, instead of solution at individual λs.
– Path algorithms trace the solution β(λ) as a function of λ.
– Need a principled way to choose λ (model selection).

• Example: Lasso solution path (n = 500, p = 100)

180
Observation: the solution path (in terms of λ) is piece-wise linear.

• Example: Solution paths with various penalties (n = 500, p = 100)

Observation: (1) The solution paths are piece-wise smooth for convex penalties, (2)
but may be discontinuous for non-convex penalties.

• How to derive path algorithm? Consider sparse regression f (β)+ pj=1 Pη (|βj |, λ) with
P

a convex penalty Pη .

181
1. Write down the Karush-Kuhn-Tucker (KKT) condition for solution β(λ)

0 = ∇j f (β) + ∇βj Pη (|βj |, λ), for all βj 6= 0


0 ∈ ∇j f (β) + ∂βj Pη (|βj |, λ), for all βj = 0.

2. Apply the implicit function theorem to the first set of equations to derive the path
direction for active βj and determine when each of them hits zero.
3. Use the second set of equations to determine when a zero coefficient βj becomes

R
non-zero.

Recall that the subdifferential ∂f (x) of a convex function f (x) is the set of all
vectors g satisfying the supporting hyperplane inequality

f (y) ≥ f (x) + g T (y − x)

for all y. For instance, subdifferential of f (x) = |x| is [-1,1] at x = 0. If f (x) is


differentiable at x, then the set ∂f (x) reduces to the single vector ∇f (x).

• Example: Lasso (Osborne et al., 2000; Efron et al., 2004)


p
1 2
X
β̂(λ) = arg min ky − Xβk2 + λ |βj |.
β 2 j=1

For simplicity, we assume predictors and responses are centered so omit the intercept.
Stationarity condition (necessary and sufficient for global minimum in this case) says

0p ∈ −X T (y − Xβ) + λ∂kβk1 .

Let A = {j : βj 6= 0} index the non-zero coefficients. Then we have

0|A| = −XAT (y − XA β A ) + λ sgn(β A )


−λ1|Ac |  −XAT c (y − XA β A )  λ1|Ac | .

Applying the implicit function theorem to the first set of equations yields the path
following direction
d
β̂ A (λ) = −(XAT XA )−1 sgn(β A ),

which effectively shows that non-zero coefficients β̂ A (λ) and thus the subgradient vector
−XAT c (y − XA β̂ A (λ)) moves linearly within a segment. The second set of equations
monitor the events a zero coefficient becomes non-zero. Therefore for each βj , j ∈ A, we
calculate when it (ever) hits 0. And for each βj , j ∈ Ac , we calculate when it becomes

182
zero. Then the end of current segment (or start of next segment) is determined by the
event that happens soonest, where we update A and then continues.
The computational cost per segment is O(|A|2 ). The number of segments is harder
to characterize though (Donoho and Tanner, 2010). Under certain conditions whole
(piece-wise linear) solution path is obtained at the cost of a regular least squares fit
(Efron et al., 2004).

• Example: Generalized lasso (Tibshirani and Taylor, 2011; Zhou and Lange, 2013)
1
ky − Xβk22 + λkV β − dk1 + λkW β − ek+ .
2
Piece-wise linear path. Applications include lasso, fused lasso, polynomial trend filter-
ing, image denoising, ...

• Example: Support vector machine (Hastie et al., 2004)


n
X λ
min [1 − yi (β0 + xTi β)]+ + kβk22 .
β0 ,β i=1
2

Piece-wise linear path.

• Example: Quantile regression and many more piece-wise linear solution paths (Rosset
and Zhu, 2007).

• Example: GLM lasso


p
X
f (β) + λ |βj |.
j=1

Approximate path algorithm (Park and Hastie, 2007) and exact path algorithm (Wu,
2011; Zhou and Wu, 2014) using ODE.

• Example: Convex generalized lasso (Zhou and Wu, 2014)

f (β) + λkV β − dk1 + λkW β − ek+ .

Applications include GLM (generalized) lasso, non-parametric density estimation, Gaus-


sian graphical lasso, ...

• A very general path algorithm presented by Friedman (2008) works for a large class of
convex/concave penalties, but is mysterious /.

• Tuning parameter selection.

– λ balances the model fit and model complexity.

183
– Choosing λ is critical in statistical applications.
– Commonly used methods
∗ Cross validation
∗ Information criteria:
ky − ŷ(λ)k2
AIC(λ) = + 2df(λ)
σ2
ky − ŷ(λ)k2
BIC(λ) = + ln(n)df(λ),
σ2

where ŷ(λ) = X β̂(λ) and df(λ) is the effective degrees of freedom of the
selected model at λ
– Using Stein (1981)’s theory of unbiased risk estimation (SURE), Efron (2004)
shows
n   
1 X ∂ ŷ(λ)
df(λ) = 2 cov(ŷi (λ), yi ) = E tr
σ i=1 ∂y

under differentiability condition on the mapping ŷ(λ).


∗ least squares estimate: df = tr(X(X T X)−1 X T ) = p
∗ ridge: df(λ) = tr(X(X T X + λI)−1 X T ) = pj=1 d2j /(d2j + λ), where dj are
P

singular values of X
∗ lasso (Zou et al., 2007): number of non-zero coefficients
∗ generalized lasso (Tibshirani and Taylor, 2011)
∗ group lasso (Yuan and Lin, 2006)
∗ nuclear norm regularization (Zhou and Li, 2014)
∗ ...

Augmented Lagrangian method (ALM)


• ALM is also called the method of multipliers.

• Consider optimization problem

minimize f (x)
subject to gi (x) = 0, i = 1, . . . , q.

– Inequality constraints are ignored for simplicity.


– Assume f and gi are smooth for simplicity.

184
– At a constrained minimum, the Lagrange multiplier condition
q
X
0 = ∇f (x) + λi ∇gi (x)
i=1

holds provided ∇gi (x) are linearly independent.

• Augmented Lagrangian:
q q
X ρX
Lρ (x, λ) = f (x) + λi gi (x) + gi (x)2 .
i=1
2 i=1
Pq
– The penalty term (ρ/2) i=1 gi (x)2 punishes violations of the equality constraints
gi (θ).
– Idea: optimize the Augmented Lagrangian and adjust λ in the hope of matching
the true Lagrange multipliers.
– For ρ large enough (but finite), the unconstrained minimizer of the augmented
Lagrangian coincides with the constrained solution of the original problem.
– At convergence, the gradient ρgi (x)∇gi (x) vanishes and we recover the standard
multiplier rule.

• Algorithm: take ρ initially large or gradually increase it; iterate

– find the unconstrained minimum

x(t+1) ← min Lρ (x, λ(t) )


x

– update the multiplier vector λ


(t+1) (t)
λi ← λi + ρgi (x(t) ), i = 1, . . . , q.

R Intuition for updating λ: if x(t) is the unconstrained minimum of Lρ (x, λ), then
the stationarity condition says
q q
(t)
X X
(t)
0 = ∇f (x ) + λi ∇gi (x(t) ) +ρ gi (x(t) )∇gi (x(t) )
i=1 i=1
q
X (t)
= ∇f (x(t) ) + [λi + ρgi (x(t) )]∇gi (x(t) ).

R
i=1

For non-smooth f , replace gradient ∇f by subdifferential ∂f .

185
• Example: Compressed sensing (or basis pursuit) problem seeks the sparsest solution
subject to linear constraints

minimize kxk1
subject to Ax = b.

Take ρ initially large or gradually increase it; iterate according to


ρ
x(t+1) ← min kxk1 + hλ(t) , Ax − bi + kAx − bk22 (lasso)
2
(t+1) (t) (t+1)
λ ← λ + ρ(Ax − b).

Converges in a finite (small) number of steps (Yin et al., 2008)

• The matrix completion problem (HW6 Q2)

minimize kXk∗
subject to xij = yij , (i, j) ∈ Ω

can be solved by ALM as well. It leads to an iterative singular value thresholding


procedure (Cai et al., 2010), which scales to very large problems.

• Remarks on ALM:

– History: The augmented Lagrangian method dates back to 50s (Hestenes, 1969;
Powell, 1969).
Without the quadratic penalty term (ρ/2)kAx−bk22 , it is the classical dual ascent
algorithm. Dual ascent algorithm works under a set of restrictive assumptions and
can be slow. ALM converges under much more relaxed assumptions (f can be
non differentiable, takes value ∞, ...)
– Monograph by Bertsekas (1982) provides a general treatment.
– Same as the Bregman iteration (Yin et al., 2008) for basis pursuit (compressive
sensing).
– Equivalent to proximal point algorithm applied to the dual; can be accelerated
(Nesterov).

186
25 Lecture 25, Apr 22
Announcements
• Course project due Wed, 4/29 @ 11:00AM.

Last Time
• Path algorithm.

• ALM (augmented Lagrangian method) or method of multipliers.

Today
• ADMM (alternating direction method of multipliers). A generic method for solving
many regularization problems.

• Dynamic programming: hidden Markov model, some fused lasso problems.

• HW7 solution sketch in Julia. http://hua-zhou.github.io/teaching/st790-2015spr/


hw07sol.html

R
ADMM
A definite resource for learning ADMM is (Boyd et al., 2011)
http://stanford.edu/~boyd/admm.html

• Alternating direction method of multipliers (ADMM).

– Consider optimization problem

minimize f (x) + g(y)


subject to Ax + By = c.

– The augmented Lagrangian


ρ
Lρ (x, y, λ) = f (x) + g(y) + hλ, Ax + By − ci + kAx + By − ck22 .
2
– Idea: perform block descent on x and y and then update multiplier vector λ
ρ
x(t+1) ← min f (x) + hλ(t) , Ax + By (t) − ci + kAx + By (t) − ck22
x 2
(t+1) (t) (t+1) ρ
y ← min g(y) + hλ , Ax + By − ci + kAx(t+1) + By − ck22
y 2
(t+1) (t) (t+1) (t+1)
λ ← λ + ρ(Ax + By − c)

187
R If we minimize x and y jointly, then it is same as ALM. We gain splitting by
blockwise updates.
– ADMM converges under mild conditions: f, g convex, closed, and proper, L0 has
a saddle point.

• Example: Generalized lasso problem minimizes


1
ky − Xβk22 + µkDβk1 .
2
– Special case D = Ip corresponds to lasso. Special case
 
1 −1
D= ···
 

1 −1

corresponds to fused lasso. Numerous applications.


– Define γ = Dβ. Then we solve
1
minimize ky − Xβk22 + µkγk1
2
sujbect to Dβ = γ.

– Augmented Lagrangian is
1 ρ
Lρ (β, γ, λ) = ky − Xβk22 + µkγk1 + λT (Dβ − γ) + kDβ − γk22 .
2 2
– ADMM algorithm:
1 ρ
β (t+1) ← min ky − Xβk22 + λ(t)T (Dβ − γ (t) ) + kDβ − γ (t) k22
β 2 2
ρ
γ (t+1) ← min µkγk1 + λT (Dβ (t+1) − γ) + kDβ (t+1) − γk22
γ 2
(t+1) (t) (t+1) (t+1)
λ ← λ + ρ(Dβ −γ )

R Update β is a smooth quadratic problem. Note the Hessian keeps constant


between iterations, therefore its inverse (or decomposition) can be calculated just

R
once, cached in memory, and re-used in each iteration.
Update γ is a separated lasso problem (elementwise soft-thresholding).

• Remarks on ADMM:

– Related algorithms

188
∗ split Bregman iteration (Goldstein and Osher, 2009)
∗ Dykstra (1983)’s alternating projection algorithm
∗ ...
Proximal point algorithm applied to the dual.
– Numerous applications in statistics and machine learning: lasso, generalized lasso,
graphical lasso, (overlapping) group lasso, ...
– Embraces distributed computing for big data (Boyd et al., 2011).

• Distributed computing with ADMM. Consider, for example, solving lasso with a huge
training data set (X, y), which is distributed on B machines. Denote the distributed
data sets by (X1 , y1 ), . . . , (XB , yB ). Then the lasso criterion is
B
1 1X
ky − Xβk22 + µkβk1 = kyb − Xb βk22 + µkβk1 .
2 2 b=1
The ADMM form is
B
1X
minimize kyb − Xb β b k22 + µkβk1
2 b=1
subject to β b = β, b = 1, . . . , B.
Here β b are local variables and β is the global (or consensus) variable. The augmented
Lagrangian function is
B B B
1X X ρX
Lρ (β, γ, λ) = kyb − Xb β b k22 + µkβk1 + λTb (β b − β) + kβ b − βk22 .
2 b=1 b=1
2 b=1

The ADMM algorithm runs as

– Update local variables β b


(t+1) 1 (t)T ρ
βb ← min kyb − Xb β b k22 + λb (β b − β (t) ) + kβ b − β (t) k22 , b = 1, . . . , B,
2 2
in parallel on B machines.
(t)
– Collect local variables β b , b = 1, . . . , B, and update consensus variable β
B B
X (t)T (t+1) ρ X (t+1)
β (t+1) ← min µkβk1 + λb (β b − β) + kβ − βk22
b=1
2 b=1 b
by elementwise soft-thresholding.
– Update multipliers

R
(t+1) (t) (t+1)
λb ← λb + ρ(β b − β (t+1) ), b = 1, . . . , B.

The whole procedure is carried out without ever transferring distributed data sets
(yb , Xb ) to a central location!

189
Dynamic programming: introduction
• Divide-and-conquer : break the problem into smaller independent subproblems

– fast sorting,
– FFT,
– ...

• Dynamic programming (DP): subproblems are not independent, that is, subproblems
share common subproblems.

• DP solves these subproblems once and store them in a table.

• Use these optimal solutions to construct an optimal solution for the original problem.

• Richard Bellman began the systematic study of DP in 50s.

• Some classical (non-statistical) DP problems:

– Matrix-chain multiplication,
– Longest common subsequence,
– Optimal binary search trees,
– ...

See (Cormen et al., 2009) for a general introduction

• Some classical DP problems in statistics

– Hidden Markov model (HMM),


– Some fused-lasso problems,

190
– Graphical models (Wainwright and Jordan, 2008),
– Sequence alignment, e.g., discovery of the cystic fibrosis gene in 1989,
– ...

• Let’s work on the a DP algorithm for the Manhattan tourist problem (MTP), taken
from Jones and Pevzner (2004, Section 6.3).

• MTP: weighted graph

Find a longest path in a weighted grid (only eastward and southward)

– Input: a weighted grid G with two distinguished vertices: a source (0, 0) and a
sink (n, m).
– Output: a longest path M T (n, m) in G from source to sink.

Brute force enumeration is out of the question even for a moderate sized graph.

191
• Simple recursive program.
M T (n, m):

– If n = 0 or m = 0, return M T (0, 0)
– x ← M T (n − 1, m)+ weight of the edge from (n − 1, m) to (n, m)
y ← M T (n, m − 1)+ weight of the edge from (n, m − 1) to (n, m)
– Return max{x, y}

• Something wrong

– M T (n, m − 1) needs M T (n − 1, m − 1), so as M T (n − 1, m).


– So M T (n − 1, m − 1) will be computed at least twice.
– Dynamic programming: the same idea as this recursive algorithm, but keep all
intermediate results in a table and reuse.

• MTP: dynamic programming

– Calculate optimal path score for each vertex in the graph


– Each vertex’s score is the maximum of the previous vertices score plus the weight
of the respective edge in between

192
• MTP dynamic programming: path!

193
Showing all back-traces!

• MTP: recurrence

– Computing the score for a point (i, j) by the recurrence relation:


(
s(i − 1, j) + weight between (i − 1, j) and (i, j)
s(i, j) = max
s(i, j − 1) + weight between (i, j − 1) and (i, j)

– The run time is mn for a n by m grid.


(n = number of rows, m = number of columns)

• Remarks on DP:

– Steps for developing a DP algorithm


1. Characterize the structure of an optimal solution
2. Recursively define the value of an optimal solution
3. Computer the value of an optimal solution in a bottom-up fashion
4. Construct an optimal solution from computed information
– “Programming” both here and in linear programming refers to the use of a tabular
solution method.
– Many problems involve large tables and entries along certain directions may be
filled out in parallel – fine scale parallel computing.

Application of dynamic programming: HMM


• Hidden Markov model (HMM) (Baum et al., 1970).

– HMM is a Markov chain that emits symbols:


Markov chain (µ, A = {akl }) + emission probabilities ek (b)
– The state sequence π = π1 · · · πL is governed by the Markov chain

P(π1 = k) = µ(k), P(πi = l|πi−1 = k) = akl .

– The symbol sequence x = x1 · · · xL is determined by the underlying state sequence


π
L
Y
P(x, π) = eπi (xi )aπi−1 πi .
i=1

– It is called hidden because in applications the state sequence is unobserved.

194
• Wide applications of HMM.

– Wireless communication: IEEE 802.11 WLAN.


– Mobile communication: CDMA and GSM.
– Speech recognition (Rabiner, 1989)
Hidden states: text, symbols: acoustic signals.
– Haplotyping and genotype imputation
Hidden states: haplotypes, symbols: genotypes.
– Gene prediction (Burge, 1997)

• General reference book on HMM:

• Let’s work on a simple HMM example. The Occasionally Dishonest Casino (Durbin
et al., 2006)

195
• Fundamental questions of HMM:

– How to compute the probability of the observed sequence of symbols given known
parameters akl and ek (b)?
Answer: Forward algorithm.
– How to compute the posterior probability of the state at a given position (posterior
decoding) given akl and ek (b)?
Answer: Backward algorithm.
– How to estimate the parameters akl and ek (b)?
Answer: Baum-Welch algorithm.
– How to find the most likely sequence of hidden states?
Answer: Viterbi algorithm (Viterbi, 1967).

• Forward algorithm:

– Calculate the probability of an observed sequence


X
P(x) = P(x, π).
π

– Brute force evaluation by enumerating is impractical


– Define the forward variable

fk (i) = P(x1 . . . xi , πi = k).

– Recursion formula for forward variables


X
fl (i + 1) = P(x1 . . . xi xi+1 , πi+1 = l) = el (xi+1 ) fk (i)akl .
k

– Algorithm:
∗ Initialization (i = 1): fk (1) = a0k ek (x1 ).
P
∗ Recursion (i = 2, . . . , L): fl (i) = el (xi ) k fk (i − 1)akl .
P
∗ Termination: P(x) = k fk (L).
Time complexity = (# states)2 × length of sequence.

• Backward algorithm.

196
– Calculate the posterior state probabilities at each position

P(x, πi = k)
P(πi = k|x) = .
P(x)

– Enough to calculate the numerator

P(x, πi = k) = P(x1 . . . xi , πi = k)P(xi+1 . . . xL |x1 . . . xi , πi = k)


= P(x1 . . . xi , πi = k)P(xi+1 . . . xL |πi = k)
= fk (i)bk (i).

– Recursion formula for the backward variables


X
bk (i) = P(xi+1 . . . xL |πi = k) = akl el (xi+1 )bl (i + 1).
l

– Algorithm:
∗ Initialization (i = L): bk (L) = 1 for all k
P
∗ Recursion (i = L − 1, . . . , 1): bk (i) = l akl el (xi+1 )bl (i + 1)
P
∗ Termination: P(x) = l a0l el (x1 )bl (1)
Time complexity = (# states)2 × length of sequence
– The Occasionally Dishonest Casino.

• Parameter estimation for HMM – Baum-Welch algorithm.

• Question: Given n independent training symbol sequences x1 , . . . , xn , how to find the


parameter value that maximizes the log-likelihood log P(x1 , . . . , xn |θ) = nj=1 log P(xj |θ)?
P

– When the underlying state sequences are known: Simple.


– When the underlying state sequences are unknown: Baum-Welch algorithm.

197
• MLE when state sequences are known.

– Let Akl = # transitions from state k to l


Ek (b) = # state k emitting symbol b
The MLEs are
Akl Ek (b)
akl = P and ek (b) = P 0
. (1)
l0 Akl0 b0 Ek (b )

– To avoid overfitting with insufficient data, add pseudocounts

Akl = # transitions k to l in training data + rkl ;


Ek (b) = # emissions of b from k in training data + rk (b)

• MLE when state sequences are unknown: Baum-Welch algorithm.

– Idea: Replace the counts Akl and Ek (b) by their expectations conditional on
current parameter iterate (EM algorithm!)
– The probability that akl is used at position i of sequence x:
P(πi = k, πi+1 = l|x, θ)
= P(x, πi = k, πi+1 = l)/P(x)
= P(x1 . . . xi , πi = k)akl el (xi+1 )P(xi+2 . . . xL |πi+1 = l)/P(x)
= fk (i)akl el (xi+1 )bl (i + 1)/P(x).

– So the expected number of times that akl is used in all training sequences is
n
X 1 X j
Akl = j
fk (i)akl el (xji+1 )bjl (i + 1). (2)
j=1
P(x ) i

• Baum-Welch Algorithm.

– Initialization: Pick arbitrary model parameters


– Recursion
∗ Set all the A and E variables to pseudocounts rs (or to zero)
∗ For each sequence j = 1, . . . , n
· calculate fk (i) for sequence j using the forward algorithm
· calculate bk (i) for sequence j using the backward algorithm
· add contribution of sequence j to A (2) and E (??)
∗ Calculate the new model parameters using (1)
∗ Calculate the new log-likelihood of the model

198
– Termination: Stop if change in log-likelihood is less than a predefined threshold
or the maximum number of iteration is exceeded

• Baum-Welch – The Occasionally Dishonest Casino.

• Viterbi Algorithm:

– Calculate the most probable state path

π ∗ = argmaxπ P (x, π).

– Define the Viterbi variable

vl (i) = P (the most probable path ending in state k with observation xi ).

– Recursion for the Viterbi variables

vl (i + 1) = el (xi+1 ) max(vk (i)akl )


k

– Algorithm:
∗ Initialization (i = 0): v0 (0) = 1, vk (0) = 0 for all k > 0
∗ Recursion (i = 1, . . . , L):

vl (i) = el (xi ) max(vk (i − 1)akl )


k
ptri (l) = argmaxk (vk (i − 1)akl )

199
∗ Termination:

P(x, π ∗ ) = max(vk (L)ak0


k
πL∗ = argmaxk (vk (L)ak0 )


∗ Traceback (i = L, . . . , 1): πi=1 = ptri (πi∗ )
Time complexity = (# states)2 × length of sequence
– Viterbi decoding - The Occasionally Dishonest Casino.

Application of dynamic programming: fused-lasso


• Fused lasso (Tibshirani et al., 2005) minimizes
p−1 p
X X
−`(β) + λ1 |βk − βk−1 | + λ2 |βk |
k=1 k=1

over Rp for better recovery of signals that are both sparse and smooth

• In many applications, one needs to minimize


n
X n−1
X
On (u) = − `k (uk ) + λ p(uk , uk+1 )
k=1 k=1

where ut takes values in a finite space S and p is a penalty function. A discrete


(combinatorial) optimization problem.

200
• A genetic example:

– Model organism study designs: inbred mice


– Goal : impute the strain origin of inbred mice (Zhou et al., 2012)

• Combinatorial optimization of penalized likelihood.

– Minimize objective function


n
X n−1
X
O(u) = − Lk (uk ) + Pk (uk , uk+1 )
k=1 k=1

by choosing the proper ordered strain origin assignment along the genome
– uk = ak |bk : the ordered strain origin pair
– Lk : log-likelihood function at marker k - matching imputed genotypes with the
observed ones
– Pk : penalty function for adjacent marker k and k + 1 - encouraging smoothness
of the solution

• Loglikelihood at each marker. At marker k, uk = ak |bk : the ordered strain origin pair;
rk /sk : observed genotype for animal i. Log-penetrance (conditional log-likelihood) is

Lk (uk ) = ln [Pr(rk /sk | ak |bk )]

• Penalty for adjacent markers.

201
– Penalty Pk (uk , uk+1 ) for each pair of adjacent markers is



0, ak = ak+1 , bk = bk+1


− ln γ p (b

ak = ak+1 , bk 6= bk+1
i k+1 ) + λ,
Pk (uk , uk+1 ) =


− ln γim (ak+1 ) + λ, ak 6= ak+1 , bk = bk+1

− ln ψiimp (ak+1 , bk+1 ) + 2λ,

ak 6= ak+1 , bk 6= bk+1 .

– Penalties suppress jumps between strains and guide jumps, when they occur,
toward more likely states.

• For each m = 1, . . . , n,
h m
X m−1
X i
Om (um ) = min − `t (ut ) + λ p(ut , ut+1 )
u1 ,...,um−1
t=1 t=1

beginning with O1 (u1 ) = −`1 (u1 ). And to proceed


h i
Om+1 (um+1 ) = min Om (um ) − `m+1 (um+1 ) + p(um , um+1 )
um

• Computational time is O(s4 n), where n = # markers and s = is # founders.

• More fused-lasso examples.

202
– Johnson (2013) proposes the dynamic programming algorithm for maximizing the
general objective function
n
X n
X
ek (βk ) − λ d(βk , βk−1 ),
k=1 k=2

where e is an exponential family log-likelihood and d is a penalty function, e..g,


d(βk , βk−1 ) = 1{βk 6=βk−1 }
– Applications: L0 -least squares segmentation, fused lasso signal approximator
(FLSA), ...

Take home message from this course


• Statistics, the science of data analysis, is the applied mathematics in the 21st century.

• Being good at computing (both algorithms and programming) is a must for today’s
working statisticians.

• In this course, we studied and practiced many (overwhelming?) tools that help us
deliver results faster and more accurate.

– Operating systems: Linux and scripting basics


– Programming languages: R (package development, Rcpp, ...), Matlab, Julia
– Tools for collaborative and reproducible research: Git, R Markdown, sweave
– Parallel computing: multi-core, cluster, GPU
– Convex optimization (LP, QP, SOCP, SDP, GP, cone programming)
– Integer and mixed integer programming
– Algorithms for sparse regression
– More advanced optimization methods motivated by modern statistical and ma-
chine learning problems, e.g., ALM, ADMM, online algorithms, ...
– Dynamic programming
– Advanced topics on EM/MM algorithms (not really ...)

Of course there are many tools not covered in this course, notably the Bayesian MCMC
machinery. Take a Bayesian course!

203
• Updated benchmark results. R is upgraded to v3.2.0 and Julia to 0.3.7 since beginning
of this course. I re-did the benchmark and did not see notable changes.
Benchmark code R-benchmark-25.R from http://r.research.att.com/benchmarks/
R-benchmark-25.R covers many commonly used numerical operations used in statis-
tics. We ported to Matlab and Julia and report the run times (averaged over 5 runs)
here.

Machine specs: Intel i7 @ 2.6GHz (4 physical cores, 8 threads), 16G RAM, Mac OS 10.9.5.
Test R 3.2.0 Matlab R2014a julia 0.3.7

Matrix creation, trans, deformation (2500 × 2500) 0.80 0.17 0.16


Power of matrix (2500 × 2500, A.1000 ) 0.22 0.11 0.22
Quick sort (n = 7 × 106 ) 0.64 0.24 0.62
Cross product (2800 × 2800, AT A) 9.89 0.35 0.37
LS solution (n = p = 2000) 1.21 0.07 0.09

FFT (n = 2400000) 0.36 0.04 0.14


Eigen-decomposition (600 × 600) 0.77 0.31 0.53
Determinant (2500 × 2500) 3.52 0.18 0.22
Cholesky (3000 × 3000) 4.08 0.15 0.21
Matrix inverse (1600 × 1600) 2.93 0.16 0.19

Fibonacci (vector) 0.29 0.17 0.65


Hilbert (matrix) 0.18 0.07 0.17
GCD (recursion) 0.28 0.14 0.20
Toeplitz matrix (loops) 0.32 0.0014 0.03
Escoufiers (mixed) 0.39 0.40 0.15

For the simple Gibbs sampler test, R v3.2.0 takes 38.32s elapsed time. Julia v0.3.7
takes 0.35s.

• Do not forget course evaluation: https://classeval.ncsu.edu/secure/prod/cesurvey/

204
References
Alizadeh, F. and Goldfarb, D. (2003). Second-order cone programming. Math. Program.,
95(1, Ser. B):3–51. ISMP 2000, Part 3 (Atlanta, GA).

Armagan, A., Dunson, D., and Lee, J. (2013). Generalized double Pareto shrinkage. Statistica
Sinica, 23:119–143.

Baggerly, K. A. and Coombes, K. R. (2009). Deriving chemosensitivity from cell lines:


Forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl.
Stat., 3(4):1309–1334.

Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique
occurring in the statistical analysis of probabilistic functions of Markov chains. Ann.
Math. Statist., 41:164–171.

Beck, A. and Teboulle, M. (2009a). Fast gradient-based algorithms for constrained total
variation image denoising and deblurring problems. Trans. Img. Proc., 18(11):2419–2434.

Beck, A. and Teboulle, M. (2009b). A fast iterative shrinkage-thresholding algorithm for


linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202.

Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: pivotal recovery of
sparse signals via conic programming. Biometrika, 98(4):791–806.

Ben-Tal, A. and Nemirovski, A. (2001). Lectures on Modern Convex Optimization.


MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics
(SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA.
Analysis, algorithms, and engineering applications.

Bertsekas, D. P. (1982). Constrained Optimization and Lagrange Multiplier Methods. Com-


puter Science and Applied Mathematics. Academic Press Inc. [Harcourt Brace Jovanovich
Publishers], New York.

Bertsimas, D. and Weismantel, R. (2005). Optimization Over Integers. Athena Scientific.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization
and statistical learning via the alternating direction method of multipliers. Found. Trends
Mach. Learn., 3(1):1–122.

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press,


Cambridge.

205
Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary
value problems. Ann. Math. Statist., 42:855–903.

Buckheit, J. and Donoho, D. (1995). Wavelab and reproducible research. In Antoniadis,


A. and Oppenheim, G., editors, Wavelets and Statistics, volume 103 of Lecture Notes in
Statistics, pages 55–81. Springer New York.

Burge, C. (1997). Prediction of complete gene structures in human genomic DNA. Journal
of Molecular Biology, 268(1):78–94.

Cai, J.-F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for
matrix completion. SIAM J. Optim., 20(4):1956–1982.

Candès, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much
larger than n. Ann. Statist., 35(6):2313–2351.

Candès, E. J., Romberg, J. K., and Tao, T. (2006). Stable signal recovery from incom-
plete and inaccurate measurements. Communications on Pure and Applied Mathematics,
59(8):1207–1223.

Candès, E. J. and Tao, T. (2006). Near-optimal signal recovery from random projections:
universal encoding strategies? IEEE Trans. Inform. Theory, 52(12):5406–5425.

Candès, E. J., Wakin, M. B., and Boyd, S. P. (2008). Enhancing sparsity by reweighted l1
minimization. J. Fourier Anal. Appl., 14(5-6):877–905.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2009). Introduction to Algo-
rithms. MIT Press, Cambridge, MA, third edition.

Daubechies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for
linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math., 57(11):1413–
1457.

Donoho, D. and Stodden, V. (2004). When does non-negative matrix factorization give a
correct decomposition into parts? In Thrun, S., Saul, L., and Schölkopf, B., editors,
Advances in Neural Information Processing Systems 16, pages 1141–1148. MIT Press.

Donoho, D. L. (2006). Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306.

Donoho, D. L. (2010). An invitation to reproducible computational research. Biostatistics,


11(3):385–388.

Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage.


Biometrika, 81(3):425–455.

206
Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet
shrinkage. J. Amer. Statist. Assoc., 90(432):1200–1224.

Donoho, D. L. and Tanner, J. (2010). Counting the faces of randomly-projected hypercubes


and orthants, with applications. Discrete Comput. Geom., 43(3):522–541.

Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (2006). Biological Sequence Analysis.
eleventh edition.

Dykstra, R. L. (1983). An algorithm for restricted least squares regression. J. Amer. Statist.
Assoc., 78(384):837–842.

Eaton, M. L. (1992). A statistical diptych: admissible inferences—recurrence of symmetric


Markov chains. Ann. Statist., 20(3):1147–1179.

Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-
validation. J. Amer. Statist. Assoc., 99(467):619–642. With comments and a rejoinder by
the author.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Ann.
Statist., 32(2):407–499. With discussion, and a rejoinder by the authors.

Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors—an empirical
Bayes approach. J. Amer. Statist. Assoc., 68:117–130.

Efron, B. and Morris, C. (1977). Stein’s paradox in statistics. Scientific American,


236(5):119–127.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. J. Amer. Statist. Assoc., 96(456):1348–1360.

Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression


tools. Technometrics, 35(2):109–135.

Friedman, J. (2008). Fast sparse regression and classification. http://www-


stat.stanford.edu/ jhf/ftp/GPSpaper.pdf.

Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. (2007). Pathwise coordinate opti-
mization. Ann. Appl. Stat., 1(2):302–332.

Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation
with the graphical lasso. Biostatistics, 9(3):432–441.

207
Friedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.

Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. J. Comput. Graph.
Statist., 7(3):397–416.

Goldstein, T. and Osher, S. (2009). The split Bregman method for l1 -regularized problems.
SIAM J. Img. Sci., 2:323–343.

Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations. Johns Hopkins Studies in
the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition.

Grant, M. and Boyd, S. (2008). Graph implementations for nonsmooth convex programs. In
Blondel, V., Boyd, S., and Kimura, H., editors, Recent Advances in Learning and Control,
Lecture Notes in Control and Information Sciences, pages 95–110. Springer-Verlag Limited.
http://stanford.edu/~boyd/graph_dcp.html.

Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). The entire regularization path for
the support vector machine. J. Mach. Learn. Res., 5:1391–1415.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York,
second edition.

Hestenes, M. R. (1969). Multiplier and gradient methods. J. Optimization Theory Appl.,


4:303–320.

Huber, P. J. (1994). Huge data sets. In COMPSTAT 1994 (Vienna), pages 3–13. Physica,
Heidelberg.

Huber, P. J. (1996). Massive data sets workshop: The morning after. In Massive Data Sets:
Proceedings of a Workshop, pages 169–184. National Academy Press, Washington.

James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley
Sympos. Math. Statist. and Prob., Vol. I, pages 361–379. Univ. California Press, Berkeley,
Calif.

Johnson, N. A. (2013). A dynamic programming algorithm for the fused lasso and L0 -
segmentation. Journal of Computational and Graphical Statistics, to appear.

Jones, N. C. and Pevzner, P. A. (2004). An Introduction to Bioinformatics Algorithms


(Computational Molecular Biology). The MIT Press.

208
Laurent, M. and Rendl, F. (2005). Semidefinite programming and integer programming.
In K. Aardal, G. N. and Weismantel, R., editors, Discrete Optimization, volume 12 of
Handbooks in Operations Research and Management Science, pages 393 – 514. Elsevier.

Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H. (1998). Applications of second-
order cone programming. Linear Algebra Appl., 284(1-3):193–228. ILAS Symposium on
Fast Algorithms for Control, Signals and Image Processing (Winnipeg, MB, 1997).

Mazumder, R., Friedman, J. H., and Hastie, T. (2011). SparseNet: Coordinate descent with
nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138.

McKay, B., Bar-Natan, D., Bar-Hillel, M., and Kalai, G. (1999). Solving the bible code
puzzle. Statist. Sci., 14(2):150–173.

Nemhauser, G. and Wolsey, L. (1999). Integer and Combinatorial Optimization. Wiley-


Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons, Inc.,
New York. Reprint of the 1988 original, A Wiley-Interscience Publication.

Nesterov, Y. (1983). A method of solving a convex programming problem with convergence


rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2):372–376.

Nesterov, Y. (1988). On an approach to the construction of optimal methods of minimization


of smooth convex functions. Ekonomika i Mateaticheskie Metody, 24:509–517.

Nesterov, Y. (2000). Squared functional systems and optimization problems. In High per-
formance optimization, volume 33 of Appl. Optim., pages 405–440. Kluwer Acad. Publ.,
Dordrecht.

Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Math. Program., 103(1,


Ser. A):127–152.

Nesterov, Y. (2007). Gradient methods for minimizing composite objective function.

Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). A new approach to variable selection
in least squares problems. IMA J. Numer. Anal., 20(3):389–403.

Parikh, N. and Boyd, S. (2013). Proximal algorithms. Found. Trends Mach. Learn., 1(3):123–
231.

Park, M. Y. and Hastie, T. (2007). L1 -regularization path algorithm for generalized linear
models. J. R. Stat. Soc. Ser. B Stat. Methodol., 69(4):659–677.

Peng, R. D. (2009). Reproducible research and biostatistics. Biostatistics, 10(3):405–408.

209
Peng, R. D. (2011). Reproducible research in computational science. Science,
334(6060):1226–1227.

Platt, J. C. (1999). Fast training of support vector machines using sequential minimal
optimization. In Schölkopf, B., Burges, C. J. C., and Smola, A. J., editors, Advances in
Kernel Methods, pages 185–208. MIT Press, Cambridge, MA, USA.

Potti, A., Dressman, H. K., Bild, A., and Riedel, R. F. (2006). Genomic signatures to guide
the use of chemotherapeutics. Nature medicine, 12(11):1294–1300.

Powell, M. J. D. (1969). A method for nonlinear constraints in minimization problems.


In Optimization (Sympos., Univ. Keele, Keele, 1968), pages 283–298. Academic Press,
London.

Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77(2):257–286.

Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. Ann. Statist.,
35(3):1012–1030.

Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal
distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics
and Probability, 1954–1955, vol. I, pages 197–206, Berkeley and Los Angeles. University
of California Press.

Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. Ann.


Statist., 9(6):1135–1151.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc.
Ser. B, 58(1):267–288.

Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J.
(2012). Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser.
B. Stat. Methodol., 74(2):245–266.

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and
smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(1):91–108.

Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. Ann.
Statist., 39(3):1335–1371.

Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable


minimization. J. Optim. Theory Appl., 109(3):475–494.

210
Tseng, P. (2008). On accelerated proximal gradient methods for convex-concave optimiza-
tion. submitted to SIAM Journal on Optimization.

Vielma, J. P., Ahmed, S., and Nemhauser, G. (2010). Mixed-integer models for nonseparable
piecewise-linear optimization: unifying framework and extensions. Oper. Res., 58(2):303–
315.

Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. Information Theory, IEEE Transactions on, 13(2):260–269.

Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and


variational inference. Found. Trends Mach. Learn., 1(1-2):1–305.

Williams, H. P. (2013). Model Building in Mathematical Programming. John Wiley & Sons,
Ltd., Chichester, fifth edition.

Witztum, D., Rips, E., and Rosenberg, Y. (1994). Equidistant letter sequences in the book
of genesis. Statist. Sci., 9(3):429–438.

Wu, T. T., Chen, Y., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association
analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721.

Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression.
Ann. Appl. Stat., 2(1):224–244.

Wu, Y. (2011). An ordinary differential equation-based solution path algorithm. Journal of


Nonparametric Statistics, 23:185–199.

Yin, W., Osher, S., Goldfarb, D., and Darbon, J. (2008). Bregman iterative algorithms for
l1 -minimization with applications to compressed sensing. SIAM J. Imaging Sci., 1(1):143–
168.

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped
variables. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(1):49–67.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.
Ann. Statist., 38(2):894–942.

Zhou, H. and Lange, K. (2013). A path algorithm for constrained estimation. Journal of
Computational and Graphical Statistics, 22:261–283.

Zhou, H. and Li, L. (2014). Regularized matrix regressions. Journal of Royal Statistical
Society, Series B, 76(2):463–483.

211
Zhou, H. and Wu, Y. (2014). A generic path algorithm for regularized statistical estimation.
J. Amer. Statist. Assoc., 109(506):686–699.

Zhou, J. J., Ghazalpour, A., Sobel, E. M., Sinsheimer, J. S., and Lange, K. (2012). Quantita-
tive trait loci association mapping by imputation of strain origins in multifounder crosses.
Genetics, 190(2):459–473.

Zhu, J., Rosset, S., Tibshirani, R., and Hastie, T. J. (2004). 1-norm support vector ma-
chines. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information
Processing Systems 16, pages 49–56. MIT Press.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J.
R. Stat. Soc. Ser. B Stat. Methodol., 67(2):301–320.

Zou, H., Hastie, T., and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso.
Ann. Statist., 35(5):2173–2192.

212

You might also like