ST790 2015 Spring LecNotes
ST790 2015 Spring LecNotes
1 Lecture 1: Jan 7
Today
• Introduction and course logistics
• Linux fundamentals
• Statistics, the science of data analysis, is the applied mathematics in the 21st century.
• Data is increasing in volume, velocity, and variety. Classification of data sets by Huber
(1994, 1996).
1
Data Size Bytes Storage Mode
Tiny 102 Piece of paper
Small 104 A few pieces of paper
Medium 106 (megabyte) A floppy disk
Large 108 Hard disk
Huge 109 (gigabytes) Hard disk(s)
Massive 1012 (terabytes) RAID storage
• This course covers some topics on computing I found useful for working statisticians
but not covered in ST758 or typical statistics curriculum. Advanced does not mean
more difficult here.
• General topics.
• Last version (2013 Spring) of this course may give you a rough idea.
http://www.stat.ncsu.edu/people/zhou/courses/st810/LectureNotes
Of course topics on computing change fast.
2
Course logistics
• Check course website frequently for updates and announcements.
http://hua-zhou.github.io/teaching/st790-2015spr/schedule.html
Pre-lecture notes will be posted before each lecture. Cumulative lecture notes will be
updated and posted after each lecture.
• A course final project. Survey results: 31 (course project) vs 2 (final exam). Group or
solo?
3
• Linux is the most common platform for scientific computing.
– E.g., both department HPC (Beowulf cluster) and campus HPC run on CentOS
Linux. It’s a lot of computing power sitting there.
– Open source and community support.
– Things break, when they break using linux its easy to fix them!
– Scalability: portable devices (Android, iOS), laptops, servers, and supercomput-
ers.
– Cost: it’s free!
4
By default, upon log-in user is at his/her home directory
• Linux shells.
R
– Most commonly used shells: bash, csh, tcsh, ...
– Sometimes a script or a command does not run simply because it’s written
for another shell.
– Determine the current shell you are working on: echo $0 or echo $SHELL.
– List available shells: cat /etc/shells.
5
– Change your login shell permanently: chsh -s /bin/bash userid. Then log out
and log in.
R
folders).
Options for many Linux commands can be combined. E.g., ls -al.
6
– File permissions.
RR
groups userid shows which group(s) a user belongs to.
– .. denotes the parent of current working directory.
7
– find is similar to locate but has more functionalities, e.g., select files by age,
size, permissions, .... , and is ubiquitous.
R
wards through the input.
“less is more, and more is less”.
– grep prints lines that match an expression.
– Wildcard characters:
Wildcard Matches
? or . Any single character
* Any string of characters
+ One or more of preceding pattern
ˆ beginning of the line
[set] Any character in set
[!set] Any character not in the set
[a-z] Any lowercase letter
[0-9] Any number (same as [0123456789])
8
E.g.
9
– Other useful text editing utilities include
sed, stream editor
awk, filter and report writer
and so on.
– Combinations of shell commands (grep, sed, awk, ...), piping and redirection, and
regular expressions allow us pre-process and reformat huge text files efficiently.
10
2 Lecture 2, Jan 12
Announcements
• TA office hours changed to Tue @ 1P-2P and Fri @ 2P-3P.
Last Time
• Course introduction and logistics.
• Linux introduction: why linux, move around file system, viewing/peeking text files,
and simple manipulation of text file.
Today
• Linux introduction (continued).
• Key authentication.
11
– Emacs is a powerful text editor with extensive support for many languages in-
cluding R, LATEX, python, and C/C++; however it’s not installed by default on
many Linux distributions. Basic survival commands:
∗ emacs filename to open a file with emacs.
∗ CTRL-x CTRL-f to open an existing or new file.
∗ CTRL-x CTRX-s to save.
∗ CTRL-x CTRL-w to save as.
∗ CTRL-x CTRL-c to quit.
Google “emacs cheatsheet” to find something like
abort partially typed or executing command C-g set mark here C-@ or C-SPC Case Change
recover files lost by a system crash M-x recover-session exchange point and mark C-x C-x uppercase word M-u
undo an unwanted change C-x u, C-_ or C-/ set mark arg words away M-@ lowercase word M-l
restore a buffer to its original contents M-x revert-buffer mark paragraph M-h capitalize word M-c
redraw garbaged screen C-l mark page C-x C-p uppercase region C-x C-u
mark sexp C-M-@ lowercase region C-x C-l
mark function C-M-h
Incremental Search mark entire buffer C-x h The Minibuffer
search forward C-s The following keys are defined in the minibuffer.
search backward C-r
Query Replace complete as much as possible TAB
regular expression search C-M-s interactively replace a text string M-% complete up to one word SPC
reverse regular expression search C-M-r using regular expressions M-x query-replace-regexp complete and execute RET
select previous search string M-p show possible completions ?
Valid responses in query-replace mode are fetch previous minibuffer input M-p
select next later search string M-n
exit incremental search RET replace this one, go on to next SPC or y fetch later minibuffer input or default M-n
undo effect of last character DEL replace this one, don’t move , regexp search backward through history M-r
abort current search C-g skip to next without replacing DEL or n regexp search forward through history M-s
replace all remaining matches ! abort command C-g
Use C-s or C-r again to repeat the search in either direction. If back up to the previous match ^ Type C-x ESC ESC to edit and repeat the last command that
Emacs is still searching, C-g cancels only the part not matched. exit query-replace RET used the minibuffer. Type F10 to activate menu bar items on
c 2012 Free Software Foundation, Inc. Permissions on back. enter recursive edit (C-M-c to exit) C-r text terminals.
12
∗ vi is a modal editor: insert mode and normal mode. Pressing i switches
from the normal mode to insert mode. Pressing ESC switches from the insert
mode to normal mode.
∗ :x<Return> quit vi and save changes.
∗ :wq<Return> quit vi and save changes.
∗ :q!<Return> quit vi without saving latest changes.
∗ :w<Return> saves changes.
Google “vi cheatsheet” to find something like
m n
cw Change word TOggle capital and lower-case Files
C Change to the end of the line J Join lines :wfile Write to file
cc Change the whole line . Repeat last text-changing command :r file Read file in after line
u Undo last change :n Go to next file
PuUi no text U Undo all changes to line :p Go to previous file
p
P
I
Put after the position or after the line
Put before the position or before the line
:e file Edit file
Based on http://www.lagmonster.org/docslvi.html !!program Replace line with output from program
– Statisticians write a lot of code. Critical to adopt a good IDE (integrated develop-
ment environment) that goes beyond code editing: syntax highlighting, executing
code within editor, debugging, profiling, version control, ...
R Studio, Matlab, Visual Studio, Eclipse, Emacs, ...
• Bash completion. Bash provides the following standard completion for the Linux users
by default. Much less typing errors and time!
13
1. Pathname completion
2. Filename completion
3. Variablename completion
E.g., echo $[TAB][TAB]
4. Username completion
E.g., cd ~[TAB][TAB]
5. Hostname completion
E.g., ssh hzhou3@[TAB][TAB]
It can also be customized to auto-complete other stuff such as options and command’s
arguments. Google “bash completion” for more information.
– Each process has Process ID (PID), Username (UID), Parent process ID (PPID),
Time and data process started (STIME), time running (TIME), ...
– ps command provides info on processes.
ps -eaf lists all currently running processes
ps -fp 1001 lists process with PID=1001
ps -eaf | grep python lists all python processes
ps -fu userid lists all processes owned by a user.
– kill kills a process. E.g., kill 1001 kills process with PID=1001.
killall kills a bunch of processes. E.g., killall -r R kills all R processes.
– top prints realtime process info (very useful).
14
(Seamless) remote access to Linux machines
• SSH (secure shell) is the dominant cryptographic network protocol for secure network
connection via an insecure network.
• Key authentication.
15
– Public key. Put on the machine(s) you want to log in.
– Private key. Put on your own computer. Consider this as the actual key in your
pocket; never give to others.
– Messages from server to your computer is encrypted with your public key. It can
only be decrypted using your private key.
– Messages from your computer to server is signed with your private key (digital
signatures) and can be verified by anyone who has your public key (authentica-
tion).
16
• Generate keys.
1. On Linux or Mac, ssh-keygen generates key pairs. E.g., on the teaching server
17
3. Append the public key to the ~/.ssh/authorized keys file of any Linux machine
we want to SSH to, e.g., the Beowulf cluster (hpc.stat.ncsu.edu).
4. Now you don’t need password each time you connect from the teaching server to
the Beowulf cluster.
5. If you set paraphrase when generating keys, you’ll be prompted for the paraphrase
each time the private key is used. Avoid repeatedly entering the paraphrase by
R
using ssh-agent on Linux/Mac or Pagent on Windows.
Same key pair can be used between any two machines. We don’t need to
R
regenerate keys for each new connection.
For Windows users, the private key generated by ssh-keygen cannot be
directly used by PuTTY; use PuTTYgen for conversion. Then let PuTTYgen
use the converted private key. Read Sections A and B of the tutorial http://
tipsandtricks.nogoodatcoding.com/2010/02/svnssh-with-tortoisesvn.html
18
– GUIs for Windows (WinSCP) or Mac (Cyberduck).
– (My preferred way) Use a version control system to sync project files between
R
different machines and systems.
• Line breaks in text files. Windows uses a pair of CR and LF for line breaks.
Linux/Unix uses an LF character only. Mac X also uses a single LF character. But old
Mac OS used a single CR character for line breaks. If transferred in binary mode (bit by
bit) between OSs, a text file could look a mess. Most transfer programs automatically
switch to text mode when transferring text files and perform conversion of line breaks
between different OSs; but I used to run into problems using WinSCP. Sometimes you
have to tell WinSCP explicitly a text file is being transferred.
Summary of Linux
• Practice Linux machine for this class:
teaching.stat.ncsu.edu
Start using it right now.
• Ask for help (order matters): Google (paste the error message to Google often helps),
man command if no internet access, friends, Terry, ...
• Homework (ungraded): set up keys for connecting your own computer to the teaching
server.
19
• Collaborative research. Statisticians, as opposed to “closet mathematicians”, rarely do
things in vacuum.
20
– Open source: cvs, subversion (aka svn), Git, ...
– Proprietary: Visual SourceSafe (VSS), ...
– Dropbox? Mostly for file back and sharing, limited version control (1 month?), ...
• Why Git?
– The Eclipse Community Survey in 2014 shows Git is the most widely used source
code management tool now. Git (33.3%) vs svn (30.7%).
– History: Initially designed and developed by Linus Torvalds in 2005 for Linux
kernel development. “git” is the British English slang for “unpleasant person”.
I’m an egotistical bastard, and I name all my projects after myself. First
’Linux’, now ’git’.
Linus Torvalds
– Advantages of Git.
∗ Speed and simple (?) design.
∗ Strong support for non-linear development (1000s of parallel branches).
∗ Fully distributed. Fast, no internet required, disaster recovery,
∗ Scalable to large projects like the Linux kernel project.
∗ Free and open source.
– Be aware that svn is still widely used in IT industry (Apache, GCC, SourceForge,
Google Code, ...) and R development. E.g., type
svn log -v -l 5 https://svn.r-project.org/R
on command line to get a glimpse of what R development core team is doing.
21
– Good to master some basic svn commands.
R
for free using your NCSU email).
For this course, use github.ncsu.edu please.
– Git client.
∗ Linux: installed on many servers, including teaching.stat.ncsu.edu and
hpc.stat.ncsu.edu. If not, install on CentOS by yum install git.
∗ Mac: install by port install git.
R
∗ Windows: GitHub for Windows (GUI), TortoiseGIT (is this good?)
Don’t rely on GUI. Learn to use Git on command line.
22
3 Lecture 3, Jan 21
Announcements
• Today’s office hours change to 5P-6P.
Last Time
• Key authentication.
Today
• Version control using Git (cont’d).
• Reproducible research.
23
Version control using Git (cont’d)
• Life cycle of a project.
Stage 1:
Stage 2:
– Hopefully, research idea pans out and we want to put up a standalone software
development repository at github.com.
– This usually inherits from the codebase folder and happens when we submit a
paper.
– Challenges: keep all version history. Read Cai Li’s slides (http://hua-zhou.
github.io/teaching/st790-2015spr/gitslides-CaiLi.pdf) for how to migrate
part of a project to a new repository while keeping all history.
Stage 3:
R
– Maintaining and distributing software on github.com.
Josh Day will cover how to distribute R package from github next week.
24
– Synchronize local Git directory with remote repository (git pull).
– Modify files in local working directory.
– Add snapshots of them to staging area (git add).
– Commit: store snapshots permanently to (local) Git repository (git commit).
– Push commits to remote repository (git push).
– Register for an account on a Git server, e.g., github.ncsu.edu. Fill out your
profile, upload your public key to the server, ...
– Identify yourself at local machine:
git config --global user.name "Hua Zhou"
git config --global user.email "hua [email protected]"
Name and email appear in each commit you make.
– Initialize a project:
∗ Create a repository, e.g., st790-2015spr, on the server github.ncsu.edu.
Then clone to local machine
git clone [email protected]:unityID/st790-2015spr.git
∗ Alternatively use following commands to initialize a Git directory from a lo-
cal folder and then push to the Git server
git init
25
git remote add origin [email protected]:unityID/st790-2015spr.git
git push -u origin master
– Edit working directory.
git pull update local Git repository with remote repository (fetch + merge).
git status displays the current status of working directory.
git log filename displays commit logs of a file.
git diff shows differences (by default difference from the most recent commit).
git add ... adds file(s) to the staging area.
git commit commits changes in staging area to Git directory.
git push publishes commits in local Git directory to remote repository.
Following demo session is on my local Mac machine.
26
git reset --soft HEAD 1 undo the last commit.
git checkout filename go back to the last commit.
R
git rm different from rm.
Although git rm deletes files from working directory. They are still in Git
history and can be retrieved whenever needed. So always be cautious to put large
data files or binary files into version control.
27
• Branching in Git.
– Branches in a project:
28
– For this course, you need to have two branches: develop for your own develop-
ment and master for releases (homework submission). Note master is the default
branch when you initialize the project; create and switch to develop branch im-
mediately after project initialization.
29
Versioning” by Yihui Xie
http://yihui.name/en/2013/06/r-package-versioning/
30
∗ Tag a new release v0.0.4.
31
• Further resources:
R
version controlled or frequently committed.
– “Commit early, commit often and don’t spare the horses”
– Adding an informative message when you commit is not optional. Spending
one minute now saves hours later for your collaborators and yourself. Read the
R
following sentence to yourself 3 times:
“Write every commit message like the next person who reads it is an axe-
wielding maniac who knows where you live.”
• Acknowledgement: some material in this lecture are taken from Cai Li’s group meeting
presentation.
32
Reproducible research (in computational science)
An article about computational result is advertising, not scholarship. The
actual scholarship is the full software environment, code and data, that pro-
duced the result.
33
More information is available at
http://en.wikipedia.org/wiki/Anil_Potti
http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/
– Nature Genetics (2013 Impact Factor: 29.648). 20 articles about microarray
profiling published in Nature Genetics between Jan 2005 and Dec 2006.
– Bible code.
34
Witztum et al. (1994) Equidistant letter sequences in the book of genesis. Statist.
Sci., 9(3):429438. http://projecteuclid.org/euclid.ss/1177010393
McKay et al. (1999) Solving the Bible code puzzle, Statist. Sci., 14(2):150–173.
http://cs.anu.edu.au/~bdm/dilugim/StatSci/
• Readings.
– Buckheit and Donoho (1995) Wavelab and reproducible research, in Wavelets and
Statistics, volume 103 of Lecture Notes in Statistics, page 55–81. Springer Newt
York. http://statweb.stanford.edu/~donoho/Reports/1995/wavelab.pdf
Donoho (2010) An invitation to reproducible computational research, Biostatis-
tics, 11(3):385-388.
– Peng (2009) Reproducible research and biostatistics, Biostatistics, 10(3):405–408.
Peng (2011) Reproducible research in computational science, Science, 334(6060):1226–
1227.
35
Roger Peng’s blogs Treading a New Path for Reproducible Research.
http://simplystatistics.org/2013/08/21/treading-a-new-path-for-reproducible-res
http://simplystatistics.org/2013/08/28/evidence-based-data-analysis-treading-a-
http://simplystatistics.org/2013/09/05/implementing-evidence-based-data-analysi
– Reproducible research with R and RStudio by Christopher Gandrud. It covers
R
many useful tools: R, RStudio, LATEX, Markdown, knitr, Github, Linux shell, ...
This book is nicely reproducible. Git clone the source from https://github.
com/christophergandrud/Rep-Res-Book and you should be able to compile into
a pdf.
– Reproducibility in Science at
http://ropensci.github.io/reproducibility-guide/
– Document everything!
– Everything is a text file (.csv, .tex, .bib, .Rmd, .R, ...) They aid future proof and
R
are subject to version control.
Word/Excel are not text files.
– All files should be human readable. Abundant comments and adopt a good style.
– Tie your files together.
36
– Use a dynamic document generation tool (weaving/knitting text, code, and output
together) for documentation. For example
http://hua-zhou.github.io/teaching/st758-2014fall/hw01sol.html
http://hua-zhou.github.io/teaching/st758-2014fall/hw02sol.html
...
http://hua-zhou.github.io/teaching/st758-2014fall/hw07sol.html
http://hua-zhou.github.io/teaching/st758-2014fall/hw08sol.html
– Use a version control system proactively.
R
– Print sessionInfo() in R.
For your homework, submit (put in the master branch) a final pdf report and all
files and instructions necessary to reproduce all results.
We will briefly talk about these features when discussing specific languages.
37
4 Lecture 4, Jan 26
Announcements
• Helpful tutorial about Git branching
http://pcottle.github.io/learnGitBranching/
shared by Bo Ning.
Last Time
• Version control using Git (cont’d).
• Reproducible research.
Today
• This week: languages (R, Matlab, Julia)
38
Computer Languages
– Efficiency (in both run time and memory) for handling big data.
– IDE support (debugging, profiling).
– Open source.
– Legacy code.
– Tools for generating dynamic report.
– Adaptivity to hardware evolution (parallel and distributed computing).
• Types of languages
39
• Messages
– To be versatile in the big data era, be proficient in at least one language in each
category.
– To improve efficiency of interpreted languages such as R or Matlab, avoid loops
as much as possible. Aka, vectorize code
“The only loop you are allowed to have is that for an iterative algorithm.”
– For some tasks where looping is necessary, consider coding in C or Fortran. It
is convenient to incorporate compiled code into R or Matlab. But do this only
after profiling!
Success stories: glmnet and lars packages in R are based on Fortran.
– When coding using C, C++, Fortran, make use of libraries for numerical linear
R
algebra: BLAS, LAPACK, ATLAS, ...
Julia seems to combine the strengths of all these languages. That is to achieve
efficiency without vectorizing code.
40
5 Lecture 5, Jan 28
Announcements
• HW2 (NNMF, GPU computing) posted. Due Feb 11.
Last Time
• Computer languages.
Today
• Matlab and Julia.
Open source , / ,
IDE R Studio ,, ,,, /
Dynamic document ,,,(RMarkdown) ,,, ,,,(IJulia)
Multi-threading parallel pkg , ,
JIT compiler pkg , ,
Call C/Fortran wrapper wrapper no glue code
Call shared library wrapper wrapper no glue code
Typing / ,, ,,,
Pass by reference / / ,,,
Linear algebra / MKL, Arpack OpenBLAS, Eigpack
Distributed computing / , ,,,
Sparse linear algebra / (Matrix package) ,,, ,,,
Documentation / ,,, ,,
41
• Benchmark code R-benchmark-25.R from http://r.research.att.com/benchmarks/
R-benchmark-25.R covers many commonly used numerical operations used in statis-
tics. We ported to Matlab and Julia and report the run times (averaged over 5 runs)
here.
Machine specs: Intel i7 @ 2.6GHz (4 physical cores, 8 threads), 16G RAM, Mac OS 10.9.5.
Test R 3.1.1 Matlab R2014a julia 0.3.5
• A slightly more complicated (or realistic) example taken from Doug Bates’s slides
http://www.stat.wisc.edu/~bates/JuliaForRProgrammers.pdf. The task is to
use Gibbs sampler to sample from bivariate density
42
– With similar coding efforts, Julia offers ∼ 100 fold speed-up! Somehow JIT in R
didn’t kick in. (Neither does Matlab, which took about 20 seconds.)
– Julia offers the capability of strong typing of variables. This facilitates the opti-
mization by compiler.
R
– With little efforts, we can do parallel and distributed computing using Julia.
Julia
• IDE in Julia.
43
@time macro displays memory footprint and significant gc (garbage collection)
along with run time.
– Tim Holy:
“quickly write the simple version first (which is really pleasant thanks to Julia’s
44
design and the nice library that everyone has been contributing to)” → “run it”
→ “ugh, too slow” → “profile it” → “fix a few problems in places where it actually
matters” → “ah, that’s much nicer!”
– Stefan Karpinski:
1. You can write the easy version that works, and
2. You can usually make it fast with a bit more work.
– Julia provides a very rich collection of static data types, abstract types, and user-
defined types.
http://julia.readthedocs.org/en/latest/manual/integers-and-floating-point-numbe
http://julia.readthedocs.org/en/latest/manual/types/.
– Functions in Julia are really methods. All functions in Julia are generic so any
function definition is actually a method definition.
– Same function (method) names can be applied to different argument signatures.
– Templated methods and data types. Sometimes you want to define algorithms on
the abstract type with minor variations for, say, the element type.
– In Julia, all arguments to functions are passed by reference. A Julia function can
modify its arguments. Such mutating functions should have names ending in “!”.
45
• Call compiled code.
– In Julia, usually it’s unnecessary to write C/C++ or Fortran code for perfor-
mance. Just write loops in Julia and leave the work to its compiler.
– Still in many situations, we’d like to call functions in some compiled libraries
(developed in C or Fortran). Use the ccall() function in Julia; no glue code is
needed. For example, Mac OS has the math library libm.dylib, from which we
can call the sin function
– They must be shared libraries available in the load path, and if necessary a direct
path may be specified.
46
• Documentation.
• Julia summary.
“In my opinion Julia provides the best of both worlds and is the technical
programming language of the future.”
Doug Bates
47
6 Lecture 6, Feb 2
Announcements
• HW1 graded. Feedback
• HW2:
Last Time
• Julia: a promising language to know about.
Today
• Matlab.
• Parallel computing.
Matlab
• Matlab IDE. A powerful IDE comes with Matlab. Familiarity with it prevents tons
of pain.
48
– Essentials: syntax highlighting, code indenting/wrapping/folding, text width (de-
fault = 72 characters), ...
– Code cells delimited by %%. Cells break script into logical segments and facilitate
automatically generating documentation.
– Code analyzer. Are you greened? Check upper-right corner.
• Matlab functions.
R
– Each function is a separate file: fun1.m, fun2.m, ...
R and Julia can have multiple functions in one file.
– Add help/documentation immediately below the function definition. It facilitates
the help command and automatically generating documentation.
– If there are more than one functions in a file, only the first one is callable. Others
are local functions, equivalent of subroutines/subfunctions in other languages.
– Nested function. It has access to the variables in its parent function. Memory
saver!
– Function help follows a fixed format: declaration, calling convention, see also,
example, copyright, ...
49
– Help command
50
b = glmfit(x,y);
% logistic regression
b = glmfit(x,y,’binomial’);
% probit regression
b = glmfit(x,y,’binomial’,’link’,’probit’);
% probit regression with observation weights
R
b = glmfit(x,y,’binomial’,’link’,’probit’,’weights’,wts);
• Debugging in Matlab.
• Profiling in Matlab.
51
– Let’s profile the lsq_sparsepath() function.
– profile viewer produces a summary in html that includes line by line analysis.
52
• Call compiled code in Matlab.
53
∗ The name of the mex function file is the name of your Matlab function
∗ Purpose: match data types between Matlab and C/Fortran, transfer
input/output (pass by value!), ...
∗ Format for mex function: Google for “matlab mex function”
– Step 4: Use mex command to compile source.
54
∗ This produces binary code: funname.mexmaci64 (Mac), funname.mexw64
(Windows), or funname.mexa64 (Linux)
∗ These binaries are what you need to run program. Just use as native Matlab
functions
55
6. Zip the toolbox folder and publish on your website or to Matlab Central
– Contents of a toolbox.
∗ Function files. The private folder “hides” functions and compiled binaries
not directly accessible by user
∗ Demo scripts
∗ The html (or any other name) folder holds the documentation generated by
“publishing” demo scripts
∗ The info.xml file contains essential information about the toolbox. It puts
the toolbox to the start menu of Matlab and links to the help documenta-
tion. See screenshots in next two slides for an example
56
57
• More features of Matlab.
• Matlab summary.
– Good points.
∗ Highly efficient, esp. for numerical linear algebra.
∗ Good IDE. Debugging and profiling is a breeze.
∗ The language of choice for some technical computing areas. E.g., my research
requires a lot solving ODE (ordinary differential equations) and tensor (multi-
dimensional array) computing, which are not available (or not good enough)
in R.
∗ Existence of Matlab sets a high standard for other competing technical com-
puting languages. Examples are R Studio and Julia.
58
∗ Reasonably update with hardware technology. For example, > 200 native
functions in Matlab supports GPU and distributed computing toolbox enables
cluster computing of large scale problems.
– Pitfalls.
∗ Not open source! $$$
∗ Limited statistical functionalities compared to R packages.
Summary of languages
• Choosing language(s) for your project mostly depends your specific tasks, legacy code,
and which “church you happen to frequent”.
• Never believe others’ benchmark results. Do your own profiling and benchmark.
• Don’t be afraid to learn new languages. Having more tools in your toolbox is always
a plus.
• Recent change in the landscape of parallel computing due to end of frequency scaling
game in 2004.
• Cranking up clock frequency (frequency scaling) obviously reduces avg. time per in-
struction, but unfortunately ... increases power consumption and worsens cooling prob-
lem too.
2
Power = Capacitance × Voltage × Frequency .
• This is what I see when running Matlab benchmark code on a MacBook Pro with a
2.6 GHz Intel Core i7 CPU.
59
You can cook eggs on that CPU ...
• Intel canceled its Tejas and Jayhawk lines in 2004 due to power consumption constraint,
which declared the end of frequency scaling and start of parallel scaling.
• This paradigm shift changes the way we do computation. Running the serial code
written for single-core CPU on a multi-core CPU will not make it faster.
• There are many modes of parallel computing: multi-core, cluster, GPU, ...
60
– Intel c Xeon c E5-2640 chip, with 6 physical cores
61
– In total it appears as 12 “processors” (logical processors, virtual cores, logical
cores, siblings) to the OS on the teaching server.
– Theoretical throughput of the machine is
120 DP GFLOPS ≈ 4 DP FLOPs/cycle x 2.5 GHz x 12 logical processors
It’s almost impossible to achieve this theoretical throughput.
– For example, MacBook Pro has an Intel R CoreTM i7-3720QM CPU @ 2.60GHz.
– 4 physical cores and 8 threads. It appears as 8 virtual cores to the OS.
• Some terminology.
62
∗ designed for C/C++, Fortran
∗ not easy to use from R (rpvm, Rmpi packages), Matlab, ...
– Forking.
– Sockets.
63
7 Lecture 7, Feb 4
Last Time
– Matlab.
– Parallel computing: what and why.
Today
– A debug-profile-optimize session on NNMF (HW2)?
– Parallel computing: multi-core in R, Matlab, Julia.
Multi-core computing in R
• Fact: base R is single-threaded.
– Develop multi-threaded code or libraries in C/C++, Fortran, ... and call from R.
– For embarrassingly parallel single-threaded tasks.
∗ Option 1: Manually run multiple R sessions
∗ Option 2: Make multiple system ("Rscript") calls. Typically automated
by a scripting language (Python, Perl, shell script) or within R.
∗ Option 3: Use package parallel
64
• parallel package in R.
> library(parallel)
> detectCores()
[1] 12
• Case study: One common embarrassingly parallel task in statistics is Monte carlo
simulation study.
– E.g., in ST758 (2012, 2013), students are asked to carry out a simulation study
to compare three procedures (LRT, eLRT, eRLRT) for testing H0 : σa2 = 0 vs
Ha : σa2 > 0 in a linear mixed model (variance component model)
Want to compare the size of power of three methods across 16 V1 pattern (stored
in n.pattern.list) and 7 σa2 /σe2 ratios (stored in sigma2.ratio.list).
– Monte carlo estimate of size/power and its standard deviation for each method/pattern/σa2
combination can be summarized in a table
65
compare.tests <- function( n.pattern, sigma2.ratio,
mc.size = 10000, ... )
– Run the same task using mcmapply() function (parallel analogue of mapply) in
the package parallel
> # parallel simulations using mcmapply w/o load balancing
> set.seed (123, "L’Ecuyer")
> system.time (result.mcmapply <- mcmapply ( compare.tests,
+ rep (n.pattern.list, each = length (sigma2.ratio.list), times = 1),
+ rep (sigma2.ratio.list, each = 1, times = length (n.pattern.list)),
+ MoreArgs = list (mc.size = 10), mc.cores = 12))
user system elapsed
218.226 0.840 22.378
66
– mc.cores=12 instructs using 12 cores
67
8 Lecture 8, Feb 9
Announcements
• No class and office hours this Wednesday. Instructor out of town.
• HW2 due this Wed @ 11:59PM. Tagging time will be your submission time. No tagging
time = no hw submission.
• HW2 progress.
Last Time
• A debug-profile-optimize session on NNMF (HW2).
Today
• Multi-core computing in R (cont’d), Matlab, Julia.
• Cluster computing.
1. Write a function to carry out Monte carlo simulation and method comparison for
one combination of levels (one cell in the table).
2. Use mcapply for multi-core parallel computing.
68
3. Results are automatically collected in the master session. No need for extra
R
scripting to collect results from parallel runs.
• Load balancing: Good for small number of parallel tasks with wildly different compu-
tation times
No load balancing: Good for numerous parallel tasks with similar computation times
• Forking creates a new R process by taking a complete copy of the master process,
including the workspace and random number stream. The copy will share memory
R
with the master until modified so forking is very fast.
mcmapply, mclapply and related functions rely on the forking capability of POSIX
operating systems (e.g. Linux, MacOS) and is not available in Windows
R
stopCluster(cl)
69
• Same simulation example using clusterMap with load balancing
• Many embarrassingly parallel tasks in statistics can be organized in a similar way using
parallel.
70
Multi-core and multi-thread computing in Matlab
• Many Matlab functions, esp. numerical linear algebra (MKL libraries), are multi-
threaded since 2007.
• For example, running a benchmark script on the teaching server occupies up to all 7
(virtual) cores.
• Parallel Computing Toolbox has more to offer (distributed array and SPMD, GPU
computing, parallel MapReduce, cluster computing, ...)
http://www.mathworks.com/help/distcomp/index.html
Cluster computing
• Architecture of a computer cluster - computing parts.
71
– Compute nodes, login nodes, gateway (I/O) nodes, management nodes, file servers
R When log into a cluster, always keep in mind you’re interacting with the login
nodes, not the commute nodes.
– A chassis (or rack ) houses one or more nodes together with power, cooling, con-
nectivity, management, ... This is a rack of our beowulf cluster.
– A node (or blade) contains one or more sockets, memory, a modest size disk drive
holding OS, swap space, and a small local scratch space.
72
– Each socket holds one processor, e.g., Intel Xeon or AMD Opteron.
– A processor contains one or more cores (logical processors).
– The cores perform FLOPS.
R
ssh [email protected]
Use git or svn to synchronize project files.
R
bwsubmit mult submits multi-threaded jobs.
Each user can use 20 threads at a time.
• Write one script using parallel package and submit by bwsubmit mult seems the
easiest way for organizing embarrassingly parallel R jobs.
73
henry2 HPC cluster at NCSU
• 1053 nodes (dual Xeon blade servers)
• Users do not interact with compute nodes directly. Users submit jobs which are sched-
uled to be run on compute nodes.
74
– bqueues: info about LSF batch queues
– simulate.fork.r
R script file
– R.csh
R configuration file
– henry2_submit_fork
Shell script file for LSF job submission
– RLRsim_2.0-11.tar.gz
necessary files for installing R libraries
...
# RLRsim package required for LR and RLR test
install.packages ("./RLRsim_2.0-11.tar.gz", repos=NULL, lib="./libs")
library (RLRsim,lib.loc="./libs")
# load libraries
library (compiler)
library (nlme) # requied for lme()
library (parallel)
...
# parallel simulations using mcmapply with load balance
set.seed (123, "L’Ecuyer")
mc = detectCores ()
mc
system.time (result.mcmapplylb <- mcmapply (compare.tests,
rep (n.pattern.list, each = length(sigma2.ratio.list), times = 1),
rep (sigma2.ratio.list, each = 1, times = length(n.pattern.list)),
MoreArgs = list (mc.size = 10), mc.cores = mc, mc.preschedule = FALSE))
...
75
– #BSUB -n 12 requests 12 processors (logical cores, threads).
– #BSUB -W 10 requests maximum of 10 minutes.
– #BSUB -R em64t requests 64-bit machines.
– #BSUB -R span[hosts=1] requests all 12 processors to be on the same machine.
Note mcmapply relies on forking, which is a shared memory model.
– #BSUB -o out.%J and #BSUB -o err.%J specify output files
76
• Wait for the job to finish. Several files are generated in working directory
• Portion of out.111391
• Portion of simulate.fork.r.Rout
...
> # parallel simulations using mcmapply with load balance
> set.seed (123, "L’Ecuyer")
> mc = detectCores ()
> mc
[1] 12
> system.time (result.mcmapplylb <- mcmapply (
+ compare.tests,
+ rep (n.pattern.list, each = length(sigma2.ratio.list), times = 1),
+ rep (sigma2.ratio.list, each = 1, times = length(n.pattern.list)),
+ MoreArgs = list (mc.size = 10), mc.cores = mc, mc.preschedule = FALSE))
user system elapsed
284.954 8.231 26.172
>
77
> # save results
> save(n.pattern.list, sigma2.ratio.list,
+ result.mcmapplylb, file = "result.fork.RData")
>
>
> proc.time()
user system elapsed
287.951 8.374 29.419
– Note clusterMap relies on socket and in principle works with any number of
processors
– setenv MPICH_NO_LOCAL 1 specifies that all MPI messages will be passed through
sockets, not using shared memory available on a node
• ARC cluster. Ask your advisor for an account. R/Matlab not available. Only compiled
code. GPUs available.
http://moss.csc.ncsu.edu/~mueller/cluster/arc/
78
9 Lecture 9, Feb 16
Announcements
• HW2 graded. grade unityID.md committed to your master branch.
Last Time
• Cluster computing.
Today
• HW2 feedback.
• GPU computing.
HW2 feedback
• Solution sketch in Matlab and Julia:
http://hua-zhou.github.io/teaching/st790-2015spr/hw02sol.html
– For CPU code, Julia offers more low-level memory management capabilities, lead-
ing to more efficient computation.
– For GPU programming, Matlab wins hands down in ease of use. Julia GPU com-
puting relies on the CUDArt.jl and CUBLAS.jl packages. Currently CUBLAS.jl
implements approximately half of BLAS functions, including gemm. For non-BLAS
computations such as elementwise multiplication and division, users need to write
their own CUDA kernel functions.
For using GPU in Python, ask Xiang Zhang and Zhen Han. For using gputools
package in R, ask Brian Naughton.
• Interpretability of basis images from NNMF. The following figure (Hastie et al., 2009,
p55) contrasts the different basis images obtained by NNMF, VQ (vector quantization),
and PCA. For a mathematical explanation of what NNMF does, see Donoho and
Stodden (2004).
79
• Different kinds of GPUs. I ran the same Matlab and Julia code on the teaching server,
a desktop, and a laptop. They represent common GPUs we see everyday. Note these
models are a couple years old and stand for technology around 2011.
• CPU vs GPU.
– Gain of GPU over CPU depends on specific cards and precision. Baby GPUs on
laptops show no gain on DP computations.
80
• GPU SP (single precision) vs GPU DP (double precision).
– Do they get same objective values? Do we have to use double precision? For ex-
ample, in MCMC, Monte Carlo errors often far exceed numerical roundoff errors.
– How’s the timing using SP vs DP? Tesla card has similar SP and DP performance.
GTX card has higher SP performance than DP. Baby GPUs on laptops show no
gain on DP computations.
81
• GPU architecture vs CPU architecture.
– GPUs contain 100s of processing cores on a single card; several cards can fit in a
desktop PC
– Each core carries out the same operations in parallel on different input data –
single program, multiple data (SPMD) paradigm.
– Extremely high arithmetic intensity *if* one can transfer the data onto and results
off of the processors quickly.
82
CPU GPU
An analogy taken from Andrew Beam’s presentation in ST790. Also see https:
//www.youtube.com/watch?v=-P28LKWTzrI.
R
Platforms Linux, Windows Linux, Windows, MacOS Linux, Windows
On the other hand, cross-platform feature of OpenCL, adopted by Intel and AMD,
is attractive.
– Almost always involve (new) algorithm development and/or revamping CPU code.
83
– Research before going for GPGPU.
– Easier to develop in C/C++ (free compiler), Fortran (compiler $), and Matlab.
– Do not reinvent the wheel – use libraries.
– Use a higher level language such as Matlab, Julia or Python, if they happen to
provide all functions we need.
– CUDA R toolchain provided by NVIDIA R
https://developer.nvidia.com/cuda-zone
∗ C/C++
∗ free
∗ only for NIVIDA cards
– PGI R toolchain (CUDA Fortran)
https://www.pgroup.com/resources/cudafortran.htm
∗ C/C++, Fortran
∗ $$$
∗ only for NVIDIA cards
– OpenCLTM (Open Computing Language)
∗ open source
84
∗ Specs for cross-platform, parallel programming of modern processors (PCs,
servers, handheld/embedded devices)
∗ Adopted by Intel, AMD, NVIDIA, Qualcomm, ...
R
∗ MAGMA (free): OpenCL LAPACK
85
– zgemm cuBLAS on NVIDIA K40m vs MKL on Xeon E5-2697 v2 @ 2.7GHz.
– dgemm MKL on Xeon Phi c 7120P vs MKL on Xeon 12-core E5-2697 v2 @ 2.7GHz.
86
• Sparse linear algebra. cuSPARSE on K40m vs MKL on Xeon E5-2697 v2 @ 2.7GHz.
87
10 Lecture 10, Feb 18
Announcements
• TA’s Friday office hour changes to Thu Feb 19 @ 2P-3P.
Last Time
• GPU computing: introduction.
Today
• GPU computing: Matlab, Julia, R.
• Convex programming.
88
• Scheme for GPU algorithm development on Matlab.
% computation on GPU
...
R Key: minimize memory transfer between host memory and GPU memory
• Always benchmark the specific bottleneck routine in CPU. If the bottleneck routine
does not enjoy GPU acceleration, there is no point embarking on GPU computing. E.g.,
to benchmark A\b (solve linear equations) on my desktop: paralleldemo_gpu_backslash()
in Matlab 2014a
89
Intel i7 960 CPU vs NVIDIA GTX 580 GPU
GPU computing in R
• Not supported in base R (opportunity? HiPLARM package).
90
GPU case study 2: PET imaging
• Poisson Model:
p
!
X
Yi ∼ Poisson cij λj ,
j=1
where cij is the (pre-calculated) cond. prob. that a photon emitted by j-th pixel is
detected by i-th tube.
• Log-likelihood:
" ! #
X X X
L(λ|y) = yi ln cij λj − cij λj + const.
i j j
91
• Regularized log-likelihood for smoother image:
µ X
L(λ|y) − (λj − λk )2
2
{j,k}∈N
" ! #
X X X µ X
= yi ln cij λj − cij λj − (λj − λk )2 ,
i j j
2
{j,k}∈N
• Which algorithm?
• Parameter constraints λj ≥ 0 are satisfied when start from positive initial values.
(t)
• Update of zij succumbs to BLAS (matrix-vector multiplication) and elementwise mul-
tiplication and division.
• The loop for updating pixels can be carried out independently – massive parallelism.
92
CPU GPU
CPU GPU
R
589 67 -55767.32966 589 0.8 -55767.32970 84
• Lessons learnt.
93
– Two people’s genome sequences are 99.9% identical
– SNP (single nucleotide polymorphism) is a single-letter change in DNA
– About 1 in 1000 DNA letters vary in the form of a SNP
– Genome-wide association study (GWAS) tries to find association of the trait of
interest (disease or not, blood pressure, height, ...) and each SNP
94
• Computation challenge and parallelism.
p
– For either MDR or Pearson, we need to construct tables for 2
SNP pairs
– For p = 106 , p2 ≈ 5 × 1011
– Massive parallelism: tables for SNP pairs (1, 2), . . . , (p − 1, p) obviously can be
constructed in parallel
– How to organize? Merry-go-round.
Golub and Van Loan (1996, Section 8.4)
• Try it.
95
g++ -c -O2 -I/usr/local/cuda-6.5/include *.cpp
nvcc -O2 -c *.cu
g++ -o mdsmain -L/usr/local/cuda-6.5/lib64 -lcudart *.o
96
• GPU host code.
97
• Lessons learnt.
minimize f0 (x)
subject to fi (x) ≤ bi , i = 1, . . . , m.
R
constraint functions.
An equality constraint fi (x) = bi can be absorbed into inequality constraints
fi (x) ≤ bi and −fi (x) ≤ −bi .
• If the objective and constraint functions are convex, then it is called a convex opti-
mization problem.
98
R In a convex optimization problem, only linear equality constraint of form Ax = b
is allowed.
• A definite resource is the book Convex Optimization by Boyd and Vandenberghe, which
is freely available at http://stanford.edu/~boyd/cvxbook/. Same website has links
to slides, code, and lecture videos.
• In this course, we learn basic terminology and how to recognize and solve some standard
convex programming problems.
99
11 Lecture 11, Feb 23
Announcements
• HW4 posted (Linear Programming). Due next Friday Mar 6 @ 11:59PM.
Last Time
• GPU computing: Matlab, Julia, R.
Today
• Convex sets and convex functions.
Convex sets
• The line segment (interval) connecting points x and y is the set
• A set C is convex if for every pair of points x and y lying in C the entire line segment
connecting them also lies in C.
1. Any singleton.
100
2. Rn .
3. Any normed ball Br (c) = {x : kx − ck ≤ r}, open or closed, of radius r centered
at c.
R lp (x) = (
Pn
i=1
1/p
|xi |p ) is not a proper norm for 0 < p < 1.
4. Any hyperplane {x : xT v = c}.
5. Any closed half space {x : xT v ≤ c} or open half space {x : xT v < c}.
6. Any polyhedron
P = {x : aTj x ≤ bj , j = 1, . . . , m, cTj x = dj , j = 1, . . . , p}
= {x : Ax b, Cx = d}.
101
7. The set Sn++ of n × n pd matrices and the set Sn+ of n × n psd matrices.
• Examples of cone:
102
2. Is the set Sn++ of pd matrices a cone?
3. The set {(x, t) : kxk2 ≤ t} is called an ice cream (or Lorentz, or second order, or
quadratic) cone.
for all x, y ∈ C. Note θ is not restricted to the unit interval. An affine set is convex
but not conversely. Every affine set A can be represented as a translate v + S of a
vector subspace S.
103
• Example: The solution set of linear equations C = {x : Ax = b} is affine. The
converse is also true. Every affine set can be expressed as the solution set of a system
of linear equations.
Similar closure properties apply to convex cones and affine sets if either the restriction
Pm
i=1 αi = 1 or the constraints αi ≥ 0, respectively, are lifted.
• The convex hull conv C of a nonempty set C is the smallest convex set containing C.
Equivalently, conv C is the set generated by taking all convex combinations m
P
i=1 αi xi
of elements of C.
The convex conical hull and affine hull of C are generated in a similar manner.
104
What is the affine hull of circle C = {x ∈ R2 : kxk22 = 1}?
Convex functions
• A function f (x) on Rn is convex if
R
for all x, y and all α ∈ [0, 1].
To define a convex function f (x) on Rn , it is convenient to allow the value ∞ and
disallow the value −∞.
The set {x : f (x) < ∞} is a convex set called the essential domain of f and written
dom f . A convex function is proper if dom f 6= ∅ and f (x) > −∞ for all x.
105
• If the inequality in the definition is strict on dom f when α > 0, β > 0, and x 6= y,
then the function is said to be strictly convex.
R
• A function f (x) is concave if its negative −f (x) is convex.
For concave functions we allow the value −∞ and disallow the value ∞.
1. Affine function. Any affine function f (x) = aT x + b is both convex and concave.
2. Norm. Any norm (scalar homogeneity, triangle inequality and separates points)
on Rn is convex.
3. Indicator function. The indicator function
0 x ∈ C
δC (x) =
∞ x ∈ /C
7. Log-det. The function f (X) = ln det X is concave on Sn++ . (Two proofs below.)
• Sublevel sets {x : f (x) ≤ c} of a convex function f (x) are convex. If f (x) is continuous
as well, then all sublevel sets are also closed.
The converse is not true. For example, the sublevel set {x ∈ R2+ : 1 − x1 x2 ≤ 0} is
closed and convex, but the function 1 − x1 x2 is not convex on the domain R2+ = {x :
x1 ≥ 0, x2 ≥ 0}.
106
• (Jenen’s inequality) A function f (x) is convex if and only if
m
! m
X X
f αi x i ≤ αi f (xi ),
i=1 i=1
Pm
for all αi ≥ 0 and i=1 αi = 1.
The probabilistic version states f [E(X)] ≤ E[f (X)].
for all x, y ∈ C. Furthermore, f (x) is strictly convex if and only if strict inequality
holds for all y 6= x.
• (Second order condition) Let f (x) be a twice differentiable function on the open convex
set C ⊂ Rn . If its Hessian matrix d2 f (x) is psd for all x, then f (x) is convex. When
d2 f (x) is pd for all x, f (x) is strictly convex.
107
12 Lecture 12, Feb 25
Announcements
• HW3 due today @ 11:59PM. Commit to your master branch and tag.
• HW4 posted (Linear Programming). Due next Friday Mar 6 @ 11:59PM (?).
Last Time
• Convex sets and convex functions.
Today
• Convex functions (cont’d).
1. (Nonnegative weighted sums) If f (x) and g(x) are convex and α and β are non-
negative constants, then αf (x) + βg(x) is convex.
2. (Composition) h(x) is convex and increasing, and g(x) is convex and finite, then
the functional composition f (x) = h ◦ g(x) is convex.
3. (Composition with affine mapping) If f (x) is convex, then the functional compo-
sition f (Ax + b) of f (x) with an affine function Ax + b is convex.
4. (Pointwise maximum and supremum) If fi (x) is convex for each fixed i ∈ I, then
g(x) = supi∈I fi (x) is convex provided it is proper. Note the index set I may be
infinite.
5. (Pointwise limit) If fm (x) is a sequence of convex functions, then limm→∞ fm (x)
is convex provided it exists and is proper.
6. (Integration) If f (x, y) is convex in x for each fixed y and µ is a measure, then
R
R
the integral g(x) = f (x, y)dµ(y) is convex provided it is proper.
It is generalization of the nonnegative weighted sum rule.
7. (Minimum) If f (x, y) is jointly convex in (x, y), then g(x) = inf y∈C f (x, y) is
convex provided it is proper and C is convex.
108
R Product of two convex functions is not necessarily convex. Counter example:
x3 = xx2 . However if both functions are convex, nondecreasing (or nonincreasing),
and positive functions on an interval, then the product is convex.
• Example: The function f (x) = x[1] + · · · + x[k] , the sum of the k largest components
R
of x ∈ Rn , is convex.
This is hint for HW3 Q3.
i.e., the maximum of all possible sums of k different components of x. Since it is the
pointwise maximum of nk linear functions, it is convex.
λmax (M ) = max xT M x
kxk=1
R
mum eigenvalue λmin (M ) is concave in M .
Sum of k largest eigenvalues is convex on Sn .
R
– f is concave if h is concave and nonincreasing, and g is convex.
Remember by f 00 (x) = h00 (g(x))g 0 (x)2 + h0 (g(x))g 00 (x). But same results apply to
non-differential functions as well.
Vector composition f (x) = h ◦ g(x) = h(g1 (x), . . . , gk (x)), where gi : Rn 7→ R and
h : Rk 7→ R.
109
R Remember by d2 f (x) = Dg(x)T d2 h(g(x))Dg(x) + (Dh(g(x)) ⊗ In )d2 g(x). But
same results apply to non-differential functions as well.
f (x, Y ) = xT Y −1 x
R
function f (x, y) = x2 /y on R × R++ .
This is hint for HW3 Q5 and Q10.
epif = {(x, Y , t) : Y 0, xT Y −1 x ≤ t}
( ! )
Y x
= (x, Y , t) : 0, Y 0 ,
xT t
which is convex. The second equality is from the linear algebra fact that a block matrix
!
A B
BT C
is psd if and only if A is psd, the Schur complement C − B T A−1 B is psd, and
(I − AA− )B = 0 (B ∈ C(A)).
110
R Same argument yields joint convexity of the matrix function f (X, Y ) = X T Y −1 X
R
on Rm×n × Sn++ .
(Singular case) The result can be further extended to show that the function
1 uT X T Y + Xu Xu ∈ C(Y )
f (X, Y ) = 2
∞ Xu ∈/ C(Y )
• (Line theorem) A function is convex if and only if it is convex when restricted to a line
that intersects its domain. That is f (x) is convex if and only if for any x ∈ domf and
v ∈ Rn , then function
g(t) = f (x + tv)
R
is convex on dom g = {t : x + tv ∈ domf }.
Not sure if a function is convex? Generate a bunch of lines through the domain
and plot. If any of them are not convex, the function is not convex.
R
x > 0.
This is hint for HW3 Q10.
g(t) = ln det(X + tV )
= ln det X 1/2 (I + tX −1/2 V X −1/2 )X 1/2
= ln det X + ln det(I + tX −1/2 V X −1/2 )
Xn
= ln det X + ln(1 + λi t),
i=1
where λi are eigenvalues of X −1/2 V X −1/2 . g(t) is concave in t thus ln det function is
concave too.
Log-convexity
• A positive function f (x) is said to be log-convex if ln f (x) is convex.
111
• Log-convex functions enjoy the same closure properties 1 through 7. In part 2 (com-
position rule), g is convex and h is log-convex.
In addition the collection of log-convex functions is closed under the formation of
R
products and powers.
Not all rules apply to log-concave functions! For instance, nonnegative sum of
log-concave functions is not necessarily log-concave.
• Examples:
is log-convex. Why?
2. The gamma function
Z ∞
Γ(t) = xt−1 e−x dx
0
is log-convex. Why?
3. The moment function
Z ∞
M (x) = ux f (u) du,
0
is log-convex. Why?
5. The Normal cdf
Z x
1 2 /2
Φ(x) = √ e−u du
2π −∞
R
x > 0.
This is hint for HW3 Q10.
112
Proof by log-concavity. Integration of the multivariate Gaussian density with pd co-
variance Σ
1 −1
−1/2 −xT Σ x/2
f (x) = | det Σ| e
(2π)n/2
produces
Z
1 −1
e−x Σ x/2 dx.
1/2 T
| det Σ| =
(2π)n/2
This identity can be restated in terms of the precision matrix Ω = Σ−1 as
Z
ln det Ω = n ln(2π) − 2 ln e−x Ωx/2 dx.
T
The integral on the right is log-convex. Why? Is the integral log-concave? Thus
ln det Ω is concave.
Optimization softwares
Like computer languages, getting familiar with good optimization softwares broadens the
scope and scale of problems we are able to solve in statistics.
113
• Following table lists some of the best convex optimization softwares. Use of Gurobi
R
and/or Mosek is highly recommended.
Gurobi is named after its founders: Zonghao Gu, Edward Rothberg, and Robert
Bixby. Bixby founded the CPLEX at IBM, while Rothberg and Gu led the CPLEX
development team for nearly a decade.
LP MILP SOCP MISOCP SDP GP NLP MINLP R Matlab Julia Python Cost
JuMP.jl D D D D D D D O
Convex.jl D D D D D D O
cvx D D D D D D D D A
Gurobi D D D D D D D D A
Mosek D D D D D D D D D D D A
CPLEX D D D ? D D D A
SCS D D D D D D O
SeDuMi D D D ? D O
SDPT3 D D D ? D O
KNITRO D D D D D D D $
LP = Linear Programming, MILP = Mixed Integer LP, SOCP = Second-order cone pro-
gramming (includes QP, QCQP), MISOCP = Mixed Integer SOCP, SDP = Semidefinite
Programming, GP = Geometric Programming, NLP = (constrained) Nonlinear Program-
ming (includes general QP, QCQP), MINLP = Mixed Integer NLP, O = Open source, A =
Free academic license
114
Set up Gurobi on the teaching server
1. Gurobi 6.0 has been installed on the teaching server at
/use/local/gurobi600
But you have to obtain a license (free) first in order to use it.
3. After confirmation of your academic account, log into your account and request a free
academic license at http://www.gurobi.com/download/licenses/free-academic.
4. Run grbgetkey command on the teaching server and enter the key you obtained in
step 3. Place the file at /home/USERID/.gurobi/
R
4. Now you should be able to use CVX in Matlab.
The standard license comes with free solvers SeDuMi and SDPT3. The Academic
license also bundles with Gurobi and Mosek.
115
13 Lecture 13, Mar 2
Announcements
• HW4 (LP) deadline extended to Mon, Mar 16 @ 11:59PM.
Last Time
• Convex and log-convex functions.
Today
• LP (linear programming).
minimize cT x
subject to Ax = b
Gx h.
116
• The standard form of an LP is
minimize cT x
subject to Ax = b
x 0.
To transform a general linear program into the standard form, we introduce the slack
variables s 0 such that Gx + s = h. Then we write x = x+ − x− , where x+ 0
and x− 0. This yields the problem
minimize cT (x+ − x− )
subject to A(x+ − x− ) = b
G(x+ − x− ) + s = h
x+ 0, x− 0, s 0
R
in x+ , x− , and s.
Slack variables are often used to transform a complicated inequality constraint to
simple non-negativity constraints.
minimize cT x
subject to Gx h.
can be transformed to an LP
minimize t
subject to aTi x + bi ≤ t, i = 1, . . . , m,
in x and t. Apparently
minimize max |aTi x + bi |
i=1,...,m
and
minimize max (aTi x + bi )+
i=1,...,m
117
R
are also LP.
Any convex optimization problem
minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m
aTi x = bi , i = 1, . . . , p,
minimize t
subject to f0 (x) − t ≤ 0
fi (x) ≤ 0, i = 1, . . . , m
aTi x = bi , i = 1, . . . , p
in variables x and t. That is why people often say linear program is universal.
can be transformed to an LP
minimize cT y + dz
subject to Gy − zh 0
Ay − zb = 0
eT y + f z = 1
z≥0
• Example. Compressed sensing (Candès and Tao, 2006; Donoho, 2006) tries to address
a fundamental question: how to compress and transmit a complex signal (e.g., musical
clips, mega-pixel images), which can be decoded to recover the original signal?
118
Suppose a signal x ∈ Rn is sparse with s non-zeros. We under-sample the signal by
multiplying a measurement matrix y = Ax, where A ∈ Rm×n has iid normal entries.
Candès et al. (2006) show that the solution to
minimize kxk1
subject to Ax = y
exactly recovers the true signal under certain conditions on A when n s and m ≈
s ln(n/s). Why sparsity is a reasonable assumption? Virtually all real-world images
have low information content.
The `1 minimization problem apparently is an LP, by writing x = x+ − x− ,
minimize 1T (x+ + x− )
subject to A(x+ − x− ) = y
x+ 0, x− 0.
119
• Example. Quantile regression (HW4). In linear regression, we model the mean of
response variable as a function of covariates. In many situations, the error variance is
not constant, the distribution of y may be asymmetric, or we simply care about the
quantile(s) of response variable. Quantile regression offers a better modeling tool in
these applications.
120
In τ -quantile regression, we minimize the loss function
n
X
f (β) = ρτ (yi − xTi β),
i=1
minimize τ 1T r + + (1 − τ )1T r −
subject to r + − r − = y − Xβ
r + 0, r − 0
in r + , r − , and β.
minimize 1T (r + + r − )
subject to r + − r − = y − Xβ
r + 0, r − 0
R
in r + , r − , and β.
`1 regression = MAD = 1/2-quantile regression.
minimize t
subject to −t ≤ yi − xTi β ≤ t, i = 1, . . . , n
in variables β and t.
• Example: Dantzig selector (HW4). Candès and Tao (2007) propose a variable selection
method called the Dantzig selector that solves
minimize kX T (y − Xβ)k∞
p
X
subject to |β j | ≤ t,
j=2
which can be transformed to an LP. Indeed they name the method after George
Dantzig, who invented the simplex method for efficiently solving LP in 50s.
121
R Apparently any loss/penalty or loss/constraint combinations of form
• Example: 1-norm SVM (HW4). In two-class classification problems, we are given train-
ing data (xi , yi ), i = 1, . . . , n, where xi ∈ Rp are feature vectors and yi ∈ {−1, 1} are
class labels. Zhu et al. (2004) propose the 1-norm support vector machine (svm) that
achieves the dual purpose of classification and feature selection. Denote the solution
of the optimization problem
p
n
" !#
X X
minimize 1 − y i β0 + xij βj
i=1 j=1 +
p
X
subject to kβk1 = |βj | ≤ t
j=1
by β̂0 (t) and β̂(t). 1-norm svm classifies a future feature vector x by the sign of fitted
model
• Many more applications: Airport scheduling (Copenhagen airport uses Gurobi), airline
flight scheduling, NFL scheduling, match.com, LATEX, ...
122
14 Lecture 14, Mar 4
Announcements
• HW4 (LP) deadline extended to Mon, Mar 16 @ 11:59PM.
Last Time
• LP (linear programming).
Today
• QP (quadratic programming).
More LP
• In the worst k error regression (HW3), we minimize ki=1 |r|(i) where |r|(1) ≥ |r|(2) ≥
P
· · · ≥ |r|(n) are order statistics of the absolute values of residuals |ri | = |yi − xTi β|.
This can be solved by the LP
minimize kt + 1T z
subject to −t1 − z y − Xβ t1 + z
z0
in variables β ∈ Rp , t ∈ R, and z ∈ Rn .
minimize (1/2)xT P x + q T x + r
subject to Gx h
Ax = b,
123
where we require P ∈ Sn+ (why?).
• Example. The least squares problem minimizes ky − Xβk22 , which obviously is a QP.
• Example. Least squares with linear constraints. For example, nonnegative least squares
(NNLS)
• Example. Lasso regression (Tibshirani, 1996; Donoho and Johnstone, 1994) minimizes
the least squares loss with `1 (lasso) penalty
1
minimize ky − β0 1 − Xβk22 + λkβk1 ,
2
where λ ≥ 0 is a tuning parameter. Writing β = β + − β − , the equivalent QP is
11T
1 + − T
minimize (β − β ) X T
I− X(β + − β − ) +
2 n
11T
T
y I− X(β + − β − ) + λ1T (β + + β − )
n
subject to β + 0, β − 0
in β + and β − .
124
• Example: Elastic net (Zou and Hastie, 2005)
1
minimize ky − β0 1 − Xβk22 + λ(αkβk1 + (1 − α)kβk22 ),
2
where λ ≥ 0 and α ∈ [0, 1] are tuning parameters.
125
• Example: The Huber loss function
r 2 |r| ≤ M
φ(r) =
M (2|r| − M ) |r| > M
can be transformed to a QP
minimize uT u + 2M 1T v
subject to −u − v y − Xβ u + v
0 u M 1, v 0
126
Second-order cone programming (SOCP)
• A second-order cone program (SOCP)
minimize fTx
subject to kAi x + bi k2 ≤ cTi x + di , i = 1, . . . , m
Fx = g
over x ∈ Rn . This says the points (Ai x + bi , cTi x + di ) live in the second order cone
(ice cream cone, Lorentz cone, quadratic cone)
R
in Rn+1 .
QP is a special case of SOCP. Why?
where β g is the subvector of regression coefficients for group g, and wg are fixed group
weights. This is equivalent to the SOCP
11T
1 T T
minimize β X I− Xβ +
2 n
G
11T
X
T
y I− Xβ + λ wg tg
n g=1
subject to kβ g k2 ≤ tg , g = 1, . . . , G,
R
in variables β and t1 , . . . , tG .
Overlapping groups are allowed here.
127
• Example. Sparse group lasso
G
1 2
X
minimize ky − β0 1 − Xβk2 + λ1 kβk1 + λ2 wg kβ g k2
2 g=1
achieves sparsity at both group and individual coefficient level and can be solved by
SOCP as well.
R Apparently we can solve any previous loss functions (quantile, `1 , composite quan-
tile, Huber, multi-response model) plus group or sparse group penalty by SOCP.
128
15 Lecture 15, Mar 16
Announcements
• HW4 (LP) due today 11:59PM.
Last Time
• QP (quadratic programming).
Today
• SOCP (cont’d).
SOCP (cont’d)
• Example. Square-root lasso (Belloni et al., 2011) minimizes
ky − β0 1 − Xβk2 + λkβk1
by SOCP. This variant generates the same solution path as lasso (why?) but simplifies
the choice of λ.
A demo example: http://hua-zhou.github.io/teaching/st790-2015spr/demo_lasso.
html
Qn+2
r = {(x, t1 , t2 ) : kxk22 ≤ 2t1 t2 , t1 ≥ 0, t2 ≥ 0}.
A point x ∈ Rn+1 belongs to the second order cone Qn+1 if and only if
In−2 0 0
√ √
0 −1/ 2 1/ 2 x
√ √
0 1/ 2 1/ 2
129
R Gurobi allows users to input second order cone constraint and quadratic con-
R
straints directly.
Mosek allows users to input second order cone constraint, quadratic constraints,
and rotated quadratic cone constraint directly.
(1/2)xT P x + cT x + r ≤ 0
⇔ xT P x ≤ 2t, t + cT x + r = 0
⇔ (F x, t, 1) ∈ Qk+2
r , t + cT x + r = 0.
Similarly,
R
kF (x − c)k2 ≤ t ⇔ (y, t) ∈ Qn+1 , y = F (x − c).
can be represented by rotated quadratic cones. See (Lobo et al., 1998) for deriva-
tion. For example,
√
K2gm = {(x1 , x2 , t) : x1 x2 ≥ t, x1 , x2 ≥ 0}
√
= {(x1 , x2 , t) : ( 2t, x1 , x2 ) ∈ Q3r }.
130
Pn −1
– (Harmonic mean) The hypograph of the harmonic mean function n−1 i=1 x−1
i
can be represented by rotated quadratic cones
n
!−1
X
n−1 x−1
i ≥ t, x 0
i=1
n
X
⇔ n−1 x−1
i ≤ y, x 0
i=1
n
X
⇔ xi zi ≥ 1, zi = ny, x 0
i=1
Xn
⇔ 2xi zi ≥ 2, zi = ny, x 0, z 0
i=1
√
⇔ ( 2, xi , zi ) ∈ Q3r , 1T z = ny, x 0, z 0.
R
z1 = · · · = zs1 = x1 , zs1 +1 = · · · = zs2 = x2 , zsn−1 +1 = · · · = zβ = xn .
References for above examples: Papers(Lobo et al., 1998; Alizadeh and Goldfarb,
2003) and book (Ben-Tal and Nemirovski, 2001, Lecture 3). Now our catalogue of
R
SOCP terms includes all above terms.
Most of these function are implemented as the built-in function in the convex
optimization modeling language cvx.
131
• Example. `p regression with p ≥ 1 a rational number
minimize ky − Xβkp
can be formulated as a SOCP. Why? For instance, `3/2 regression combines advantage
R
of both robust `1 regression and least squares.
norm(x, p) is a built-in function in the convex optimization modeling language
cvx.
132
16 Lecture 16, Mar 18
Announcements
• HW5 (QP, SOCP) due this Fri, Mar 20 @ 11:59PM.
Last Time
• SOCP (cont’d).
Today
• SDP (semidefinite programming).
• GP (geometric programming).
minimize cT x
subject to x1 F1 + · · · + xn Fn + G 0 (LMI, linear matrix inequality)
Ax = b,
133
R
where G, F1 , . . . , Fn ∈ Sk , A ∈ Rp×n , and b ∈ Rp .
When G, F1 , . . . , Fn are all diagonal, SDP reduces to LP.
minimize tr(CX)
subject to tr(Ai X) = bi , i = 1, . . . , p
X 0,
where C, A1 , . . . , Ap ∈ Sn .
minimize cT x
subject to x1 A1 + · · · + xn An B,
minimize kA − XkF
subject to X ∈ C.
minimize t
subject to kA − XkF ≤ t
X = X T , diag(X) = 1
X0
134
• Eigenvalue problems. Suppose
A(x) = A0 + x1 A1 + · · · xn An ,
minimize t
subject to A(x) tI
R
in variables x ∈ Rn and t ∈ R.
Minimizing the sum of k largest eigenvalues is an SDP too. How about
R
minimizing the sum of all eigenvalues?
Maximize the minimum eigenvalue is an SDP as well.
– Minimize the spread of the eigenvalues λ1 (x) − λm (x) is equivalent to the SDP
minimize t1 − tm
subject to tm I A(x) t1 I
in variables x ∈ Rn and t1 , tm ∈ R.
– Minimize the spectral radius (or spectral norm) ρ(x) = maxi=1,...,m |λi (x)| is equiv-
alent to the SDP
minimize t
subject to −tI A(x) tI
in variables x ∈ Rn and t ∈ R.
– To minimize the condition number κ(x) = λ1 (x)/λm (x), note λ1 (x)/λm (x) ≤ t
if and only if there exists a µ > 0 such that µI A(x) µtI, or equivalently,
I µ−1 A(x) tI. With change of variables yi = xi /µ and s = 1/µ, we can
solve the SDP
minimize t
subject to I sA0 + y1 A1 + · · · yn An tI
s ≥ 0,
135
– Minimize the `1 norm of the eigenvalues |λ1 (x)| + · · · + |λm (x)| is equivalent to
the SDP
t ≤ det(A(x))q
!
A(x) Z
⇔ T
0, (z11 z22 · · · zmm )q ≥ t,
Z diag(Z)
t ≥ det(A(x))−q
!
A(x) Z
⇔ 0, (z11 z22 · · · zmm )−q ≤ t
Z T diag(Z)
for a lower triangular Z.
Pm
– Trace of inverse. trA(x)−1 = i=1 λ−1
i (x) is a convex function and can be mini-
mized using SDP
minimize trB !
B I
subject to 0.
I A(x)
Pm
Note trA(x)−1 = i=1 eTi A(x)−1 ei . Therefore another equivalent formulation is
m
X
minimize ti
i=1
subject to eTi A(x)−1 ei ≤ ti .
136
R See (Ben-Tal and Nemirovski, 2001, Lecture 4, p146-p151) for the proof of above
R
facts.
lambda max, lambda min, lambda sum largest, lambda sum smallest, det rootn,
R
and trace inv are implemented in cvx for Matlab.
lambda max, lambda min are implemented in Convex.jl package for Julia.
137
17 Lecture 17, Mar 23
Announcements
• HW6 (SDP, GP, MIP) due next Mon, Mar 30 @ 11:59PM. http://hua-zhou.github.
io/teaching/st790-2015spr/ST790-2015-HW6.pdf
• Lecture pace too fast / For this course I put priority on diversity over thoroughness of
topics. The goal is to introduce variety of tools that I consider useful but not covered in
standard statistics curriculum. That means, given time limitation, many details have
to be omitted. On the other hand, I have tried hard to point you to the best resources
I know of (text book, lecture video, best software, ...) regarding these topics. It is your
responsibility to follow up, understand and do homework problems, and internalize the
material to become your own tools.
For the convex optimization part, the most important thing is to keep a catalog of
problems that can be solved by each problem class (LP, QP, SOCP, SDP, GP) and get
familiar with the good convex optimization tools for solving them.
• On course project:
– Ideally I hope you can come up a project that benefits yourself. You’ve learnt a lot
tools from this course. Do something with them, that can turn into a manuscript,
a software package, or a blog, and most importantly, something that interests
yourself.
∗ Re-examine the computational issues in your research projects. Is that slow?
What’s the bottleneck? Would Rcpp or changing to another language like
Julia help? Is there an optimization problem there? Is that a convex prob-
lem? Can I do convex relaxation? Can I formulate the problem as a standard
problem class (LP, QP, ...)?
138
∗ Create new applications by try different combinations of the terms in each
category. Say XXX loss + XXX penalty? Can they solve some problems
better (or faster) than current methods?
∗ Reverse engineering. Go over the examples and exercises in the textbook
(Boyd and Vandenberghe, 2004) and ask yourself “this is cool, can I apply
this to solve some statistical problems?”
∗ Do not worry about how to satisfy the instructor. Think about doing some-
thing that benefit yourself in the long run. Be creative and do not be afraid
your idea dose not work. Even negative results are valuable; I appreciate
negative results as far as I see a strong motivation and efforts in them and
you provide some hindsights why the method does not work as you thought.
And seriously, you should write a blog for whatever negative results you got.
I think they have as much intellectual merits as published positive results.
“If your mentor handed you a sure-fire project, then it probably is dull.” (Kenneth
Lange)
Give a man a fish, he eats for a day. Teach him to fish, he will never go
hungry.
• Up to technology? NVIDIA CUDA v7.0 was released last week. A new library
cuSOLVER provides a collection of dense and sparse direct solvers. https://developer.
nvidia.com/cusolver This potentially opens up a lot GPU computing opportunities
for statistics.
Last Time
• SDP.
Today
• SDP (cont’d).
139
SDP (cont’d)
• Singular value problems. Let A(x) = A0 + x1 A1 + · · · xn An , where Ai ∈ Rp×q and
σ1 (x) ≥ · · · σmin{p,q} (x) ≥ 0 be the ordered singular values.
– Spectral norm (or operator norm or matrix-2 norm) minimization. Consider min-
imizing the spectral norm kA(x)k2 = σ1 (x). !Note kAk2 ≤ t if and only if
tI A
AT A t2 I (and t ≥ 0) if and only if 0. This results in the SDP
AT tI
minimize t !
tI A(x)
subject to 0
A(x)T tI
R
in variables x ∈ Rn and t ∈ R.
Minimizing the sum of k largest singular values is an SDP as well.
– Nuclear norm minimization. Minimization of the nuclear norm (or trace norm)
P
kA(x)k∗ = i σi (x) can be formulated as an SDP.
Argument 1: Singular values of A coincides with the eigenvalues of the symmetric
matrix
!
0 A
,
AT 0
which has eigenvalues (σ1 , . . . , σp , −σp , . . . , −σ1 ). Therefore minimizing the nu-
clear norm of A is same as minimizing the `1 norm of eigenvalues of the augmented
matrix, which we know is an SDP.
Argument 2: An alternative characterization of nuclear norm is kAk∗ = supkZk2 ≤1 tr(AT Z).
That is
maximize tr(AT Z)
!
I ZT
subject to 0,
Z I
140
Therefore the epigraph of nuclear norm can be represented by LMI
kA(x)k∗ ≤ t
!
U A(x)T
⇔ 0, tr(U + V )/2 ≤ t.
A(x) V
R
Argument 3: See (Ben-Tal and Nemirovski, 2001, Proposition 4.2.2, p154).
See (Ben-Tal and Nemirovski, 2001, Lecture 4, p151-p154) for the proof of above
R
facts.
R
sigma max and norm nuc are implemented in cvx for Matlab.
operator norm and nuclear norm are implemented in Convex.jl package for Ju-
lia.
A(x) = A0 + x1 A1 + · · · + xn An
B(y) = B0 + y1 B1 + · · · + yr Br .
Then
R
by the Schur complement lemma.
matrix frac() is implemented in both cvx for Matlab and Convex.jl package
for Julia.
F (X) Y
!
I (AXB)T
⇔ 0
AXB Y − E − CXD − (CXD)T
141
• Another matrix inequality
X 0, Y (C T X −1 C)−1
⇔ Y Z, Z 0, X CZC T .
The cone
R
+ }.
References: paper (Nesterov, 2000) and the book (Ben-Tal and Nemirovski, 2001,
Lecture 4, p157-p159).
f (t) = x0 + x1 t + x2 t2 + · · · xn tn
142
to a set of measurements (ti , yi ), i = 1, . . . , m, such that f (ti ) ≈ yi . Define the
Vandermonde matrix
1 t1 t21 ··· tn1
1 t2 t22 ··· tn2
A= .. .. .. ,
..
. . . .
1 tm t2m · · · tnm
then we wish Ax ≈ y. Using least squares criterion, we obtain the optimal solution
xLS = (AT A)−1 AT y. With various constraints, it is possible to find optimal x by
SDP.
R
n−2
1)nxn ) ∈ Ka,b .
nonneg poly coeffs() and convex poly coeffs() are implemented in cvx. Not
in Convex.jl yet.
143
18 Lecture 18, Mar 25
Announcements
• HW6 (SDP, GP, MIP) due next Mon, Mar 30 @ 11:59PM. http://hua-zhou.github.
io/teaching/st790-2015spr/ST790-2015-HW6.pdf
• The teaching server is reserved for teaching purpose. Please do not run and store
your research stuff on it. Each ST790-003 homework problem should take no longer
than a few minutes. Most of them take only a couple seconds.
Last Time
• SDP (cont’d).
Today
• SDP (cont’d).
• GP (geometric programming).
SDP (cont’d)
• Example. Nonparametric density estimation by polynomials. See notes.
minimize cT x
subject to Ax = b, x ∈ {0, 1}n .
Note
144
By relaxing the rank 1 constraint on X, we obtain an SDP relaxation
minimize cT x
subject to Ax = b, diag(X) = x, X xxT ,
which can be efficiently solved and provides a lower bound to the original problem.
If the optimal X has rank 1, then it is a solution to the original binary problem
also. Note X xxT is equivalent to the LMI
!
1 xT
0.
x X
We can tighten the relaxation by adding other constraints that cut away part of
the feasible set, without excluding rank 1 solutions. For instance, 0 ≤ xi ≤ 1 and
0 ≤ Xij ≤ 1.
– SDP relaxation of boolean optimization. For Boolean constraints x ∈ {−1, 1}n ,
we note
R
xi ∈ {0, 1} ⇔ X = xxT , diag(X) = 1.
References: Paper (Laurent and Rendl, 2005) and book (Ben-Tal and Nemirovski,
2001, Lecture 4.3).
• A sum of monomials
K
X
f (x) = ck xa11k xa22k · · · xannk ,
k=1
minimize f0 (x)
subject to fi (x) ≤ 1, i = 1, . . . , m
hi (x) = 1, i = 1, . . . , p
145
where f0 , . . . , fm are posynomials and h1 , . . . , hp are monomials. The constraint x 0
R
is implicit.
Is GP a convex optimization problem?
can be written as
T y+b
f (x) = f (ey1 , . . . , eyn ) = c(ey1 )a1 · · · (eyn )an = ea ,
where aik , gi ∈ Rn . Taking log of both objective and constraint functions, we obtain
the geometric program in convex form
K0
!
T
X
minimize ln ea0k y+b0k
k=1
Ki
!
aT
X
subject to ln eik y+bik ≤ 0, i = 1, . . . , m
k=1
giT y + hi = 0, i = 1, . . . , p.
R Mosek is capable of solving GP. cvx has a GP mode that recognizes and transforms
GP problems.
146
• Example. Logistic regression as GP. Given data (xi , yi ), i = 1, . . . , n, where yi ∈ {0, 1}
and xi ∈ Rp , the likelihood of the logistic regression model is
n
Y
pyi i (1 − pi )1−yi
i=1
!yi
i β
n xT
1−yi
Y e 1
=
1 + exi β 1 + exi β
T T
i=1
n
xi β yi
Y T Y 1
= e .
1 + exi β
T
i:yi =1 i=1
This leads to a GP
Y n
Y
minimize si ti
i:yi =1 i=1
p
Y −x
subject to zj ij ≤ si , i = 1, . . . , m
j=1
p
x
Y
1+ zj ij ≤ ti , i = 1, . . . , n,
j=1
R
yi = 1.
+ −
How to incorporate lasso penalty? Let zj+ = eβj , zj− = eβj . Lasso penalty takes
the form eλ|βj | = (zj+ zj− )λ .
• Example. Bradley-Terry model for sports ranking. See ST758 HW8 http://hua-zhou.
github.io/teaching/st758-2014fall/ST758-2014-HW8.pdf. The likelihood is
Y γi yij
.
i,j
γi + γj
147
MLE is solved by GP
y
Y
minimize tijij
i,j
148
19 Lecture 19, Mar 30
Announcements
• HW6 (SDP, GP, MIP) deadline extended to this Wed, Apr 1 @ 11:59PM. Some hints
if you use Convex.jl package in Julia for HW6:
– Q1(a): Convex.jl does not implement root determinant function but it imple-
ments the logdet function that you can use
– Q1(d): Convex.jl does not implement trace inv function but you can easily
formulate it as an SDP
– Q4(a): Convex.jl does not model GP (geometric program), but you can use
change of variable yi = ln xi and utilize the logsumexp function in Convex.jl
– Q4(b): Convex.jl does not have a log normcdf function but you can learn the
quadratic approximation trick from cvx https://github.com/cvxr/CVX/blob/
master/functions/%40cvx/log_normcdf.m
Last Time
• SDP (cont’d).
• GP (geometric programming).
Today
• Cone programming.
• Planned topics for remaining of the course: algorithms for sparse and regularized
regressions, dynamic programming, EM/MM advanced topics: s.e., convergence and
acceleration, and online estimation.
149
A proper cone defines a partial ordering on Rn via generalized inequalities: x K y if
and only if y − x ∈ K and x ≺ y if and only if y − x ∈ int(K).
E.g., X Y means Y − X ∈ Sn+ and X ≺ Y means Y − X ∈ Sn++ .
minimize cT x
subject to F x + g K 0
Ax = b.
minimize cT x
subject to x K 0
Ax = b.
minimize cT x
subject to F x + g K 0.
150
– Mosek implements up to SDP.
– SCS (free solver accessible from Convex.jl) can deal with exponential cone pro-
gram.
– cvx uses a successive approximation strategy to deal with exponential cone rep-
resentable terms, which only relies on SOCP.
R
http://web.cvxr.com/cvx/doc/advanced.html#successive
cvx implements log det and log sum exp.
– Convex.jl accepts exponential cone representable terms, which can solve using
R
SCS.
Convex.jl implements logsumexp, exp, log, entropy, and logistic loss.
See cvx example library for an example for logistic regression. http://cvxr.com/cvx/
examples/
See the link for an example using Julia. http://nbviewer.ipython.org/github/
JuliaOpt/Convex.jl/blob/master/examples/logistic_regression.ipynb
minimize f (x) + cT x
subject to li ≤ gi (x) + aTi x ≤ ui , i = 1, . . . , m
l x x ux .
• The example
151
is not separable. But the equivalent formulation
minimize x1 − ln(x3 )
subject to x21 + x22 ≤ 1, x1 + 2x2 − x3 = 0, x3 ≥ 0
is.
• It should cover a lot statistical applications. But I have no experience with its perfor-
mance yet.
• Algorithms. Interior point method. (Boyd and Vandenberghe, 2004) Part III (Chapters
9-11).
• History:
R
• Current technology can solve small to moderately sized MILP and MIQP.
cvx allows binary and integer variables.
Convex.jl for Julia does not allow integer variables.
JuMP.jl for Julia allows binary and integer variables.
152
• Modeling using integer variables. References (Nemhauser and Wolsey, 1999; Williams,
2013).
– (Positivity) If 0 ≤ x < M for a known upper bound M , then we can model the
implication (x > 0) → (z = 1) by linear inequality x ≤ M z, where z ∈ {0, 1}.
Similarly if 0 < m ≤ x for a known lower bound m. Then we can model the
implication (z = 1) → (x > 0) by the linear inequality x ≥ mz, where z ∈ {0, 1}.
– (Semi-continuity) We can model semi-continuity of a variable x ∈ R, x ∈ 0 ∪ [a, b]
where 0 < a ≤ b using a double inequality az ≤ x ≤ bz where z ∈ {0, 1}.
– (Constraint satisfaction) Suppose we know the upper bound M on aT x − b. Then
the implication (z = 1) → (aT x ≤ b) can be modeled as
aT x ≤ b + M (1 − z),
aT x ≥ b + (m − )z + .
aT x ≤ b + M (1 − z), aT x ≥ b + (m − )z + .
aT x ≥ b + M (1 − z), aT x ≤ b + (m − )z +
aT x ≤ b + M (1 − z), aT x ≥ b + m(1 − z)
and (z = 0) → (aT x 6= b) by
aT x ≥ b + (m − )z1 + , aT x ≤ b + (M + )z2 − , z1 + z2 − z ≤ 1,
153
– (Disjunctive constraints) The requirement that at least one out of a set of con-
straints is satisfied (z = 1) → (aT1 x ≤ b1 ) ∨ (aT2 x ≤ b2 ) ∨ · · · ∨ (aTk x ≤ bk ) can be
modeled by
where zj ∈ {0, 1} are binary variables and M > aTj − bj for all j is a collective
upper bound.
The reverse implication (aT1 x ≤ b1 ) ∨ (aT2 x ≤ b2 ) ∨ · · · ∨ (aTk x ≤ bk ) → (z = 1) is
modeld as
aTj x ≥ b + (m − )z + , j = 1, . . . , k,
with a lower bound m < aTj x − bj for all j and z ∈ {0, 1}.
– (Pack constraints) The requirement at most one of the constraints are satisfied is
modeled as
– (Partition constraints) The requirement exactly one of the constraints are satisfied
is modeled as
– Boolean primitives.
∗ Complement
x ¬x
0 1
1 0
is modeled as ¬x = 1 − x.
∗ Conjunction
x y x∧y
0 0 0
1 0 0
0 1 0
1 1 1
z = (x ∧ y) is modeled as z + 1 ≥ x + y, x ≥ z, y ≥ z.
∗ Disjunction
154
x y x∨y
0 0 0
1 0 1
0 1 1
1 1 1
is modeled as x + y ≥ 1.
∗ Implication
x y x→y
0 0 1
1 0 0
0 1 1
1 1 1
is modeled as x ≤ y.
– Special ordered set constraint: SOS1 and SOS2. See (Williams, 2013, Section 9.3)
or (Bertsimas and Weismantel, 2005).
An SOS1 constraint is a set of variables for which at most one variable in the
set may take a value other than zero. An SOS2 constraint is an ordered set of
variables where at most two variables in the set may take non-zero values. If two
R
take non-zeros values, they must be contiguous in the ordered set.
Gurobi solver allows SOS1 and SOS2 constraints. JuMP.jl modeling tool for
Julia accepts SOS1 and SOS2 constraints and pass them to solvers that support
them. cvx and Convex.jl dose not take SOS constraints.
minimize ky − β0 1 − Xβk22
subject to kβk0 ≤ k.
Introducing binary variables zj ∈ {0, 1} such that (βj 6= 0) → (zj = 1), then it can be
formulated as a MIQP
minimize ky − β0 1 − Xβk22
subject to −M zj ≤ βj ≤ M zj
p
X
zj ≤ k,
j=1
155
R We should be able to do best subset XXX for all problems in HW4/5 by formulating
a corresponding MILP, MIQP or MISOCP.
3. Each column i of the 2D grid has exactly one value out of each of the digits from
1 to 9. So 9i=1 xijk = 1.
P
4. The major 3-by-3 grids have similar property. So 3i=1 3j=1 xi+U,j+V,k = 1, where
P P
U, V ∈ {0, 3, 6}.
5. Observed entries prescribe xijm = 1 if (i, j)-th entry is integer m.
Julia: http://nbviewer.ipython.org/github/JuliaOpt/juliaopt-notebooks/blob/
master/notebooks/JuMP-Sudoku.ipynb
Matlab: http://www.mathworks.com/help/optim/ug/solve-sudoku-puzzles-via-integer-prog
html
156
20 Lecture 20, Apr 1
Announcements
• HW6 (SDP, GP, MIP) due today @ 11:59PM. Don’t forget git tag your submission.
Last Time
• Cone programming.
Today
• Sparse regression.
– Shrinkage
– Model selection
– Computational efficiency (convex optimization)
• Why shrinkage? Idea of shrinkage dates back to one of the most surprising results in
mathematical statistics in the 20th century. Let’s consider the simple task of estimating
population mean(s).
157
– How to estimate hog weight in Montana?
– How to estimate hog weight in Montana and tea consumption in China?
– How to estimate hog weight in Montana, tea consumption in China, and speed of
light?
• Stein’s paradox.
(m−2)
µ̂LS = y or µ̂JS = 1 − kyk22
y?
The James-Stein shrinkage estimator µ̂JS dominates the least squares estimate µ̂LS
when the number of populations m ≥ 3!
– Stein (1956) showed the inadmissibility of µ̂LS ; his student James and himself
later proposed the specific form of µ̂JS in (James and Stein, 1961).
– Message: when estimating many parameters, shrinkage helps improve risk prop-
erty, even when the parameters are totally unrelated to each other.
158
• Efron’s famous baseball example (Efron and Morris, 1977).
MLE empirical risk: 0.076. James-Stein (shrinkage towards average) empirical risk:
0.021
0.4
true
MLE (0.076)
James−Stein (0.021)
0.35
0.3
0.25
0.2
0 2 4 6 8 10 12 14 16 18
• Stein’s effect is universal and underlies many modern statistical learning methods
159
• Why m ≥ 3? Connection with transience/recurrence of Markov chains (Brown, 1971;
Eaton, 1992)
“A drunk man will eventually find his way home but a drunk bird may get lost forever.”
(Kakutani at a UCLA colloquium talk)
• Now we see the benefits of shrinkage. Lasso has the added benefit of model selection.
for the prostate cancer data in HW4/5/6. The right plot shows the lasso solution path
on the same data set. We see both ridge and lasso shrink β̂. But lasso has the extra
benefit of performing variable selection.
160
– P : the penalty function
– λ: penalty tuning parameter
– η: index a penalty family
161
• SCAD (Fan and Li, 2001),
λ|w|
|w| < λ
Pη (|w|, λ) = λ2 + ηλ(|w|−λ) w2 −λ2
η−1
− 2(η−1)
|w| ∈ [λ, ηλ]
λ2 (η + 1)/2
|w| > ηλ
– for small signals |w| < λ, it acts as lasso; for large signals |w| > ηλ, the penalty
flattens and leads to the unbiasedness of the regularized estimate
162
• MC+ penalty (Zhang, 2010)
w2 λ2 η
Pη (|w|, λ) = λ|w| − 2η
1{|w|<λη} + 1
2 {|w|≥λη}
, η > 0,
is quadratic on [0, λη] and flattens beyond λη. Varying η from 0 to ∞ bridges hard
thresholding (`0 regression) to lasso (`1 ) shrinkage.
• We have seen many examples where convex optimization softwares apply. For a convex
loss f and convex penalty P , write βj = βj+ − βj− , where βj+ = max{βj , 0} and
βj− = − min{βj , 0}. Then we minimize the objective
p
X
+ −
f (β − β ) + Pη (βj+ + βj− , λ)
j=1
R
subject to nonnegativity constraints βj+ , βj− ≥ 0 using a convex optimization solver.
To guarantee βj+ and βj− to be the positive and negative part of βj , we also need
the (non-convex) constraint βj+ βj− = 0. This condition can be dispensed in sparse
regression because the penalty function is an increasing function in (βj+ + βj− ). So the
solution will always put βj+ or βj− to be 0.
163
21 Lecture 21, Apr 6
Announcements
• HW6 solution sketch posted: http://hua-zhou.github.io/teaching/st790-2015spr/
hw06sol.html
Last Time
• Sparse regression: introduction.
Today
• Coordinate descent for sparse regression.
until objective value converges. Similar to the Gauss-Seidel method for solving linear
equations. Why objective value converges?
• Success stories
– Linear regression (Fu, 1998; Daubechies et al., 2004; Friedman et al., 2007; Wu
and Lange, 2008): GlmNet in R.
– GLM (Friedman et al., 2010): GlmNet in R.
– Non-convex penalties (Mazumder et al., 2011): SparseNet in R.
164
1
4
3.5
3
0.5
2.5
y
0 1.5
2 2 0.5
f(x,y) = (y−x )(y−2x )
−0.5 y = 1.4x2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
x
Answer: No.
– Q2: Same question, but for a convex, differentiable f .
Answer: No.
165
P
– Q4: Same question, but for h(x) = f (x) + j gj (xj ), where f is convex and
differentiable and gj are convex but not necessarily differentiable.
or equivalently
– This justifies the CD algorithm for sparse regression of form f (β)+ pj=1 Pη (|βj |, λ),
P
166
• Example. Lasso penalized linear regression.
n p
1X T 2
X
min (yi − β0 − xi β) + λ |β j |.
β 0 ,β 2 i=1 j=1
– Update of intercept β0
n
(t+1) 1X
β0 = (yi − xTi β (t) )
n i=1
n
1X (t) (t)
= (yi − β0 − xTi β (t) + β0 )
n i=1
n
(t) 1 X (t)
= β0 + r .
n i=1 i
– Update of βj
n i2
(t+1) 1 Xh (t) (t)
βj = arg min yi − β0 − xTi β (t) − (βj − βj )xij + λ|βj |
βj 2
i=1
n h i2
1 X (t) (t)
= arg min ri − (βj − βj )xij + λ|βj |
βj 2
i=1
!2
T T (t)
x·j x·j (t) x ·j r
= arg min βj − βj − T + λ|βj |
βj 2 x·j x·j
!
T (t)
(t) x ·j r λ
= ST βj + T , T ,
x·j x·j x·j x·j
where
1
ST(z, γ) = arg min (x − z)2 + γ|x| = sgn(z)(|z| − γ)+
x 2
167
– Method 1: Use Newton method to update coordinate βj (Wu et al., 2009).
– Method 2 (IWLS): Each coordinate descent sweep is performed on the quadratic
approximation
n p
1 X (t) (t) T 2
X
w (z − xi β) + λ |βj |,
2 i=1 i i j=1
(t) (t)
where wi are the working weights and zi are the working responses (Friedman
R
et al., 2010).
IWLS becomes more popular because it needs much less exponentiations.
• Remarks on CD.
R
efficiency due to extensive looping.
What Trevor Hastie calls the FFT trick: Friedman + Fortran + some nu-
merical Tricks = no waste flops.
– Wide applicability of CD: `1 regression (Wu and Lange, 2008), svm (Platt, 1999),
group lasso (block CD), graphical lasso (Friedman et al., 2008), ...
168
22 Lecture 22, Apr 8
Last Time
• Coordinate descent for sparse regression.
Today
• Proximal gradient and accelerated proximal gradient method.
169
∗ Do not work for non-smooth problems.
– Remedies
∗ Slow convergence:
· conjugate gradient method
· quasi-Newton
· accelerated gradient method
∗ Non-differentiable or constrained problems:
· subgradient method
· proximal gradient method
· smoothing method
· cutting-plane methods
R
Proximal gradient method
A definite resource for learning about proximal algorithms is (Parikh and Boyd, 2013)
https://web.stanford.edu/~boyd/papers/prox_algs.html
• “Much like Newton’s method is a standard tool for solving unconstrained smooth min-
imization problems of modest size, proximal algorithms can be viewed as an analogous
tool for nonsmooth, constrained, large-scale, or distributed versions of these problems.”
R
2
Intuitively proxg (x) moves towards the minimum of g but not far away (proximal)
from the point x.
170
R
• Fact: For a closed convex g, proxg (x) exists and is unique for all x.
A function f (x) with domain Rn and range (−∞, ∞] is said to be closed (or lower
semicontinuous) if every sublevel set {x : f (x) ≤ c} is closed. Alternative definition
is f (x) ≤ lim inf m f (xm ) whenever limm xm = x. Another definition is the epigraph
{(x, y) ∈ Rn ×R : f (x) ≤ y} is closed. Examples of closed functions are all continuous
functions, matrix rank, and set indicators.
R
2
171
4. (Group lasso) g(x) = λkxk2 : group soft-thresholding
1 2
proxg (x) = argminu λkuk2 + ku − xk2
2
(1 − λ/kxk )x kxk ≥ λ
2 2
= .
0 otherwise
R
Therefore u∗ = 0 is the global minimum.
where X = U diag(σ)V T is the SVD of X. See ST758 (2014 fall) lecture notes
p159 for the proof.
172
R
6. ...
where f is convex and differentiable and g is a closed convex function with inexpensive
prox-operator by iterating
R
Here s is a constant step size or determined by line search.
Interpretation: from the third line, we see x(t+1) minimizes g(x) plus a simple
R
quadratic local model of f (x) around x(t) .
Interpretation: the function on the third line
1
h(x|x(t) ) := g(x) + f (x(t) ) + ∇f (x(t) )T (x − x(t) ) + kx − x(t) k22
2s
majorizes f (x)+g(x) at current iterate x(t) when s ≤ 1/L (why?). Therefore proximal
R
gradient is an MM algorithm as well.
R
The function to be minimized in each iteration is separated in parameters ,
When g is constant, proximal gradient method reduces to the classical gradient
descent (or steepest descent) method. When g is indicator function χC (x), proximal
gradient method reduces to the projected gradient method.
173
• Convergence of proximal gradient method.
– Assumptions
∗ f is convex and ∇f (x) is Lipschitz continuous with parameter L > 0
∗ g is a closed convex function (so that proxsg is well-defined)
∗ optimal value h∗ = inf x h(x) is finite and attained at x∗
– Theorem: With fixed step size s = 1/L,
Lkx(0) − x∗ k22
h(x(t) ) − h∗ ≤ .
2t
Similar result for backtracking line search without knowing L.
– Same convergence rate as the classical gradient method for smooth functions:
O(1/) steps to reach h(x(t) ) − h∗ ≤ .
– Q: Can the O(1/t) rate be improved?
174
23 Lecture 23, Apr 15
Announcements
• HW7 posted http://hua-zhou.github.io/teaching/st790-2015spr/ST790-2015-HW7.
pdf
• Typo in lecture notes p167 (CD for lasso penalized least squares).
Last Time
• Proximal gradient algorithm.
Today
• Accelerated proximal gradient method.
• History:
– Nesterov:
∗ Nesterov (1983): original acceleration method for smooth functions
∗ Nesterov (1988): second acceleration method for smooth functions
∗ Nesterov (2005): smoothing techniques for nonsmooth functions, coupled
with original acceleration method
∗ Nesterov (2007): acceleration for composite functions
– Beck and Teboulle (2009b): extension of Nesterov (1983) to composite functions
(FISTA).
– Tseng (2008): unified analysis of acceleration techniques (all of these, and more).
– Minimize
175
where f is convex and differentiable and g is convex with inexpensive prox-
operator.
– FISTA algorithm: choose any x(0) = x(−1) ; for t ≥ 1, repeat
t − 2 (t−1)
y ← x(t−1) + (x − x(t−2) ) (extrapolation)
t+1
x(t) ← proxsg (y − s∇f (y)) (prox. grad. desc.)
R
Step size s is fixed or determined by line search.
Interpretation: proximal gradient step is performed on the extrapolated point
R
y based on the previous two iterates.
Physical interpretation of Nesterov acceleration? (Pointed to me by Xiang
Zhang) http://cs231n.github.io/neural-networks-3/#sgd
• Convergence of FISTA.
– Assumptions
∗ f is convex and ∇f (x) is Lipschitz continuous with parameter L > 0
∗ g is closed convex (so that proxsg is well-defined)
∗ optimal value h∗ = inf x h(x) is finite and attained at x∗
– Theorem: With fixed step size s = 1/L,
Lkx(0) − x∗ k22
h(x(t) ) − h∗ ≤ .
2(t + 1)2
R
Similar result for backtracking line search.
√
Need O(1/ ) iterations to get h(x(t) ) − h∗ ≤ . To appreciate this acceler-
ation, to get close to optimal value within = 10−4 , proximal gradient method
requires up to 104 iterations, while accelerated proximal gradient method requires
up to 100 iterations.
176
– First order method : any iterative algorithm that selects x(k) in
3 Lkx(0) − x∗ k22
f (x(t) ) − f ∗ ≥
R
32 (t + 1)2
This says O(1/t2 ) is the best rate first order methods can achieve.
– Nesterov’s accelerated gradient method achieves the optimal O(1/t2 ) rate among
all first-order methods!
– Similarly FISTA achieves the optimal O(1/t2 ) rate among all first-order methods
for minimizing composite function h(x) = f (x) + g(x). See (Beck and Teboulle,
2009b) for proof.
177
• Numerous applications of FISTA.
• Remarks.
178
all x iff the largest eigenvalue of Hessian is bounded above by L. For least
squares, we have L = λmax (X T X). For logistic regression, we have L =
0.25λmax (X T X).
∗ See (Beck and Teboulle, 2009b) for the line search strategy. Same 1/t2 con-
vergence rate.
t − 2 (t−1)
y ← x(t−1) + (x − x(t−2) ) (extrapolation)
t+1
Repeat (line search)
xtemp ← proxsg (y − s∇f (y))
s ← s/2
until h(xtemp ) ≤ h(xtemp |y)
x(t) ← xtemp
α(t−2) − 1 (t−1)
y ← x(t−1) + (x − x(t−2) ) (extrapolation)
α(t−1)
x(t) ← proxsg (y − s∇f (y)) (prox. grad. desc.)
p
1 + 1 + (2α(t−1) )2
α(t) ← .
2
See (Beck and Teboulle, 2009b). Same O(1/t2 ) convergence rate.
179
24 Lecture 24, Apr 20
Announcements
• HW7 due Tue, 4/21 @ 11:59PM.
Last Time
• Accelerated proximal gradient algorithm.
Today
• Path algorithm.
• ALM.
for all λ ≥ 0.
180
Observation: the solution path (in terms of λ) is piece-wise linear.
Observation: (1) The solution paths are piece-wise smooth for convex penalties, (2)
but may be discontinuous for non-convex penalties.
• How to derive path algorithm? Consider sparse regression f (β)+ pj=1 Pη (|βj |, λ) with
P
a convex penalty Pη .
181
1. Write down the Karush-Kuhn-Tucker (KKT) condition for solution β(λ)
2. Apply the implicit function theorem to the first set of equations to derive the path
direction for active βj and determine when each of them hits zero.
3. Use the second set of equations to determine when a zero coefficient βj becomes
R
non-zero.
Recall that the subdifferential ∂f (x) of a convex function f (x) is the set of all
vectors g satisfying the supporting hyperplane inequality
f (y) ≥ f (x) + g T (y − x)
For simplicity, we assume predictors and responses are centered so omit the intercept.
Stationarity condition (necessary and sufficient for global minimum in this case) says
0p ∈ −X T (y − Xβ) + λ∂kβk1 .
Applying the implicit function theorem to the first set of equations yields the path
following direction
d
β̂ A (λ) = −(XAT XA )−1 sgn(β A ),
dλ
which effectively shows that non-zero coefficients β̂ A (λ) and thus the subgradient vector
−XAT c (y − XA β̂ A (λ)) moves linearly within a segment. The second set of equations
monitor the events a zero coefficient becomes non-zero. Therefore for each βj , j ∈ A, we
calculate when it (ever) hits 0. And for each βj , j ∈ Ac , we calculate when it becomes
182
zero. Then the end of current segment (or start of next segment) is determined by the
event that happens soonest, where we update A and then continues.
The computational cost per segment is O(|A|2 ). The number of segments is harder
to characterize though (Donoho and Tanner, 2010). Under certain conditions whole
(piece-wise linear) solution path is obtained at the cost of a regular least squares fit
(Efron et al., 2004).
• Example: Generalized lasso (Tibshirani and Taylor, 2011; Zhou and Lange, 2013)
1
ky − Xβk22 + λkV β − dk1 + λkW β − ek+ .
2
Piece-wise linear path. Applications include lasso, fused lasso, polynomial trend filter-
ing, image denoising, ...
• Example: Quantile regression and many more piece-wise linear solution paths (Rosset
and Zhu, 2007).
Approximate path algorithm (Park and Hastie, 2007) and exact path algorithm (Wu,
2011; Zhou and Wu, 2014) using ODE.
• A very general path algorithm presented by Friedman (2008) works for a large class of
convex/concave penalties, but is mysterious /.
183
– Choosing λ is critical in statistical applications.
– Commonly used methods
∗ Cross validation
∗ Information criteria:
ky − ŷ(λ)k2
AIC(λ) = + 2df(λ)
σ2
ky − ŷ(λ)k2
BIC(λ) = + ln(n)df(λ),
σ2
where ŷ(λ) = X β̂(λ) and df(λ) is the effective degrees of freedom of the
selected model at λ
– Using Stein (1981)’s theory of unbiased risk estimation (SURE), Efron (2004)
shows
n
1 X ∂ ŷ(λ)
df(λ) = 2 cov(ŷi (λ), yi ) = E tr
σ i=1 ∂y
singular values of X
∗ lasso (Zou et al., 2007): number of non-zero coefficients
∗ generalized lasso (Tibshirani and Taylor, 2011)
∗ group lasso (Yuan and Lin, 2006)
∗ nuclear norm regularization (Zhou and Li, 2014)
∗ ...
minimize f (x)
subject to gi (x) = 0, i = 1, . . . , q.
184
– At a constrained minimum, the Lagrange multiplier condition
q
X
0 = ∇f (x) + λi ∇gi (x)
i=1
• Augmented Lagrangian:
q q
X ρX
Lρ (x, λ) = f (x) + λi gi (x) + gi (x)2 .
i=1
2 i=1
Pq
– The penalty term (ρ/2) i=1 gi (x)2 punishes violations of the equality constraints
gi (θ).
– Idea: optimize the Augmented Lagrangian and adjust λ in the hope of matching
the true Lagrange multipliers.
– For ρ large enough (but finite), the unconstrained minimizer of the augmented
Lagrangian coincides with the constrained solution of the original problem.
– At convergence, the gradient ρgi (x)∇gi (x) vanishes and we recover the standard
multiplier rule.
R Intuition for updating λ: if x(t) is the unconstrained minimum of Lρ (x, λ), then
the stationarity condition says
q q
(t)
X X
(t)
0 = ∇f (x ) + λi ∇gi (x(t) ) +ρ gi (x(t) )∇gi (x(t) )
i=1 i=1
q
X (t)
= ∇f (x(t) ) + [λi + ρgi (x(t) )]∇gi (x(t) ).
R
i=1
185
• Example: Compressed sensing (or basis pursuit) problem seeks the sparsest solution
subject to linear constraints
minimize kxk1
subject to Ax = b.
minimize kXk∗
subject to xij = yij , (i, j) ∈ Ω
• Remarks on ALM:
– History: The augmented Lagrangian method dates back to 50s (Hestenes, 1969;
Powell, 1969).
Without the quadratic penalty term (ρ/2)kAx−bk22 , it is the classical dual ascent
algorithm. Dual ascent algorithm works under a set of restrictive assumptions and
can be slow. ALM converges under much more relaxed assumptions (f can be
non differentiable, takes value ∞, ...)
– Monograph by Bertsekas (1982) provides a general treatment.
– Same as the Bregman iteration (Yin et al., 2008) for basis pursuit (compressive
sensing).
– Equivalent to proximal point algorithm applied to the dual; can be accelerated
(Nesterov).
186
25 Lecture 25, Apr 22
Announcements
• Course project due Wed, 4/29 @ 11:00AM.
Last Time
• Path algorithm.
Today
• ADMM (alternating direction method of multipliers). A generic method for solving
many regularization problems.
R
ADMM
A definite resource for learning ADMM is (Boyd et al., 2011)
http://stanford.edu/~boyd/admm.html
187
R If we minimize x and y jointly, then it is same as ALM. We gain splitting by
blockwise updates.
– ADMM converges under mild conditions: f, g convex, closed, and proper, L0 has
a saddle point.
– Augmented Lagrangian is
1 ρ
Lρ (β, γ, λ) = ky − Xβk22 + µkγk1 + λT (Dβ − γ) + kDβ − γk22 .
2 2
– ADMM algorithm:
1 ρ
β (t+1) ← min ky − Xβk22 + λ(t)T (Dβ − γ (t) ) + kDβ − γ (t) k22
β 2 2
ρ
γ (t+1) ← min µkγk1 + λT (Dβ (t+1) − γ) + kDβ (t+1) − γk22
γ 2
(t+1) (t) (t+1) (t+1)
λ ← λ + ρ(Dβ −γ )
R
once, cached in memory, and re-used in each iteration.
Update γ is a separated lasso problem (elementwise soft-thresholding).
• Remarks on ADMM:
– Related algorithms
188
∗ split Bregman iteration (Goldstein and Osher, 2009)
∗ Dykstra (1983)’s alternating projection algorithm
∗ ...
Proximal point algorithm applied to the dual.
– Numerous applications in statistics and machine learning: lasso, generalized lasso,
graphical lasso, (overlapping) group lasso, ...
– Embraces distributed computing for big data (Boyd et al., 2011).
• Distributed computing with ADMM. Consider, for example, solving lasso with a huge
training data set (X, y), which is distributed on B machines. Denote the distributed
data sets by (X1 , y1 ), . . . , (XB , yB ). Then the lasso criterion is
B
1 1X
ky − Xβk22 + µkβk1 = kyb − Xb βk22 + µkβk1 .
2 2 b=1
The ADMM form is
B
1X
minimize kyb − Xb β b k22 + µkβk1
2 b=1
subject to β b = β, b = 1, . . . , B.
Here β b are local variables and β is the global (or consensus) variable. The augmented
Lagrangian function is
B B B
1X X ρX
Lρ (β, γ, λ) = kyb − Xb β b k22 + µkβk1 + λTb (β b − β) + kβ b − βk22 .
2 b=1 b=1
2 b=1
R
(t+1) (t) (t+1)
λb ← λb + ρ(β b − β (t+1) ), b = 1, . . . , B.
The whole procedure is carried out without ever transferring distributed data sets
(yb , Xb ) to a central location!
189
Dynamic programming: introduction
• Divide-and-conquer : break the problem into smaller independent subproblems
– fast sorting,
– FFT,
– ...
• Dynamic programming (DP): subproblems are not independent, that is, subproblems
share common subproblems.
• Use these optimal solutions to construct an optimal solution for the original problem.
– Matrix-chain multiplication,
– Longest common subsequence,
– Optimal binary search trees,
– ...
190
– Graphical models (Wainwright and Jordan, 2008),
– Sequence alignment, e.g., discovery of the cystic fibrosis gene in 1989,
– ...
• Let’s work on the a DP algorithm for the Manhattan tourist problem (MTP), taken
from Jones and Pevzner (2004, Section 6.3).
– Input: a weighted grid G with two distinguished vertices: a source (0, 0) and a
sink (n, m).
– Output: a longest path M T (n, m) in G from source to sink.
Brute force enumeration is out of the question even for a moderate sized graph.
191
• Simple recursive program.
M T (n, m):
– If n = 0 or m = 0, return M T (0, 0)
– x ← M T (n − 1, m)+ weight of the edge from (n − 1, m) to (n, m)
y ← M T (n, m − 1)+ weight of the edge from (n, m − 1) to (n, m)
– Return max{x, y}
• Something wrong
192
• MTP dynamic programming: path!
193
Showing all back-traces!
• MTP: recurrence
• Remarks on DP:
194
• Wide applications of HMM.
• Let’s work on a simple HMM example. The Occasionally Dishonest Casino (Durbin
et al., 2006)
195
• Fundamental questions of HMM:
– How to compute the probability of the observed sequence of symbols given known
parameters akl and ek (b)?
Answer: Forward algorithm.
– How to compute the posterior probability of the state at a given position (posterior
decoding) given akl and ek (b)?
Answer: Backward algorithm.
– How to estimate the parameters akl and ek (b)?
Answer: Baum-Welch algorithm.
– How to find the most likely sequence of hidden states?
Answer: Viterbi algorithm (Viterbi, 1967).
• Forward algorithm:
– Algorithm:
∗ Initialization (i = 1): fk (1) = a0k ek (x1 ).
P
∗ Recursion (i = 2, . . . , L): fl (i) = el (xi ) k fk (i − 1)akl .
P
∗ Termination: P(x) = k fk (L).
Time complexity = (# states)2 × length of sequence.
• Backward algorithm.
196
– Calculate the posterior state probabilities at each position
P(x, πi = k)
P(πi = k|x) = .
P(x)
– Algorithm:
∗ Initialization (i = L): bk (L) = 1 for all k
P
∗ Recursion (i = L − 1, . . . , 1): bk (i) = l akl el (xi+1 )bl (i + 1)
P
∗ Termination: P(x) = l a0l el (x1 )bl (1)
Time complexity = (# states)2 × length of sequence
– The Occasionally Dishonest Casino.
197
• MLE when state sequences are known.
– Idea: Replace the counts Akl and Ek (b) by their expectations conditional on
current parameter iterate (EM algorithm!)
– The probability that akl is used at position i of sequence x:
P(πi = k, πi+1 = l|x, θ)
= P(x, πi = k, πi+1 = l)/P(x)
= P(x1 . . . xi , πi = k)akl el (xi+1 )P(xi+2 . . . xL |πi+1 = l)/P(x)
= fk (i)akl el (xi+1 )bl (i + 1)/P(x).
– So the expected number of times that akl is used in all training sequences is
n
X 1 X j
Akl = j
fk (i)akl el (xji+1 )bjl (i + 1). (2)
j=1
P(x ) i
• Baum-Welch Algorithm.
198
– Termination: Stop if change in log-likelihood is less than a predefined threshold
or the maximum number of iteration is exceeded
• Viterbi Algorithm:
– Algorithm:
∗ Initialization (i = 0): v0 (0) = 1, vk (0) = 0 for all k > 0
∗ Recursion (i = 1, . . . , L):
199
∗ Termination:
∗
∗ Traceback (i = L, . . . , 1): πi=1 = ptri (πi∗ )
Time complexity = (# states)2 × length of sequence
– Viterbi decoding - The Occasionally Dishonest Casino.
over Rp for better recovery of signals that are both sparse and smooth
200
• A genetic example:
by choosing the proper ordered strain origin assignment along the genome
– uk = ak |bk : the ordered strain origin pair
– Lk : log-likelihood function at marker k - matching imputed genotypes with the
observed ones
– Pk : penalty function for adjacent marker k and k + 1 - encouraging smoothness
of the solution
• Loglikelihood at each marker. At marker k, uk = ak |bk : the ordered strain origin pair;
rk /sk : observed genotype for animal i. Log-penetrance (conditional log-likelihood) is
201
– Penalty Pk (uk , uk+1 ) for each pair of adjacent markers is
0, ak = ak+1 , bk = bk+1
− ln γ p (b
ak = ak+1 , bk 6= bk+1
i k+1 ) + λ,
Pk (uk , uk+1 ) =
− ln γim (ak+1 ) + λ, ak 6= ak+1 , bk = bk+1
− ln ψiimp (ak+1 , bk+1 ) + 2λ,
ak 6= ak+1 , bk 6= bk+1 .
– Penalties suppress jumps between strains and guide jumps, when they occur,
toward more likely states.
• For each m = 1, . . . , n,
h m
X m−1
X i
Om (um ) = min − `t (ut ) + λ p(ut , ut+1 )
u1 ,...,um−1
t=1 t=1
202
– Johnson (2013) proposes the dynamic programming algorithm for maximizing the
general objective function
n
X n
X
ek (βk ) − λ d(βk , βk−1 ),
k=1 k=2
• Being good at computing (both algorithms and programming) is a must for today’s
working statisticians.
• In this course, we studied and practiced many (overwhelming?) tools that help us
deliver results faster and more accurate.
Of course there are many tools not covered in this course, notably the Bayesian MCMC
machinery. Take a Bayesian course!
203
• Updated benchmark results. R is upgraded to v3.2.0 and Julia to 0.3.7 since beginning
of this course. I re-did the benchmark and did not see notable changes.
Benchmark code R-benchmark-25.R from http://r.research.att.com/benchmarks/
R-benchmark-25.R covers many commonly used numerical operations used in statis-
tics. We ported to Matlab and Julia and report the run times (averaged over 5 runs)
here.
Machine specs: Intel i7 @ 2.6GHz (4 physical cores, 8 threads), 16G RAM, Mac OS 10.9.5.
Test R 3.2.0 Matlab R2014a julia 0.3.7
For the simple Gibbs sampler test, R v3.2.0 takes 38.32s elapsed time. Julia v0.3.7
takes 0.35s.
204
References
Alizadeh, F. and Goldfarb, D. (2003). Second-order cone programming. Math. Program.,
95(1, Ser. B):3–51. ISMP 2000, Part 3 (Atlanta, GA).
Armagan, A., Dunson, D., and Lee, J. (2013). Generalized double Pareto shrinkage. Statistica
Sinica, 23:119–143.
Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique
occurring in the statistical analysis of probabilistic functions of Markov chains. Ann.
Math. Statist., 41:164–171.
Beck, A. and Teboulle, M. (2009a). Fast gradient-based algorithms for constrained total
variation image denoising and deblurring problems. Trans. Img. Proc., 18(11):2419–2434.
Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: pivotal recovery of
sparse signals via conic programming. Biometrika, 98(4):791–806.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization
and statistical learning via the alternating direction method of multipliers. Found. Trends
Mach. Learn., 3(1):1–122.
205
Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary
value problems. Ann. Math. Statist., 42:855–903.
Burge, C. (1997). Prediction of complete gene structures in human genomic DNA. Journal
of Molecular Biology, 268(1):78–94.
Cai, J.-F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for
matrix completion. SIAM J. Optim., 20(4):1956–1982.
Candès, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much
larger than n. Ann. Statist., 35(6):2313–2351.
Candès, E. J., Romberg, J. K., and Tao, T. (2006). Stable signal recovery from incom-
plete and inaccurate measurements. Communications on Pure and Applied Mathematics,
59(8):1207–1223.
Candès, E. J. and Tao, T. (2006). Near-optimal signal recovery from random projections:
universal encoding strategies? IEEE Trans. Inform. Theory, 52(12):5406–5425.
Candès, E. J., Wakin, M. B., and Boyd, S. P. (2008). Enhancing sparsity by reweighted l1
minimization. J. Fourier Anal. Appl., 14(5-6):877–905.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2009). Introduction to Algo-
rithms. MIT Press, Cambridge, MA, third edition.
Daubechies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for
linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math., 57(11):1413–
1457.
Donoho, D. and Stodden, V. (2004). When does non-negative matrix factorization give a
correct decomposition into parts? In Thrun, S., Saul, L., and Schölkopf, B., editors,
Advances in Neural Information Processing Systems 16, pages 1141–1148. MIT Press.
206
Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet
shrinkage. J. Amer. Statist. Assoc., 90(432):1200–1224.
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (2006). Biological Sequence Analysis.
eleventh edition.
Dykstra, R. L. (1983). An algorithm for restricted least squares regression. J. Amer. Statist.
Assoc., 78(384):837–842.
Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-
validation. J. Amer. Statist. Assoc., 99(467):619–642. With comments and a rejoinder by
the author.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Ann.
Statist., 32(2):407–499. With discussion, and a rejoinder by the authors.
Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors—an empirical
Bayes approach. J. Amer. Statist. Assoc., 68:117–130.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. J. Amer. Statist. Assoc., 96(456):1348–1360.
Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. (2007). Pathwise coordinate opti-
mization. Ann. Appl. Stat., 1(2):302–332.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation
with the graphical lasso. Biostatistics, 9(3):432–441.
207
Friedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.
Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. J. Comput. Graph.
Statist., 7(3):397–416.
Goldstein, T. and Osher, S. (2009). The split Bregman method for l1 -regularized problems.
SIAM J. Img. Sci., 2:323–343.
Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations. Johns Hopkins Studies in
the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition.
Grant, M. and Boyd, S. (2008). Graph implementations for nonsmooth convex programs. In
Blondel, V., Boyd, S., and Kimura, H., editors, Recent Advances in Learning and Control,
Lecture Notes in Control and Information Sciences, pages 95–110. Springer-Verlag Limited.
http://stanford.edu/~boyd/graph_dcp.html.
Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). The entire regularization path for
the support vector machine. J. Mach. Learn. Res., 5:1391–1415.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York,
second edition.
Huber, P. J. (1994). Huge data sets. In COMPSTAT 1994 (Vienna), pages 3–13. Physica,
Heidelberg.
Huber, P. J. (1996). Massive data sets workshop: The morning after. In Massive Data Sets:
Proceedings of a Workshop, pages 169–184. National Academy Press, Washington.
James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley
Sympos. Math. Statist. and Prob., Vol. I, pages 361–379. Univ. California Press, Berkeley,
Calif.
Johnson, N. A. (2013). A dynamic programming algorithm for the fused lasso and L0 -
segmentation. Journal of Computational and Graphical Statistics, to appear.
208
Laurent, M. and Rendl, F. (2005). Semidefinite programming and integer programming.
In K. Aardal, G. N. and Weismantel, R., editors, Discrete Optimization, volume 12 of
Handbooks in Operations Research and Management Science, pages 393 – 514. Elsevier.
Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H. (1998). Applications of second-
order cone programming. Linear Algebra Appl., 284(1-3):193–228. ILAS Symposium on
Fast Algorithms for Control, Signals and Image Processing (Winnipeg, MB, 1997).
Mazumder, R., Friedman, J. H., and Hastie, T. (2011). SparseNet: Coordinate descent with
nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138.
McKay, B., Bar-Natan, D., Bar-Hillel, M., and Kalai, G. (1999). Solving the bible code
puzzle. Statist. Sci., 14(2):150–173.
Nesterov, Y. (2000). Squared functional systems and optimization problems. In High per-
formance optimization, volume 33 of Appl. Optim., pages 405–440. Kluwer Acad. Publ.,
Dordrecht.
Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). A new approach to variable selection
in least squares problems. IMA J. Numer. Anal., 20(3):389–403.
Parikh, N. and Boyd, S. (2013). Proximal algorithms. Found. Trends Mach. Learn., 1(3):123–
231.
Park, M. Y. and Hastie, T. (2007). L1 -regularization path algorithm for generalized linear
models. J. R. Stat. Soc. Ser. B Stat. Methodol., 69(4):659–677.
209
Peng, R. D. (2011). Reproducible research in computational science. Science,
334(6060):1226–1227.
Platt, J. C. (1999). Fast training of support vector machines using sequential minimal
optimization. In Schölkopf, B., Burges, C. J. C., and Smola, A. J., editors, Advances in
Kernel Methods, pages 185–208. MIT Press, Cambridge, MA, USA.
Potti, A., Dressman, H. K., Bild, A., and Riedel, R. F. (2006). Genomic signatures to guide
the use of chemotherapeutics. Nature medicine, 12(11):1294–1300.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77(2):257–286.
Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. Ann. Statist.,
35(3):1012–1030.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal
distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics
and Probability, 1954–1955, vol. I, pages 197–206, Berkeley and Los Angeles. University
of California Press.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc.
Ser. B, 58(1):267–288.
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J.
(2012). Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser.
B. Stat. Methodol., 74(2):245–266.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and
smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(1):91–108.
Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. Ann.
Statist., 39(3):1335–1371.
210
Tseng, P. (2008). On accelerated proximal gradient methods for convex-concave optimiza-
tion. submitted to SIAM Journal on Optimization.
Vielma, J. P., Ahmed, S., and Nemhauser, G. (2010). Mixed-integer models for nonseparable
piecewise-linear optimization: unifying framework and extensions. Oper. Res., 58(2):303–
315.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. Information Theory, IEEE Transactions on, 13(2):260–269.
Williams, H. P. (2013). Model Building in Mathematical Programming. John Wiley & Sons,
Ltd., Chichester, fifth edition.
Witztum, D., Rips, E., and Rosenberg, Y. (1994). Equidistant letter sequences in the book
of genesis. Statist. Sci., 9(3):429–438.
Wu, T. T., Chen, Y., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association
analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721.
Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression.
Ann. Appl. Stat., 2(1):224–244.
Yin, W., Osher, S., Goldfarb, D., and Darbon, J. (2008). Bregman iterative algorithms for
l1 -minimization with applications to compressed sensing. SIAM J. Imaging Sci., 1(1):143–
168.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped
variables. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(1):49–67.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.
Ann. Statist., 38(2):894–942.
Zhou, H. and Lange, K. (2013). A path algorithm for constrained estimation. Journal of
Computational and Graphical Statistics, 22:261–283.
Zhou, H. and Li, L. (2014). Regularized matrix regressions. Journal of Royal Statistical
Society, Series B, 76(2):463–483.
211
Zhou, H. and Wu, Y. (2014). A generic path algorithm for regularized statistical estimation.
J. Amer. Statist. Assoc., 109(506):686–699.
Zhou, J. J., Ghazalpour, A., Sobel, E. M., Sinsheimer, J. S., and Lange, K. (2012). Quantita-
tive trait loci association mapping by imputation of strain origins in multifounder crosses.
Genetics, 190(2):459–473.
Zhu, J., Rosset, S., Tibshirani, R., and Hastie, T. J. (2004). 1-norm support vector ma-
chines. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information
Processing Systems 16, pages 49–56. MIT Press.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J.
R. Stat. Soc. Ser. B Stat. Methodol., 67(2):301–320.
Zou, H., Hastie, T., and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso.
Ann. Statist., 35(5):2173–2192.
212