Carnegie
Mellon
University
Data and Application
Tutorial of Parameter Server
Mu Li!
CSD@CMU & IDL@Baidu!
[email protected]
About me
Ph.D student working with Alex Smola and
Dave Andersen!
large scale machine learning theory,
algorithm, application, and distributed system!
Senior architect at Baidu!
the largest search engine at China, >60%
market share!
working on distributed machine learning
systems for computational ads
Carnegie Mellon University
About this tutorial
Focus on the design and implementation of
large scale machine learning systems!
Parallel and distributed algorithms!
Several coding exercises !
Provide real datasets and machines
Carnegie Mellon University
Application & Data
There are lots of data
Text!
Images!
Voices!
Videos!
All about user
activities:
personalization
Carnegie Mellon University
Data are sparse
not true, indeed more
few active
examples!
than Alex :)
Most categories have only
Most users are not so active
simple
statistic tools
model the
head well
machine learning
models the tail:
personalization
Carnegie Mellon University
Online Advertising
The major revenue source of internet search
companies
query: flower delivery results from baidu, google, bing
Carnegie Mellon University
Computational Advertising
Search companies charge advertisers if their
Ads were clicked by users!
Display position is the scarce resource!
Ads are ranked by
p(click | Ad, user, scene) x bid_price(Ad)
bid prices are given (studied by electronic
mechanism design)!
our goal: predict the click-though rate
Carnegie Mellon University
System architecture
from Google Sibyl
Carnegie Mellon University
Machine Learning Approach
Represent {Ad, user, scene} as a feature
vector x, let y (clicked or not clicked) be the
label, then model p(y|x)!
A common way!
!
!
1
p(y|x, w) =
1 + exp(yhx, wi)
then learn w by logistic regression !
Also increasing interests on deep learning
Carnegie Mellon University
Feature Engineering
Feature engineering is the most effective way
to improve the model performance !
even still true for deep learning!
Easy way to add domain knowledge into the
model!
Often contain multiple feature groups!
three major sources: ads, users, advertisers
Carnegie Mellon University
N-grams
uni-gram: international, flower, delivery, !
bi-gram: international flower, flower delivery, !
tri-gram: international flower delivery, !
for short text, even desirable generate all
possible n-grams, then filter out unimportant
ones
Carnegie Mellon University
Style
Bold text
Layout
Images
Carnegie Mellon University
Personalization
Users profile!
gender, age, location, !
Advertiser profile!
category, reputation, !
Session!
a sequence of activities
Carnegie Mellon University
Feature combination
Given two feature groups!
{(a,1), (b,0)}!
{(A, 0), (B, 1)}!
Produce a combination group!
and: {(aA, 0), (aB, 1), (bA, 0), (bB,0)}!
or: {(aA, 1), (aB, 1), (bA, 0), (bB,1)}!
Approximate the polynomial kernel, but much
more efficient!
Guide by domain knowledge or heuristic search
Carnegie Mellon University
2 trillion ads
in one year
Data Scale of Ad-ctr
Only 1 year search log produces 2 trillions
examples!
sub-sampling? not always works because of
the personalization!
Feature size = #ngram + #users + #sessions +
#combination!
often at the same scale of #samples!
A training task some years ago
Carnegie Mellon University
Industry Dataset Size
100 billions of samples!
10 billions of features !
1T1P training data
>5 years ago
Carnegie Mellon University
Where to store the data
Lots of
disks!
Fail at
any time
Carnegie Mellon University
Access patterns
Files are large 100MB
10GB!
Sequential read and
append
Carnegie Mellon University
Google File System
Data are replicated!
write success only if all replicas are done!
Request data:!
ask master for the location!
ask chunk server for the data!
New generations: Colossus
Carnegie Mellon University
HDFS
Open source implementation of GFS!
operations:!
haddop fs -ls, -get, -put, -head, -cat, !
libhdfs: C API!
mount to local filesystem!
A little bit slower than GFS (personal experience)!
Large delay !
hadoop fs -ls /xxx (8000 files)!
Sometimes reading the training data uses more
times than training
Carnegie Mellon University