Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
89 views22 pages

Large Scale Machine Learning Systems Tutorial

This document provides an overview of machine learning for computational advertising. It discusses how advertising companies use machine learning to predict click-through rates for ads based on features of the ads, users, and context. It describes the large scale of advertising data, with billions of samples and features, and how distributed file systems like Google File System and HDFS are used to store and access this data for training machine learning models.

Uploaded by

seph8765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views22 pages

Large Scale Machine Learning Systems Tutorial

This document provides an overview of machine learning for computational advertising. It discusses how advertising companies use machine learning to predict click-through rates for ads based on features of the ads, users, and context. It describes the large scale of advertising data, with billions of samples and features, and how distributed file systems like Google File System and HDFS are used to store and access this data for training machine learning models.

Uploaded by

seph8765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Carnegie

Mellon
University

Data and Application


Tutorial of Parameter Server

Mu Li!
CSD@CMU & IDL@Baidu!
[email protected]

About me
Ph.D student working with Alex Smola and
Dave Andersen!
large scale machine learning theory,
algorithm, application, and distributed system!
Senior architect at Baidu!
the largest search engine at China, >60%
market share!
working on distributed machine learning
systems for computational ads

Carnegie Mellon University

About this tutorial

Focus on the design and implementation of


large scale machine learning systems!
Parallel and distributed algorithms!
Several coding exercises !
Provide real datasets and machines

Carnegie Mellon University

Application & Data

There are lots of data

Text!
Images!
Voices!
Videos!
All about user
activities:
personalization
Carnegie Mellon University

Data are sparse

not true, indeed more


few active
examples!
than Alex :)

Most categories have only


Most users are not so active

simple
statistic tools
model the
head well

machine learning
models the tail:
personalization
Carnegie Mellon University

Online Advertising

The major revenue source of internet search


companies
query: flower delivery results from baidu, google, bing

Carnegie Mellon University

Computational Advertising

Search companies charge advertisers if their


Ads were clicked by users!
Display position is the scarce resource!
Ads are ranked by
p(click | Ad, user, scene) x bid_price(Ad)

bid prices are given (studied by electronic


mechanism design)!
our goal: predict the click-though rate
Carnegie Mellon University

System architecture

from Google Sibyl


Carnegie Mellon University

Machine Learning Approach

Represent {Ad, user, scene} as a feature


vector x, let y (clicked or not clicked) be the
label, then model p(y|x)!
A common way!
!
!

1
p(y|x, w) =
1 + exp(yhx, wi)

then learn w by logistic regression !


Also increasing interests on deep learning

Carnegie Mellon University

Feature Engineering
Feature engineering is the most effective way
to improve the model performance !
even still true for deep learning!
Easy way to add domain knowledge into the
model!
Often contain multiple feature groups!
three major sources: ads, users, advertisers

Carnegie Mellon University

N-grams

uni-gram: international, flower, delivery, !


bi-gram: international flower, flower delivery, !
tri-gram: international flower delivery, !
for short text, even desirable generate all
possible n-grams, then filter out unimportant
ones
Carnegie Mellon University

Style
Bold text

Layout
Images

Carnegie Mellon University

Personalization
Users profile!
gender, age, location, !
Advertiser profile!
category, reputation, !
Session!
a sequence of activities

Carnegie Mellon University

Feature combination
Given two feature groups!
{(a,1), (b,0)}!
{(A, 0), (B, 1)}!
Produce a combination group!
and: {(aA, 0), (aB, 1), (bA, 0), (bB,0)}!
or: {(aA, 1), (aB, 1), (bA, 0), (bB,1)}!
Approximate the polynomial kernel, but much
more efficient!
Guide by domain knowledge or heuristic search

Carnegie Mellon University

2 trillion ads
in one year

Data Scale of Ad-ctr


Only 1 year search log produces 2 trillions
examples!
sub-sampling? not always works because of
the personalization!
Feature size = #ngram + #users + #sessions +
#combination!
often at the same scale of #samples!
A training task some years ago

Carnegie Mellon University

Industry Dataset Size

100 billions of samples!


10 billions of features !
1T1P training data
>5 years ago

Carnegie Mellon University

Where to store the data

Lots of
disks!
Fail at
any time

Carnegie Mellon University

Access patterns

Files are large 100MB


10GB!
Sequential read and
append

Carnegie Mellon University

Google File System

Data are replicated!


write success only if all replicas are done!
Request data:!
ask master for the location!
ask chunk server for the data!
New generations: Colossus
Carnegie Mellon University

HDFS
Open source implementation of GFS!
operations:!
haddop fs -ls, -get, -put, -head, -cat, !
libhdfs: C API!
mount to local filesystem!
A little bit slower than GFS (personal experience)!
Large delay !
hadoop fs -ls /xxx (8000 files)!
Sometimes reading the training data uses more
times than training

Carnegie Mellon University

You might also like