0% found this document useful (0 votes)

89 views22 pages

Large Scale Machine Learning Systems Tutorial

This document provides an overview of machine learning for computational advertising. It discusses how advertising companies use machine learning to predict click-through rates for ads based on features of the ads, users, and context. It describes the large scale of advertising data, with billions of samples and features, and how distributed file systems like Google File System and HDFS are used to store and access this data for training machine learning models.

Uploaded by

seph8765

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views22 pages

Large Scale Machine Learning Systems Tutorial

Uploaded by

seph8765

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Carnegie

Mellon
University

Data and Application

Tutorial of Parameter Server

Mu Li!
CSD@CMU & IDL@Baidu!
[email protected]

About me
Ph.D student working with Alex Smola and
Dave Andersen!
large scale machine learning theory,
algorithm, application, and distributed system!
Senior architect at Baidu!
the largest search engine at China, >60%
market share!
working on distributed machine learning
systems for computational ads

Carnegie Mellon University

About this tutorial

Focus on the design and implementation of

large scale machine learning systems!
Parallel and distributed algorithms!
Several coding exercises !
Provide real datasets and machines

Carnegie Mellon University

Application & Data

There are lots of data

Text!
Images!
Voices!
Videos!
All about user
activities:
personalization
Carnegie Mellon University

Data are sparse

not true, indeed more

few active
examples!
than Alex :)

Most categories have only

Most users are not so active

simple
statistic tools
model the
head well

machine learning
models the tail:
personalization
Carnegie Mellon University

Online Advertising

The major revenue source of internet search

companies
query: flower delivery results from baidu, google, bing

Carnegie Mellon University

Computational Advertising

Search companies charge advertisers if their

Ads were clicked by users!
Display position is the scarce resource!
Ads are ranked by
p(click | Ad, user, scene) x bid_price(Ad)

bid prices are given (studied by electronic

mechanism design)!
our goal: predict the click-though rate
Carnegie Mellon University

System architecture

from Google Sibyl

Carnegie Mellon University

Machine Learning Approach

Represent {Ad, user, scene} as a feature

vector x, let y (clicked or not clicked) be the
label, then model p(y|x)!
A common way!
!
!

1
p(y|x, w) =
1 + exp(yhx, wi)

then learn w by logistic regression !

Also increasing interests on deep learning

Carnegie Mellon University

Feature Engineering
Feature engineering is the most effective way
to improve the model performance !
even still true for deep learning!
Easy way to add domain knowledge into the
model!
Often contain multiple feature groups!
three major sources: ads, users, advertisers

Carnegie Mellon University

N-grams

uni-gram: international, flower, delivery, !

bi-gram: international flower, flower delivery, !
tri-gram: international flower delivery, !
for short text, even desirable generate all
possible n-grams, then filter out unimportant
ones
Carnegie Mellon University

Style
Bold text

Layout
Images

Carnegie Mellon University

Personalization
Users profile!
gender, age, location, !
Advertiser profile!
category, reputation, !
Session!
a sequence of activities

Carnegie Mellon University

Feature combination
Given two feature groups!
{(a,1), (b,0)}!
{(A, 0), (B, 1)}!
Produce a combination group!
and: {(aA, 0), (aB, 1), (bA, 0), (bB,0)}!
or: {(aA, 1), (aB, 1), (bA, 0), (bB,1)}!
Approximate the polynomial kernel, but much
more efficient!
Guide by domain knowledge or heuristic search

Carnegie Mellon University

2 trillion ads
in one year

Data Scale of Ad-ctr

Only 1 year search log produces 2 trillions
examples!
sub-sampling? not always works because of
the personalization!
Feature size = #ngram + #users + #sessions +
#combination!
often at the same scale of #samples!
A training task some years ago

Carnegie Mellon University

Industry Dataset Size

100 billions of samples!

10 billions of features !
1T1P training data
>5 years ago

Carnegie Mellon University

Where to store the data

Lots of
disks!
Fail at
any time

Carnegie Mellon University

Access patterns

Files are large 100MB

10GB!
Sequential read and
append

Carnegie Mellon University

Google File System

Data are replicated!

write success only if all replicas are done!
Request data:!
ask master for the location!
ask chunk server for the data!
New generations: Colossus
Carnegie Mellon University

HDFS
Open source implementation of GFS!
operations:!
haddop fs -ls, -get, -put, -head, -cat, !
libhdfs: C API!
mount to local filesystem!
A little bit slower than GFS (personal experience)!
Large delay !
hadoop fs -ls /xxx (8000 files)!
Sometimes reading the training data uses more
times than training

Carnegie Mellon University

Chloride 80-Net Ups Manual
100% (3)
Chloride 80-Net Ups Manual
126 pages
Pol Science H
No ratings yet
Pol Science H
269 pages
23CS401 Aiml Lab Manual PDF
No ratings yet
23CS401 Aiml Lab Manual PDF
55 pages
Machine Learning and Its Applications
No ratings yet
Machine Learning and Its Applications
81 pages
Respect FocusedTherapy CH 1
100% (1)
Respect FocusedTherapy CH 1
15 pages
SageMaker Algorithms Guide
No ratings yet
SageMaker Algorithms Guide
20 pages
01 Machine Learning
No ratings yet
01 Machine Learning
44 pages
Koppadi Ramesh
No ratings yet
Koppadi Ramesh
109 pages
Lecture 1 - Introduction To ML
No ratings yet
Lecture 1 - Introduction To ML
41 pages
The Famished Road
No ratings yet
The Famished Road
91 pages
MRM Assessment Questionaire
No ratings yet
MRM Assessment Questionaire
2 pages
ML Aa
No ratings yet
ML Aa
83 pages
Heimdal The Gjallarhorn The Horn Resounding and Ragnarok by Ormungandr Melchizedek
100% (1)
Heimdal The Gjallarhorn The Horn Resounding and Ragnarok by Ormungandr Melchizedek
4 pages
1 Overview
No ratings yet
1 Overview
44 pages
AIML Curriculum Powered by IBM - Pregrad-Merged
No ratings yet
AIML Curriculum Powered by IBM - Pregrad-Merged
66 pages
ABES Presentation
No ratings yet
ABES Presentation
91 pages
Lavanuru Lakshmi Keerthi-Internship Report - Lavanuru Lakshmi Keerthi PDF
No ratings yet
Lavanuru Lakshmi Keerthi-Internship Report - Lavanuru Lakshmi Keerthi PDF
43 pages
COE301 Lab 11 Datapath Component Design
No ratings yet
COE301 Lab 11 Datapath Component Design
7 pages
Machine Learning & Some Industry Applications
No ratings yet
Machine Learning & Some Industry Applications
43 pages
AI and Deep Learning Course Guide
No ratings yet
AI and Deep Learning Course Guide
17 pages
Pro Forma Invoice
0% (1)
Pro Forma Invoice
1 page
AIML-Curriculum by Pregrad
No ratings yet
AIML-Curriculum by Pregrad
33 pages
AI - ML Curriculum Powered by IBM - Pregrad
No ratings yet
AI - ML Curriculum Powered by IBM - Pregrad
31 pages
Unit 1 1. Define Machine Learning. Application of Machine Learning Applications of ML
No ratings yet
Unit 1 1. Define Machine Learning. Application of Machine Learning Applications of ML
40 pages
1964 K Theja Kumar
No ratings yet
1964 K Theja Kumar
33 pages
1 Introduction
No ratings yet
1 Introduction
81 pages
Ai Full Stack
No ratings yet
Ai Full Stack
15 pages
Soft Computing RohanChimbaikar
No ratings yet
Soft Computing RohanChimbaikar
25 pages
The Genius Guide To - Divine Archetypes
100% (1)
The Genius Guide To - Divine Archetypes
18 pages
Bachelor of Technology IN Artificial Intelligence and Data Science
No ratings yet
Bachelor of Technology IN Artificial Intelligence and Data Science
16 pages
Data Mining: Presentation Topic
No ratings yet
Data Mining: Presentation Topic
53 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
54 pages
VEERENDRA Internship Report 1
No ratings yet
VEERENDRA Internship Report 1
42 pages
Assignment 1 Pinnacle's E-Library: Team Members
100% (1)
Assignment 1 Pinnacle's E-Library: Team Members
27 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
Sem 7 All
No ratings yet
Sem 7 All
15 pages
2
No ratings yet
2
8 pages
Cs253 01 Introduction Marked
No ratings yet
Cs253 01 Introduction Marked
49 pages
AL-405 Machine Learning Lab Manual
No ratings yet
AL-405 Machine Learning Lab Manual
40 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Machine Learning Internship Seminar
No ratings yet
Machine Learning Internship Seminar
19 pages
A Brief Biography of Hazrat Maqdum Fakhi Ali Al-Mahaimi
No ratings yet
A Brief Biography of Hazrat Maqdum Fakhi Ali Al-Mahaimi
13 pages
Grade 9 Chapter 10 Review Exercise
No ratings yet
Grade 9 Chapter 10 Review Exercise
6 pages
Ai&ml Unit 3
No ratings yet
Ai&ml Unit 3
81 pages
B1 Booster v1
No ratings yet
B1 Booster v1
32 pages
CIS Theory - MachineLearning
No ratings yet
CIS Theory - MachineLearning
13 pages
Data Science & AIML Coursework
No ratings yet
Data Science & AIML Coursework
10 pages
GRP Project DT
No ratings yet
GRP Project DT
22 pages
Data Science VI Sem Syllabus - 1
No ratings yet
Data Science VI Sem Syllabus - 1
14 pages
1DataScience MachineLearning AI Syllabus.-1.PDF 20240118 174213 0000
No ratings yet
1DataScience MachineLearning AI Syllabus.-1.PDF 20240118 174213 0000
9 pages
The Wizard's Harem - Volume Five - His Elven Dancer - Griz T. Orc & Kimiko Petaway - 2020 - Anna's Archive
No ratings yet
The Wizard's Harem - Volume Five - His Elven Dancer - Griz T. Orc & Kimiko Petaway - 2020 - Anna's Archive
45 pages
001IntroductiontomachinelearningPart I
No ratings yet
001IntroductiontomachinelearningPart I
10 pages
Datascience
No ratings yet
Datascience
7 pages
Topic 7 - Challenge Risk and Safety
No ratings yet
Topic 7 - Challenge Risk and Safety
83 pages
Pa Lab MDM
No ratings yet
Pa Lab MDM
4 pages
M.Sc. Computer Science Curriculum
No ratings yet
M.Sc. Computer Science Curriculum
7 pages
Fluid Mechanics-I Course Overview
No ratings yet
Fluid Mechanics-I Course Overview
10 pages
@DataScience - Ir - 111 Essential Concepts For Data Scientists
No ratings yet
@DataScience - Ir - 111 Essential Concepts For Data Scientists
14 pages
Articles Search Project
No ratings yet
Articles Search Project
8 pages
ML Sessional - I Ans
No ratings yet
ML Sessional - I Ans
18 pages
This Document Is Published In:: Institutional Repository
No ratings yet
This Document Is Published In:: Institutional Repository
9 pages
Big Data Searching FIRST Review
No ratings yet
Big Data Searching FIRST Review
10 pages
MLT Syllabus
No ratings yet
MLT Syllabus
3 pages
Sayiqa - AI Engineer
No ratings yet
Sayiqa - AI Engineer
4 pages
Data Science Side Quests - 4 Uncommon Projects To Elevate Your Skills - KDnuggets
No ratings yet
Data Science Side Quests - 4 Uncommon Projects To Elevate Your Skills - KDnuggets
7 pages
AIML 2nd Year
No ratings yet
AIML 2nd Year
5 pages
Vrawal Resume
No ratings yet
Vrawal Resume
5 pages
Business Data Mining
No ratings yet
Business Data Mining
4 pages
Lecture O03: ENGR90024 Computational Fluid Dynamics
No ratings yet
Lecture O03: ENGR90024 Computational Fluid Dynamics
43 pages
Pre Post Observation
100% (2)
Pre Post Observation
4 pages
Free Course Finder for Analytics Vidhya
No ratings yet
Free Course Finder for Analytics Vidhya
4 pages
Design and Analysis of A High Gain Rail To Rail Operational Amplifier
No ratings yet
Design and Analysis of A High Gain Rail To Rail Operational Amplifier
5 pages
Advanced Machine Learning Course
No ratings yet
Advanced Machine Learning Course
4 pages
Machine Learning: Instructor: Prof. Ayesha
No ratings yet
Machine Learning: Instructor: Prof. Ayesha
31 pages
Organophosphate Insecticides (OPC)
No ratings yet
Organophosphate Insecticides (OPC)
27 pages
PCC-2000 Reference Manual V1.42
No ratings yet
PCC-2000 Reference Manual V1.42
26 pages
MC Female Home Challenge 6.0 Cut
100% (2)
MC Female Home Challenge 6.0 Cut
22 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
4 pages
Aditya Internship Training
No ratings yet
Aditya Internship Training
14 pages
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
No ratings yet
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
13 pages
Steel Welded Fabric List Price (SG) - V2.00
No ratings yet
Steel Welded Fabric List Price (SG) - V2.00
2 pages
Lance Design For Argon Bubbling in Molten Metal
No ratings yet
Lance Design For Argon Bubbling in Molten Metal
12 pages
Beginning of The Year Progress Note
No ratings yet
Beginning of The Year Progress Note
2 pages
Day 4 English Worksheets-21.9.2024
No ratings yet
Day 4 English Worksheets-21.9.2024
3 pages
160719a0cd3011 - 29094359708
No ratings yet
160719a0cd3011 - 29094359708
2 pages
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
100% (4)
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
8 pages

Large Scale Machine Learning Systems Tutorial

Uploaded by

Large Scale Machine Learning Systems Tutorial

Uploaded by

Carnegie

Data and Application

Carnegie Mellon University

About this tutorial

Focus on the design and implementation of

Carnegie Mellon University

Application & Data

There are lots of data

Data are sparse

not true, indeed more

Most categories have only

The major revenue source of internet search

Carnegie Mellon University

Search companies charge advertisers if their

bid prices are given (studied by electronic

from Google Sibyl

Machine Learning Approach

Represent {Ad, user, scene} as a feature

then learn w by logistic regression !

Carnegie Mellon University

Carnegie Mellon University

uni-gram: international, flower, delivery, !

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Data Scale of Ad-ctr

Carnegie Mellon University

Industry Dataset Size

100 billions of samples!

Carnegie Mellon University

Where to store the data

Carnegie Mellon University

Files are large 100MB

Carnegie Mellon University

Google File System

Data are replicated!

Carnegie Mellon University

You might also like