0% found this document useful (0 votes)

21 views34 pages

Lecture 2 Access Patterns in Big Data

The document discusses the changing access patterns in big data, particularly focusing on media systems and their distribution models. It argues that traditional Zipf distributions do not accurately represent media access patterns, which instead follow a stretched exponential distribution. The findings suggest that scalable distributed systems are necessary for effective media content delivery, as media access patterns exhibit long lifespans and require significant storage capacity.

Uploaded by

Jin Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views34 pages

Lecture 2 Access Patterns in Big Data

Uploaded by

Jin Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

The Weakening and Delayed Effects of Long

Tail Distributions in Big Data Accesses

Xiaodong Zhang
Ohio State University

1
Big Data and Power Law

# of hits to
each data
object

Popularity ranks
for each data object

To the rights (the yellow region) is the long tail of lower 80%
objects; to the left are the few that dominate (the top 20%
objects). With limited space to store objects and limited search
ability to a large volume of objects, most attentions and hits
have to be in the top 20% objects, ignoring the long tail.
The Change of Time (short search latency) and Space (unlimited storage
capacity) for Big Data Creates Different Data Access Distributions

Traditional long tail distribution

Flattered distribution after the
long tail can be easily accessed

• The head is lowered and the tail is dropped more and more slowly
• If the flattered distribution is not power law anymore, what is it?
Distribution Changes in DVDs in Netflix 2000 to 2011

2011
predicted

• The growth of Netflix selections ( today: 30 million US users, 40 million users total,
1/3 streaming traffic of Internet)
– 2000: 4,500 DVDs, 2005: 18,000 DVDs
– 2011: over 100,000 DVDs (the long tail would be dropped even more slowly for more demands)
– Note: “breaks and mortar retailers”: face-to-face sell shops.
Amazon Case: Growth of Sales from the Changes of Time/Space
We Must Find the New Distribution for Big Data Accesses

• Internet stores all kinds of huge big data sets

– The rapid growth and wide distribution of Internet media
content is a representative case study of big data

– The media content is carried by scalable distributed systems

• We hope distribution model developed is

– General purpose for other applications of big data

– Scalability nature of both data and systems

7
Zipf distribution is believed the general
model of data access patterns
• Zipf distribution (power law)
logy y
– Characterizes the property of scale
invariance
– Heavy tailed, scale free
slope: -

• 80-20 rule heavy tail

– Income distribution: 80% of social wealth
owned by 20% people (Pareto law) log
i i
– Web traffic: 80% Web requests access
20% pages (Breslau, INFOCOM’99) 
yi  i : 0.6~0.8
• System implications
i : rank of objects
– Objectively caching the working set in
yi : number of references
proxy
– Significantly reduce network traffic
8
Does Internet media traffic follow Zipf’s law?
Web media systems VoD media systems

Chesire, USITS’01: Zipf-like Acharya, MMCN’00: non-Zipf

Cherkasova, NOSSDAV’02: non-Zipf Yu, EUROSYS’06: Zipf-like

P2P media systems Live streaming and IPTV systems

Gummadi, SOSP’03: non-Zipf Veloso, IMW’02: Zipf-like

9
Iamnitchi, INFOCOM’04: Zipf-like Sripanidkulchai, IMC’04: non-Zipf
Inconsistent media access pattern models
• Still based on the Zipf model
– Zipf with exponential cutoff
– Zipf-Mandelbrot distribution
– Generalized Zipf-like distribution
– Two-mode Zipf distribution heuristic assumptions
– Fetch-at-most-once effect
– Parabolic fractal distribution
– …
• All case studies
– Based on one or two workloads
– Different from or even conflict with each other
• An insightful understanding is essential to
– Content delivery system design
– Internet resource provisioning
– Performance optimization 10
Research Objectives

• Find a general distribution model of Internet media

access patterns as a case for big data

– Comprehensive measurements and experiments

– Rigorous mathematical analysis and modeling

– Insights into media system designs

11
Outline
• Motivation and objectives

• Stretched exponential model of Internet media traffic

• Dynamics of access patterns in media systems

• Caching implications and storage requirements

• Summary

• Other newly reported SE distributions in real world

12
Workload summary

• 16 workloads in different media systems

– Web, VoD, P2P, and live streaming nearly all workloads
available on the Internet
– Both client side and server side
• Different delivery techniques
– Downloading, streaming, pseudo streaming all major delivery
techniques
– Overlay multicast, P2P exchange, P2P swarming
• Data set characteristics
– Workload duration: 5 days - two years
data sets of
– Number of users: 10 - 10
3 5
different scales
– Number of requests: 104 - 108
– Number of objects: 102 - 106
13
Distribution Changes in DVDs in Netflix 2000 to 2011

2011
predicted

• The growth of Netflix selections ( today: 30 million US users, 40 million users total,
1/3 streaming traffic of Internet)
– 2000: 4,500 DVDs, 2005: 18,000 DVDs
– 2011: over 100,000 DVDs (the long tail would be dropped even more slowly for more demands)
– Note: “breaks and mortar retailers”: face-to-face sell shops.
Stretched exponential distribution
• Media reference rank follows stretched exponential distribution
(passed Chi-square test)
log y
Probability distribution: Weibull fat head
x
P ( X  x) 1  exp[ ( )c ]
x0
c: stretch factor

Rank distribution: thin tail

• fat head and thin tail in log-log scale log i
• straight line in logx-yc scale
yc c: stretch factor
i : rank of media objects (N objects)
y : number of references
i
P ( y  yi )  b slope: -a
N
y ic   a l o g i  b (1  i  N , a  x 0c )
b 1  a log N (assuming y N 1) 15
log i
Evidences: Web media systems (server
logs)
fat *HPC-98
head (14 MB)thin tail *HPLabs-99 (120 MB) ST-SVR-01 (15 MB)
powered scale yc

c = 0.22 log scale

R2 ~ 1

log scale in x axis

x: rank of media object, y: number of references to the object. Title: workload name (median file size)
data in stretched exponential scale
data in log-log scale
R2: coefficient of determination (1 means a perfect fit)
HPC-98: enterprise streaming media server logs of HP corporation (29 months)
HPLabs: logs of video streaming server for employees in HP Labs (21 months)
ST-SVR-01: an enterprise streaming media server log workload like HPC-98 (4 months)
16
Evidences: Web media systems (req packets)
fat head thin
PS-CLT-04 (1.5 MB)tail ST-CLT-04 (2 ST-CLT-05 (4.5 MB)
MB)
powered scale yc

log scale

log scale in x axis

All collected from a large cable network hosted by a well-known ISP

PS-CLT-04: first IP packets of HTTP requests for media objects (downloading and
pseudo streaming), 9 days
ST-CLT-04: RTSP/MMS streaming requests (on-demand media), 9 days
ST-CLT-05: RTSP/MMS streaming requests (on-demand media), 11 days

17
Evidences: VoD media systems
*mMoD-98 (125 MB) fat*CTVoD-04
head thin
(300 MB)tail
• mMoD-98: logs of a multicast

powered scale yc
Media-on-Demand video
server, 194 days

log scale
• CTVoD-04: streaming serer
logs of a large VoD system by
China telecom, 219 days,
reported as Zipf in
EUROSYS’06

log scale in x axis • IFILM-06: number of web

page clicks to video clips in
IFILM-06 (2.25 MB) YouTube-06 (3.4 MB) IFILM site, 16 weeks (one
week for the figure)

• YouTube-06: cumulative
number of requests to
YouTube video clips, by
crawling on web pages
publishing the data

18
Evidences: P2P media systems
*KaZaa-02 (300 MB) *KaZaa-03 (5 MB) BT-03 (636 MB)

KaZaa-02: large video file (> 100 MB. Files smaller than 100 MB are intensively removed)
transferring in KaZaa network, collected in a campus network, 203 days.

KaZaa-03: music files, movie clips, and movie files downloading in KaZaa network, 5 days,
reported as Zipf in INFOCOM’04.

BT-03: 48 days BitTorrent file downloading (large video and DVD images) recorded by
two tracker sites
19
Evidences: Live streaming and other systems
Akamai-03 Movie-02 IMDB-06

Akamai-03: server logs of live streaming media collected from akamai CDN, 3 months,
reported as two-mode Zipf in IMC’04

Movie-02: US movie box office ticket sales of year 2002.

IMDB-06: cumulative number of votes for top 250 movies in Internet Movie Database web site

20
Why Zipf was observed in the past?

ad server

cache proxy

media server
• Media traffic is driven by user requests
• Intermediate systems may affect traffic pattern
– Effect of extraneous traffic
– Filtering effect due to caching
• Biased measurements may cause Zipf observation
21
Extraneous media traffic
ads clip
flag clip
video prog 1
flag clip
video prog 2

ads server
ads
meta file link ads clip
clip
flag clip
video prog 1
flag clip
web
videoserver
prog 2
flag clip

ad and flag video are pushed streaming

video
program to clients mandatorily media server
22
Effects of extraneous traffic on
reference rank distributions
• Do not represent user access patterns Reference rates

– High request rate (high popularity)

– High total number of requests
• Not necessary Zipf with extraneous traffic
– Extraneous traffic changes
– Always SE without extraneous traffic
• Small object sizes, small traffic volume prog ads flag

without extraneous
with extraneous traffic
traffic 2004: 2 objects 2005: merged
into 1 object
SE SE
2004
Zipf Non-Zipf
2005

2004 2005
23
Fundamental Differences between Zipf and SE

• “Rich-get-richer” phenomenon 3
10
BitTorrent media file
10 3 Web
Video
------ raw data

CCDF of req (log)

– Pareto, power law, …
------ linear fit

Number of distinct objects

– The structure of WWW 2
10
102

• Web accesses are Zipf

– Popular pages can attract more users
1
10 16
10 1

– Pages update to keep popular

0 100
– Zipf-like for long duration 10 0
10
1 0 100
10
1

Popularity rank
200 10
2

Time after object birth (day)

• Media accesses are big data based
– Popularity decreases with time Number of distinct weekly top N
popular objects in 16 weeks
exponentially
– Media objects are permanantly stored
Top 1 Web object never changes
– Rich-get-richer not present
Top 1 video object changes every
– Non-Zipf in long duration week

26
Dynamics of Access Patterns in Media Systems

• Media reference rank distribution in log-log scale

– Different systems have different access patterns
– The distribution changes over time in a system (NOSSDAV’02)

• All follow stretched exponential distribution

– Stretch factor c
– Minus of slope a yc c: stretch factor

• Physical meanings b slope: -a

– Media file sizes
– Aging effects of media objects
– Deviation from the Zipf model log i
27
Stretched exponential parameters
• In a media system yc
– Constant request rate
– Constant object birth rate
b slope: -a
– Constant median file size
• Stretch factor c is a time invariant
constant log i
• Parameter a increases with time
c
  req 1 1 
a  N ( t )

  obj 1   obj t  (1  c ) 
1

c
; a  
 req  req
t : y  1
 obj   o b j  (1  1c )  29
Huge capacity for long life-span accesses

50% 50%

200 days 150 days

• Media objects have long lifespan

– Most requested objects are created long time ago
– Most requests are for objects created long time ago
• To achieve maximal concentration
– Very long time (months to years)
– Huge amount of storage
– Only large and scalable systems provide such a huge space with a long time
31
Summary
• Media access patterns do not fit Zipf model, so doesn’t big data

• We give reasons why previous distributions were confusing

• Media access patterns are stretched exponential

• Our findings imply that

– Client-server proxy systems is not effective to deliver media contents

– Scalable distributed systems are suitable for this purpose

– The storage system in cloud systems must be very scalable

• We believe the stretched exponential model is sufficiently

general for big data accesses. 32
Two Distribution Models are Based on Different
Storage Requirements
• Accesses to big data (e.g. Internet media) follows stretched
exponential distribution
log y
The SE curve implies a wide range access fat head
distribution in long period of time.

Rank distribution:
• fat head and thin tail in log-log scale thin tail
• weak locality needs a huge and log i
distributed storage
log y

The sharp zipf slope implies a concentrated

access distribution on a small number of objects slope: -a
• strong locality makes proxy cache very effective

33
log i
Other Reported Data Access Distributions Fitting SE (after PODC’08)
Internet Video/audio services
– IPTV, user channel selection distribution (SIGMETRICS’09)

– PPLive, P2P streaming request distribution (ICDCS’09)

– Access distribution in PPStream is converting from zipf (2007) to stretched

exponential (2009) (a report from Nanjing Statistical Institute)

– USTC-VOD, Shanghai Jiading TVOD: program request distributions (China National

College Statistical Modeling Competition Outstanding award project, 09)

– User listening behavior of Bugs Music (http://www.bugs.co.kr) in Korea, 72K users,

400K songs, 15M log records (ICIS’10)

– BitTorrent Video-on-demand accesses (NSM 2010)

– VUCLIP (video service to mobile devices): access distributions in servers,

(INFOCOM’12)

– News-on-demand services from 6 Spanish newspapers (IJMA’12)

– Viewer access patterns to a large TV-on-demand system in Sweden (IMC’12)

34
– Mobile viewer access patterns to PPTV VoD system in China (IMC’12)
Other Reported Data Access Distributions Fitting SE (after PODC’08)
Social networks

– Wikipedia, Yahoo answers, social network posting distribution (KDD’09)

– digg.com, (a discussion social network), comment distribution (ICMD’10)

– ireport.com (CNN discussion social network), comment distribution (ICMD’10)

– sina, tudou, i-baidu (social networks in China), access patterns, (ICDCN’10)

– 20minutes.fr (a France news social network), access patterns (U. Paris, 12)

– YouKu (largest user generated video site in China), subscription and access
patterns (ICCCN’12, ICPP’12)

– Yahoo HK blog, posting from SE to power law after being spammed (ICDCS’12)

– Facebook Photo Serving Stack (in the backend Haystack storage), access
patterns show SE distributions (SOSP’13)

Bioinfomatics

– Protein abundances (density vs structure space, Proteome Science, 2013)35

Other Reported Data Access Distributions Fitting SE (after PODC’08)

General data accesses via Internet

– Web access patterns in American University of Nigeria in Africa (AMCIS’09)

– Access patterns to AmazingStore in China (http://www.amazingstore.org), a

P2P storage for college students to access TeraBytes of files (TPDS, 2011)

– FS2You (online storage system in China), file request distribution (INFOCOM’09)

36
Global Research Collaborations Touch the Long Tail
(1998-2013)

37
Comparing distribution patterns between 1998 and 2013

First excise:
– Finding the collaboration data from the link

• Writing a note on the data availability

– Making the figure as US-centric (US should not be in the figure)

– Plotting the number of collections (vertical bar), sorted by the

collaborating countries with US, comparing the shape of the two figures

– Making a log operation on the data on both vertical and horizontal bars,
and compare the slope.

– Writing notes on your observations.

38
Another view of Big Data Access Patterns: Gini Coefficient

G = A/(A+B)
Area A shrinks:
•Accesses to big data are less concentrating

•Gap of rich and poor is narrowed

CSE031.Lecture 08.big Data
No ratings yet
CSE031.Lecture 08.big Data
24 pages
Big Data & Security Training Guide
No ratings yet
Big Data & Security Training Guide
106 pages
Big Data Analytics
100% (1)
Big Data Analytics
31 pages
Big Data Analytics For Wireless and Wired Network Design: A Survey
No ratings yet
Big Data Analytics For Wireless and Wired Network Design: A Survey
23 pages
Predictive Modelling Project Vaishakh Harkrishnababu
No ratings yet
Predictive Modelling Project Vaishakh Harkrishnababu
30 pages
Distributed Computing Overview
No ratings yet
Distributed Computing Overview
28 pages
BIG DATA Technology: Subtitle
No ratings yet
BIG DATA Technology: Subtitle
34 pages
Processign Using Hadoop
No ratings yet
Processign Using Hadoop
44 pages
978-81-322-2494-5-1-80 Parte 1
No ratings yet
978-81-322-2494-5-1-80 Parte 1
80 pages
978-81-322-2494-5-1-30 pp1
100% (1)
978-81-322-2494-5-1-30 pp1
30 pages
Big Data: Hrushikesha Mohanty Prachet Bhuyan Deepak Chenthati Editors
100% (1)
Big Data: Hrushikesha Mohanty Prachet Bhuyan Deepak Chenthati Editors
195 pages
Big Data: Hrushikesha Mohanty Prachet Bhuyan Deepak Chenthati Editors
No ratings yet
Big Data: Hrushikesha Mohanty Prachet Bhuyan Deepak Chenthati Editors
50 pages
What Is Data
No ratings yet
What Is Data
24 pages
Introduction To Netflix Streaming
No ratings yet
Introduction To Netflix Streaming
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Big Data Processing With Hadoop: Bachelor's Thesis Information Technology Internet Technology 2015
No ratings yet
Big Data Processing With Hadoop: Bachelor's Thesis Information Technology Internet Technology 2015
45 pages
Datamining-Lect1 - Introduction To Data Mining
No ratings yet
Datamining-Lect1 - Introduction To Data Mining
77 pages
Big Data Management - Assessment 4 - Answer Template - Computing BDM
No ratings yet
Big Data Management - Assessment 4 - Answer Template - Computing BDM
14 pages
Netflix Srs
No ratings yet
Netflix Srs
19 pages
ask.com ppt
No ratings yet
ask.com ppt
43 pages
Exaflood - MN High-Speed - 10.24.08
100% (1)
Exaflood - MN High-Speed - 10.24.08
42 pages
Big Data in Media and Entertainment
No ratings yet
Big Data in Media and Entertainment
10 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Bda 1
No ratings yet
Bda 1
26 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Netflix Srs
100% (1)
Netflix Srs
18 pages
CCS334 BIG DATA ANALYTICS - Notes - Fullsyllabus
No ratings yet
CCS334 BIG DATA ANALYTICS - Notes - Fullsyllabus
94 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
55 pages
Netflix Srs
No ratings yet
Netflix Srs
18 pages
Big Data: Types, Trends, and Analytics
No ratings yet
Big Data: Types, Trends, and Analytics
74 pages
Big Data - A Primer
100% (3)
Big Data - A Primer
195 pages
Lecture 07
No ratings yet
Lecture 07
64 pages
Big Data
No ratings yet
Big Data
52 pages
Lectue 02
No ratings yet
Lectue 02
74 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Geie 112 - S19 - LCN - 5
No ratings yet
Geie 112 - S19 - LCN - 5
24 pages
Understanding Big Data Fundamentals
No ratings yet
Understanding Big Data Fundamentals
16 pages
Zoya Parasher - 2152916 - Big Data
No ratings yet
Zoya Parasher - 2152916 - Big Data
6 pages
Big Data Analytics: UNIT-1
No ratings yet
Big Data Analytics: UNIT-1
141 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Introduction To Big Data Management
No ratings yet
Introduction To Big Data Management
53 pages
ICT30005 - Assignment 1 - Begum Bolu 6623433 - Big Data Analytics
No ratings yet
ICT30005 - Assignment 1 - Begum Bolu 6623433 - Big Data Analytics
7 pages
Netflix - Srs Netflix - Srs
No ratings yet
Netflix - Srs Netflix - Srs
18 pages
Digital World - Topic 4 Big Data
No ratings yet
Digital World - Topic 4 Big Data
14 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
BDMEpres
No ratings yet
BDMEpres
18 pages
Workload Characterization Guide
No ratings yet
Workload Characterization Guide
52 pages
Big Data Basics for Beginners
No ratings yet
Big Data Basics for Beginners
53 pages
Big Data - 1
No ratings yet
Big Data - 1
46 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
88 pages
Digital World - Topic 4 Big Data
No ratings yet
Digital World - Topic 4 Big Data
42 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Statistical Data Mining: Edward J. Wegman
No ratings yet
Statistical Data Mining: Edward J. Wegman
61 pages
Big Data, Hadoop
No ratings yet
Big Data, Hadoop
24 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
136 pages
Netflix Analysis Report (2105878 - Bibhudutta Swain)
No ratings yet
Netflix Analysis Report (2105878 - Bibhudutta Swain)
19 pages
User Traffic Modeling For Future Mobile Systems
No ratings yet
User Traffic Modeling For Future Mobile Systems
27 pages
BD U1.PDF - Crdownload
No ratings yet
BD U1.PDF - Crdownload
65 pages
Module 4 Virtual Servers
No ratings yet
Module 4 Virtual Servers
7 pages
Junior Grade 9 and Grade 10
No ratings yet
Junior Grade 9 and Grade 10
6 pages
Pre-Ecolier Grade 1 and Grade 2
No ratings yet
Pre-Ecolier Grade 1 and Grade 2
8 pages
Answer Key
No ratings yet
Answer Key
1 page
2025 袋鼠Level F真题
No ratings yet
2025 袋鼠Level F真题
10 pages
ASCR Report On A Quantum Computing Testbed For Science
No ratings yet
ASCR Report On A Quantum Computing Testbed For Science
46 pages
DARPA - (Slides) Human-AI Communication For Deontic Reasoning Devops (CODORD)
No ratings yet
DARPA - (Slides) Human-AI Communication For Deontic Reasoning Devops (CODORD)
26 pages
TRADOC - (Slides) US Army Training & Doctrine Command Overview
No ratings yet
TRADOC - (Slides) US Army Training & Doctrine Command Overview
11 pages
Summary of The 2018 White House Summit On Advancing American Leadership in Quantum Information Science Updated
No ratings yet
Summary of The 2018 White House Summit On Advancing American Leadership in Quantum Information Science Updated
4 pages
Summary Q12 Kick Off Event
No ratings yet
Summary Q12 Kick Off Event
4 pages
Artificial Intelligence Quantum Information Science R D Summary August 2020
No ratings yet
Artificial Intelligence Quantum Information Science R D Summary August 2020
6 pages
Lecture 6 Locks and CC
No ratings yet
Lecture 6 Locks and CC
29 pages
Lecture 3 MR Model and Systems
No ratings yet
Lecture 3 MR Model and Systems
67 pages
BES-HEP Connections: Common Problems in Condensed Matter and High Energy Physics
No ratings yet
BES-HEP Connections: Common Problems in Condensed Matter and High Energy Physics
17 pages
CUBRC - An Overview of The Common Core Space Domain Ontologies
No ratings yet
CUBRC - An Overview of The Common Core Space Domain Ontologies
12 pages
IJCAI 2015 - Military Ontologies For Information Dissemination at The Tactical Edge
No ratings yet
IJCAI 2015 - Military Ontologies For Information Dissemination at The Tactical Edge
7 pages
2025 Level A题目及答案
No ratings yet
2025 Level A题目及答案
13 pages
Lecture 4 LSBM Tree
No ratings yet
Lecture 4 LSBM Tree
42 pages
袋鼠数学 1-2年级 2015等级1：1-2年级
No ratings yet
袋鼠数学 1-2年级 2015等级1：1-2年级
7 pages
7805BG
No ratings yet
7805BG
28 pages
Kalimba Song Book For Beginners - Play by Letter
No ratings yet
Kalimba Song Book For Beginners - Play by Letter
168 pages
Topic-Economic Role For Advertisement Development
No ratings yet
Topic-Economic Role For Advertisement Development
11 pages
Why Choose Jolly Phonics Flyer - 250125 - 035602
No ratings yet
Why Choose Jolly Phonics Flyer - 250125 - 035602
8 pages
Distance Learning Courses DLEN
No ratings yet
Distance Learning Courses DLEN
35 pages
LESSON PLAN - 04-Graphing Linear Equations in Two Variables
No ratings yet
LESSON PLAN - 04-Graphing Linear Equations in Two Variables
6 pages
1 Preoperative
No ratings yet
1 Preoperative
67 pages
Inspection of The Building Signature by Pinnacle.: (Estructure and Electromechanic Equipment Surveying.)
No ratings yet
Inspection of The Building Signature by Pinnacle.: (Estructure and Electromechanic Equipment Surveying.)
12 pages
Asuhan Keperawatan Diare
No ratings yet
Asuhan Keperawatan Diare
32 pages
Financial Technologies (India) Limited CSR Policy
No ratings yet
Financial Technologies (India) Limited CSR Policy
8 pages
Os Lec 4 Process
No ratings yet
Os Lec 4 Process
7 pages
Saeed Updated CV
No ratings yet
Saeed Updated CV
14 pages
Radiology MD Training Guide
No ratings yet
Radiology MD Training Guide
12 pages
Sample Guard House Drawing-Model
No ratings yet
Sample Guard House Drawing-Model
1 page
Design and Manufacturing of Pneumatic Burr Removing Machine: Kakde D V, Lokawar V L
No ratings yet
Design and Manufacturing of Pneumatic Burr Removing Machine: Kakde D V, Lokawar V L
3 pages
Camay Relaunch in Pakistan
100% (1)
Camay Relaunch in Pakistan
26 pages
P&ID Symbols and Legend Guide
No ratings yet
P&ID Symbols and Legend Guide
1 page
Soil Variability and Its Consequences in Geotechnical Engineering
No ratings yet
Soil Variability and Its Consequences in Geotechnical Engineering
302 pages
Manual7298631 Dell Color Management User S Guide For Macos
No ratings yet
Manual7298631 Dell Color Management User S Guide For Macos
13 pages
Oil Field Data Handbook
100% (2)
Oil Field Data Handbook
148 pages
Project Two
No ratings yet
Project Two
14 pages
Rabbit Silage Study
No ratings yet
Rabbit Silage Study
36 pages
MobiSTOP Ultima 02242 R8 EN PDF
No ratings yet
MobiSTOP Ultima 02242 R8 EN PDF
1 page
Ielts5 - Santiago Suarez
No ratings yet
Ielts5 - Santiago Suarez
1 page
Pottery Basics
No ratings yet
Pottery Basics
29 pages
Class 11 Physics Exam Paper
No ratings yet
Class 11 Physics Exam Paper
4 pages
Katz-Moses Multi Sled FENCE Drawing v2
No ratings yet
Katz-Moses Multi Sled FENCE Drawing v2
1 page
A Project Report ON: Online Payroll Management System
No ratings yet
A Project Report ON: Online Payroll Management System
52 pages
Adani Group Acquires NDTV Assingment No. 1
No ratings yet
Adani Group Acquires NDTV Assingment No. 1
11 pages
Carnot and Rankine Cycle
No ratings yet
Carnot and Rankine Cycle
22 pages

Lecture 2 Access Patterns in Big Data

Uploaded by

Lecture 2 Access Patterns in Big Data

Uploaded by

The Weakening and Delayed Effects of Long

Tail Distributions in Big Data Accesses

Traditional long tail distribution

• Internet stores all kinds of huge big data sets

– The media content is carried by scalable distributed systems

• We hope distribution model developed is

– Scalability nature of both data and systems

• 80-20 rule heavy tail

Chesire, USITS’01: Zipf-like Acharya, MMCN’00: non-Zipf

P2P media systems Live streaming and IPTV systems

Gummadi, SOSP’03: non-Zipf Veloso, IMW’02: Zipf-like

• Find a general distribution model of Internet media

– Comprehensive measurements and experiments

– Rigorous mathematical analysis and modeling

– Insights into media system designs

• Stretched exponential model of Internet media traffic

• Dynamics of access patterns in media systems

• Caching implications and storage requirements

• Other newly reported SE distributions in real world

• 16 workloads in different media systems

Rank distribution: thin tail

c = 0.22 log scale

log scale in x axis

log scale in x axis

All collected from a large cable network hosted by a well-known ISP

log scale in x axis • IFILM-06: number of web

Movie-02: US movie box office ticket sales of year 2002.

ad and flag video are pushed streaming

– High request rate (high popularity)

CCDF of req (log)

Number of distinct objects

• Web accesses are Zipf

– Pages update to keep popular

Time after object birth (day)

• Media reference rank distribution in log-log scale

• All follow stretched exponential distribution

• Physical meanings b slope: -a

200 days 150 days

• Media objects have long lifespan

• We give reasons why previous distributions were confusing

• Media access patterns are stretched exponential

• Our findings imply that

– Scalable distributed systems are suitable for this purpose

– The storage system in cloud systems must be very scalable

• We believe the stretched exponential model is sufficiently

The sharp zipf slope implies a concentrated

– PPLive, P2P streaming request distribution (ICDCS’09)

– Access distribution in PPStream is converting from zipf (2007) to stretched

– USTC-VOD, Shanghai Jiading TVOD: program request distributions (China National

– User listening behavior of Bugs Music (http://www.bugs.co.kr) in Korea, 72K users,

– BitTorrent Video-on-demand accesses (NSM 2010)

– VUCLIP (video service to mobile devices): access distributions in servers,

– News-on-demand services from 6 Spanish newspapers (IJMA’12)

– Viewer access patterns to a large TV-on-demand system in Sweden (IMC’12)

– Wikipedia, Yahoo answers, social network posting distribution (KDD’09)

– digg.com, (a discussion social network), comment distribution (ICMD’10)

– ireport.com (CNN discussion social network), comment distribution (ICMD’10)

– sina, tudou, i-baidu (social networks in China), access patterns, (ICDCN’10)

– Protein abundances (density vs structure space, Proteome Science, 2013)35

General data accesses via Internet

– Web access patterns in American University of Nigeria in Africa (AMCIS’09)

– Access patterns to AmazingStore in China (http://www.amazingstore.org), a

– FS2You (online storage system in China), file request distribution (INFOCOM’09)

• Writing a note on the data availability

– Making the figure as US-centric (US should not be in the figure)

– Plotting the number of collections (vertical bar), sorted by the

– Writing notes on your observations.

•Gap of rich and poor is narrowed

You might also like