Ecs765p W1

This document provides an overview of a big data processing module taught by Joseph Doyle. It includes information on the module organization, assessment, topics to be covered which include what is big data, parallelism, the big data pipeline and the word count problem. Specific technologies like Apache Hadoop, HDFS and MapReduce will be covered during weeks 2-4, while week 5 will focus on ingestion and storage solutions for big data.

Uploaded by

Yen-Kai Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views39 pages

Ecs765p W1

Uploaded by

Yen-Kai Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

ECS640/ECS765P

Big Data Processing

Introduction to Big Data
Lecturer: Joseph Doyle
School of Electronic Engineering and Computer Science
ECS640U/ECS765P
Big Data Processing
Introduction to Big Data
Lecturer: Joseph Doyle
School of Electronic Engineering and Computer Science
Module Organization
Lectures: Monday 13:00 – 15:00
Labs (check timetables, subject to change):
● Monday 9:00 – 11:00
● Monday 15:00 – 17:00
Lecturer
● Joseph Doyle
Demonstrators
● Jianshu Qiao
Module Assessment
Exam 50%
● 4 Questions, short answers, practical exercises
Coursework: 50%
● Lab quizzes: 20%
1 quiz covering all the labs. Released in week 2 with a deadline in week 11.
● Individual Coursework: 30%
Deadline: End of week 10 (2nd December)
Contents

● What is Big Data?

● Parallelism
● The Big Data Pipeline
● The Word-Count Problem
What is Big Data?
Many different definitions
● “Big data is a term used to refer to data sets that are too large or complex for traditional data processing
application software to adequately deal with” (Wikipedia)
● “Big data is high-volume, high velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making
and process automations.” (Gartner)
Example: Big Data at Netflix
Big Data at Netflix
● 100 million users
● 125+ million hours of video watched each day
● 4000 different devices
● 700+ billions events a day
● 60 peta bytes of data
More information at Netflix Technology Blog https://medium.com/@NetflixTechBlog
Notable Open Datasets
● Kaggle – over 14,000 datasets
● Microsoft Academic Graph
● Data.gov – U.S. Government’s open data
● Pushshift.io – full Reddit dataset
● Common Crawl – 7 years of web pages data
● YouTube-8M Dataset – a large-scale labelled video dataset that consists of millions of YouTube video
● Data4Good.io – over 1TB of compressed networks data
Big Data Parallel Computing
● Dataset so large that is impractical to analyse on one computer
● Thus, need to multiple computers analysing the data to allow it to be examined in a reasonable time
● Need to consider problems and challenges in parallel computing to make the most of the hardware
Contents

● What is Big Data?

● Parallelism
● The Big Data Pipeline
● The Word-Count Problem
Parallel Computing
● The use of a number of processors, working together, to perform a calculation or solve a problem
● The calculation will be divided into tasks, sent to different processors
● Processors can be different cores in the same machine, and/or different machines linked by a network
● Processes need to be coordinated
Sequential Program Execution
● Basic model based on Von Neumann architecture in the 40s
● One instruction is fetched, decoded and executed at a time
● As processor speed increases, more instructions can be executed in the same time
Single Processor Limitations
● We have roughly reached the practical limitations in the amount of computing power a single processor
can have
● We cannot make processors much faster, or much bigger
● According to Moore’s law the number of transistors per chip will continue to increase
● More transistors are great but if the clock speed goes much beyond 4GHz the chips tend to melt with
heat
● But this means now we have chips containing an increasing amount of processors
● Multicore Chips
Machine Cluster
Programming a cluster of machines

© Square-Enix, Platinum Games

© Square-Enix, Platinum Games
Parallel Computing Challenges
Several challenges to efficient parallel computing
● Many algorithms are hard to divided into subtasks (or cannot be divided at all)
● The subtasks might use results from each other, so coordinating the different tasks might be difficult
Some problem areas are much easier than others to parallelise
Easy parallelization example
Image processing: sharpening an image
● We have a big photograph
● We divide it into tiles (square patches), and sharpen each tile
● Then we adjust the pixels at the edges so that they match up
This works because edge pixels are a small part of the total, and changing them does not affect the other
pixels
● Easy to divide up the work!!!
Not so easy parallelisation
Path search: Get best route from A to B
● How to divide the task in lesser ones? Find intermediate points
● How to select these intermediate points?
● Need to communicate among processes, what are the intermediate results
● How can we guarantee that the solution is the best one?
Examples of parallelism: Simulation
We often need to simulate physical events:
● What happens when lightning strikes a plane?
● This is important with modern planes, because they are made of plastic, and lightning goes through
them (whereas it goes round metal planes)
● Planes are too expensive to conduct a large number of tests with them, but we have to know that they
are safe to fly
● Simulation is a necessity: but it is computationally difficult
There are a lot of similar tasks (nuclear safety, for example) where direct testing is impossible or difficult
Examples of parallelism: Weather Prediction
Examples of parallelism: Prediction
● We would like to predict real-world events (such as the weather)
● This often involves very long and data intensive calculations, just like simulation does
● Challenging both for algorithm design and implementation
Examples of parallelism: Data Analysis
● Marketers analyse large amounts of data from Twitter in order to find out how their products are doing
● Google analyses lots of web pages in order to support google search
● The LHC produces huge amounts of data, which need analysis
● Biologists read large quantities of DNA, and they have to work out what it means
● All these applications are computationally demanding, and they also use large amounts of data
Examples of parallelism: Data Analysis (Netflix)
● Take Netflix and its approach to applying data analysis for maximizing the effectiveness of their video
catalogue
● Traditionally, there is a limited subset of movie genres (let’s say 100)
● Netflix manages more than 70,000 genres, and combines them with collected data about the millions of
users that actively consume content through the service
● With that information, they can provide recommendations considerably more effective than the
previous standard
Contents

● What is Big Data?

● Parallelism
● The Big Data Pipeline
● The Word-Count Problem
The Big Data Pipeline
The Big Data Pipeline consists of a sequence of 3V (high-volume, high velocity, high-variety) data actions:
• Data sources are the input to the pipeline.
• The output can be another big data pipeline, small data analysis, query system, visualisation or an
action.

Data
Output
Sources
Data Sources