ECS640/ECS765P
Big Data Processing
Introduction to Big Data
Lecturer: Joseph Doyle
School of Electronic Engineering and Computer Science
ECS640U/ECS765P
Big Data Processing
Introduction to Big Data
Lecturer: Joseph Doyle
School of Electronic Engineering and Computer Science
Module Organization
Lectures: Monday 13:00 – 15:00
Labs (check timetables, subject to change):
● Monday 9:00 – 11:00
● Monday 15:00 – 17:00
Lecturer
● Joseph Doyle
Demonstrators
● Jianshu Qiao
Module Assessment
Exam 50%
● 4 Questions, short answers, practical exercises
Coursework: 50%
● Lab quizzes: 20%
1 quiz covering all the labs. Released in week 2 with a deadline in week 11.
● Individual Coursework: 30%
Deadline: End of week 10 (2nd December)
Contents
● What is Big Data?
● Parallelism
● The Big Data Pipeline
● The Word-Count Problem
What is Big Data?
Many different definitions
● “Big data is a term used to refer to data sets that are too large or complex for traditional data processing
application software to adequately deal with” (Wikipedia)
● “Big data is high-volume, high velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making
and process automations.” (Gartner)
Example: Big Data at Netflix
Big Data at Netflix
● 100 million users
● 125+ million hours of video watched each day
● 4000 different devices
● 700+ billions events a day
● 60 peta bytes of data
More information at Netflix Technology Blog https://medium.com/@NetflixTechBlog
Notable Open Datasets
● Kaggle – over 14,000 datasets
● Microsoft Academic Graph
● Data.gov – U.S. Government’s open data
● Pushshift.io – full Reddit dataset
● Common Crawl – 7 years of web pages data
● YouTube-8M Dataset – a large-scale labelled video dataset that consists of millions of YouTube video
● Data4Good.io – over 1TB of compressed networks data
Big Data Parallel Computing
● Dataset so large that is impractical to analyse on one computer
● Thus, need to multiple computers analysing the data to allow it to be examined in a reasonable time
● Need to consider problems and challenges in parallel computing to make the most of the hardware
Contents
● What is Big Data?
● Parallelism
● The Big Data Pipeline
● The Word-Count Problem
Parallel Computing
● The use of a number of processors, working together, to perform a calculation or solve a problem
● The calculation will be divided into tasks, sent to different processors
● Processors can be different cores in the same machine, and/or different machines linked by a network
● Processes need to be coordinated
Sequential Program Execution
● Basic model based on Von Neumann architecture in the 40s
● One instruction is fetched, decoded and executed at a time
● As processor speed increases, more instructions can be executed in the same time
Single Processor Limitations
● We have roughly reached the practical limitations in the amount of computing power a single processor
can have
● We cannot make processors much faster, or much bigger
● According to Moore’s law the number of transistors per chip will continue to increase
● More transistors are great but if the clock speed goes much beyond 4GHz the chips tend to melt with
heat
● But this means now we have chips containing an increasing amount of processors
● Multicore Chips
Machine Cluster
Programming a cluster of machines
© Square-Enix, Platinum Games
© Square-Enix, Platinum Games
Parallel Computing Challenges
Several challenges to efficient parallel computing
● Many algorithms are hard to divided into subtasks (or cannot be divided at all)
● The subtasks might use results from each other, so coordinating the different tasks might be difficult
Some problem areas are much easier than others to parallelise
Easy parallelization example
Image processing: sharpening an image
● We have a big photograph
● We divide it into tiles (square patches), and sharpen each tile
● Then we adjust the pixels at the edges so that they match up
This works because edge pixels are a small part of the total, and changing them does not affect the other
pixels
● Easy to divide up the work!!!
Not so easy parallelisation
Path search: Get best route from A to B
● How to divide the task in lesser ones? Find intermediate points
● How to select these intermediate points?
● Need to communicate among processes, what are the intermediate results
● How can we guarantee that the solution is the best one?
Examples of parallelism: Simulation
We often need to simulate physical events:
● What happens when lightning strikes a plane?
● This is important with modern planes, because they are made of plastic, and lightning goes through
them (whereas it goes round metal planes)
● Planes are too expensive to conduct a large number of tests with them, but we have to know that they
are safe to fly
● Simulation is a necessity: but it is computationally difficult
There are a lot of similar tasks (nuclear safety, for example) where direct testing is impossible or difficult
Examples of parallelism: Weather Prediction
Examples of parallelism: Prediction
● We would like to predict real-world events (such as the weather)
● This often involves very long and data intensive calculations, just like simulation does
● Challenging both for algorithm design and implementation
Examples of parallelism: Data Analysis
● Marketers analyse large amounts of data from Twitter in order to find out how their products are doing
● Google analyses lots of web pages in order to support google search
● The LHC produces huge amounts of data, which need analysis
● Biologists read large quantities of DNA, and they have to work out what it means
● All these applications are computationally demanding, and they also use large amounts of data
Examples of parallelism: Data Analysis (Netflix)
● Take Netflix and its approach to applying data analysis for maximizing the effectiveness of their video
catalogue
● Traditionally, there is a limited subset of movie genres (let’s say 100)
● Netflix manages more than 70,000 genres, and combines them with collected data about the millions of
users that actively consume content through the service
● With that information, they can provide recommendations considerably more effective than the
previous standard
Contents
● What is Big Data?
● Parallelism
● The Big Data Pipeline
● The Word-Count Problem
The Big Data Pipeline
The Big Data Pipeline consists of a sequence of 3V (high-volume, high velocity, high-variety) data actions:
• Data sources are the input to the pipeline.
• The output can be another big data pipeline, small data analysis, query system, visualisation or an
action.
Data
Output
Sources
Data Sources
Data
Ingestion Storage Processing Output
Sources
A data sources is any mechanism that produces data. Examples of data sources include:
● Mobile and web apps
● Websites
● IoT devices
● Databases
● Output of other big data pipelines
Ingestion
Data
Ingestion Storage Processing Output
Sources
The ingestion stage gathers data from a variety of data sources and makes it available for further
stages.
• Ingested data can be stored or directly processed
• Data is moved from the original sources across a network, hence communication protocols such
as HTTP, MQT or FTP can be used during ingestion
• Data ingestion can include transformations operations on raw data
Storage
Data
Ingestion Storage Processing Output
Sources
After ingestion, data are usually stored ready for future processing.
• Big data storage needs solutions that operate with high-volumes of data, hence distributed storage
will be used
• Big data storage also requires flexibility and fast retrieval of high-volumes of data, for which NoSQL
solutions are appropriate
Processing
Data
Ingestion Storage Processing Output
Sources
Processing runs algorithms on big data.
• Data can come directly from the ingestion stage or from storage
• The results of the processing stage can be stored back in the storage system or be used as the input
of the pipeline
• Can be batch, stream or graph
Weeks 2-4: Apache Hadoop
Data
Ingestion Storage Processing Output
Sources
During weeks 2 to 4, we will cover Apache Hadoop, our first Big Data solution that consists of:
• Processing capabilities (based on MapReduce)
• Storage system (HDFS, Hadoop Distributed File System)
Week 5: Ingestion and Storage
Data
Ingestion Storage Processing Output
Sources
Week 5 focuses Big Data solutions for
• Ingestion. Specific ingestion solutions for Apache Hadoop will be presented
• Storage technologies, including distributed file systems and NoSQL data stores
Different types of data sources will also be discussed.
Weeks 6-11: Processing
Data
Ingestion Storage Processing Output
Sources
The remaining of the module, focuses on Big Data Processing stages, namely
• Weeks 6 and 7: Apache Spark
• Weeks 8 and 9: Stream processing
• Weeks 10 and 11: Graph processing
Contents
● What is Big Data?
● Parallelism
● The Big Data Pipeline
● The Word-Count Problem
Our first parallel program
● Task: count the number of occurrences of each word in one document
● Input: text document
● Output: sequence of: word, count
The 56
School 23
Queen 10
● Collection Stage: Not applicable in this case
● Ingestion Stage: Move file to data lake with applicable protocol e.g. HTTP/FTP
● Preparation Stage: Remove character which might confuse algorithm e.g. quotation marks etc
Program Input
QMUL has been ranked 9th among multi-faculty institutions in the UK, according to tables published today in the Times Higher
Education.
A total of 154 institutions were submitted for the exercise.
The 2008 RAE confirmed Queen Mary to be one of the rising stars of the UK research environment and the REF 2014 shows that this
upward trajectory has been maintained.
Professor Simon Gaskell, President and Principal of Queen Mary, said: “This is an outstanding result for Queen Mary. We have built
upon the progress that was evidenced by the last assessment exercise and have now clearly cemented our position as one of the UK’s
foremost research-led universities. This achievement is derived from the talent and hard work of our academic staff in all disciplines,
and the colleagues who support them.”
The Research Excellence Framework (REF) is the system for assessing the quality of research in UK higher education institutions.
Universities submit their work across 36 panels of assessment. Research is judged according to quality of output (65 per cent),
environment (15 per cent) and, for the first time, the impact of research (20 per cent).
How to solve the problem?
How to solve the problem?
How to solve the problem on a single processor?
#input:text string with the complete text
words = text.split()
count = dict()
for word in words:
if word in count:
count[word] = count[word] + 1
else:
count[word] = 1
Parallelising the problem
Splitting the load on subtasks:
● Split sentences/lines into words
● Count all the occurrences of each word
…What do we do with the intermediate results?
● Merge into single collection
● Possibly requires parallelism too
● One model for executing this is MapReduce which is the subject of next weeks lecture