Saad Khan
MSCS
2nd semester
Content
Introduction
What is Big Data.
Characteristic of Big Data.
Storing, Selecting and processing of Big Data
Why Big Data
How it is Different
Hive
Pig
Flume
Introduction
Big Data may well be the Next Big Thing in the IT world
The first organizations to embrace it were online and startup firms. Firms like
Google, eBay, LinkedIn, and Facebook were built around big data from the
beginning.
Big data burst upon the scene in the first decade of the 21st century
What is BIG DATA?
‘Big Data’ is similar to ‘small data’, but bigger in size.
An aim to solve new problems or old problems in a
better way
What is BIG DATA (Cont..)
Walmart handles more than 1 million customer transactions every hour.
Facebook handles 40 billion photos from its user base.
Characteristic of Big DATA
Volume
A typical PC might have had 10 gigabytes of storage in 2000
Today, Facebook ingests 500 terabytes of new data every day
Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US.
Velocity
Clickstreams and ad impressions capture user behavior at millions of events
per second.
High-frequency stock trading algorithms reflect market changes within
microseconds
Machine to machine processes exchange data between billions of devices
Variety
Big Data isn't just numbers, dates, and strings. Big Data is also geospatial
data, 3D data, audio and video, and unstructured text, including log files
and social media.
Traditional database systems were designed to address smaller volumes of
structured data, fewer updates or a predictable, consistent data structure
Storing Big Data
Selecting data source for analysis
Eliminating redundant data
Establishing the role of NoSQL
Selecting Big Data Stores
Choosing the correct data stores based on your data characteristics.
Moving code to data.
Implementing polyglot data store solutions
Processing Big Data
Mapping data to the programming framework
Connecting and extracting data from storage.
Transforming data for processing.
Why Big Data
Increase of Storage capacities.
Increase of processing.
Availability of data(different data types).
How is big data different?
Automatically generated by a machine
(e.g. Sensor embedded in an engine)
Typically an entirely new source of data
(e.g. Use of the internet)
Not designed to be friendly
(e.g. Text streams)
Hive
What is Hive?
Hive is a data warehouse infrastructure tool to process structure data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy
Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive.
Feature of Hive
It stores Schema in a database and processed data into HDFS(Hadoop
Distributed File System).
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
Architecture Of Hive
User Interface - Hive is a data warehouse infrastructure software that can
create interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD.
Meta Store -Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data types and
HDFS mapping.
Architecture Of Hive(Cont..)
Architecture Of Hive(Cont..)
HiveQL Process Engine- HiveQL is similar to SQL for querying on schema info
on the Megastore. It is one of the replacements of traditional approach for
MapReduce program
HDFS or HBASE - Hadoop distributed file system or HBASE are the data
storage techniques to store data into the file system.
Working of Hive
Get Plan- The driver takes the help of query complier that parses
the query to check the syntax and query plan or the requirement of
query.
Get Metadata- The compiler sends metadata request to Megastorez
Send Metadata- Metastore sends metadata as a response to the
compiler.
Working of Hive(Cont..)
Send Plan- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is complete.
Execute Plan- the driver sends the execute plan to the
execution engine.
Pig
What is Pig?
A platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs
Compiles down to MapReduce jobs
Developed by Yahoo!
Pig Component
Two Main Components.
High Level Language (Pig Latin)
Set of Commands
Two Execution Modes
Local: Read/Write to local file system
MapReduce: connects to Hadoop cluster and reads/write to HDFS
Why Pig?
Common design patterns as key word (joins, distinct, counts)
Data flow analysis
Avoid java level errors
Language Feature Pig
Keywords
Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order by,…
Aggregations
Count, Avg, Sum, Max, Min
Schema
Defines at query-times not when files are loaded
Flume
What is flume?
Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log
files, events (etc...) from various sources to a centralized data store
Flume is a highly reliable, distributed, and configurable tool. It is principally
designed to copy streaming data (log data) from various web servers to
HDFS.
Flume Architecture
Flume Event
An event is the basic unit of the data transported inside Flume.
Flume Agent.
Take a look at the following illustration. It shows the internal components of an
agent and how they collaborate with each other.
Application of Flume
Assume an e-commerce web application wants to analyze the customer
behavior from a particular region.
To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into
HDFS at a higher speed
Feature of flume
Flume ingests long data from multiple web serves into a centralized store
Using flume, we can get the data from multiple servers immediately into
Hadoop.
Flume supports a large set of sources and destinations types
Flume can be scaled horizontally.
Advantages of flume
Using apache flume we can store the data in to any of the centralized
stores (Hbase, HDFS).
Flume provides the feature of contextual routing.
Flume is reliable, fault tolerant, scalable, manageable, and customizable
Any Question