0% found this document useful (0 votes)

83 views39 pages

Big Data Overview

The document provides an overview of big data concepts including what big data is, its key characteristics of volume, velocity and variety, and how big data is stored, selected and processed. It then discusses tools used for big data including Hive, Pig and Flume. Hive is described as a data warehouse infrastructure tool to process structured data in Hadoop using SQL-like queries. Pig is presented as a platform for analyzing large datasets using a high-level language called Pig Latin. Flume is defined as a tool for collecting, aggregating and transporting large amounts of streaming data from various sources to centralized data stores.

Uploaded by

noor khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views39 pages

Big Data Overview

Uploaded by

noor khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Saad Khan

MSCS
2nd semester
Content

 Introduction
 What is Big Data.
 Characteristic of Big Data.
 Storing, Selecting and processing of Big Data
 Why Big Data
 How it is Different
 Hive
 Pig
 Flume
Introduction

 Big Data may well be the Next Big Thing in the IT world

 The first organizations to embrace it were online and startup firms. Firms like
Google, eBay, LinkedIn, and Facebook were built around big data from the
beginning.

 Big data burst upon the scene in the first decade of the 21st century
What is BIG DATA?
 ‘Big Data’ is similar to ‘small data’, but bigger in size.

 An aim to solve new problems or old problems in a

better way
What is BIG DATA (Cont..)

 Walmart handles more than 1 million customer transactions every hour.

 Facebook handles 40 billion photos from its user base.

Characteristic of Big DATA
Volume

 A typical PC might have had 10 gigabytes of storage in 2000

 Today, Facebook ingests 500 terabytes of new data every day

 Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US.
Velocity

 Clickstreams and ad impressions capture user behavior at millions of events

per second.

 High-frequency stock trading algorithms reflect market changes within

microseconds

 Machine to machine processes exchange data between billions of devices

Variety

 Big Data isn't just numbers, dates, and strings. Big Data is also geospatial
data, 3D data, audio and video, and unstructured text, including log files
and social media.

 Traditional database systems were designed to address smaller volumes of

structured data, fewer updates or a predictable, consistent data structure
Storing Big Data

 Selecting data source for analysis

 Eliminating redundant data

 Establishing the role of NoSQL

Selecting Big Data Stores

 Choosing the correct data stores based on your data characteristics.

 Moving code to data.

 Implementing polyglot data store solutions

Processing Big Data

 Mapping data to the programming framework

 Connecting and extracting data from storage.

 Transforming data for processing.

Why Big Data

 Increase of Storage capacities.

 Increase of processing.

 Availability of data(different data types).

How is big data different?

 Automatically generated by a machine

(e.g. Sensor embedded in an engine)

 Typically an entirely new source of data

(e.g. Use of the internet)

 Not designed to be friendly

(e.g. Text streams)
Hive
What is Hive?

 Hive is a data warehouse infrastructure tool to process structure data in

Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy

 Initially Hive was developed by Facebook, later the Apache Software

Foundation took it up and developed it further as an open source under
the name Apache Hive.
Feature of Hive

 It stores Schema in a database and processed data into HDFS(Hadoop

Distributed File System).

 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or HQL.

Architecture Of Hive

 User Interface - Hive is a data warehouse infrastructure software that can

create interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD.

 Meta Store -Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data types and
HDFS mapping.
Architecture Of Hive(Cont..)
Architecture Of Hive(Cont..)

 HiveQL Process Engine- HiveQL is similar to SQL for querying on schema info
on the Megastore. It is one of the replacements of traditional approach for
MapReduce program

 HDFS or HBASE - Hadoop distributed file system or HBASE are the data
storage techniques to store data into the file system.
Working of Hive

 Get Plan- The driver takes the help of query complier that parses
the query to check the syntax and query plan or the requirement of
query.

 Get Metadata- The compiler sends metadata request to Megastorez

 Send Metadata- Metastore sends metadata as a response to the

compiler.
Working of Hive(Cont..)

 Send Plan- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is complete.

 Execute Plan- the driver sends the execute plan to the

execution engine.
Pig
What is Pig?

 A platform for analyzing large data sets that consists of a high-level

language for expressing data analysis programs

 Compiles down to MapReduce jobs

 Developed by Yahoo!
Pig Component

 Two Main Components.

 High Level Language (Pig Latin)
 Set of Commands
 Two Execution Modes
 Local: Read/Write to local file system
 MapReduce: connects to Hadoop cluster and reads/write to HDFS
Why Pig?

 Common design patterns as key word (joins, distinct, counts)

 Data flow analysis

 Avoid java level errors

Language Feature Pig

 Keywords
 Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order by,…

 Aggregations
 Count, Avg, Sum, Max, Min

 Schema
 Defines at query-times not when files are loaded
Flume
What is flume?

 Apache Flume is a tool/service/data ingestion mechanism for collecting

aggregating and transporting large amounts of streaming data such as log
files, events (etc...) from various sources to a centralized data store

 Flume is a highly reliable, distributed, and configurable tool. It is principally

designed to copy streaming data (log data) from various web servers to
HDFS.
Flume Architecture

 Flume Event
 An event is the basic unit of the data transported inside Flume.

 Flume Agent.
 Take a look at the following illustration. It shows the internal components of an
agent and how they collaborate with each other.
Application of Flume

 Assume an e-commerce web application wants to analyze the customer

behavior from a particular region.

 To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.

 Flume is used to move the log data generated by application servers into
HDFS at a higher speed
Feature of flume

 Flume ingests long data from multiple web serves into a centralized store

 Using flume, we can get the data from multiple servers immediately into
Hadoop.

 Flume supports a large set of sources and destinations types

 Flume can be scaled horizontally.

Advantages of flume

 Using apache flume we can store the data in to any of the centralized
stores (Hbase, HDFS).

 Flume provides the feature of contextual routing.

 Flume is reliable, fault tolerant, scalable, manageable, and customizable

Any Question

Hadoop Online Training
No ratings yet
Hadoop Online Training
7 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
Hadoop Training for IT Professionals
No ratings yet
Hadoop Training for IT Professionals
5 pages
1 Month Big Data Boot Camp
No ratings yet
1 Month Big Data Boot Camp
6 pages
r16 Syllabus Cse Jntuh
No ratings yet
r16 Syllabus Cse Jntuh
58 pages
Sapthagiri College of Engineering: Department of Information Science and Engineering Big Data Analytics Question Bank
No ratings yet
Sapthagiri College of Engineering: Department of Information Science and Engineering Big Data Analytics Question Bank
3 pages
Icles' Motilal Jhunjhunwala College, Vashi IT& CS Department
No ratings yet
Icles' Motilal Jhunjhunwala College, Vashi IT& CS Department
41 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
Datawarehousing Chap01
No ratings yet
Datawarehousing Chap01
27 pages
Data Science Course for Programmers
No ratings yet
Data Science Course for Programmers
18 pages
AI Question Bank 2017 18 CSE
No ratings yet
AI Question Bank 2017 18 CSE
4 pages
DWDM 1-5 QB Sols
No ratings yet
DWDM 1-5 QB Sols
193 pages
Unit-Ii Knowledge Representation and Reasoning Part-A
No ratings yet
Unit-Ii Knowledge Representation and Reasoning Part-A
10 pages
DBMS QB
No ratings yet
DBMS QB
4 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
No ratings yet
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
54 pages
Wrapper Classes Exercise: Cognizant Technology Solutions
No ratings yet
Wrapper Classes Exercise: Cognizant Technology Solutions
7 pages
Final Lab Manual WEB
No ratings yet
Final Lab Manual WEB
62 pages
SC QB
No ratings yet
SC QB
24 pages
Presentation 2
No ratings yet
Presentation 2
36 pages
EC360 Soft Computing - Syllabus PDF
No ratings yet
EC360 Soft Computing - Syllabus PDF
2 pages
Big Data Hadoop Career Boost
No ratings yet
Big Data Hadoop Career Boost
12 pages
Data Warehouse 1
No ratings yet
Data Warehouse 1
21 pages
It6713 Grid Cloud Computing Lab
No ratings yet
It6713 Grid Cloud Computing Lab
96 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
51 pages
Android Development Training Guide
No ratings yet
Android Development Training Guide
6 pages
Relational Database Concepts Guide
No ratings yet
Relational Database Concepts Guide
27 pages
Cs 403 Software Engineering Jun 2020
No ratings yet
Cs 403 Software Engineering Jun 2020
3 pages
Well House Consultants Samples Notes From Well House Consultants 1
100% (1)
Well House Consultants Samples Notes From Well House Consultants 1
24 pages
Part B Questions
No ratings yet
Part B Questions
3 pages
Expert Systems & AI Concepts
No ratings yet
Expert Systems & AI Concepts
2 pages
Hci 2m
No ratings yet
Hci 2m
8 pages
Hadoop Installation Step by Step
No ratings yet
Hadoop Installation Step by Step
6 pages
Dbms Unit4 SQL Final
No ratings yet
Dbms Unit4 SQL Final
7 pages
Elective-II Soft Computing
No ratings yet
Elective-II Soft Computing
3 pages
Infosys Placement Guide by Fresherscampus
No ratings yet
Infosys Placement Guide by Fresherscampus
46 pages
Software Engineering QB
No ratings yet
Software Engineering QB
6 pages
Object-Oriented Modeling Guide
No ratings yet
Object-Oriented Modeling Guide
5 pages
Unit Wise-Question Bank UNIT-1 1. Two Marks Question With Answers: 1. What Are The Uses of Multi Feature Cubes?
No ratings yet
Unit Wise-Question Bank UNIT-1 1. Two Marks Question With Answers: 1. What Are The Uses of Multi Feature Cubes?
85 pages
BDA QB - CMRIT - pdf-0
100% (2)
BDA QB - CMRIT - pdf-0
3 pages
ADBMS Lab Manual Aug-Dec 2017 - ByMe
No ratings yet
ADBMS Lab Manual Aug-Dec 2017 - ByMe
9 pages
Labmanual SOA IT2406
No ratings yet
Labmanual SOA IT2406
51 pages
SRM Institute of Science and Technology
No ratings yet
SRM Institute of Science and Technology
6 pages
Advanced DBMS Lab Course Plan
No ratings yet
Advanced DBMS Lab Course Plan
7 pages
Tycs Ai Unit 2
No ratings yet
Tycs Ai Unit 2
84 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
50 pages
Wipro Questions and Answers
No ratings yet
Wipro Questions and Answers
3 pages
DBMS & SQL Interview Q&A for Freshers
No ratings yet
DBMS & SQL Interview Q&A for Freshers
90 pages
MPC
No ratings yet
MPC
17 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Data Science and Big Data UNIT 4
No ratings yet
Data Science and Big Data UNIT 4
10 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
2 Unit 5
No ratings yet
2 Unit 5
24 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Covumaiphuongthionline2 - Menhdequanhe
No ratings yet
Covumaiphuongthionline2 - Menhdequanhe
3 pages
Safety Data Sheet Idlube XL: 1. Identification of The Substance/Preparation and The Company
No ratings yet
Safety Data Sheet Idlube XL: 1. Identification of The Substance/Preparation and The Company
4 pages
Smith & Wesson 2013 Catalog
100% (2)
Smith & Wesson 2013 Catalog
75 pages
DIAGNOSTIC AND LABORATORY TESTS (Lecture) 1
No ratings yet
DIAGNOSTIC AND LABORATORY TESTS (Lecture) 1
4 pages
Guide To Developing An Approved Culinology Degree Program - Updated 2017
No ratings yet
Guide To Developing An Approved Culinology Degree Program - Updated 2017
15 pages
PP Riseofchina
No ratings yet
PP Riseofchina
16 pages
4th Sem Exam Fees Paid Yogi
No ratings yet
4th Sem Exam Fees Paid Yogi
1 page
Reading Unit 4
No ratings yet
Reading Unit 4
3 pages
Texas Scorecard Mail - Re - ARIF Results - Rolando Ortiz Redacted
No ratings yet
Texas Scorecard Mail - Re - ARIF Results - Rolando Ortiz Redacted
8 pages
TB 216 Workshop Manual TB216
No ratings yet
TB 216 Workshop Manual TB216
296 pages
CAT DP40 Electric Equipment Parts
No ratings yet
CAT DP40 Electric Equipment Parts
6 pages
Sts Benigno Aquino III
No ratings yet
Sts Benigno Aquino III
3 pages
Bhatti 062014
No ratings yet
Bhatti 062014
41 pages
Scomi Drilling Fluid
No ratings yet
Scomi Drilling Fluid
23 pages
Military Flight Simulators
No ratings yet
Military Flight Simulators
3 pages
VCDS Diagnostic Report
No ratings yet
VCDS Diagnostic Report
7 pages
KSP Response To LINK Nky Records Request
No ratings yet
KSP Response To LINK Nky Records Request
2 pages
Automatic Night Lamp With
No ratings yet
Automatic Night Lamp With
3 pages
Active Fire Protectiondocx
No ratings yet
Active Fire Protectiondocx
51 pages
Bro vd10 20140115
No ratings yet
Bro vd10 20140115
2 pages
APTs 1st Set of 20 of 240 Printable MCQs For AQA As Econ Sect 1
No ratings yet
APTs 1st Set of 20 of 240 Printable MCQs For AQA As Econ Sect 1
15 pages
SFRA6 US Web
No ratings yet
SFRA6 US Web
2 pages
ETREP
No ratings yet
ETREP
20 pages
AC Service Unit: Repair Instructions
100% (1)
AC Service Unit: Repair Instructions
29 pages
Heavy Vehicle Tire Safety Guide
No ratings yet
Heavy Vehicle Tire Safety Guide
12 pages
Bookkeeping (Second Part)
100% (3)
Bookkeeping (Second Part)
38 pages
NJVP2K8 App Slip
No ratings yet
NJVP2K8 App Slip
3 pages
Lec 1 Manufacturing Processes HAF
No ratings yet
Lec 1 Manufacturing Processes HAF
11 pages
KITI FHK Technik 2015 Engl INT PDF
No ratings yet
KITI FHK Technik 2015 Engl INT PDF
140 pages
Brocade 3850 Specifications
No ratings yet
Brocade 3850 Specifications
2 pages

Big Data Overview

Uploaded by

Big Data Overview

Uploaded by

Saad Khan

 An aim to solve new problems or old problems in a

 Walmart handles more than 1 million customer transactions every hour.

 Facebook handles 40 billion photos from its user base.

 A typical PC might have had 10 gigabytes of storage in 2000

 Today, Facebook ingests 500 terabytes of new data every day

 Clickstreams and ad impressions capture user behavior at millions of events

 High-frequency stock trading algorithms reflect market changes within

 Machine to machine processes exchange data between billions of devices

 Traditional database systems were designed to address smaller volumes of

 Selecting data source for analysis

 Eliminating redundant data

 Establishing the role of NoSQL

 Choosing the correct data stores based on your data characteristics.

 Moving code to data.

 Implementing polyglot data store solutions

 Mapping data to the programming framework

 Connecting and extracting data from storage.

 Transforming data for processing.

 Increase of Storage capacities.

 Availability of data(different data types).

 Automatically generated by a machine

 Typically an entirely new source of data

 Not designed to be friendly

 Hive is a data warehouse infrastructure tool to process structure data in

 Initially Hive was developed by Facebook, later the Apache Software

 It stores Schema in a database and processed data into HDFS(Hadoop

 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or HQL.

 User Interface - Hive is a data warehouse infrastructure software that can

 Get Metadata- The compiler sends metadata request to Megastorez

 Send Metadata- Metastore sends metadata as a response to the

 Execute Plan- the driver sends the execute plan to the

 A platform for analyzing large data sets that consists of a high-level

 Compiles down to MapReduce jobs

 Two Main Components.

 Common design patterns as key word (joins, distinct, counts)

 Data flow analysis

 Avoid java level errors

 Apache Flume is a tool/service/data ingestion mechanism for collecting

 Flume is a highly reliable, distributed, and configurable tool. It is principally

 Assume an e-commerce web application wants to analyze the customer

 Flume supports a large set of sources and destinations types

 Flume can be scaled horizontally.

 Flume provides the feature of contextual routing.

 Flume is reliable, fault tolerant, scalable, manageable, and customizable

You might also like