Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
24 views21 pages

Config de Basic

The document is a password reset request from MCAA Academy, containing details about a publication titled 'Introduction to Big Data' edited by Sartaj Singh and authored by Dr. Rajni Bhalla. It outlines the characteristics, applications, and tools of Big Data, emphasizing its growing significance and the challenges associated with managing large datasets. The content includes various units covering topics such as data models, Hadoop, and predictive analytics.

Uploaded by

blt.zakary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views21 pages

Config de Basic

The document is a password reset request from MCAA Academy, containing details about a publication titled 'Introduction to Big Data' edited by Sartaj Singh and authored by Dr. Rajni Bhalla. It outlines the characteristics, applications, and tools of Big Data, emphasizing its growing significance and the challenges associated with managing large datasets. The content includes various units covering topics such as data models, Hadoop, and predictive analytics.

Uploaded by

blt.zakary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Message-ID: <[anl_32]@[n_7]>

Content-Id: <[anu_12].[anu_13]@[an_10].google.com>
X-Feedback-ID: [n_7]:SG
Return-Path: bounce-[n_4]-[n_6]-[n_7]-live@click.[al_3].org
sender:[mailbox]
Content-Transfer-Encoding: 8bit
X-Feedback-ID: AR[n_1]-[a_1]-[a_4]:[an_3]-[a_1]-[a_6]:[an_3]-[a_1]-[a_6]:createSEND
X-Campaign: [n_8]/[n_9]/[n_12]
In-Reply-To: <abuse@[an_10].google.com>
Errors-To: abuse@[an_10].google.com
X-Complaints-To: abuse@[an_10].google.com
References: <[anl_22]@[al_6].com> <[anl_23]-[anl_21]+[anl_5]@mail.gmail.com>
From: MCAA Academy <[mailbox]>
To: [email]
Subject: Password reset request
Date: Wed, 02 Apr 2025 14:01:13 -0400
MIME-Version: 1.0
Content-Type: multipart/PARALLEL; boundary=---------[an_42]

-----------[an_42]
Content-Type: multipart/related; boundary="----------[n_24]"

------------[n_24]
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<p>Introduction to Big Data</p>


<p>Edited by</p>
<p>Sartaj Singh</p>
<p>DECAP456</p>
<p>Edited By:</p>
<p>Sartaj Singh</p>
<p>Title: INTRODUCTION TO BIG DATA</p>
<p>Author’s Name: Dr. Rajni Bhalla</p>
<p>Published By : Lovely Professional University</p>
<p>Publisher Address: Lovely Professional University, Jalandhar Delhi GT road,
Phagwara - 144411 Printer Detail: Lovely Professional University</p>
<p>Edition Detail: (I)</p>
<p>ISBN: 978-93-94068-26-1</p>
<p>Copyrights@ Lovely Professional University</p>
<p>CONTENT</p>
<p>Unit 2: 21</p>
<p>Unit 3: 37</p>
<p>Unit 4: 55</p>
<p>Unit 5: 87</p>
<p>Unit 6: 112</p>
<p>Unit 7: 139</p>
<p>Unit 8: 155</p>
<p>Unit 9: 169</p>
<p>Unit 10:</p>
<p>Unit 11: 212</p>
<p>Unit 12: 240</p>
<p>Unit 13: 253</p>
<p>Unit 14: 270</p>
<p>184</p>
<p>Introduction to Big Data</p>
<p>Foundations of Big Data</p>
<p>Data Models</p>
<p>NOSQL Management</p>
<p>Introduction to Hadoop</p>
<p>Hadoop Administration</p>
<p>Hadoop Architecture</p>
<p>Hadoop Master Slave Architecture</p>
<p>Hadoop Node Commands</p>
<p>Map Reduce Applications</p>
<p>Hadoop Ecosystem</p>
<p>Predective Analytics</p>
<p>Data Analytics with R</p>
<p>Big Data Management using Splunk</p>
<p>Unit 1: 1</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Unit 01: Introduction to Big Data</p>
<p>CONTENTS</p>
<p>Objectives</p>
<p>Introduction</p>
<p>1.1 What is Big Data</p>
<p>1.2 Characteristics of Big Data</p>
<p>1.3 Applications of BIG DATA</p>
<p>1.4 Tools used in BIG DATA</p>
<p>1.5 Challenges in BIG DATA</p>
<p>Summary</p>
<p>Keywords</p>
<p>Self Assessment</p>
<p>Answers for Self Assessment</p>
<p>Review Questions</p>
<p>Further Readings</p>
<p>Objectives</p>
<p>After studying this unit, you will be able to: • understand what is BIG DATA. •
understand Applications of BIG DATA • learn tools used in BIG DATA</p>
<p>• known challenges in BIG DATA</p>
<p>Introduction</p>
<p>The quantity of data created by humans is quickly increasing every year as a
result of the introduction of new technology, gadgets, and communication channels
such as social networking sites.Big data is a group of enormous datasets that can't
be handled with typical computer methods.</p>
<p>It is no longer a single technique or tool; rather, it has evolved into a
comprehensive subject including a variety of tools, techniques, and
frameworks.Quantities, letters, or symbols on which a computer performs operations
and which can be stored and communicated as electrical signals and recorded on
magnetic, optical, or mechanical media.</p>
<p>1.1 What is Big Data</p>
<p>Big Data is a massive collection of data that continues to increase dramatically
over time. It is a data set that is so huge and complicated that no typical data
management technologies can effectively store or process it. Big data is similar to
regular data, except it is much larger.Big data analytics is the use of advanced
analytic techniques to very large, heterogeneous data sets, which can contain
structured, semi-structured, and unstructured data, as well as data from many
sources and sizes ranging from terabytes to zettabytes.</p>
<p>Lovely Professional University 1</p>
<p>Dr. Rajni Bhalla, Lovely Professional University</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 1 Structured,Semi-structured and Un-structured</p>
<p>Big data is a term that defines the massive amount of organized and unstructured
data that a company encounters on a daily basis.</p>
<p>Note</p>
<p> It may be studied for insights that lead to improved business choices and
strategic movements.  It is a collection of organized, semi-structured, and
unstructured data that may be mined for information and utilized in machine
learning, predictive modelling, and other advanced analytics initiatives.</p>
<p>Examples of Big Data</p>
<p>Figure 2 shows an example of big data. Every day, 500+ terabytes of fresh data
are absorbed into the Facebook systems. This information is mostly gathered through
photo and video uploads, message exchanges, and the posting of comments, among
other things.</p>
<p>In 30 minutes of flying time, a single Jet engine may create 10+ gigabytes of
data. With thousands of flights every day, the amount of data generated can amount
to several Petabytes.Every day, the Fresh York Stock Exchange creates around a
terabyte of new trading data.</p>
<p>Figure 2: Example of Big Data</p>
<p>1.2 Characteristics of Big Data</p>
<p>Big data can be described by following characteristics as shown in Figure 3.</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 1 Structured,Semi-structured and Un-structured</p>
<p>Big data is a term that defines the massive amount of organized and unstructured
data that a company encounters on a daily basis.</p>
<p>Note</p>
<p> It may be studied for insights that lead to improved business choices and
strategic movements.  It is a collection of organized, semi-structured, and
unstructured data that may be mined for information and utilized in machine
learning, predictive modelling, and other advanced analytics initiatives.</p>
<p>Examples of Big Data</p>
<p>Figure 2 shows an example of big data. Every day, 500+ terabytes of fresh data
are absorbed into the Facebook systems. This information is mostly gathered through
photo and video uploads, message exchanges, and the posting of comments, among
other things.</p>
<p>In 30 minutes of flying time, a single Jet engine may create 10+ gigabytes of
data. With thousands of flights every day, the amount of data generated can amount
to several Petabytes.Every day, the Fresh York Stock Exchange creates around a
terabyte of new trading data.</p>
<p>Figure 2: Example of Big Data</p>
<p>1.2 Characteristics of Big Data</p>
<p>Big data can be described by following characteristics as shown in Figure 3.</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 1 Structured,Semi-structured and Un-structured</p>
<p>Big data is a term that defines the massive amount of organized and unstructured
data that a company encounters on a daily basis.</p>
<p>Note</p>
<p> It may be studied for insights that lead to improved business choices and
strategic movements.  It is a collection of organized, semi-structured, and
unstructured data that may be mined for information and utilized in machine
learning, predictive modelling, and other advanced analytics initiatives.</p>
<p>Examples of Big Data</p>
<p>Figure 2 shows an example of big data. Every day, 500+ terabytes of fresh data
are absorbed into the Facebook systems. This information is mostly gathered through
photo and video uploads, message exchanges, and the posting of comments, among
other things.</p>
<p>In 30 minutes of flying time, a single Jet engine may create 10+ gigabytes of
data. With thousands of flights every day, the amount of data generated can amount
to several Petabytes.Every day, the Fresh York Stock Exchange creates around a
terabyte of new trading data.</p>
<p>Figure 2: Example of Big Data</p>
<p>1.2 Characteristics of Big Data</p>
<p>Big data can be described by following characteristics as shown in Figure 3.</p>
<p>2 Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 3 Characteristics of Big Data</p>
<p>Volume</p>
<p>The term 'Big Data' refers to a massive amount of information. The term "volume"
refers to a large amount of data. The magnitude of data plays a critical role in
determining its worth. When the amount of data is extremely vast, it is referred to
as 'Big Data.'</p>
<p>This means that the volume of data determines whether or not a set of data may
be classified as Big Data. As a result, while dealing with Big Data, it is vital to
consider a certain 'Volume.'</p>
<p>Example:</p>
<p>In 2016, worldwide mobile traffic was predicted to be 6.2 Exabytes (6.2 billion
GB) per month.</p>
<p>Furthermore, by 2020, we will have about 40000 ExaBytes of data.</p>
<p>Velocity</p>
<p>The term "velocity" refers to the rapid collection of data. Data comes in at a
high rate from machines, networks, social media, mobile phones, and other sources
in Big Data velocity. A large and constant influx of data exists. This influences
the data's potential, or how quickly data is created and processed in order to
satisfy needs. Data sampling can assist in dealing with issues such as'velocity.'
For instance, Google receives more than 3.5 billion queries every day. In addition,
the number of Facebook users is growing at a rate of around 22% every year.</p>
<p>Variety</p>
<p>Structured data is just data that has been arranged. It usually refers to data
that has been specified in terms of length and format.</p>
<p>Semi-structured data is a type of data that is semi-organized. It's a type of
data that doesn't follow the traditional data structure. This sort of data is
represented by log files.</p>
<p>Unstructured data is just data that has not been arranged. It usually refers to
data that doesn't fit cleanly into a relational database's standard row and column
structure.Texts, pictures, videos etc.</p>
<p>are the examples of unstructured data which can’t be stored in the form of rows
and columns.</p>
<p>Benefits of Big Data Processing</p>
<p>Ability to process Big Data brings in multiple benefits, such as- 1. Businesses
can utilize outside intelligence while taking decisions. 2. Access to social data
from search engines and sites like facebook, twitter are enabling organizations to
fine tune their business strategies.</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 3 Characteristics of Big Data</p>
<p>Volume</p>
<p>The term 'Big Data' refers to a massive amount of information. The term "volume"
refers to a large amount of data. The magnitude of data plays a critical role in
determining its worth. When the amount of data is extremely vast, it is referred to
as 'Big Data.'</p>
<p>This means that the volume of data determines whether or not a set of data may
be classified as Big Data. As a result, while dealing with Big Data, it is vital to
consider a certain 'Volume.'</p>
<p>Example:</p>
<p>In 2016, worldwide mobile traffic was predicted to be 6.2 Exabytes (6.2 billion
GB) per month.</p>
<p>Furthermore, by 2020, we will have about 40000 ExaBytes of data.</p>
<p>Velocity</p>
<p>The term "velocity" refers to the rapid collection of data. Data comes in at a
high rate from machines, networks, social media, mobile phones, and other sources
in Big Data velocity. A large and constant influx of data exists. This influences
the data's potential, or how quickly data is created and processed in order to
satisfy needs. Data sampling can assist in dealing with issues such as'velocity.'
For instance, Google receives more than 3.5 billion queries every day. In addition,
the number of Facebook users is growing at a rate of around 22% every year.</p>
<p>Variety</p>
<p>Structured data is just data that has been arranged. It usually refers to data
that has been specified in terms of length and format.</p>
<p>Semi-structured data is a type of data that is semi-organized. It's a type of
data that doesn't follow the traditional data structure. This sort of data is
represented by log files.</p>
<p>Unstructured data is just data that has not been arranged. It usually refers to
data that doesn't fit cleanly into a relational database's standard row and column
structure.Texts, pictures, videos etc.</p>
<p>are the examples of unstructured data which can’t be stored in the form of rows
and columns.</p>
<p>Benefits of Big Data Processing</p>
<p>Ability to process Big Data brings in multiple benefits, such as- 1. Businesses
can utilize outside intelligence while taking decisions. 2. Access to social data
from search engines and sites like facebook, twitter are enabling organizations to
fine tune their business strategies.</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 3 Characteristics of Big Data</p>
<p>Volume</p>
<p>The term 'Big Data' refers to a massive amount of information. The term "volume"
refers to a large amount of data. The magnitude of data plays a critical role in
determining its worth. When the amount of data is extremely vast, it is referred to
as 'Big Data.'</p>
<p>This means that the volume of data determines whether or not a set of data may
be classified as Big Data. As a result, while dealing with Big Data, it is vital to
consider a certain 'Volume.'</p>
<p>Example:</p>
<p>In 2016, worldwide mobile traffic was predicted to be 6.2 Exabytes (6.2 billion
GB) per month.</p>
<p>Furthermore, by 2020, we will have about 40000 ExaBytes of data.</p>
<p>Velocity</p>
<p>The term "velocity" refers to the rapid collection of data. Data comes in at a
high rate from machines, networks, social media, mobile phones, and other sources
in Big Data velocity. A large and constant influx of data exists. This influences
the data's potential, or how quickly data is created and processed in order to
satisfy needs. Data sampling can assist in dealing with issues such as'velocity.'
For instance, Google receives more than 3.5 billion queries every day. In addition,
the number of Facebook users is growing at a rate of around 22% every year.</p>
<p>Variety</p>
<p>Structured data is just data that has been arranged. It usually refers to data
that has been specified in terms of length and format.</p>
<p>Semi-structured data is a type of data that is semi-organized. It's a type of
data that doesn't follow the traditional data structure. This sort of data is
represented by log files.</p>
<p>Unstructured data is just data that has not been arranged. It usually refers to
data that doesn't fit cleanly into a relational database's standard row and column
structure.Texts, pictures, videos etc.</p>
<p>are the examples of unstructured data which can’t be stored in the form of rows
and columns.</p>
<p>Benefits of Big Data Processing</p>
<p>Ability to process Big Data brings in multiple benefits, such as- 1. Businesses
can utilize outside intelligence while taking decisions. 2. Access to social data
from search engines and sites like facebook, twitter are enabling organizations to
fine tune their business strategies.</p>
<p>Lovely Professional University 3</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>3. Improved customer service (Traditional customer feedback systems are getting
replaced by new systems designed with Big Data technologies.</p>
<p>4. Improved customer service (In these new systems, Big Data and natural
language processing technologies are being used to read and evaluate consumer
responses.</p>
<p>5. Early identification of risk to the product/services, if any</p>
<p>6. Better operational efficiency</p>
<p>Big Data technologies can be used for creating a staging area or landing zone
for new data before identifying what data should be moved to the data warehouse. In
addition, such integration of Big Data technologies and data warehouse helps an
organization to offload infrequently accessed data.</p>
<p>Why is Big Data Important?</p>
<p>• Cost Savings</p>
<p>Big data helps in providing business intelligence that can reduce costs and
improve the efficiency of operations. Processes like quality assurance and testing
can involve many complications particularly in industries like biopharmaceuticals
and nanotechnologies • Time Reductions</p>
<p>Companies may collect data from a variety of sources using real-time in-memory
analytics. Tools like Hadoop enable businesses to evaluate data quickly, allowing
them to make swift decisions based on their findings. • Understand the market
conditions Businesses can benefit from big data analysis by gaining a better grasp
of market conditions.</p>
<p>Analysing client purchase behaviour, for example, enables businesses to discover
the most popular items and develop them appropriately. This allows businesses to
stay ahead of the competition. • Social Media Listening’s</p>
<p>Companies can perform sentiment analysis using Big Data tools. These enable them
to get feedback about their company, that is, who is saying what about the company.
Companies can use Big data tools to improve their online presence</p>
<p>• Using Big Data Analytics to Boost Customer Acquisition and Retention.
Customers are a crucial asset that each company relies on. Without a strong
consumer base, no company can be successful.However, even with a strong consumer
base, businesses cannot ignore market rivalry. It will be difficult for businesses
to succeed if they do not understand what their consumers desire.It will be
difficult for businesses to succeed if they do not understand what their consumers
desire. It will result in a loss of customers, which will have a negative impact on
business growth.Businesses may use big data analytics to detect customer-related
trends and patterns. Customer behaviour analysis is the key to a successful
business.  Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights</p>
<p>All company activities are shaped by big data analytics. It allows businesses to
meet client expectations. Big data analytics aids in the modification of a
company's product range. It guarantees that marketing initiatives are effective. •
Big Data Analytics as a Driver of Innovations and Product Development Companies may
use big data to innovate and revamp their goods.</p>
<p>4 Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>1.3 Applications of BIG DATA</p>
<p>All of the data must be recorded and processed, which takes a lot of expertise,
resources, and time.Data may be creatively and meaningfully used to provide
business benefits. There are three sorts of business applications, each with
varying degrees of revolutionary potential as shown in Figure 4.</p>
<p>Figure 4 Applications of Big Data</p>
<p>Monitoring and tracking application</p>
<p>These are the first and most fundamental Big Data applications. In practically
all industries, they aid in increasing corporate efficiency. The following are a
few examples of specialised applications: - • Public health monitoring The US
government is encouraging all healthcare stakeholders to establish a national
platform for interoperability and data sharing standards. This would enable
secondary use of health data, which would advance BIG DATA analytics and
personalized holistic precision medicine. This would be a broad-based platform like
Google flu trends.</p>
<p>Figure 5 Public health monitoring</p>
<p>• Consumer Sentiment Monitoring</p>
<p>Social media has become more powerful than advertising. Many good companies have
moved a bulk of their advertising budgets from traditional media into social
media.They have setup Big Data listening platforms, where social media data streams
(including tweets, and Facebook posts and blog posts) are filtered and analysed for
certain keywords or sentiments, by certain demographics and regions. Actionable
information from this analysis is delivered to marketing professionals for
appropriate action, especially when the product is new to the market.</p>
<p>Three major</p>
<p>types of</p>
<p>Business</p>
<p>applications</p>
<p>1. Monitoring</p>
<p>and tracking</p>
<p>applications</p>
<p>2. Analysis</p>
<p>and Insight</p>
<p>Applications</p>
<p>3. New</p>
<p>Product</p>
<p>Development</p>
<p>Lovely Professional University 5</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 6 Consumer Sentiment Monitoring</p>
<p>• Asset Tracking</p>
<p>Figure 7 Asset Tracking</p>
<p>The US department of defence is encouraging the industry to devise a tiny RFID
chip that could prevent the counterfeiting of electronic parts that end up in
avionics or circuit board for other devices. Airplanes are one of the heaviest
users of sensors which track every aspect of the performance of every part of the
plane. The data can be displayed on the dashboard as well as stored for later
detailed analysis. Working with communicating devices, these sensors can produce a
torrent of data.Theft by shoppers and employees is a major source of loss of
revenue for retailers. All valuable items in the store can be assigned RFID tags,
and the gates of the store can be equipped with RF readers. This can help secure
the products, and reduce leakage(theft) from the store. • Supply chain monitoring
All containers on ships communicate their status and location using RFID tags. Thus
retailers and their suppliers can gain real-time visibility to the inventory
throughout the global supply chain. Retailers can know exactly where the items are
in the warehouse, and so can bring them into the store at the right time. This is
particularly relevant for seasonal items that must be sold on time, or else they
will be sold at a discount.With item-level RFID tacks, retailers also gain full
visibility of each item and can serve their customers better.</p>
<p>Figure 8 Supply chain monitoring</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 6 Consumer Sentiment Monitoring</p>
<p>• Asset Tracking</p>
<p>Figure 7 Asset Tracking</p>
<p>The US department of defence is encouraging the industry to devise a tiny RFID
chip that could prevent the counterfeiting of electronic parts that end up in
avionics or circuit board for other devices. Airplanes are one of the heaviest
users of sensors which track every aspect of the performance of every part of the
plane. The data can be displayed on the dashboard as well as stored for later
detailed analysis. Working with communicating devices, these sensors can produce a
torrent of data.Theft by shoppers and employees is a major source of loss of
revenue for retailers. All valuable items in the store can be assigned RFID tags,
and the gates of the store can be equipped with RF readers. This can help secure
the products, and reduce leakage(theft) from the store. • Supply chain monitoring
All containers on ships communicate their status and location using RFID tags. Thus
retailers and their suppliers can gain real-time visibility to the inventory
throughout the global supply chain. Retailers can know exactly where the items are
in the warehouse, and so can bring them into the store at the right time. This is
particularly relevant for seasonal items that must be sold on time, or else they
will be sold at a discount.With item-level RFID tacks, retailers also gain full
visibility of each item and can serve their customers better.</p>
<p>Figure 8 Supply chain monitoring</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>Figure 6 Consumer Sentiment Monitoring</p>
<p>• Asset Tracking</p>
<p>Figure 7 Asset Tracking</p>
<p>The US department of defence is encouraging the industry to devise a tiny RFID
chip that could prevent the counterfeiting of electronic parts that end up in
avionics or circuit board for other devices. Airplanes are one of the heaviest
users of sensors which track every aspect of the performance of every part of the
plane. The data can be displayed on the dashboard as well as stored for later
detailed analysis. Working with communicating devices, these sensors can produce a
torrent of data.Theft by shoppers and employees is a major source of loss of
revenue for retailers. All valuable items in the store can be assigned RFID tags,
and the gates of the store can be equipped with RF readers. This can help secure
the products, and reduce leakage(theft) from the store. • Supply chain monitoring
All containers on ships communicate their status and location using RFID tags. Thus
retailers and their suppliers can gain real-time visibility to the inventory
throughout the global supply chain. Retailers can know exactly where the items are
in the warehouse, and so can bring them into the store at the right time. This is
particularly relevant for seasonal items that must be sold on time, or else they
will be sold at a discount.With item-level RFID tacks, retailers also gain full
visibility of each item and can serve their customers better.</p>
<p>Figure 8 Supply chain monitoring</p>
<p>6 Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>• Preventive machine maintenance</p>
<p>All machines, including cars and computers, do tend to fail sometimes. This is
because one or more or their components may cease to function. As a preventive
measure, precious equipment could be equipped with sensors.The continuous stream of
data from the sensors could be monitored and analyzed to forecast the status of key
components, and thus, monitor the overall machine’s health. Preventive maintenance
can, thus, reduce the cost of downtime.</p>
<p>Figure 9 Preventive maintenance</p>
<p>Analysis and Insight Applications</p>
<p>These are the next generation of big data apps. They have the ability to improve
corporate effectiveness and have transformational potential.Big Data may be
organised and analysed to reveal trends and insights that can be utilised to
improve business. • Predictive Policing • Winning political elections</p>
<p>• Personal Health</p>
<p>Predictive Policing</p>
<p>The notion of predictive policing was created by the Los Angeles Police
Department. The LAPD collaborated with UC Berkeley academics to examine its massive
database of 13 million crimes spanning 80 years and forecast the likelihood of
particular sorts of crimes occurring at specific times and in specific areas.They
pinpointed crime hotspots of certain categories, at specific times, and in specific
areas.They identified crime hotspots where crimes have happened and were likely to
occur in the future.After a basic insight derived from a metaphor of earthquakes
and their aftershocks, crime patterns were statistically simulated.The model said
that once a crime occurred in a location, it represented a CERTAIN disturbance in
harmony, and would thus, lead to a greater likelihood of a similar crime occurring
in the local vicinity soon. The model showed for each police beat, the specific
neighborhood blocks and specific time slots, where crime was likely to occur.By
aligning the police cars patrol schedule in accordance with the models’
predictions, the LAPD could reduce crime by 12 percent to 26 percent for different
categories of crime.Recently, the SAN Francisco Police department released its own
crime for over 2 years, so data analyst could model that data and prevent future
crimes.</p>
<p>Figure 10 Predictive policing</p>
<p>Winning political elections</p>
<p>The US president,Barack Obama was the first major political candidate to use big
data in a significant way, in the 2008n elections. He is the first big data
president. His campaign gathered data about millions of people, including
supporters. They invented the mechanism to obtain small campaign contributions from
millions of supporters. They Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>• Preventive machine maintenance</p>
<p>All machines, including cars and computers, do tend to fail sometimes. This is
because one or more or their components may cease to function. As a preventive
measure, precious equipment could be equipped with sensors.The continuous stream of
data from the sensors could be monitored and analyzed to forecast the status of key
components, and thus, monitor the overall machine’s health. Preventive maintenance
can, thus, reduce the cost of downtime.</p>
<p>Figure 9 Preventive maintenance</p>
<p>Analysis and Insight Applications</p>
<p>These are the next generation of big data apps. They have the ability to improve
corporate effectiveness and have transformational potential.Big Data may be
organised and analysed to reveal trends and insights that can be utilised to
improve business. • Predictive Policing • Winning political elections</p>
<p>• Personal Health</p>
<p>Predictive Policing</p>
<p>The notion of predictive policing was created by the Los Angeles Police
Department. The LAPD collaborated with UC Berkeley academics to examine its massive
database of 13 million crimes spanning 80 years and forecast the likelihood of
particular sorts of crimes occurring at specific times and in specific areas.They
pinpointed crime hotspots of certain categories, at specific times, and in specific
areas.They identified crime hotspots where crimes have happened and were likely to
occur in the future.After a basic insight derived from a metaphor of earthquakes
and their aftershocks, crime patterns were statistically simulated.The model said
that once a crime occurred in a location, it represented a CERTAIN disturbance in
harmony, and would thus, lead to a greater likelihood of a similar crime occurring
in the local vicinity soon. The model showed for each police beat, the specific
neighborhood blocks and specific time slots, where crime was likely to occur.By
aligning the police cars patrol schedule in accordance with the models’
predictions, the LAPD could reduce crime by 12 percent to 26 percent for different
categories of crime.Recently, the SAN Francisco Police department released its own
crime for over 2 years, so data analyst could model that data and prevent future
crimes.</p>
<p>Figure 10 Predictive policing</p>
<p>Winning political elections</p>
<p>The US president,Barack Obama was the first major political candidate to use big
data in a significant way, in the 2008n elections. He is the first big data
president. His campaign gathered data about millions of people, including
supporters. They invented the mechanism to obtain small campaign contributions from
millions of supporters. They Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>• Preventive machine maintenance</p>
<p>All machines, including cars and computers, do tend to fail sometimes. This is
because one or more or their components may cease to function. As a preventive
measure, precious equipment could be equipped with sensors.The continuous stream of
data from the sensors could be monitored and analyzed to forecast the status of key
components, and thus, monitor the overall machine’s health. Preventive maintenance
can, thus, reduce the cost of downtime.</p>
<p>Figure 9 Preventive maintenance</p>
<p>Analysis and Insight Applications</p>
<p>These are the next generation of big data apps. They have the ability to improve
corporate effectiveness and have transformational potential.Big Data may be
organised and analysed to reveal trends and insights that can be utilised to
improve business. • Predictive Policing • Winning political elections</p>
<p>• Personal Health</p>
<p>Predictive Policing</p>
<p>The notion of predictive policing was created by the Los Angeles Police
Department. The LAPD collaborated with UC Berkeley academics to examine its massive
database of 13 million crimes spanning 80 years and forecast the likelihood of
particular sorts of crimes occurring at specific times and in specific areas.They
pinpointed crime hotspots of certain categories, at specific times, and in specific
areas.They identified crime hotspots where crimes have happened and were likely to
occur in the future.After a basic insight derived from a metaphor of earthquakes
and their aftershocks, crime patterns were statistically simulated.The model said
that once a crime occurred in a location, it represented a CERTAIN disturbance in
harmony, and would thus, lead to a greater likelihood of a similar crime occurring
in the local vicinity soon. The model showed for each police beat, the specific
neighborhood blocks and specific time slots, where crime was likely to occur.By
aligning the police cars patrol schedule in accordance with the models’
predictions, the LAPD could reduce crime by 12 percent to 26 percent for different
categories of crime.Recently, the SAN Francisco Police department released its own
crime for over 2 years, so data analyst could model that data and prevent future
crimes.</p>
<p>Figure 10 Predictive policing</p>
<p>Winning political elections</p>
<p>The US president,Barack Obama was the first major political candidate to use big
data in a significant way, in the 2008n elections. He is the first big data
president. His campaign gathered data about millions of people, including
supporters. They invented the mechanism to obtain small campaign contributions from
millions of supporters. They Lovely Professional University 7</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>created personal profiles of millions of supporters and what they had done and
could do for the campaign. Data was used to determine undecided voters who could be
converted to their side. They provided phone numbers of these undecided voters to
the volunteers.</p>
<p>The results of the calls were recorded in real time using interactive web
applications.</p>
<p>Obama himself used his twitter account to communicate his message directly with
his millions of followers.After the elections, Obama converted his list of tens of
millions of supporters to an advocacy machine that would provide the grassroots
support for the president initiatives. Since then, almost all campaigns use big
data.</p>
<p>Figure 11 Winning political elections</p>
<p>Senator Bernie sanders used the same big data playbook to build an effective
national political machine powered entirely by small donors. Election analyst, Nate
silver, created sophistical predictive models using inputs from many political
polls and surveys to win pundits to successfully predict winner of the US
elections. Nate was however, unsuccessful in predicting Donald trump’s rise and
ultimate victory and that shows the limits of big data.</p>
<p>Personal health</p>
<p>Medical knowledge and technology is growing by leaps and bounds. IBM’s Watson
system is a big data analytics engine that ingests and digests all the medical
information in the world, and then applies it intelligently to an individual
situation.Watson can provide a detailed and accurate medical diagnosis using
current symptoms, patient history, medical history and environmental trends, and
other parameters. Similar products might be offered as an APP to licensed doctors,
and even individuals, to improve productivity and accuracy in health care.</p>
<p>New Product Development</p>
<p>These are completely new notions that did not exist previously. These
applications have the ability to disrupt whole sectors and provide organisations
with new revenue streams. • Flexible Auto Insurance • Location based retail
promotion</p>
<p>• Recommendation service</p>
<p>Flexible Auto Insurance</p>
<p>An auto insurance company can use the GPS data from cars to calculate the risk
of accidents based on travel patterns. The automobile companies can use the car
sensor data to track the performance of a car. Safer drivers can be rewarded and
the errant drivers can be penalized.</p>
<p>Figure 12 GPS vehicle tracking system</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>created personal profiles of millions of supporters and what they had done and
could do for the campaign. Data was used to determine undecided voters who could be
converted to their side. They provided phone numbers of these undecided voters to
the volunteers.</p>
<p>The results of the calls were recorded in real time using interactive web
applications.</p>
<p>Obama himself used his twitter account to communicate his message directly with
his millions of followers.After the elections, Obama converted his list of tens of
millions of supporters to an advocacy machine that would provide the grassroots
support for the president initiatives. Since then, almost all campaigns use big
data.</p>
<p>Figure 11 Winning political elections</p>
<p>Senator Bernie sanders used the same big data playbook to build an effective
national political machine powered entirely by small donors. Election analyst, Nate
silver, created sophistical predictive models using inputs from many political
polls and surveys to win pundits to successfully predict winner of the US
elections. Nate was however, unsuccessful in predicting Donald trump’s rise and
ultimate victory and that shows the limits of big data.</p>
<p>Personal health</p>
<p>Medical knowledge and technology is growing by leaps and bounds. IBM’s Watson
system is a big data analytics engine that ingests and digests all the medical
information in the world, and then applies it intelligently to an individual
situation.Watson can provide a detailed and accurate medical diagnosis using
current symptoms, patient history, medical history and environmental trends, and
other parameters. Similar products might be offered as an APP to licensed doctors,
and even individuals, to improve productivity and accuracy in health care.</p>
<p>New Product Development</p>
<p>These are completely new notions that did not exist previously. These
applications have the ability to disrupt whole sectors and provide organisations
with new revenue streams. • Flexible Auto Insurance • Location based retail
promotion</p>
<p>• Recommendation service</p>
<p>Flexible Auto Insurance</p>
<p>An auto insurance company can use the GPS data from cars to calculate the risk
of accidents based on travel patterns. The automobile companies can use the car
sensor data to track the performance of a car. Safer drivers can be rewarded and
the errant drivers can be penalized.</p>
<p>Figure 12 GPS vehicle tracking system</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>created personal profiles of millions of supporters and what they had done and
could do for the campaign. Data was used to determine undecided voters who could be
converted to their side. They provided phone numbers of these undecided voters to
the volunteers.</p>
<p>The results of the calls were recorded in real time using interactive web
applications.</p>
<p>Obama himself used his twitter account to communicate his message directly with
his millions of followers.After the elections, Obama converted his list of tens of
millions of supporters to an advocacy machine that would provide the grassroots
support for the president initiatives. Since then, almost all campaigns use big
data.</p>
<p>Figure 11 Winning political elections</p>
<p>Senator Bernie sanders used the same big data playbook to build an effective
national political machine powered entirely by small donors. Election analyst, Nate
silver, created sophistical predictive models using inputs from many political
polls and surveys to win pundits to successfully predict winner of the US
elections. Nate was however, unsuccessful in predicting Donald trump’s rise and
ultimate victory and that shows the limits of big data.</p>
<p>Personal health</p>
<p>Medical knowledge and technology is growing by leaps and bounds. IBM’s Watson
system is a big data analytics engine that ingests and digests all the medical
information in the world, and then applies it intelligently to an individual
situation.Watson can provide a detailed and accurate medical diagnosis using
current symptoms, patient history, medical history and environmental trends, and
other parameters. Similar products might be offered as an APP to licensed doctors,
and even individuals, to improve productivity and accuracy in health care.</p>
<p>New Product Development</p>
<p>These are completely new notions that did not exist previously. These
applications have the ability to disrupt whole sectors and provide organisations
with new revenue streams. • Flexible Auto Insurance • Location based retail
promotion</p>
<p>• Recommendation service</p>
<p>Flexible Auto Insurance</p>
<p>An auto insurance company can use the GPS data from cars to calculate the risk
of accidents based on travel patterns. The automobile companies can use the car
sensor data to track the performance of a car. Safer drivers can be rewarded and
the errant drivers can be penalized.</p>
<p>Figure 12 GPS vehicle tracking system</p>
<p>8 Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Location based retail promotion</p>
<p>A retailer or a third-party advertiser, can target customers with specific
promotions and coupons based on location data obtained through Global positioning
system (GPS) the time of day, the presence of stores nearby, and mapping it to the
consumer preference data available from social media databases. Advertisements and
offers can be delivered through mobile apps, SMS and email. These are examples of
mobile apps.</p>
<p>Figure 13 Location based retail promotion</p>
<p>Recommendation service</p>
<p>Ecommerce has been a fast-growing industry in the last couple of decades. A
variety of products are sold and shared over the internet. Web users browsing and
purchase history on ecommerce sites is utilized to learn about their preference and
needs, and to advertise relevant product and pricing offers in real-time. Amazon
uses a personalized recommendation engine system to suggest new additional products
to consumers based on affinities of various products.</p>
<p>Figure 14 Recommendation Service</p>
<p>Netflix also use a recommendation engine to suggest entertainment options to its
users.Big data is valuable across all industries.</p>
<p>These are three major types of data sources of big data. Example (people to
people communication, people-machine communications, Machine-machine
communications.)Each type has many sources of data. There are three types of
applications. They are the monitoring type, the analysis type and new product
development.They have an impact on efficiency, effectiveness and even disruption of
industries.</p>
<p>1.4 Tools used in BIG DATA</p>
<p>There are number of tools used in BIGDATA. Most popular tools are: - Apache
Hadoop A large data framework is the Apache Hadoop software library. It enables
massive data sets to be processed across clusters of computers in a distributed
manner. It's one of the most powerful big data technologies, with the ability to
grow from a single server to thousands of computers.</p>
<p>Features</p>
<p>• When utilising an HTTP proxy server, authentication is improved. • Hadoop
Compatible Filesystem effort specification. Extended characteristics for POSIX-
style filesystems are supported.</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Location based retail promotion</p>
<p>A retailer or a third-party advertiser, can target customers with specific
promotions and coupons based on location data obtained through Global positioning
system (GPS) the time of day, the presence of stores nearby, and mapping it to the
consumer preference data available from social media databases. Advertisements and
offers can be delivered through mobile apps, SMS and email. These are examples of
mobile apps.</p>
<p>Figure 13 Location based retail promotion</p>
<p>Recommendation service</p>
<p>Ecommerce has been a fast-growing industry in the last couple of decades. A
variety of products are sold and shared over the internet. Web users browsing and
purchase history on ecommerce sites is utilized to learn about their preference and
needs, and to advertise relevant product and pricing offers in real-time. Amazon
uses a personalized recommendation engine system to suggest new additional products
to consumers based on affinities of various products.</p>
<p>Figure 14 Recommendation Service</p>
<p>Netflix also use a recommendation engine to suggest entertainment options to its
users.Big data is valuable across all industries.</p>
<p>These are three major types of data sources of big data. Example (people to
people communication, people-machine communications, Machine-machine
communications.)Each type has many sources of data. There are three types of
applications. They are the monitoring type, the analysis type and new product
development.They have an impact on efficiency, effectiveness and even disruption of
industries.</p>
<p>1.4 Tools used in BIG DATA</p>
<p>There are number of tools used in BIGDATA. Most popular tools are: - Apache
Hadoop A large data framework is the Apache Hadoop software library. It enables
massive data sets to be processed across clusters of computers in a distributed
manner. It's one of the most powerful big data technologies, with the ability to
grow from a single server to thousands of computers.</p>
<p>Features</p>
<p>• When utilising an HTTP proxy server, authentication is improved. • Hadoop
Compatible Filesystem effort specification. Extended characteristics for POSIX-
style filesystems are supported.</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Location based retail promotion</p>
<p>A retailer or a third-party advertiser, can target customers with specific
promotions and coupons based on location data obtained through Global positioning
system (GPS) the time of day, the presence of stores nearby, and mapping it to the
consumer preference data available from social media databases. Advertisements and
offers can be delivered through mobile apps, SMS and email. These are examples of
mobile apps.</p>
<p>Figure 13 Location based retail promotion</p>
<p>Recommendation service</p>
<p>Ecommerce has been a fast-growing industry in the last couple of decades. A
variety of products are sold and shared over the internet. Web users browsing and
purchase history on ecommerce sites is utilized to learn about their preference and
needs, and to advertise relevant product and pricing offers in real-time. Amazon
uses a personalized recommendation engine system to suggest new additional products
to consumers based on affinities of various products.</p>
<p>Figure 14 Recommendation Service</p>
<p>Netflix also use a recommendation engine to suggest entertainment options to its
users.Big data is valuable across all industries.</p>
<p>These are three major types of data sources of big data. Example (people to
people communication, people-machine communications, Machine-machine
communications.)Each type has many sources of data. There are three types of
applications. They are the monitoring type, the analysis type and new product
development.They have an impact on efficiency, effectiveness and even disruption of
industries.</p>
<p>1.4 Tools used in BIG DATA</p>
<p>There are number of tools used in BIGDATA. Most popular tools are: - Apache
Hadoop A large data framework is the Apache Hadoop software library. It enables
massive data sets to be processed across clusters of computers in a distributed
manner. It's one of the most powerful big data technologies, with the ability to
grow from a single server to thousands of computers.</p>
<p>Features</p>
<p>• When utilising an HTTP proxy server, authentication is improved. • Hadoop
Compatible Filesystem effort specification. Extended characteristics for POSIX-
style filesystems are supported.</p>
<p>Lovely Professional University 9</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>• It has big data technologies and tools that offers robust ecosystem that is
well suited to meet the analytical needs of developer. • It brings Flexibility in
Data Processing. It allows for faster data Processing HPCC</p>
<p>HPCC is a big data tool developed by LexisNexis Risk Solution. It delivers on a
single platform, a single architecture and a single programming language for data
processing.</p>
<p>Features</p>
<p>• It is one of the Highly efficient big data tools that accomplish big data
tasks with far less code. • It is one of the big data processing tools which offers
high redundancy and availability. • It can be used both for complex data processing
on a Thor cluster. Graphical IDE for simplifies development, testing and debugging.
It automatically optimizes code for parallel processing</p>
<p>• Provide enhance scalability and performance. ECL code compiles into optimized
C++, and it can also extend using C++ libraries</p>
<p>Apache STORM</p>
<p>Storm is a free big data open source computation system. It is one of the best
big data tools which offers distributed real-time, fault-tolerant processing
system. With real-time computation capabilities.</p>
<p>Features</p>
<p>• It is one of the best tool from big data tools list which is benchmarked as
processing one million 100 byte messages per second per node</p>
<p>• It has big data technologies and tools that uses parallel calculations that
run across a cluster of machines. • It will automatically restart in case a node
die. The worker will be restarted on another node. Storm guarantees that each unit
of data will be processed at least once or exactly once</p>
<p>• Once deployed Storm is surely easiest tool for Bigdata analysis</p>
<p>Qubole</p>
<p>Qubole Data is Autonomous Big data management platform. It is a big data open-
source tool which is self-managed, self-optimizing and allows the data team to
focus on business outcomes.</p>
<p>Features</p>
<p>• Features: • Single Platform for every use case</p>
<p>• It is an Open-source big data software having Engines, optimized for the
Cloud. • Comprehensive Security, Governance, and Compliance • Provides actionable
Alerts, Insights, and Recommendations to optimize reliability, performance, and
costs. • Automatically enacts policies to avoid performing repetitive manual
actions 10 Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Apache Cassandra</p>
<p>The Apache Cassandra database is widely used today to provide an effective
management of large amounts of data.</p>
<p>Features</p>
<p>• Support for replicating across multiple data centers by providing lower
latency for users • Data is automatically replicated to multiple nodes for fault-
tolerance • It one of the best big data tools which is most suitable for
applications that can't afford to lose data, even when an entire data center is
down</p>
<p>• Cassandra offers support contracts and services are available from third
parties • Statwing</p>
<p>Statwing is an easy-to-use statistical tool. It was built by and for big data
analysts. Its modern interface chooses statistical tests automatically.</p>
<p>Features</p>
<p>• It is a big data software that can explore any data in seconds. Statwing helps
to clean data, explore relationships, and create charts in minutes</p>
<p>• It allows creating histograms, scatterplots, heatmaps, and bar charts that
export to Excel or PowerPoint. It also translates results into plain English, so
analysts unfamiliar with statistical analysis</p>
<p>CouchDB</p>
<p>CouchDB stores data in JSON documents that can be accessed web or query using
JavaScript. It offers distributed scaling with fault-tolerant storage. It allows
accessing data by defining the Couch Replication Protocol.</p>
<p>Features</p>
<p>• CouchDB is a single-node database that works like any other database • It is
one of the big data processing tools that allows running a single logical database
server on any number of servers. • It makes use of the ubiquitous HTTP protocol and
JSON data format. Easy replication of a database across multiple server instances.
Easy interface for document insertion, updates, retrieval and deletion</p>
<p>• JSON-based document format can be translatable across different languages
Pentaho</p>
<p>Pentaho provides big data tools to extract, prepare and blend data. It offers
visualizations and analytics that change the way to run any business. This Big data
tool allows turning big data into big insights.</p>
<p>Features: • Data access and integration for effective data visualization. It is
a big data software that empowers users to architect big data at the source and
stream them for accurate analytics.</p>
<p>Lovely Professional University 11</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>• Seamlessly switch or combine data processing with in-cluster execution to get
maximumprocessing. Allow checking data with easy access to analytics, including
charts, visualizations, and reporting</p>
<p>• Supports wide spectrum of big data sources by offering unique capabilities
Apache Flink</p>
<p>Apache Flink is one of the best open source data analytics tools for stream
processing big data. It is distributed, high-performing, always-available, and
accurate data streaming applications.</p>
<p>Features: • Provides results that are accurate, even for out-of-order or late-
arriving data • It is stateful and fault-tolerant and can recover from failures. •
It is a big data analytics software which can perform at a large scale, running on
thousands of nodes</p>
<p>• Has good throughput and latency characteristics</p>
<p>• This big data tool supports stream processing and windowing with event time
semantics.</p>
<p>It supports flexible windowing based on time, count, or sessions to data-driven
windows • It supports a wide range of connectors to third-party systems for data
sources and sinks Cloudera</p>
<p>Cloudera is the fastest, easiest and highly secure modern big data platform. It
allows anyone to get any data across any environment within single, scalable
platform.</p>
<p>Features: • High-performance big data analytics software</p>
<p>• It offers provision for multi-cloud</p>
<p>• Deploy and manage Cloudera Enterprise across AWS, Microsoft Azure and Google
Cloud Platform. Spin up and terminate clusters, and only pay for what is needed
when need it • Developing and training data models • Reporting, exploring, and
self-servicing business intelligence</p>
<p>• Delivering real-time insights for monitoring and detection</p>
<p>• Conducting accurate model scoring and serving</p>
<p>Open Refine</p>
<p>OpenRefine is a powerful big data tool. It is a big data analytics software that
helps to work with messy data, cleaning it and transforming it from one format into
another. It also allows extending it with web services and external data.</p>
<p>Features: • OpenRefine tool help you explore large data sets with ease. It can
be used to link and extend your dataset with various webservices. Import data in
various formats. • Explore datasets in a matter of seconds • Apply basic and
advanced cell transformations</p>
<p>• Allows to deal with cells that contain multiple values</p>
<p>12 Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>• Create instantaneous links between datasets. Use named-entity extraction on
text fields to automatically identify topics. Perform advanced data operations with
the help of Refine Expression Language</p>
<p>RapidMiner</p>
<p>RapidMiner is one of the best open-source data analytics tools. It is used for
data prep, machine learning, and model deployment. It offers a suite of products to
build new data mining processes and setup predictive analysis.</p>
<p>Features</p>
<p>• Allow multiple data management methods</p>
<p>• GUI or batch processing</p>
<p>• Integrates with in-house databases</p>
<p>• Interactive, shareable dashboards</p>
<p>• Big Data predictive analytics</p>
<p>• Remote analysis processing</p>
<p>• Data filtering, merging, joining and aggregating</p>
<p>• Build, train and validate predictive models</p>
<p>• Store streaming data to numerous databases</p>
<p>• Reports and triggered notifications</p>
<p>Data cleaner</p>
<p>Data Cleaner is a data quality analysis application and a solution platform. It
has strong data profiling engine. It is extensible and thereby adds data cleansing,
transformations, matching, and merging.</p>
<p>Feature: • Interactive and explorative data profiling</p>
<p>• Fuzzy duplicate record detection. • Data transformation and standardization •
Data validation and reporting</p>
<p>• Use of reference data to cleanse data</p>
<p>• Master the data ingestion pipeline in Hadoop data lake. Ensure that rules
about the data are correct before user spends thier time on the processing. Find
the outliers and other devilish details to either exclude or fix the incorrect
data</p>
<p>Kaggle</p>
<p>Kaggle is the world's largest big data community. It helps organizations and
researchers to post their data & statistics. It is the best place to analyze data
seamlessly.</p>
<p>Features: • The best place to discover and seamlessly analyze open data • Search
box to find open datasets.</p>
<p>Lovely Professional University 13</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p>• Contribute to the open data movement and connect with other data enthusiasts
Apache Hive</p>
<p>Hive is an open-source big data software tool. It allows programmers analyze
large data sets on Hadoop. It helps with querying and managing large datasets real
fast.</p>
<p>Features: • It Supports SQL like query language for interaction and Data
modeling • It compiles language with two main tasks map, and reducer. • It allows
defining these tasks using Java or Python • Hive designed for managing and querying
only structured data</p>
<p>• Hive's SQL-inspired language separates the user from the complexity of Map
Reduce programming</p>
<p>• It offers Java Database Connectivity (JDBC) interface</p>
<p>1.5 Challenges in BIG DATA</p>
<p>Lack of proper understanding of Big Data</p>
<p>Companies fail in their Big Data initiatives due to insufficient understanding.
Employees may not know what data is, its storage, processing, importance, and
sources. Data professionals may know what is going on, but others may not have a
clear picture.For example, if employees do not understand the importance of data
storage, they might not keep the backup of sensitive data. They might not use
databases properly for storage. As a result, when this important data is required,
it cannot be retrieved easily.</p>
<p>Big Data workshops and seminars must be held at companies for everyone. Basic
training programs must be arranged for all the employees who are handling data
regularly and are a part of the Big Data projects. A basic understanding of data
concepts must be inculcated by all levels of the organization.</p>
<p>Data growth issues</p>
<p>One of the most pressing challenges of Big Data is storing all these huge sets
of data properly.</p>
<p>The amount of data being stored in data centers and databases of companies is
increasing rapidly. As these data sets grow exponentially with time, it gets
extremely difficult to handle.Most of the data is unstructured and comes from
documents, videos, audios, text files and other sources. This means that you cannot
find them in databases.</p>
<p>Solution</p>
<p>In order to handle these large data sets, companies are opting for modern
techniques, such as compression, tiering, and deduplication. Compression is used
for reducing the number of bits in the data, thus reducing its overall size.
Deduplication is the process of removing duplicate and unwanted data from a data
set. Data tiering allows companies to store data in different storage tiers. It
ensures that the data is residing in the most appropriate storage space.</p>
<p>Data tiers can be public cloud, private cloud, and flash storage, depending on
the data size and importance. Companies are also opting for Big Data tools, such as
Hadoop, NoSQL and other technologies. This leads us to the third Big Data
problem.</p>
<p>Confusion while Big Data tool selection</p>
<p>• Companies often get confused while selecting the best tool for Big Data
analysis and storage. Is HBase or Cassandra the best technology for data storage?
Is Hadoop MapReduce good enough or will Spark be a better option for data analytics
and storage?These questions bother companies and sometimes they are unable to find
the 14 Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>answers. They end up making poor decisions and selecting an inappropriate
technology.</p>
<p>As a result, money, time, efforts and work hours are wasted.</p>
<p>Solution</p>
<p>The best way to go about it is to seek professional help. You can either hire
experienced professionals who know much more about these tools. Another way is to
go for Big Data consulting. Here, consultants will give a recommendation of the
best tools, based on your company’s scenario. Based on their advice, you can work
out a strategy and then select the best tool for you.</p>
<p>Lack of data professionals</p>
<p>• To run these modern technologies and Big Data tools, companies need skilled
data professionals. These professionals will include data scientists, data analysts
and data engineers who are experienced in working with the tools and making sense
out of huge data sets.Companies face a problem of lack of Big Data professionals.
This is because data handling tools have evolved rapidly, but in most cases, the
professionals have not.</p>
<p>Actionable steps need to be taken in order to bridge this gap.</p>
<p>Solution</p>
<p>Companies are investing more money in the recruitment of skilled professionals.
They also have to offer training programs to the existing staff to get the most out
of them.Another important step taken by organizations is the purchase of data
analytics solutions that are powered by artificial intelligence/machine learning.
These tools can be run by professionals who are not data science experts but have
basic knowledge. This step helps companies to save a lot of money for recruitment.
Securing data</p>
<p>• Securing these huge sets of data is one of the daunting challenges of Big
Data. Often companies are so busy in understanding, storing and analyzing their
data sets that they push data security for later stages. But, this is not a smart
move as unprotected data repositories can become breeding grounds for malicious
hackers. • Companies can lose up to $3.7 million for a stolen record or a data
breach. • Solution • Companies are recruiting more cybersecurity professionals to
protect their data. Other steps taken for securing data include: • Data
encryption</p>
<p>• Data segregation</p>
<p>• Identity and access control • Implementation of endpoint security</p>
<p>• Real-time security monitoring</p>
<p>Integrating data from a variety of sources</p>
<p>• Data in an organization comes from a variety of sources, such as social media
pages, ERP applications, customer logs, financial reports, e-mails, presentations
and reports created by employees. Combining all this data to prepare reports is a
challenging task.This is an area often neglected by firms. But, data integration is
crucial for analysis, reporting and business intelligence, so it has to be
perfect.</p>
<p>Solution</p>
<p>Companies have to solve their data integration problems by purchasing the right
tools. Some of the best data integration tools are mentioned below:  Talend Data
Integration  Centerprise Data Integrator  ArcESB</p>
<p>Lovely Professional University 15</p>
<p>Introduction to Big Data</p>
<p>Notes</p>
<p> IBM InfoSphere</p>
<p> Xplenty</p>
<p> Informatica PowerCenter  CloverDX</p>
<p> Microsoft SQL</p>
<p> QlikView</p>
<p> Oracle Data Service Integrator</p>
<p>In order to put Big Data to the best use, companies have to start doing things
differently. This means hiring better staff, changing the management, reviewing
existing business policies and the technologies being used. To enhance decision
making, they can hire a Chief Data Officer – a step that is taken by many of the
fortune 500 companies. Summary</p>
<p> Big data refers to massive, difficult-to-manage data quantities – both
organised and unstructured – that inundate enterprises on a daily basis. Big data
may be evaluated for insights that help people make better judgments and feel more
confident about making key business decisions.  These are the most basic and basic
Big Data applications. They assist in enhancing company efficiency in almost every
industry.  These are the big data apps of the future. They have the potential to
alter businesses and boost corporate effectiveness. Big data may be organised and
analysed to uncover patterns and insights that can be used to boost corporate
performance.  These are brand-new concepts that didn't exist before. These
applications have the potential to disrupt whole industries and generate new income
streams for businesses.  Apache Hadoop is a set of open-source software tools for
solving issues involving large volumes of data and processing utilising a network
of many computers. It uses the MapReduce programming concept to create a software
framework for distributed storage and processing of massive data.  Apache
Cassandra is a distributed, wide-column store, NoSQL database management systemthat
is designed to handle massive volumes of data across many commodity servers while
maintaining high availability and avoiding single points of failure.  Cloudera,
Inc. is a Santa Clara, California-based start-up that offers a subscription-based
enterprise data cloud. Cloudera's platform, which is based on open-source
technology, leverages analytics and machine learning to extract insights from data
through a secure connection.  RapidMiner is a data science software platform built
by the same-named firm that offers a unified environment for data preparation,
machine learning, deep learning, text mining, and predictive analytics.  Kaggle, a
Google LLC subsidiary, is an online community of data scientists and machine
learning experts.  LexisNexis Risk Solutions created HPCC, often known as DAS, an
open source data-intensive computing system platform. The HPCC platform is based on
a software architecture that runs on commodity computing clusters and provides
high-performance, data-parallel processing for big data applications.</p>
<p>16 Lovely Professional University</p>
<p>Unit 01: Introduction to Big Data</p>
<p>Notes</p>
<p>Keywords</p>
<p>Big Data: Big data refers to massive, difficult-to-manage data quantities – both
organised and unstructured – that inundate enterprises on a daily basis. But it's
not simply the type or quantity of data that matters; it's also what businesses do
with it. Big data may be evaluated for insights that help people make better
judgments and feel more confident about making key business decisions.</p>
<p>Volume: Transactions, smart (IoT) devices, industrial equipment, videos, photos,
audio, social media, and other sources are all used to collect data. Previously,
keeping all of that data would have been too expensive; now, cheaper storage
options such as data lakes, Hadoop, and the cloud have alleviated the strain.</p>
<p>Velocity:Data floods into organisations at an unprecedented rate as the Internet
of Things grows, and it must be handled quickly. The need to cope with these floods
of data in near-real time is being driven by RFID tags, sensors, and smart metres.
Variety: From organised, quantitative data in traditional databases to unstructured
text documents, emails, movies, audios, stock ticker data, and financial
transactions, data comes in a variety of formats.</p>
<p>Variability: Data flows are unpredictable, changing often and altering
substantially, in addition to rising velocities and variety of data. It's
difficult, but companies must recognise when something is hot on social media and
how to manage high data loads on a daily, seasonal, and event-triggered basis.</p>
<p>Veracity:The quality of data is referred to as veracity. Information's tough to
link, match, cleanse,</p>
------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

------------[n_24]
Content-Type: image/jpeg; name=""
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
Content-Disposition: inline; filename=""

-----------[an_42]--

You might also like