0% found this document useful (0 votes)

49 views55 pages

Unit 1

This document provides an introduction to big data, including definitions of key concepts like distributed file systems, the four Vs of big data, and big data analytics. It discusses the importance of big data and drivers behind its growth, as well as algorithms like map reduce. The document also provides a brief history of big data and how the amount of available data has grown exponentially in recent years.

Uploaded by

Ramstage Testing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views55 pages

Unit 1

Uploaded by

Ramstage Testing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

UNIT - 1 INTRODUCTION TO BIG DATA

STRUCTURE

1.0 Learning Objectives

1.1 Introduction

1.2 Big Data

1.3 Distributed file system

1.4 Big Data and its importance

1.5 Four Vs

1.6 Drivers for Big data

1.7 Big data analytics

1.8 Big data applications

1.9 Algorithms using map reduce.

1.10 Summary

1.11 Keywords

1.12 Learning Activity

1.13 Unit End Questions

1.14 References

1.0 LEARNING OBJECTIVES

After studying this unit, you will be able to:

 Describe the concept of Big Data.

 Define the Distributed File System.

 Explain the Four Vs.

 Elucidate the Big data analytics.

 Describe the Algorithms using map reduce.

3
1.1 INTRODUCTION

Big data analytics is the often-complex process of examining big data to uncover information
such as hidden patterns, correlations, market trends and customer preferences that can help
organizations make informed business decisions.

On a broad scale, data analytics technologies and techniques give organizations a way to
analyse data sets and gather new information. Business intelligence (BI) queries answer basic
questions about business operations and performance.

Big data analytics is a form of advanced analytics, which involve complex applications with
elements such as predictive models, statistical algorithms and what-if analysis powered by
analytics systems.

The concept of big data has been around for years; most organizations now understand that if
they capture all the data that streams into their businesses, they can apply analytics and get
significant value from it. But even in the 1950s, decades before anyone uttered the term “big
data,” businesses were using basic analytics (essentially numbers in a spread sheet that were
manually examined) to uncover insights and trends.

The new benefits that big data analytics brings to the table, however, are speed and
efficiency. Whereas a few years ago a business would have gathered information, run
analytics and unearthed information that could be used for future decisions, today that
business can identify insights for immediate decisions. The ability to work faster and stay
agile gives organizations a competitive edge they didn’t have before.

Few disputes that organizations have more data than ever at their disposal. But actually,
deriving meaningful insights from that data and converting knowledge into action is easier
said than done. We spoke with six senior leaders from major organizations and asked them
about the challenges and opportunities involved in adopting advanced analytics: Murli
Buluswar, chief science officer at AIG; Vince Campisi, chief information officer at GE
Software; Ash Gupta, chief risk officer at American Express; Zoher Karu, vice president of
global customer optimization and data at eBay; Victor Nilson, senior vice president of big
data at AT&T; and Ruben Sigala, chief analytics officer at Caesars Entertainment. An edited
transcript of their comments follows.

4
The world of business intelligence software shifted acutely over the past couple of decades.
While the overall goal to achieve smarter, optimized business has not changed, the methods
of doing so are like baseball players in the Steroid Era: they’ve grown immensely. Two areas
of business intelligence, big data, and business analytics are the very definition of this new
world of business data.

Companies have started adopting an optimised method for the optimal distribution of
resources to carve the path of a company’s growth rather than relying on a trial-and-error
method. The best method of implementation has been incorporating techniques of big data
analysis. The business data acquired by large corporations is too complex to be processed by
conventional data processing applications. There are better ways to extract useful information
which can support proper decision making and help uncover patterns in an otherwise random
looking data. These techniques form the core of big data analytics. There are many ways in
which small and medium businesses are leveraging big data to obtain the best possible
outcomes for their firms.

Big data analytics is the use of advanced analytic techniques against very large, diverse big
data sets that include structured, semi-structured and unstructured data, from different
sources, and in different sizes from terabytes to zettabytes.

What is big data exactly? It can be defined as data sets whose size or type is beyond the
ability of traditional relational databases to capture, manage and process the data with low
latency. Characteristics of big data include high volume, high velocity, and high variety.
Sources of data are becoming more complex than those for traditional data because they are
being driven by artificial intelligence (AI), mobile devices, social media, and the Internet of
Things (IoT). For example, the different types of data originate from sensors, devices,
video/audio, networks, log files, transactional applications, web, and social media — much of
it generated in real time and at a very large scale.

5
1.2 BIG DATA

Big Data has the ability to change the nature of a business. In fact, there are many firms
whose sole existence is based upon their capability to generate insights that only Big Data
can deliver. Businesses need to understand that Big Data is not just about technology—it is
also about how these technologies can propel an organization forward.

Big data is a term that describes the large volume of data – both structured and unstructured –
that inundates a business on a day-to-day basis. But it’s not the amount of data that’s
important. Its what organizations do with the data that matters. Big data can be analysed for
insights that lead to better decisions and strategic business moves.

Big data is a field that treats ways to analyse, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional data-
processing application software. Data with many fields (columns) offer greater statistical
power, while data with higher complexity (more attributes or columns) may lead to a higher
false discovery rate. Big data analysis challenges include capturing data, data storage, data
analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and
data source. Big data was originally associated with three key concepts: volume, variety, and
velocity. The analysis of big data presents challenges in sampling, and thus previously
allowing for only observations and sampling. Therefore, big data often includes data with
sizes that exceed the capacity of traditional software to process within an acceptable time and
value.

Current usage of the term big data tends to refer to the use of predictive analytics, user
behaviour analytics, or certain other advanced data analytics methods that extract value from
big data, and seldom to a particular size of data set. "There is little doubt that the quantities of
data now available are indeed large, but that's not the most relevant characteristic of this new
data ecosystem." Analysis of data sets can find new correlations to "spot business trends,
prevent diseases, combat crime and so on". Scientists, business executives, medical
practitioners, advertising, and governments alike regularly meet difficulties with large
datasets in areas including Internet searches, fintech, healthcare analytics, geographic
information systems, urban informatics, and business informatics. Scientists encounter
limitations in e-Science work, including meteorology, genomics; connect comics, complex
physics simulations, biology, and environmental research.

6
90% of the available data has been created in the last two years and the term Big Data has
been around 2005, when it was launched by O’Reilly Media in 2005. However, the usage of
Big Data and the need to understand all available data has been around much longer.

In fact, the earliest records of using data to track and control businesses date back from 7.000
years ago when accounting was introduced in Mesopotamia in order to record the growth of
crops and herds. Accounting principles continued to improve, and in 1663, John Graunt
recorded and examined all information about mortality roles in London. He wanted to gain an
understanding and build a warning system for the on-going bubonic plague. In the first
recorded record of statistical data analysis, he gathered his findings in the book Natural and
Political Observations Made upon the Bills of Mortality, which provides great insights into
the causes of death in the seventeenth century. Because of his work, Graunt can be
considered the father of statistics. From there on, the accounting principles improved but
nothing spectacular happened. Until in the 20<sup>th</sup> century the information age
started. The earliest remembrance of modern data is from the 1887 when Herman Hollerith
invented a computing machine that could read holes punched into paper cards in order to
organize census data.

History of Big Data

The term “big data” refers to data that is so large, fast, or complex that it’s difficult or
impossible to process using traditional methods. The act of accessing and storing large
amounts of information for analytics has been around a long time.

 The 20th Century

The first major data project is created in 1937 and was ordered by the Franklin D.
Roosevelt’s administration in the USA. After the Social Security Act became law in
1937, the government had to keep track of contribution from 26 million Americans and
more than 3 million employers. IBM got the contract to develop punch card-reading
machine for this massive bookkeeping project.

The first data-processing machine appeared in 1943 and was developed by the British to
decipher Nazi codes during World War II. This device, named Colossus, searched for
patterns in intercepted messages at a rate of 5.000 characters per second. Thereby
reducing the task from weeks to merely hours.

7
In 1952 the National Security Agency (NSA) is created and within 10 years contract
more than 12.000 cryptologists. They are confronted with information overload during
the Cold War as they start collecting and processing intelligence signals automatically.

In 1965 the United Stated Government decided to build the first data centre to store over
742 million tax returns and 175 million sets of fingerprints by transferring all those
records onto magnetic computer tape that had to be stored in a single location. The
project was later dropped out of fear for ‘Big Brother’, but it is generally accepted that it
was the beginning of the electronic data storage era.

In 1989 British computer scientist Tim Berners-Lee invented eventually the World Wide
Web. He wanted to facilitate the sharing of information via a ‘hypertext’ system. Little
could he know at the moment the impact of his invention?

As of the ‘90s the creation of data is spurred as more and more devices are connected to
the internet. In 1995 the first super-computer is built, which was able to do as much work
in a second than a calculator operated by a single person can do in 30.000 years.

 The 21st Century

In 2005 Roger Mougalas from O’Reilly Media coined the term Big Data for the first
time, only a year after they created the term Web 2.0. It refers to a large set of data that is
almost impossible to manage and process using traditional business intelligence tools.

2005 is also the year that Hadoop was created by Yahoo! built on top of Google’s Map
Reduce. Its goal was to index the entire World Wide Web and nowadays the open-source
Hadoop is used by a lot of organizations to crunch through huge amounts of data.

As more and more social networks start appearing and the Web 2.0 takes flight, more
and more data is created on a daily basis. Innovative start-ups slowly start to dig into this
massive amount of data and also governments start working on Big Data projects. In
2009 the Indian government decides to take an iris scan, fingerprint, and photograph of
all of its 1.2 billion inhabitants. All this data is stored in the largest biometric database in
the world.

In 2010 Eric Schmidt speaks at the Techonomic conference in Lake Tahoe in California
and he states that "there were 5 exabytes of information created by the entire world
between the dawn of civilization and 2003. Now that same amount is created every two
days."

8
In 2011 the McKinsey report on Big Data: The next frontier for innovation, competition,
and productivity, states that in 2018 the USA alone will face a shortage of 140.000 –
190.000 data scientist as well as 1.5 million data managers.

In the past few years, there has been a massive increase in Big Data start-ups, all trying
to deal with Big Data and helping organizations to understand Big Data and more and
more companies are slowly adopting and moving towards Big Data. However, while it
looks like Big Data is around for a long time already, in fact Big Data is as far as the
internet was in 1993. The large Big Data revolution is still ahead of us so a lot will
change in the coming years. Let the Big Data era begin!

To illustrate this development over time, the evolution of Big Data can roughly be sub-
divided into three main phases. Each phase has its own characteristics and capabilities. In
order to understand the context of Big Data today, it is important to understand how each
phase contributed to the contemporary meaning of Big Data.

 Big Data phase 1.0

Data analysis, data analytics and Big Data originate from the longstanding domain of
database management. It relies heavily on the storage, extraction, and optimization
techniques that are common in data that is stored in Relational Database Management
Systems (RDBMS).

Database management and data warehousing are considered the core components of Big
Data Phase 1. It provides the foundation of modern data analysis as we know it today,
using well-known techniques such as database queries, online analytical processing, and
standard reporting tools.

 Big Data phase 2.0

Since the early 2000s, the Internet and the Web began to offer unique data collections
and data analysis opportunities. With the expansion of web traffic and online stores,
companies such as Yahoo, Amazon and eBay started to analyse customer behaviour by
analysing click-rates, IP-specific location data and search logs. This opened a whole new
world of possibilities.

From a data analysis, data analytics, and Big Data point of view, HTTP-based web traffic
introduced a massive increase in semi-structured and unstructured data.

9
Besides the standard structured data types, organizations now needed to find new
approaches and storage solutions to deal with these new data types in order to analyse
them effectively. The arrival and growth of social media data greatly aggravated the need
for tools, technologies and analytics techniques that were able to extract meaningful
information out of this unstructured data.

 Big Data phase 3.0

Although web-based unstructured content is still the main focus for many organizations
in data analysis, data analytics, and big data, the current possibilities to retrieve valuable
information are emerging out of mobile devices.

Mobile devices not only give the possibility to analyse behavioural data (such as clicks
and search queries), but also give the possibility to store and analyse location-based data
(GPS-data). With the advancement of these mobile devices, it is possible to track
movement, analyse physical behaviour and even health-related data (number of steps you
take per day). This data provides a whole new range of opportunities, from transportation
to city design and health care.

Simultaneously, the rise of sensor-based internet-enabled devices is increasing the data

generation like never before. Famously coined as the ‘Internet of Things’ (IoT), millions
of TVs, thermostats, wearable’s and even refrigerators are now generating zettabytes of
data every day. And the race to extract meaningful and valuable information out of these
new data sources has only just begun.

A summary of the three phases in Big Data is listed in the figure below:

BIG DATA PHASE 1 BIG DATA PHASE 2 BIG DATA PHASE 3

Period: 1970-2000 Period: 2000-2010 Period: 2010-present

DBMS-based, structured Web-based, unstructured Mobile and sensor-based

content content content

 RDBMS & data  Information retrieval and  Location-aware analysis

warehousing extraction
 Person-centered analysis
 Extract Transfer Load  Opinion mining
 Context-relevant analysis
 Online Analytical  Question answering
 Mobile visualization

10
Processing  Web analytics and web  Human-Computer-
intelligence Interaction
 Dashboards & scorecards
 Social media analytics
 Data mining & statistical
analysis  Social network analysis

 Spatial-temporal analysis

Table 1.1 Three phases in Big Data

1.3 DISTRIBUTED FILE SYSTEM

The first storage mechanism used by computers to store data was punch cards. Each group of
related punch cards (Punch cards related to same program) used to be stored into a file; and
files were stored in file cabinets. This is very similar to what we do nowadays to archive
papers in government intuitions who still use paperwork on daily basis. This is where the
word “File System” (FS) comes from. The computer systems evolved; but the concept
remains the same. Instead of storing information on punch cards, we can now store
information / data in a digital format on a digital storage device such as hard disk, flash
drive…etc. Related data are still categorized as files; related groups of files are stored in
folders. Each file has a name, extension, and icon. The file name gives an indication about the
content it has while file extension indicates the type of information stored in that file. for
example, EXE extension refers to executable files, TXT refers to text files…etc.

11
Fig 1.1 The file system

File management system is used by the operating system to access the files and folders stored
in a computer or any external storage devices. Imagine file management system as a big
dictionary that contains information about file names, locations, and types. File management
system is capable of handling files within one computer or a cluster. But what if we have
many? So here comes DFS.

A Distributed File System (DFS) as the name suggests, is a file system that is distributed on
multiple file servers or multiple locations. It allows programs to access or store isolated files
as they do with the local ones, allowing programmers to access files from any network or
computer.

The main purpose of the Distributed File System (DFS) is to allows users of physically
distributed systems to share their data and resources by using a Common File System. A
collection of workstations and mainframes connected by a Local Area Network (LAN) is a
configuration on Distributed File System. A DFS is executed as a part of the operating
system. In DFS, a namespace is created, and this process is transparent for the clients.

As there has been exceptional growth in network-based computing, client/server-based

applications have brought revolutions in the process of building distributed file systems.
Sharing storage resources and information on a network is one of the key elements in both
local area networks (LANs) and wide area networks (WANs). Different technologies like
DFS have been developed to bring convenience and efficiency to sharing resources and files
on a network, as networks themselves also evolve.

One process involved in implementing the DFS is giving access control and storage
management controls to the client system in a centralized way. The servers that are involved
have to be able to capably dole out data with sufficient dexterity.

In Big Data, we deal with multiple clusters (computers) often. One of the main advantages of
Big Data which is that it goes beyond the capabilities of one single super powerful server
with extremely high computing power. The whole idea of Big Data is to distribute data across
multiple clusters and to make use of computing power of each cluster (node) to process
information.

Distributed file system is a system that can handle accessing data across multiple clusters
(nodes). In the next section we will learn more about how it works?

Fig 1.2 Concept of distribution file system

DFS has two components

 Location Transparency

Location Transparency achieves through the namespace component.

 Redundancy

Redundancy is done through a file replication component.

In the case of failure and heavy load, these components together improve data availability by
allowing the sharing of data in different locations to be logically grouped under one folder,
which is known as the “DFS root”.

It is not necessary to use both the two components of DFS together, it is possible to use the
namespace component without using the file replication component and it is perfectly
possible to use the file replication component without using the namespace component
between servers.

Features of DFS:

Transparency

 Structure transparency

There is no need for the client to know about the number or locations of file servers and
the storage devices. Multiple file servers should be provided for performance,
adaptability, and dependability.

 Access transparency

Both local and remote files should be accessible in the same manner. The file system
should be automatically located on the accessed file and send it to the client’s side.

 Naming transparency

There should not be any hint in the name of the file to the location of the file. Once a
name is given to the file, it should not be changed during transferring from one node to
another.

 Replication transparency

If a file is copied on multiple nodes, both the copies of the file and their locations should
be hidden from one node to another.

User mobility

It will automatically bring the user’s home directory to the node where the user logs in.

14
Performance

Performance is based on the average amount of time needed to convince the client requests.
This time covers the CPU time + time taken to access secondary storage + network access
time. It is advisable that the performance of the Distributed File System be similar to that of a
centralized file system.

Simplicity and ease of use

The user interface of a file system should be simple and the number of commands in the file
should be small.

High availability

A Distributed File System should be able to continue in case of any partial failures like a link
failure, a node failure, or a storage drive crash.

A high authentic and adaptable distributed file system should have different and independent
file servers for controlling different and independent storage devices.

Scalability

Since growing the network by adding new machines or joining two networks together is
routine, the distributed system will inevitably grow over time. As a result, a good, distributed
file system should be built to scale quickly as the number of nodes and users in the system
grows. Service should not be substantially disrupted as the number of nodes and users grows.

High reliability

The likelihood of data loss should be minimized as much as feasible in a suitable distributed
file system. That is, because of the system’s unreliability, users should not feel forced to
make backup copies of their files. Rather, a file system should create backup copies of key
files that can be used if the originals are lost. Many file systems employ stable storage as a
high-reliability strategy.

Data integrity

Multiple users frequently share a file system. The integrity of data saved in a shared file must
be guaranteed by the file system. That is, concurrent access requests from many users who
are competing for access to the same file must be correctly synchronized using a concurrency
control method. Atomic transactions are a high-level concurrency management mechanism
for data integrity that is frequently offered to users by a file system.

15
Security

Heterogeneity in distributed systems is unavoidable as a result of huge scale. Users of

heterogeneous distributed systems have the option of using multiple computer platforms for
different purposes.

Heterogeneity

A distributed file system should be secure so that its users may trust that their data will be
kept private. To safeguard the information contained in the file system from unwanted &
unauthorized access, security mechanisms must be implemented.

How Distributed File System (DFS) Works?

Distributed file system works as follows:

Distribution: Distribute blocks of data sets across multiple nodes. Each node has its own
computing power, which gives the ability of DFS to parallel processing data blocks.

Replication: Distributed file system will also replicate data blocks on different clusters by
copy the same pieces of information into multiple clusters on different racks. This will help to
achieve the following:

Fault Tolerance: recover data block in case of cluster failure or Rack failure.

High Concurrency: avail same piece of data to be processed by multiple clients at the same
time. It is done using the computation power of each node to parallel process data blocks.

The following graph shows how data replication concept works:

Fig 1.3 Data replication concept

What are the Advantages of Distributed File System (DFS)?

Distributed file system provides the following main advantages:

 Scalability: You can scale up your infrastructure by adding more racks or clusters to
your system.

 Fault Tolerance: Data replication will help to achieve fault tolerance in the following
cases:

i. Cluster is down

ii. Rack is down

iii. Rack is disconnected from the network.

iv. Job failure or restart.

 High Concurrency: utilize the compute power of each node to handle multiple client
requests (in a parallel way) at the same time.

 DFS allows multiple users to access or store the data.

 It allows the data to be share remotely.

 It improved the availability of file, access time and network efficiency.

 Improved the capacity to change the size of the data and also improves the ability to
exchange the data.

 Distributed File System provides transparency of data even if server or disk fails.

Disadvantages

 In Distributed File System nodes and connections needs to be secured therefore we can
say that security is at stake.

 There is a possibility of loss of messages and data in the network while movement from
one node to another.

 Database connection in case of Distributed File System is complicated.

 Also handling of the database is not easy in Distributed File System as compared to a
single user system.

 There are chances that overloading will take place if all nodes try to send data at once.

17
History

The server component of the Distributed File System was initially introduced as an add-on
feature. It was added to Windows NT 4.0 Server and was known as “DFS 4.1”. Then later on
it was included as a standard component for all editions of Windows 2000 Server. Client-side
support has been included in Windows NT 4.0 and also in later on version of Windows.

Linux kernels 2.6.14 and versions after it come with an SMB client VFS known as “cifs”
which supports DFS. Mac OS X 10.7 (lion) and onwards supports Mac OS X DFS.

Applications

NFS

NFS stands for Network File System. It is a client-server architecture that allows a computer
user to view, store, and update files remotely. The protocol of NFS is one of the several
distributed file system standards for Network-Attached Storage (NAS).

CIFS

CIFS stands for Common Internet File System. CIFS is an accent of SMB. That is, CIFS is an
application of SIMB protocol, designed by Microsoft.

SMB –

SMB stands for Server Message Block. It is a protocol for sharing a file and was invented by
IMB. The SMB protocol was created to allow computers to perform read and write
operations on files to a remote host over a Local Area Network (LAN). The directories
present in the remote host can be accessed via SMB and are called as “shares”.

Hadoop

Hadoop is a group of open-source software services. It gives a software framework for

distributed storage and operating of big data using the Map Reduce programming model. The
core of Hadoop contains a storage part, known as Hadoop Distributed File System (HDFS),
and an operating part which is a Map Reduce programming model.

NetWare

NetWare is an abandon computer network operating system developed by Novell, Inc. It

primarily used combined multitasking to run different services on a personal computer, using
the IPX network protocol.

18
Working of DFS

There are two ways in which DFS can be implemented:

Standalone DFS namespace

It allows only for those DFS roots that exist on the local computer and are not using Active
Directory. A Standalone DFS can only be acquired on those computers on which it is created.
It does not provide any fault liberation and cannot be linked to any other DFS. Standalone
DFS roots are rarely come across because of their limited advantage.

Domain-based DFS namespace

It stores the configuration of DFS in Active Directory, creating the DFS namespace root
accessible at \\<domainname>\<dfsroot> or \\<FQDN>\<dfsroot>

Fig 1.4 Working of DFS

Benefits of DFS Models

The distributed file system brings with it some common benefits.

A DFS makes it possible to restrict access to the file system, depending on access lists or
capabilities on both the servers and the clients, depending on how the protocol is designed.

Also, since the server also provides a single central point of access for data requests, it is
thought to be fault-tolerant (as mentioned above) in that it will still function well if some of
the nodes are taken offline.
This dovetails with some of the reasons that DFS was developed in the first place – the
system can still have that integrity if a few workstations get moved around.

DFS and Backup

Ironically enough, even though a DFS server is prized for being a single central point of
access, another server may also be in play. However, that doesn't mean that there won't be
that single central access point. The second server will be for backup.

Because businesses invest in having one central DFS server, they will worry that the server
could be compromised somehow. Backing all of the data up at a separate location ensures the
right kind of redundancy to make the system fully fault-tolerant, even if the king itself (the
primary server) is toppled by something like a DDoS attack or something else.

DFS systems, like other systems, continue to innovate. With new kinds of networking
controls and virtualization systems, modern DFS will often take advantage of logical
partitioning or other advances in hardware and software.

1.4 BIG DATA AND ITS IMPORTANCE

Cost
Time Saving
Saving

Social
Customer
Media
Listening Acquisition

Marketing
Innovation Insights

Fig 1.5 Importance of big data

20
Big Data initiatives were rated as “extremely important” to 93% of companies. Leveraging a
Big Data analytics solution helps organizations to unlock the strategic values and take full
advantage of their assets.

It helps organizations, as follows:

 To understand Where, When and Why their customers buy

 Protect the company’s client base with improved loyalty programs.

 Seizing cross-selling and upselling opportunities

 Provide targeted promotional information.

 Optimize Workforce planning and operations.

 Improve inefficiencies in the company’s supply chain.

 Predict market trends.

 Predict future needs.

 Make companies more innovative and competitive.

 It helps companies to discover new sources of revenue.

Companies are using Big Data to know what their customers want, who are their best
customers, why people choose different products. The more a company knows about its
customers, the more competitive it becomes.

Companies can use Historical and real-time data to assess evolving consumers’ preferences.
This consequently enables businesses to improve and update their marketing strategies which
make companies more responsive to customer needs.

Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data.

Every company uses its collected data in its own way. More effectively the company uses its
data, more rapidly it grows.

21
The companies in the present market need to collect it and analyse it because:

 Cost Savings

Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses
when they have to store large amounts of data. These tools help organizations in
identifying more effective ways of doing business.

 Timesaving

Real-time in-memory analytics helps companies to collect data from various sources.
Tools like Hadoop help them to analyse data immediately thus helping in making quick
decisions based on the learning’s.

 Understand the market conditions.

Big Data analysis helps businesses to get a better understanding of market situations.

For example, analysis of customer purchasing behaviour helps companies to identify the
products sold most and thus produces those products accordingly. This helps companies
to get ahead of their competitors.

 Social Media Listening

Companies can perform sentiment analysis using Big Data tools. These enable them to
get feedback about their company, that is, who is saying what about the company.

Companies can use big data tools to improve their online presence.

 Boost Customer Acquisition and Retention

Customers are a vital asset on which any business depends on. No single business can
achieve its success without building a robust customer base. But even with a solid
customer base, the companies can’t ignore the competition in the market.

If we don’t know what our customers want, then it will degrade companies’ success. It
will result in the loss of clientele which creates an adverse effect on business growth.

Big data analytics helps businesses to identify customer related trends and patterns.
Customer behaviour analysis leads to a profitable business.

22
 Solve Advertisers Problem and Offer Marketing Insights

Big data analytics shapes all business operations. It enables companies to fulfil customer
expectations. Big data analytics helps in changing the company’s product line. It ensures
powerful marketing campaigns.

 The driver of Innovations and Product Development

Big data makes companies capable to innovate and redevelop their products.

Companies use big data in their systems to improve operations, provide better customer
service, create personalized marketing campaigns, and take other actions that, ultimately, can
increase revenue and profits. Businesses that use it effectively hold a potential competitive
advantage over those that don't because they're able to make faster and more informed
business decisions.

For example, big data provides valuable insights into customers that companies can use to
refine their marketing, advertising, and promotions in order to increase customer engagement
and conversion rates. Both historical and real-time data can be analysed to assess the
evolving preferences of consumers or corporate buyers, enabling businesses to become more
responsive to customer wants and needs.

Big data is also used by medical researchers to identify disease signs and risk factors and by
doctors to help diagnose illnesses and medical conditions in patients. In addition, a
combination of data from electronic health records, social media sites, the web and other
sources gives healthcare organizations and government agencies up-to-date information on
infectious disease threats or outbreaks.

Here are some more examples of how big data is used by organizations

 In the energy industry, big data helps oil and gas companies identify potential drilling
locations and monitor pipeline operations; likewise, utilities use it to track electrical
grids.

 Financial services firms use big data systems for risk management and real-time analysis
of market data.

 Manufacturers and transportation companies rely on big data to manage their supply
chains and optimize delivery routes.

23
 Other government uses include emergency response, crime prevention and smart city
initiatives.

Real-Time Benefits of Big Data

Big Data analytics has expanded its roots in all the fields. This results in the use of Big Data
in a wide range of industries including Finance and Banking, Healthcare, Education,
Government, Retail, Manufacturing, and many more.

There are many companies like Amazon, Netflix, Spotify, LinkedIn, Swiggy, etc which use
big data analytics. Banking sectors make the maximum use of Big Data Analytics. Education
sector is also using data analytics to enhance students’ performance as well as making
teaching easier for instructors.

Big Data analytics help retailers from traditional to e-commerce to understand customer
behaviour and recommend products as per customer interest. This helps them in developing
new and improved products which help the firm enormously.

To get an answer to Why You should learn Big Data? Let’s start with what industry leaders
say about Big Data

 Gartner – Big Data is the new Oil.

 IDC – Its market will be growing 7 times faster than the overall IT market.

 IBM – It is not just a technology – it’s a Business Strategy for capitalizing on

information resources.

 IBM – Big Data is the biggest buzz word because technology makes it possible to
analyse all the available data.

 McKinsey – There will be a shortage of 1500000 Big Data professionals by the end of
2018.

While the term “Big Data” is still quite new, we’ve been dealing with data sets ever since the
60s and 70s. Since the days the first data centres were created, companies have been
exploring spread sheets and basic analytics to make informed decisions about the future.

Around 2005, when social media started to grow in popularity, people began to realise just
how much data we generate every day. In fact, our current data output is currently 2.5
quintillion bytes every day.

24
The information available online today can potentially provide useful insights about your
market and customer. In the past, companies needed to use long and expensive processes to
sort through that information manually. Today, machine learning and data analytics tools
make it easier to access business insights almost instantly.

To better understand the importance of Big Data, let’s look at how today’s organisations are
using it to drive success.

 Answering Important Customer Questions

The answers to all your most pressing business questions exist within the right data sets.
Using Big Data and analytics, companies can learn:

What customers want

Where they’re missing out on conversions

Who their best customers are?

Why people choose different products.

Every day, your organisation gathers more useful insights into your sales and marketing
questions, so you can begin to adjust and optimise your campaigns. The more you learn
about your customers, the more competitive your company becomes.

For instance, the U.S. company Target uses data analytics to predict customer
pregnancies, to help them better offer promotions to their customers. You can even use
Big Data combined with artificial intelligence to create marketing strategies based on
predictions about your customer.

For instance, if you use Big Data to learn that your customers are more likely to up-sell
at certain times in the month, you can send lead nurturing emails to prompt those
conversions.

 Making Confident Decisions

All companies need to make complex decisions as they grow.

Big Data can help you to make those choices with confidence, based on an in-depth
analysis of what you know about your marketplace, industry, and customers.

25
One of the biggest benefits of Big Data is its accuracy. Big Data analytics gives you a
complete overview of everything you’ve learned so far as you’ve developed your
organisation. This means that you don’t have to guess whether you should launch a new
marketing campaign or try a new product. Instead, you can look back over the
information you have, and make focused decisions designed to generate the highest
possible ROI.

Add machine learning and AI to the mix, and your Big Data collections can form neural
networks that help your artificial intelligence to suggest positive changes for your
company.

 Optimising and Understanding Business Processes.

Knowledge is power. That concept is at the heart of Big Data analytics.

Big Data technologies like cloud computing and machine learning help you to stay ahead
of the curve by identifying inefficiencies and opportunities in your company practices.

For instance, your Big Data analytics can tell you that your email marketing strategy is
working, but your social media profiles aren’t reaching the right people. On the other
hand, if you use Big Data internally, you can find out which parts of your company
culture are having the right impact, and which may be causing a turnover.

Using existing evidence to make quick decisions ensure that you spend more of your
budget on the things that are helping your business grow, and less on strategies that don’t
work.

For instance, one area that is seeing a lot of growth with Big Data analytics is supply
chain optimisation. Geographic sensors can track goods and delivery vehicles and
optimise routes by providing live data and traffic information.

 Empowering the Next Generation

Finally, as a new generation of technology leaders enter the marketplace, Big Data
delivers the agility and innovation top-tier talent needs from their employer.

For instance, millennials are natural technology natives. The younger people in your
team will expect access to technology that allows them to make useful decisions rapidly.
By constantly collecting and analysing information, you can create an agile culture that’s
ready to evolve to suit the latest trends.

26
In the past, limited data sets, poor analytics processes and a lack of the right skill meant
that businesses could only access a small amount of the information available to them.
Now, companies can not only answer critical questions faster but empower their teams to
accomplish more with the information they collect.

1.5 FOUR VS

What is the difference between regular data analysis and when are we talking about “Big”
data? Although the answer to this question cannot be universally determined, there are a
number of characteristics that define Big Data. How do you know if the data you have is
considered Big Data? There are generally four characteristics that must be part of a dataset to
qualify it as Big Data—volume, velocity, variety, and veracity.

Our world has become data file. From data that shows activity such as our Google searches
and online shopping habits to our communication and conversations through text,
smartphones, and virtual assistants, and all the pictures and videos we take to the sensor data
collected by internet-of-things devices and more, there are 2.5 quintillion bytes of data
created each day. The better companies and organizations manage and secure this data, the
more successful they are likely to be. How do you know if the data you have has the
characteristics that qualify it as “big”? But in order for data to be useful to an organization, it
must create value—a critical fifth characteristic of Big Data that can’t be overlooked. Big
Data and analytics technologies enable your organisation to become more competitive and
grow without limits. But if an organisation is capturing large amounts of data, it will need
specific solutions for its analysis.

Let us understand the 4 V’s of Big Data

Volume

You may have heard on more than one occasion that Big Data is nothing more than business
intelligence, but in a very large format. More data, however, does not necessarily mean it is
Big Data.

Obviously, the Big Data, needs a certain amount of data, but having a huge amount of data,
does not necessarily mean that you are working on Big Data.

27
It would also be a mistake to think that all areas of Big Data are business intelligence. The
Big Data is not limited or defined by the objectives sought with that initiative. But it will be
by the characteristics of the data itself.

The volume of data refers to the size of the data sets that need to be analysed and processed,
which are now frequently larger than terabytes and petabytes. The sheer volume of the data
requires distinct and different processing technologies than traditional storage and processing
capabilities. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be
all credit card transactions on a day within the country.

The main characteristic that makes data “big” is the sheer volume. It makes no sense to focus
on minimum storage units because the total amount of information is growing exponentially
every year. In 2010, Thomson Reuters estimated in its annual report that it believed the world
was “awash with over 800 exabytes of data and growing.”

For that same year, EMC, a hardware company that makes data storage devices, thought it
was closer to 900 exabytes and would grow by 50 percent every year. No one really knows
how much new data is being generated, but the amount of information being collected is
huge.

Today, every single minute we create the same amount of data that was created from the
beginning of time until the year 2000. We now use the terms terabytes and petabytes to
discuss the size of data that needs to be processed. The quantity of data is certainly an
important aspect of making it be classified as Big Data. As a result of the amount of data we
deal with daily, new technologies and strategies such as multitiered storage media have been
developed to securely collect, analyse, and store it properly.

Variety

Today, we can base our decisions on the prescriptive data obtained through the Big Data.
Thanks to this technology, every action of customers, competitors, suppliers, etc, will
generate prescriptive information that will range from structured and easily managed data to
unstructured information that is difficult to use for decision making.

Each piece of data, or core information, will require specific treatment. In addition, each type
of data will require specific storage needs (the storage of an e-mail will be much less than
that of a video).

28
Variety makes Big Data really big. Big Data comes from a great variety of sources and
generally is one out of three types: structured, semi structured, and unstructured data. The
variety in data types frequently requires distinct processing capabilities and specialist
algorithms. An example of high variety data sets would be the CCTV audio and video files
that are generated at various locations in a city.

Variety is one the most interesting developments in technology as more and more
information is digitized. Traditional data types (structured data) include things on a bank
statement like date, amount, and time. These are things that fit neatly in a relational database.

Structured data is augmented by unstructured data, which is where things like Twitter feeds,
audio files, MRI images, web pages, web logs are put — anything that can be captured and
stored but doesn’t have a meta model (a set of rules to frame a concept or idea — it defines a
class of information and how to express it) that neatly defines it.

Unstructured data is a fundamental concept in Big Data. The best way to understand
unstructured data is by comparing it to structured data. Think of structured data as data that is
well defined in a set of rules. For example, money will always be numbers and have at least
two decimal points; names are expressed as text; and dates follow a specific pattern.

With unstructured data, on the other hand, there are no rules. A picture, a voice recording, a
tweet — they all can be different but express ideas and thoughts based on human
understanding. One of the goals of Big Data is to use technology to take this unstructured
data and make sense of it.

Today, data is generally one of three types: unstructured, semi-structured and structured. The
algorithms required to process the variety of data generated varies based on the type of data
to be processed. In the past, data was nicely structured—think Excel spread sheets or other
relational databases. A key characteristic of Big Data is that it not only is structured data but
also includes text, images, videos, voice files and other unstructured data that doesn’t fit
easily into the framework of a spread sheet. Unstructured data isn’t bound by rules like
structured data is. Again, this variety has helped put the “big” in data. We are able to use
technology to make sense of unstructured data today in a way that wasn’t possible in the past.
This ability has opened up a tremendous amount of data that have previously not been
accessible or useful.

29
Veracity

This V will refer to both data quality and availability. When it comes to traditional business
analytics, the source of the data is going to be much smaller in both quantity and variety.
However, the organization will have more control over them, and their veracity will be
greater.

When we talk about Big Data, variety is going to mean greater uncertainty about the quality
of that data and its availability. It will also have its implications in terms of the data sources
we may have.

Veracity refers to the quality of the data that is being analysed. High veracity data has many
records that are valuable to analyse and that contribute in a meaningful way to the overall
results. Low veracity data, on the other hand, contains a high percentage of meaningless data.
The non-valuable in these data sets is referred to as noise. An example of a high veracity data
set would be data from a medical experiment or trial.

Data that is high volume, high velocity and high variety must be processed with advanced
tools (analytics and algorithms) to reveal meaningful information. Because of these
characteristics of the data, the knowledge domain that deals with the storage, processing, and
analysis of these data sets has been labelled Big Data.

Veracity refers to the trustworthiness of the data. Can the manager rely on the fact that the
data is representative? Every good manager knows that there are inherent discrepancies in all
the data collected.

Velocity is the frequency of incoming data that needs to be processed. Think about how many
SMS messages, Facebook status updates, or credit card swipes are being sent on a particular
telecom carrier every minute of every day, and you’ll have a good appreciation of velocity. A
streaming application like Amazon Web Services Kinesis is an example of an application that
handles the velocity of data.

The veracity of Big Data denotes the trustworthiness of the data. Is the data accurate and
high-quality? When talking about Big Data that comes from a variety of sources, it’s
important to understand the chain of custody, metadata, and the context when the data was
collected to be able to glean accurate insights. The higher the veracity of the data equates to
the data’s importance to analyse and contribute to meaningful results for an organization.

30
Velocity

It is very possible that Variety and Veracity would not be so relevant and would not be so
much pressure when facing a Big Data initiative if it were not for the high Volume of
information that has to be handled and, above all, for the velocity at which the information
has to be generated and managed.

The data will be an input for the technology area (it will be essential to be able to store and
digest large amounts of information). And the output part will be the decisions and reactions
that will later involve the corresponding departments. The important thing here is that they
are able to react with the necessary speed to boost the business area.

Velocity refers to the speed with which data is generated. High velocity data is generated
with such a pace that it requires distinct (distributed) processing techniques. An example of a
data that is generated with high velocity would be Twitter messages or Facebook posts.

When you send a text, check out your social media feed and react to posts on Facebook,
Instagram or Twitter or make a credit card purchase, these acts create data that need to be
processed instantaneously. Compound these activities by all the people in the world doing the
same and more and you can start to see how velocity is a key attribute of Big Data.

Big Data systems and analysis by organizations large and small are here to stay, and we need
to stay up to date with the technologies, analytics, and utilization of these types of systems as
they advance in the future.

1.6 DRIVERS FOR BIG DATA

Big Data emerged in the last decade from a combination of business needs and technology
innovations. A number of companies that have Big Data at the core of their strategy have
become very successful at the beginning of the 21st century. Famous examples include
Apple, Amazon, Facebook, and Netflix.

Big Data is no longer just a buzzword; it is a proven phenomenon and not likely to die away
soon. A recent IDC report predicts that the digital universe will be 44 times bigger in 2020
than it was in 2009, totalling a staggering 35 zettabytes.

Two factors have combined to make Big Data especially appealing now. One is that so many
potentially valuable data resources have come into existence.

31
These sources include the telemetry generated by today's smart devices, the digital footprints
left by people who are increasingly living their lives online, and the rich sources of
information commercially available from specialized data vendors. Add to this the
tremendous wealth of data — structured and unstructured, historical, and real-time — that
has come to reside in diverse systems across the enterprise, and it is clear that Big Data offers
hugely appealing opportunities to those who can unlock its secrets.

The other factor contributing to Big Data's appeal is the emergence of powerful technologies
for effectively exploiting it. IT organizations can now take advantage of tools such as
Hadoop, NoSQL and Gephi to rationalize, analyse and visualize Big Data in ways that enable
them to quickly separate the actionable insight from the massive chaff of raw input. As an
added bonus, many of these tools are available free under open-source licensing. This
promises to help keep the cost of Big Data implementation under control.

What is the business justification to starting the Big Data initiative? Basically, what drives
the Big Data Analytics Strategy? It was an easy but uncomfortable decision and I thought
sketching it down would help in giving an initial GPS if you are still not sure where to look
for motivation on why Big Data Analytics projects and what are its drivers.

Let’s look for these drivers from two different lenses: Business and Technology. A number of
business drivers are at the core of this success and explain why Big Data has quickly risen to
become one of the most coveted topics in the industry. Business entails market, sales, and
financial side of things, whereas, Technology has indicator/driver targeted towards
technology and IT infrastructure side of things.

Some of the drivers for Big Data are:

The Digitization of Society

Big Data is largely consumer driven and consumer oriented. Most of the data in the world is
generated by consumers, who are nowadays ‘always-on’. Most people now spend 4-6 hours
per day consuming and generating data through a variety of devices and (social) applications.
With every click, swipe or message, new data is created in a database somewhere around the
world. Because everyone now has a smartphone in their pocket, the data creation sums to
incomprehensible amounts. Some studies estimate that 60% of data was generated within the
last two years, which is a good indication of the rate with which society has digitized.

32
The Plummeting of Technology Costs

Technology related to collecting and processing massive quantities of diverse (high variety)
data has become increasingly more affordable. The costs of data storage and processors keep
declining, making it possible for small businesses and individuals to become involved with
Big Data. For storage capacity, the often-cited Moore’s Law still holds that the storage
density (and therefore capacity) still doubles every two years. Besides the plummeting of the
storage costs, a second key contributing factor to the affordability of Big Data has been the
development of open-source Big Data software frameworks. The most popular software
framework (nowadays considered the standard for Big Data) is Apache Hadoop for
distributed storage and processing. Due to the high availability of these software frameworks
in open sources, it has become increasingly inexpensive to start Big Data projects in
organizations.

Connectivity through Cloud Computing

Cloud computing environments (where data is remotely stored in distributed storage systems)
have made it possible to quickly scale up or scale down IT infrastructure and facilitate a pay-
as-you-go model. This means that organizations that want to process massive quantities of
data (and thus have large storage and processing requirements) do not have to invest in large
quantities of IT infrastructure. Instead, they can license the storage and processing capacity
they need and only pay for the amounts they actually used. As a result, most of Big Data
solutions leverage the possibilities of cloud computing to deliver their solutions to
enterprises.

Increased Knowledge about Data Science

In the last decade, the term data science and data scientist have become tremendously
popular. In October 2012, Harvard Business Review called the data scientist “sexiest job of
the 21st century” and many other publications have featured this new job role in recent years.
The demand for data scientist (and similar job titles) has increased tremendously and many
people have actively become engaged in the domain of data science. As a result, the
knowledge and education about data science has greatly professionalized and more
information becomes available every day. While statistics and data analysis mostly remained
an academic field previously, it is quickly becoming a popular subject among students and
the working population.

33
Social Media Applications

Everyone understands the impact that social media has on daily life. However, in the study of
Big Data, social media plays a role of paramount importance. Not only because of the sheer
volume of data that is produced every day through platforms such as Twitter, Facebook,
LinkedIn, and Instagram, but also because social media provides nearly real-time data about
human behaviour.

Social media data provides insights into the behaviours, preferences, and opinions of ‘the
public’ on a scale that has never been known before. Due to this, it is immensely valuable to
anyone who is able to derive meaning from these large quantities of data. Social media data
can be used to identify customer preferences for product development, target new customers
for future purchases, or even target potential voters in elections. Social media data might even
be considered one of the most important business drivers of Big Data.

The Upcoming Internet of Things (IoT)

The Internet of things (IoT) is the network of physical devices, vehicles, home appliances and
other items embedded with electronics, software, sensors, actuators, and network connectivity
which enable these objects to connect and exchange data. It is increasingly gaining popularity
as consumer goods providers start including ‘smart’ sensors in household appliances.
Whereas the average household in 2010 had around 10 devices that connected to the internet,
this number is expected to rise to 50 per household by 2020. Examples of these devices
include thermostats, smoke detectors, televisions, audio systems and even smart refrigerators.

Data Science as a Competitive Advantage

There had been a consistent outcry on having to build Big Data as a capability to add to their
competitive advantage. With a proper data driven framework, businesses could build
sustainable capabilities and further leverage these capabilities as a competitive edge. If
businesses were able to master Big Data driven capabilities, businesses could use these
capabilities to establish secondary source of revenues by selling it to other businesses.

Sustained Processes

Data driven approach creates sustainable processes, which gives a huge endorsement to Big
Data analytics strategy as a go for enterprise adoption. Randomness kills businesses and adds
scary risks, while data driven strategy reduces the risk by bringing statistical models, which
are measurable.

34
Cost Advantages of Commodity Hardware & Open-Source Software

Cost advantage is music to CXO’s ears. How about the savings your IT will enjoy from
moving things to commodity hardware and leverage more open-source platforms for cost
effective ways to achieve enterprise level computations and beyond. No more overpaying of
premium hardware when similar or better analytical processing could be done using
commodity and open-source systems.

Quick Turnaround and Less Bench Times

Have you dealt with IT folks in your company? More and more people, complex processes
and communication charter gives you hard time connecting with someone who could get the
task done. Things take forever long and cost fortunes with substandard quality. A good big
data and analytics strategy could reduce the proof-of-concept time smoothly and
substantially. It reduces the burden on IT and gets more high quality, fast and cost-effective
solutions baked. So, you will waste less time waiting for analysis / insights and more time
digging through mo. and mo. data and use it for better insights and analyses which was never
heard of before.

Automation to Backfill Redundant/Mundane Tasks

How about doing something to the 80% of time that is wasted in data cleaning and pre-
processing. There is great deal of automation that could be take part and skyrocket enterprise
efficiency. Less manual time spent on data prep and more time is spent on doing analysis that
would have substantial ROI compared to mundane data preps and monotonous tasks.

Optimize Workforce to Leverage High Talent Cost

This is an interesting area that I am keeping a close eye on. Businesses already have right
talent pools that would solve some pieces of the Big Data puzzle on data science. Businesses
have BI (Business Intelligence), Modelers and IT people working in harmony in some shape
or form. So, a good Big Data & analytics strategy ensures current workforce is leveraged to
its core in handling enterprise Big Data and also ensures right number of data scientists are
involved with clearer sight to their contribution and their ROI.

Data Continues to Grow Exponentially

Whether you like it or not, data is increasing. One key technological push is the increasing
data and the threat of not being able to use this exploding enterprise data for insights. Having
a good strategy puts a pacifier to growing unutilized data concerns.

35
Data is everywhere and, in many formats,

Besides being able to sieve through data in huge volumes, having a stream of disparate data
also poses its threats. Text, voice, video, logs, and other emerging formats make it harder to
gain insights using traditional tools. So, businesses need to drive their Big Data toolkit to
prep for this exploding data type that is entering corporate data DNA.

Alternate, Multiple Synchronous & Asynchronous Data Streams

Data coming through multiple silos in real time, creating problem in keeping up with this data
in existing data systems. These multiple streams put pressure on businesses to have an
effective strategy on handling these sources. With tools out there to handle such situations, it
has become important to acquire such capabilities before the competition does.

Low Barrier to Entry

As with any business, low barrier to entry poses one great leverage for businesses to try
different technologies and come up with the best strategy. Easy frameworks & paradigms
have made available lots of tools, which are relatively easier to deploy. These tools could
deliver a phenomenal computing horsepower.

Traditional Solutions Failing to Catch Up with New Market Conditions

Big Data has given rise to exploding volume, velocity, and variety of data. These 3Vs are
difficult to handle and demand cutting edge technologies. New requirements have emerged
from changing market dynamics that could not be addressed by old tools but demands big
new data tools. Hence, a Big Data and analytics strategy to embrace these tools before
business goes obsolete.

The main reason that Big Data is well, so 'Big,' is that companies can gain several high-value
outcomes around Big Data. These include

 Discovery of new business insights: Big Data help companies improve marketing,
enhance customer experience, improve operational efficiencies, identify fraud and waste,
prevent compliance failures, and achieve other outcomes that directly affect top- and
bottom-line business performance.

 Reduction in technology implementation and ownership costs. The right partner can help
companies reduce investment costs in several ways: by expertly evaluating the various
technologies available, by recommending the right size or by hosting compute

36
infrastructure, or by architecting the end-to-end solution stack for reasonable, predictable
total cost of ownership (TCO).

 Repeatable success. While many companies are thrilled with the initial Big Data proof-
of-concept success, it is the long-term use of Big Data that will bring in the most
success. The right partner can help bring consistency and repeatability to successive Big
Data deliverables — providing economies of scale and accelerating time-to-benefit.

As people and businesses do more of what they do in an always-on digital environment, as a

growing number of intelligent devices capture and transmit a growing volume of useful data,
and as unstructured data becomes an increasingly rich and pervasive source of business
intelligence, Big Data will continue to play a more strategic role in enterprise IT. Companies
that recognize this reality — and that act on it in a technologically, operationally and
economically optimized way — will gain sustainable competitive advantages over those that
don't.

1.7 BIG DATA ANALYTICS

Fig1.2 Big Data analytics

The Big Data analytics is indeed a revolution in the field of Information Technology. The use
of Data analytics by the companies is enhancing every year. Big Data has the properties of
high variety, volume, and velocity. Big Data involves the use of analytics techniques like
machine learning, data mining, natural language processing, and statistics. With the help of
Big Data multiple operations can be performed at a single platform. You can store Tbs of
data, pre-process it, analyse the data and visualize the data with the help of couple of Big
Data tools.

Data is extracted, prepared, and blended to provide analysis for the businesses. Large
enterprises and multinational organizations use these techniques widely these days in
different ways.

Big Data analytics helps organizations to work with their data efficiently and use that data
identify new opportunities. Different techniques and algorithms can be applied to predict
from data. Multiple business strategies can be applied for future success of the company and
that leads to smarter business moves, more efficient operations, and higher profits.

Following are the three main reasons that why Big Data is so important and efficient.

 Cost reduction

Big Data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data.

 Faster, better decision making.

With the speed of Hadoop and in-memory analytics, combined with the ability to analyse
new sources of data, businesses are able to analyse information immediately and make
decisions based on what they’ve learned.

 New products and services.

With the ability to gauge customer needs and satisfaction through analytics comes the
power to give customers what they want.

Real-Time Benefits of Big Data Analytics

The use of Big Data analytics is very flexible to other fields as well. With the use of Big
Data, a lot there has been an enormous growth in multiple industries. Some of them are.

 Banking

 Technology

38
 Consumer

 Manufacturing

Specially in Banking sector, Big Data tools have been associated with their system. Multiple
operations can be performed on transactional data moreover tools like Apache Hive facilitate
users to query on their data to get results in a very short period of time. A user can optimize
the query engine to get better query performance.

The usability of Big Data is also increased in educational sector. There are new options for
research and analysis using data analytics. The insights provided by the Big Data analytics
tools help in knowing the needs of customers better.

Job Opportunities and Big Data Analytics

Fig 1. 3 Opportunities with big data analytics

With huge interest and investment in the Big Data technologies, the professionals carrying
the skills of Big Data analytics are in huge demand. Fields like Data Analytics and Data
Engineering have the most worth now a day. IT Executives, Business Analysts and Software
developers are learning Big Data tools & techniques to grow with the market of jobs &
opportunities since some of the Big Data tools are based on Python and Java so it is easier for
the programmers who already working on these languages moreover users who know how to
pre-process and has skills like data cleaning, can easily learn about Big Data analyzation
tools and analytics. With the help of visualization tools like Power Bi, QlikView, Tableau etc,
a user can easily analyse the data and present a new marketing strategy.

In different domains of industry, the nature of the job differs and so does the requirement of
the industry. Since analytics is the emerging in every field, the workforce needs are equally
enormous. The job titles may include Big Data Analyst, Big Data Engineer, Business
Intelligence Consultants, Solution Architect, etc.

The importance of Big Data analytics leads to intense competition and increased demand for
Big Data professionals. Data Science and Analytics is an evolving field with huge potential.
There are huge requirements and significance of Big Data analytics in different fields and
industries. Hence, it becomes essential for a professional to keep oneself aware of these
techniques. At the same time, the companies can gain a lot by using these analytics tools
correctly.

Industries today are searching new and better ways to maintain their position and be prepared
for the future. According to experts, Big Data analytics provides leaders a path to capture
insights and ideas to stay ahead in the tough competition.

According to Gartner – It is huge-volume, fast-velocity, and different variety information

assets that demand innovative platform for enhanced insights and decision making.

A Revolution, authors explain it as – It is a way to solve all the unsolved problems related to
data management and handling, an earlier industry was used to live with such problems.

With Big Data analytics, you can also unlock hidden patterns and know the 360-degree view
of customers and better understand their needs.

1.8 BIG DATA APPLICATIONS

The primary goal of Big Data applications is to help companies make more informative
business decisions by analysing large volumes of data. It could include web server logs,
Internet click stream data, social media content and activity reports, text from customer
emails, mobile phone call details and machine data captured by multiple sensors.

40
Here is the list of the top 10 industries using Big Data applications:

 Banking and Securities

 Communications, Media, and Entertainment

 Healthcare Providers

 Education

 Manufacturing and Natural Resources

 Government

 Insurance

 Retail and Wholesale trade

 Transportation

 Energy and Utilities

Organisations from different domain are investing in Big Data applications, for examining
large data sets to uncover all hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information.

Big Data Applications: Healthcare

The level of data generated within healthcare systems is not trivial. Traditionally, the health
care industry lagged in using Big Data, because of limited ability to standardize and
consolidate data. The healthcare sector has access to huge amounts of data but has been
plagued by failures in utilizing the data to curb the cost of rising healthcare and by inefficient
systems that stifle faster and better healthcare benefits across the board.

But now Big Data analytics have improved healthcare by providing personalized medicine
and prescriptive analytics. Researchers are mining the data to see what treatment are more
effective for particular conditions, identify patterns related to drug side effects, and gains
other important information that can help patients and reduce costs.

With the added adoption of mHealth, eHealth and wearable technologies the volume of data
is increasing at an exponential rate. This includes electronic health record data, imaging data,
patient generated data, sensor data, and other forms of data.

41
By mapping healthcare data with geographical data sets, it’s possible to predict disease that
will escalate in specific areas. Based on predictions, it’s easier to strategize diagnostics and
plan for stocking serums and vaccines.

Some hospitals, like Beth Israel, are using data collected from a cell phone app, from millions
of patients, to allow doctors to use evidence-based medicine as opposed to administering
several medical/lab tests to all patients who go to the hospital. A battery of tests can be
efficient, but it can also be expensive and usually ineffective.

Free public health data and Google Maps have been used by the University of Florida to
create visual data that allows for faster identification and efficient analysis of healthcare
information, used in tracking the spread of chronic disease.

Big Data Applications: Manufacturing

Predictive manufacturing provides near-zero downtime and transparency. It requires an

enormous amount of data and advanced prediction tools for a systematic process of data into
useful information.

Major benefits of using Big Data applications in manufacturing industry are:

 Product quality and defects tracking

 Supply planning

 Manufacturing process defect tracking

 Output forecasting

 Increasing energy efficiency

 Testing and simulation of new manufacturing processes

 Support for mass-customization of manufacturing

Big Data Applications: Media & Entertainment

Various companies in the media and entertainment industry are facing new business models,
for the way they – create, market, and distribute their content. This is happening because of
current consumer’s search and the requirement of accessing content anywhere, any time, on
any device.

Big Data provides actionable points of information about millions of individuals. Now,
publishing environments are tailoring advertisements and content to appeal consumers.

42
These insights are gathered through various data-mining activities. Big Data applications
benefits media and entertainment industry by:

 Predicting what the audience wants

 Scheduling optimization

 Increasing acquisition and retention

 Ad targeting

 Content monetization and new product development.

 Collecting, analysing, and utilizing consumer insights

 Leveraging mobile and social media content

 Understanding patterns of real-time, media content usage

A case in point is the Wimbledon Championships (YouTube Video) that leverages Big Data
to deliver detailed sentiment analysis on the tennis matches to TV, mobile, and web users in
real-time.

Spotify, an on-demand music service, uses Hadoop Big Data analytics, to collect data from
its millions of users worldwide and then uses the analysed data to give informed music
recommendations to individual users.

Amazon Prime, which is driven to provide a great customer experience by offering video,
music, and Kindle books in a one-stop-shop, also heavily utilizes Big Data.

Big Data Providers in this industry include Info chimps, Splunk, Pervasive Software, and
Visible Measures.

Big Data Applications: Internet of Things (IoT)

Data extracted from IoT devices provides a mapping of device inter-connectivity. Such
mappings have been used by various companies and governments to increase efficiency. IoT
is also increasingly adopted as a means of gathering sensory data, and this sensory data is
used in medical and manufacturing contexts.

43
Big Data Applications: Government

The use and adoption of Big Data within governmental processes allows efficiencies in terms
of cost, productivity, and innovation. In government use cases, the same data sets are often
applied across multiple applications & it requires multiple departments to work in
collaboration.

Since Government majorly acts in all the domains, thus it plays an important role in
innovating Big Data applications in each and every domain. Let me address some of the
major areas:

 Cyber security & Intelligence

The federal government launched a cyber-security research and development plan that
relies on the ability to analyse large data sets in order to improve the security of U.S.
computer networks.

The National Geospatial-Intelligence Agency is creating a “Map of the World” that can
gather and analyse data from a wide variety of sources such as satellite and social media
data. It contains a variety of data from classified unclassified, and top-secret networks.

 Crime Prediction and Prevention

Police departments can leverage advanced, real-time analytics to provide actionable

intelligence that can be used to understand criminal behaviour, identify crime/incident
patterns, and uncover location-based threats.

 Pharmaceutical Drug Evaluation

According to a McKinsey report, Big Data technologies could reduce research and
development costs for pharmaceutical makers by $40 billion to $70 billion. The FDA
and NIH use Big Data technologies to access large amounts of data to evaluate drugs and
treatment.

 Scientific Research

The National Science Foundation has initiated a long-term plan to:

 Implement new methods for deriving knowledge from data

 Develop new approaches to education

44
 Create a new infrastructure to “manage, curate, and serve data to
communities”.

 Weather Forecasting

The NOAA (National Oceanic and Atmospheric Administration) gathers data every
minute of every day from land, sea, and space-based sensors. Daily NOAA uses Big
Data to analyse and extract value from over 20 terabytes of data.

 Tax Compliance

Big Data Applications can be used by tax organizations to analyse both unstructured and
structured data from a variety of sources in order to identify suspicious behaviour and
multiple identities. This would help in tax fraud identification.

 Traffic Optimization

Big Data helps in aggregating real-time traffic data gathered from road sensors, GPS
devices and video cameras. The potential traffic problems in dense areas can be
prevented by adjusting public transportation routes in real time.

Big Data Applications: Banking and Securities

The Securities Exchange Commission (SEC) is using Big Data to monitor financial market
activity. They are currently using network analytics and natural language processors to catch
illegal trading activity in the financial markets.

Retail traders, Big banks, hedge funds, and other so-called ‘big boys’ in the financial markets
use Big Data for trade analytics used in high-frequency trading, pre-trade decision-support
analytics, sentiment measurement, Predictive Analytics, etc.

This industry also heavily relies on Big Data for risk analytics, including anti-money
laundering, demand enterprise risk management, "Know Your Customer," and fraud
mitigation.

Big Data providers are specific to this industry includes 1010data, Panopticon Software,
Stream base Systems, Nice Actimize, and Quartet FS.

45
Big Data Applications: Education

From a technical point of view, a significant challenge in the education industry is to

incorporate Big Data from different sources and vendors and to utilize it on platforms that
were not designed for the varying data. From a practical point of view, staff and institutions
have to learn new data management and analysis tools.

On the technical side, there are challenges to integrating data from different sources on
different platforms and from different vendors that were not designed to work with one
another. Politically, an issue of privacy and personal data protection associated with Big Data
used for educational purposes is a challenge.

Big Data is used quite significantly in higher education. For example, The University of
Tasmania. An Australian university with over 26000 students has deployed a Learning and
Management System that tracks, among other things, when a student logs onto the system,
how much time is spent on different pages in the system, as well as the overall progress of a
student over time.

In a different use case of the use of Big Data in education, it is also used to measure teacher’s
effectiveness to ensure a pleasant experience for both students and teachers. Teacher’s
performance can be fine-tuned and measured against student numbers, subject matter, student
demographics, student aspirations, behavioural classification, and several other variables.

On a governmental level, the Office of Educational Technology in the U. S. Department of

Education is using Big Data to develop analytics to help correct course students who are
going astray while using online Big Data certification courses. Click patterns are also being
used to detect boredom.

Technologies Used

There are lots of technologies to solve the problem of Big Data Storage and processing. Such
technologies are Apache Hadoop, Apache Spark, Apache Kafka, etc. Let’s take an overview
of these technologies in one by one-

 Apache Hadoop

Big Data is creating a big impact on industries today. Therefore, the world’s 50% of the
data has already been moved to Hadoop.

It is predicted that by 2017, more than 75% of the world’s data will be moved to Hadoop
and this technology will be the most demanding in the market as it is now.

46
 Apache Spark

Further enhancement of this technology has led to an evolution of Apache Spark –

lightning fast and general-purpose computation engine for large-scale processing. It can
process the data up to 100 times faster than Map Reduce.

 Apache Kafka

Apache Kafka is another addition to this Big Data Ecosystem which is a high throughput
distributed messaging system frequently used with Hadoop.

IT organizations have started considering Big Data initiative for managing their data in a
better manner, visualizing this data, gaining insights of this data as and when required and
finding new business opportunities to accelerate their business growth.

Every CIO wants to transform his company, enhance their business models, and identify
potential revenue sources whether he is being from the telecom domain, banking domain,
retail, or healthcare domain etc.

Such business transformation requires the right tools and hiring the right people to ensure
right insights extract at right time from the available data.

1.9 ALGORITHMS USING MAP REDUCE

Map Reduce Algorithm is mainly inspired by the Functional Programming model. It is used
for processing and generating Big Data. These data sets can be run simultaneously and
distributed in a cluster. A Map Reduce program mainly consists of map procedure and a
reduce method to perform the summary operation like counting or yielding some results. The
Map Reduce system works on distributed servers that run in parallel and manage all
communications between different systems. The model is a special strategy of split-apply-
combine strategy which helps in data analysis. Mapping is done by the Mapper class and
reduces the task is done by Reducer class. The Map Reduce algorithm contains two important
tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class.

 The reduce task is done by means of Reducer Class.

47
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is
used as input by Reducer class, which in turn searches matching pairs and reduces them.

Fig 1.4 Algorithms using Map Reduce

Map Reduce implements various mathematical algorithms to divide a task into small parts
and assign them to multiple systems. In technical terms, Map Reduce algorithm helps in
sending the Map & Reduce tasks to appropriate servers in a cluster. Map Reduce Algorithm
mainly works in three steps:

 Map Function

 Shuffle Function

 Reduce Function

Map Function

This is the first step of the Map Reduce Algorithm. It takes the data sets and distributes it into
smaller sub-tasks. This is further done in two steps, splitting, and mapping. Splitting takes the
input dataset and divides the data set while mapping takes those subsets of data and performs
the required action. The output of this function is a key-value pair.

Shuffle Function

This is also known as combine function and includes merging and sorting. Merging combines
all key-value pairs. All of these will have the same keys. Sorting takes the input from the
merging step and sorts all the key-value pairs by making use of the keys. This step will also
return to key-value pairs. The output will be sorted.

48
Reduce Function

This is the last step of this algorithm. It takes the key-value pairs from the shuffle and reduces
the operation.

Map Reduce Algorithms Make Working Easy

The relational database systems have a centralized server which helps in storing and
processing the data. These were usually centralized systems. When multiple files come into
the picture the processing is tedious and creates a bottleneck while processing multiple files.
Map Reduce maps the set of data and converts the data set where all data is divided into
tuples and the reduce task will take the output from this step and combine these data tuples
into the smaller sets. It works in different phases and creates key-value pairs that can be
distributed over different systems.

Map Reduce can be used with a variety of applications. It can be used for distributed pattern-
based searching, distributed sorting, we blink graph reversal, web access log stats. It can also
help in creating and working on multiple clusters, desktop grids, volunteer computing
environments. One can also create dynamic cloud environments, mobile environments, and
also high-performance computing environments. Google made use of Map Reduce which
regenerates Google Index of the World Wide Web. By using it the old ad hoc programs are
updated and they have run different kinds of analysis. It also integrated the live search results
without rebuilding the complete index. All the inputs and outputs are stored in the distributed
file system. The transient data is stored on a local disk.

1.10 SUMMARY

 Big Data is a term that describes the large volume of data – both structured and
unstructured – that inundates a business on a day-to-day basis.

 Big Data analytics is a form of advanced analytics, which involve complex

applications with elements such as predictive models, statistical algorithms and what-
if analysis powered by analytics systems.

 Big Data helps companies to make informed decisions, understand their customer
desires.

49
 Big Data analysis helps companies to achieve rapid growth by analysing the real-time
data. It allows companies to defeat their competitors and achieve success.

 Big Data technologies help us to understand inefficiency and opportunities in our

company. It plays a major role in shaping the organization’s growth.

 We can now store information / data in a digital format on a digital storage device
such as hard disk, flash drive…etc. Related data are still categorized as files; related
groups of files are stored in folders. Each file has a name, extension, and icon.

 File management system is used by the operating system to access the files and
folders stored in a computer or any external storage devices.

 Distributed File System (DFS) allows users of physically distributed systems to share
their data and resources by using a Common File System. A collection of
workstations and mainframes connected by a Local Area Network (LAN) is a
configuration on Distributed File System.

 One process involved in implementing the DFS is giving access control and storage
management controls to the client system in a centralized way. The servers that are
involved have to be able to capably dole out data with sufficient dexterity.

 There are generally four characteristics that must be part of a dataset to qualify it as
Big Data—volume, velocity, variety, and veracity.

 The volume of data refers to the size of the data sets that need to be analysed and
processed, which are now frequently larger than terabytes and petabytes. An example
of a high-volume data set would be all credit card transactions on a day within the
country.

 Variety makes Big Data really big. Big Data comes from a great variety of sources
and generally is one out of three types: structured, semi structured, and unstructured
data. An example of high variety data sets would be the CCTV audio and video files
that are generated at various locations in a city.

 Veracity refers to the quality of the data that is being analysed. High veracity data has
many records that are valuable to analyse and that contribute in a meaningful way to
the overall results. An example of a high veracity data set would be data from a
medical experiment or trial.

50
 Velocity refers to the speed with which data is generated. High velocity data is
generated with such a pace that it requires distinct (distributed) processing techniques.
An example of a data that is generated with high velocity would be Twitter messages
or Facebook posts.

 Big Data will continue to play a more strategic role in enterprise IT. Companies that
recognize this reality — and that act on it in a technologically, operationally, and
economically optimized way — will gain sustainable competitive advantages over
those that don't.

 The three main reasons that why Big Data is so important and efficient are Cost
reduction, Faster decision making and developing new products and services.

 The use of Big Data analytics is very flexible to other fields as well. With the use of
Big Data, a lot there has been an enormous growth in multiple industries. Some of
them are Banking, Technology. Consumer and Manufacturing

 Big Data analytics have improved healthcare by providing personalized medicine and
prescriptive analytics. Researchers are mining the data to see what treatment are more
effective for particular conditions, identify patterns related to drug side effects, and
gains other important information that can help patients and reduce costs.

 Predictive manufacturing provides near-zero downtime and transparency. It requires

an enormous amount of data and advanced prediction tools for a systematic process of
data into useful information.

 Big Data provides actionable points of information about millions of individuals.

Now, publishing environments are tailoring advertisements and content to appeal
consumers. These insights are gathered through various data-mining activities.

 Data extracted from IoT devices provides a mapping of device inter-connectivity.

Such mappings have been used by various companies and governments to increase
efficiency.

 The use and adoption of Big Data within governmental processes allows efficiencies
in terms of cost, productivity, and innovation. In government use cases, the same data
sets are often applied across multiple applications & it requires multiple departments
to work in collaboration.

51
 The Securities Exchange Commission (SEC) is using Big Data to monitor financial
market activity. They are currently using network analytics and natural language
processors to catch illegal trading activity in the financial markets.

 Big Data in education, can be used to measure teacher’s effectiveness to ensure a

pleasant experience for both students and teachers. Teacher’s performance can be
fine-tuned and measured against student numbers, subject matter, student
demographics, student aspirations, behavioural classification, and several other
variables.

 Map Reduce Algorithm is mainly inspired by the Functional Programming model. It

is used for processing and generating Big Data. These data sets can be run
simultaneously and distributed in a cluster.

1.11 KEYWORDS

 RDBMS - A tangible product is a physical object that can be perceived by touch such
as a building, vehicle, or gadget. Most goods are tangible products. For example, a
soccer ball is a tangible product. They are distinct from intangible goods, which may
have value but are not physical entities. Goods that are tangible play a large part in
retail, though the purchasing of intangible goods is now widely available through the
Internet. They are also distinct from services, such as a spa treatment since the result
of a service is not a tangible product.

 EXE extension - An EXE file contains an executable program for Windows. EXE is
short for "executable," and it is the standard file extension used by Windows
programs. For many Windows users, EXE files are synonymous with Windows
programs, making ".exe" one of the most recognizable file extensions. EXE files
contain binary machine code that has been compiled from source code. The machine
code is saved in such a way that it can be executed directly by the computer's CPU,
thereby "running" the program. EXE files may also contain resources, such as
graphics assets for the GUI, the program's icon, and other resources needed by the
program.

52
 Internet of things (IoT) - The internet of things, or IoT, is a system of interrelated
computing devices, mechanical and digital machines, objects, animals, or people that
are provided with unique identifiers (UIDs) and the ability to transfer data over a
network without requiring human-to-human or human-to-computer interaction.

A thing on the internet of things can be a person with a heart monitor implant, a farm
animal with a biochip transponder, an automobile that has built-in sensors to alert the
driver when tire pressure is low or any other natural or man-made object that can be
assigned an Internet Protocol (IP) address and is able to transfer data over a network

 BI (Business Intelligence) - Business intelligence (BI) refers to the procedural and

technical infrastructure that collects, stores, and analyses the data produced by a
company’s activities. BI is a broad term that encompasses data mining, process
analysis, performance benchmarking, and descriptive analytics. BI parses all the data
generated by a business and presents easy-to-digest reports, performance measures,
and trends that inform management decisions

 HTTP - Hypertext Transfer Protocol (HTTP) is an application-layer protocol for

transmitting hypermedia documents, such as HTML. It was designed for
communication between web browsers and web servers, but it can also be used for
other purposes. HTTP follows a classical client-server model, with a client opening a
connection to make a request, then waiting until it receives a response. HTTP is a
stateless protocol, meaning that the server does not keep any data (state) between two
requests. HTTP is a protocol which allows the fetching of resources, such as HTML
documents. It is the foundation of any data exchange on the Web, and it is a client-
server protocol, which means requests are initiated by the recipient, usually the Web
browser. A complete document is reconstructed from the different sub-documents
fetched, for instance text, layout description, images, videos, scripts, and more.

 Network-Attached Storage (NAS) - A term used to refer to storage devices that

connect to a network and provide file access services to computer systems. These
devices generally consist of an engine that implements the file services, and one or
more devices, on which data is stored. NAS uses file access protocols such as NFS or
CIFS. Designers of software applications suited to smaller business environments
tend to use file-based systems to meet these goals, particularly those of flexibility,
simplicity, and ease of management; and there are a wide variety of easy-to-use tools

53
to provide security, and robust backup & recovery. NAS systems are popular with
enterprise and small businesses in many industries as effective, scalable, and low-cost
storage solutions. They can be used to support email systems, accounting databases,
payroll, video recording and editing, data logging, business analytics and more; a
wide variety of other business applications are underpinned by NAS systems.

1.12 LEARNING ACTIVITY

1. Carry out a research on how Big Data is serving different Business sectors.

___________________________________________________________________________
___________________________________________________________________________

2. Collect certain facts and find out how implementation of DFS has affected Business
industry.

___________________________________________________________________________
___________________________________________________________________________

1.13 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

1. What is Big Data? Give suitable examples.

2. Describe how Big Data came up during 21 st century.

3. Explain what are extensions in file names?

4. What are the two components of DFS?

5. How Distributed file system (DFS) works?

Long Questions

1. Explain how each phase contributed to the contemporary meaning of Big Data,

2. What are the different features of DFS? Explain.

3. Explain the importance of Big Data.

54
4. Enumerate the benefits of Big Data along with real life applications.

5. Identify and explain the different technologies used in Big Data.

B. Multiple Choice Questions

1. In which of the following form Big Data analytics involve complex applications with
elements?

a. Advanced

b. Basic

c. Intermediate

d. None of these

2. Which of the following is the part of Big Data key concepts apart from volume and
variety?

a. Value

b. Velocity

c. Variance

d. Validity

3. What is the year in which Hadoop was created by Yahoo?

a. 2001

b. 2002

c. 2005

d. 2008

55
4. Which of the following is the first storage mechanism used by computers to store
data?

a. Floppy

b. Pen drives.

c. Hard drives.

d. Punch Cards

5. What is the full form of NFS?

a. National Funds Scheme

b. Network File System

c. Natural File System

d. National File Scheme

Answers

1-a, 2-b, 3-c, 4-d, 5-b

1.14 REFERENCES

References

 Shabbir, M.Q., Gardezi, S.B.W. Application of Big Data analytics and

organizational performance: the mediating role of knowledge management
practices. J Big Data 7, 47 (2020).

 Chen, H., Chiang, R., & Storey, V. (2012). Business Intelligence and Analytics:
From Big Data to Big Impact. MIS Quarterly, 36(4), 1165-1188.
doi:10.2307/41703503

 Jiwat Ram, C. Z. (2016). The Implications of Big Data Analytics on Business

Intelligence: A Qualitative Study in China, Procedia Computer Science, Volume 87,
221-226.

56
Textbooks

 Wamba SF, et al. Big Data analytics and firm performance: effects of dynamic
capabilities. J Bus Res. 2017; 70:356–65.

 Sagiroglu S, Sinanc D. Big Data: a review. In: 2013 international conference on

collaboration technologies and systems (CTS). IEEE. 2013.

 De Mauro A, Greco M, Grimaldi M. A formal definition of Big Data based on its

essential features. Libr Rev. 2016;65(3):122–35.

 W. Ahmad, B.S.M.K. Quadri, Big Data promises value: Is hardware technology

taken on board, Industrial Management & Data Systems, 115 (9) (2015)

Websites

 https://searchbusinessanalytics.techtarget.com/definition/big-data-analytics

 https://www.sas.com/en_us/insights/analytics/big-data-analytics.html

 https://www.mckinsey.com/business-functions/mckinsey-analytics/our-
insights/how-companies-are-using-big-data-and-analytics#

 https://www.selecthub.com/big-data-analytics/big-data-business-analytics/

 https://marutitech.com/big-data-analytics-will-play-important-role-businesses/

 https://www.ibm.com/analytics/hadoop/big-data-analytics

Linux Commands MT
0% (1)
Linux Commands MT
72 pages
BSC (Hons) Business Management Bmp4005 Information Systems and Big Data Analysis Assessment Number 2 Written Report and Poster Accompanying Paper
No ratings yet
BSC (Hons) Business Management Bmp4005 Information Systems and Big Data Analysis Assessment Number 2 Written Report and Poster Accompanying Paper
8 pages
Big Data
No ratings yet
Big Data
25 pages
IDAV Unit-1
No ratings yet
IDAV Unit-1
20 pages
Unit 1 - Big Data Analytics - CCS334
No ratings yet
Unit 1 - Big Data Analytics - CCS334
35 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
38 pages
Big Data Analytics Insights
No ratings yet
Big Data Analytics Insights
14 pages
Big Data Mid Term Report
No ratings yet
Big Data Mid Term Report
11 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
13 pages
Introduction To Big Data Unit - 2
No ratings yet
Introduction To Big Data Unit - 2
75 pages
Big Data: Concepts, History, and Importance
No ratings yet
Big Data: Concepts, History, and Importance
42 pages
Chap 1 Introduction To Big Data
No ratings yet
Chap 1 Introduction To Big Data
13 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
6 pages
MZU MBA SEM II Digital Business Management Unit 2 Converted
No ratings yet
MZU MBA SEM II Digital Business Management Unit 2 Converted
56 pages
Unit 1
No ratings yet
Unit 1
56 pages
BA RP Final
No ratings yet
BA RP Final
13 pages
Big Data Analytics: Applications, Prospects and Challenges
No ratings yet
Big Data Analytics: Applications, Prospects and Challenges
18 pages
Introduction To Big Data - Report 1
No ratings yet
Introduction To Big Data - Report 1
5 pages
Unit 1
No ratings yet
Unit 1
122 pages
What Is Big Data & Why Is Big Data Important in Today's Era
100% (1)
What Is Big Data & Why Is Big Data Important in Today's Era
13 pages
MODULE 1 - FBA INTRO - PPTM
No ratings yet
MODULE 1 - FBA INTRO - PPTM
15 pages
117769
No ratings yet
117769
20 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Big Data Seminar Report 2016
No ratings yet
Big Data Seminar Report 2016
10 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Da 1
No ratings yet
Da 1
20 pages
The Influence of Big Data Analytics in The Industry
No ratings yet
The Influence of Big Data Analytics in The Industry
15 pages
A Study of Big Data: An Importance To Create New Trend in E-Business
No ratings yet
A Study of Big Data: An Importance To Create New Trend in E-Business
6 pages
BDA Unit 1
No ratings yet
BDA Unit 1
17 pages
Big Data
No ratings yet
Big Data
9 pages
Dwbi Unit 4 & 5
No ratings yet
Dwbi Unit 4 & 5
26 pages
A Seminar Report: Big Data
No ratings yet
A Seminar Report: Big Data
22 pages
BDM 1
No ratings yet
BDM 1
37 pages
Unit 1 Notes Bda
No ratings yet
Unit 1 Notes Bda
20 pages
21ai402 Data Analytics Unit-1
No ratings yet
21ai402 Data Analytics Unit-1
37 pages
Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES
No ratings yet
Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES
130 pages
IMTC634 - Data Science - Chapter 11
No ratings yet
IMTC634 - Data Science - Chapter 11
22 pages
Bda Aiml Note Unit 1
No ratings yet
Bda Aiml Note Unit 1
14 pages
Big Data Analytics
0% (1)
Big Data Analytics
19 pages
Emerging Big Data and Cloud Computing
No ratings yet
Emerging Big Data and Cloud Computing
15 pages
Business Intelligence For Big Data Analytics
No ratings yet
Business Intelligence For Big Data Analytics
8 pages
BDA - Unit-I
No ratings yet
BDA - Unit-I
35 pages
Unit 2
No ratings yet
Unit 2
35 pages
Big Data: Concepts and Applications
No ratings yet
Big Data: Concepts and Applications
5 pages
P2 Chapter 14 Slides
No ratings yet
P2 Chapter 14 Slides
13 pages
BDA Unit 1
No ratings yet
BDA Unit 1
23 pages
Big Data Essay
No ratings yet
Big Data Essay
6 pages
Lec 1 Introduction To Big Data Analytics
No ratings yet
Lec 1 Introduction To Big Data Analytics
68 pages
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
No ratings yet
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
35 pages
Big Data Definition and Challenges
No ratings yet
Big Data Definition and Challenges
12 pages
The Dawn of Big Data - Origins and Early Promise of Big Data-Dec-07-2024-0914
No ratings yet
The Dawn of Big Data - Origins and Early Promise of Big Data-Dec-07-2024-0914
26 pages
Module 04 Ba
No ratings yet
Module 04 Ba
45 pages
Lecture Notes 1
No ratings yet
Lecture Notes 1
5 pages
Fundamentals of Business Analytics Module
No ratings yet
Fundamentals of Business Analytics Module
5 pages
Big Data Management
No ratings yet
Big Data Management
25 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
Big Data
No ratings yet
Big Data
14 pages
Assignment: Ce Marketing Research & Data Analytics
No ratings yet
Assignment: Ce Marketing Research & Data Analytics
7 pages
Playmobil Sports & Action Karate Class Gift Set - Mastermind Toys
No ratings yet
Playmobil Sports & Action Karate Class Gift Set - Mastermind Toys
1 page
210 - 98 Lillian Street, Toronto For Rent at $3,100 Condos - Ca
No ratings yet
210 - 98 Lillian Street, Toronto For Rent at $3,100 Condos - Ca
1 page
WebSaver Coupons 20241103
No ratings yet
WebSaver Coupons 20241103
2 pages
WebSaver Coupons 20241103 3
No ratings yet
WebSaver Coupons 20241103 3
2 pages
Ste Cs1q1m4 Afgbmts
No ratings yet
Ste Cs1q1m4 Afgbmts
22 pages
Linux Filesystem Hierarchy Guide
No ratings yet
Linux Filesystem Hierarchy Guide
29 pages
HDFS Architecture
No ratings yet
HDFS Architecture
47 pages
Operatingsystem
No ratings yet
Operatingsystem
3 pages
ZMC User's Guide 3.3
No ratings yet
ZMC User's Guide 3.3
105 pages
Whatisafile?: Attributes of The File
No ratings yet
Whatisafile?: Attributes of The File
15 pages
FOSS Workshop for Developers
No ratings yet
FOSS Workshop for Developers
38 pages
Hfat
No ratings yet
Hfat
37 pages
Assignment of Swap Space Management
No ratings yet
Assignment of Swap Space Management
5 pages
OS NOTES 682.23 New LAST 2024
No ratings yet
OS NOTES 682.23 New LAST 2024
85 pages
MongoDB Security Architecture WP PDF
No ratings yet
MongoDB Security Architecture WP PDF
18 pages
Downloaded From PublicHD - Se
No ratings yet
Downloaded From PublicHD - Se
13 pages
Storage in AWS
No ratings yet
Storage in AWS
55 pages
Linux Interview Notes .PDF-2
No ratings yet
Linux Interview Notes .PDF-2
12 pages
Peer-To-Peer File Sharing
No ratings yet
Peer-To-Peer File Sharing
6 pages
HDFS Basics for Tech Professionals
No ratings yet
HDFS Basics for Tech Professionals
13 pages
(SOLVED) DD or Tar - Why
No ratings yet
(SOLVED) DD or Tar - Why
1 page
Operating System Functions & Types
No ratings yet
Operating System Functions & Types
15 pages
Unity Data Protection SRG
No ratings yet
Unity Data Protection SRG
78 pages
SEM VI - HCSC601 - Digital Forensics - CSHONORS - LessonPLan
No ratings yet
SEM VI - HCSC601 - Digital Forensics - CSHONORS - LessonPLan
8 pages
A Brief Introduction To IBM Tivoli Storage Manager Architecture - TSM Design and Planning Best Prac
No ratings yet
A Brief Introduction To IBM Tivoli Storage Manager Architecture - TSM Design and Planning Best Prac
25 pages
Booting AIX Into Maintenance Mode
No ratings yet
Booting AIX Into Maintenance Mode
6 pages
W-NMS Backup and Restore User Guide
No ratings yet
W-NMS Backup and Restore User Guide
160 pages
Troubleshooting Frs-Sonar
No ratings yet
Troubleshooting Frs-Sonar
83 pages
Configuration Guidefor Sun Solaris Host Attachment
No ratings yet
Configuration Guidefor Sun Solaris Host Attachment
82 pages
Basic Unix
No ratings yet
Basic Unix
4 pages
CIS FreeBSD 14 Benchmark v1.0.0
No ratings yet
CIS FreeBSD 14 Benchmark v1.0.0
456 pages
Nexus 7 Infinity
No ratings yet
Nexus 7 Infinity
73 pages
Ex200 4-7
No ratings yet
Ex200 4-7
1 page