Unit – III
Big Data
Definition:
Big data is a term used to describe large and complex sets of data that are difficult to
manage and analyze using traditional data processing tools. Big data is often
characterized by its volume, velocity, and variety. It can be structured, semi-structured,
or unstructured, and can come from a variety of sources, such as social media, email,
and mobile devices.
Big data is used in data science for machine learning, predictive modeling, and other
advanced analytics applications. It can help businesses make more accurate decisions
by providing insights that can be used to improve strategic business moves. For example,
big data analytics can help businesses understand product viability and keep up with
trends.
Evolution of Big Data:
If we see the last few decades, we can analyze that Big Data technology has gained so
much growth. There are a lot of milestones in the evolution of Big Data which are
described below:
Page | 27
Lokesh Suryawanshi
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed
storage medium and large data processing are provided by Hadoop, and it is an
open-source framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store
and retrieve unstructured data.
4. Cloud Computing:
Cloud Computing technology helps companies to store their important data in
data centers that are remote, and it saves their infrastructure cost and
maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it. This
has led to the development of artificial intelligence (AI) applications.
6. Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes
of data in real time.
7. Edge Computing:
dge Computing is a kind of distributed computing paradigm that allows data
processing to be done at the edge or the corner of the network, closer to the
source of the data.
Four V's of big data:
The four V's of big data are volume, velocity, variety, and veracity:
• Volume: The amount of data collected
• Velocity: How quickly data is generated, gathered, and analyzed
• Variety: The number of reference points used to collect data
• Veracity: How reliable and accurate the data is
Page | 28
Lokesh Suryawanshi
Drivers for Big Data:
Some drivers of big data include:
• The four Vs
Volume, variety, velocity, and value are the four key drivers of the big data revolution.
• Cloud computing
Cloud computing environments allow for quick scaling of IT infrastructure and a pay-as-
you-go model.
• Data-driven decision-making
There is an increasing demand for data-driven decision-making.
• Artificial intelligence and machine learning
The rise of artificial intelligence and machine learning in enterprise applications is a
major driver of the big data market.
• Software innovations
Innovations and developments in software for handling unstructured big data are a
major driver of the big data market.
• Elastic scalability
Big data architectures can be scaled horizontally, allowing the environment to be
adjusted to the size of each workload.
• Data quality
Big data should be stored and maintained properly to ensure it can be used by less
experienced data scientists and analysts.
Page | 29
Lokesh Suryawanshi
Big Data Analytics:
Big data analytics refers to the systematic processing and analysis of large amounts of
data and complex data sets, known as big data, to extract valuable insights. Big data
analytics allows for the uncovering of trends, patterns and correlations in large amounts
of raw data to help analysts make data-informed decisions. This process allows
organizations to leverage the exponentially growing data generated from diverse sources,
including internet-of-things (IoT) sensors, social media, financial transactions and smart
devices to derive actionable intelligence through advanced analytic techniques.
Four main data analysis methods
These are the four methods of data analysis at work within big data:
Descriptive analytics
The "what happened" stage of data analysis. Here, the focus is on summarizing and
describing past data to understand its basic characteristics.
Diagnostic analytics
The “why it happened” stage. By delving deep into the data, diagnostic analysis identifies
the root patterns and trends observed in descriptive analytics.
Predictive analytics
The “what will happen” stage. It uses historical data, statistical modeling and machine
learning to forecast trends.
Prescriptive analytics
Describes the “what to do” stage, which goes beyond prediction to provide
recommendations for optimizing future actions based on insights derived from all
previous.
Big Data Application:
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of customer’s
spending habit (in which product customer spent, in which brand they wish to spent, how
frequently they spent), shopping behavior, customer’s most liked product (so that they
can keep those products in the store). Which product is being searched/sold most, based
on that data, production/collection rate of that product get fixed.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big
retails store provide a recommendation to the customer. E-commerce site like Amazon,
Page | 30
Lokesh Suryawanshi
Walmart, Flipkart does product recommendation. They track what product a customer is
searching, based on that data they recommend that type of product to that customer.
3. Smart Traffic System: Data about the condition of the traffic of different road,
collected through camera kept beside the road, at entry and exit point of the city, GPS
device placed in the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-free
or less jam way, less time taking ways are recommended. Such a way smart traffic system
can be built in the city by Big data analysis. One more profit is fuel consumption can be
reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors
present. These sensors capture data like the speed of flight, moisture, temperature, other
environmental condition. Based on such data analysis, an environmental parameter
within flight are set up and varied.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In
the various spot of car camera, a sensor placed, that gather data like the size of the
surrounding car, obstacle, distance from those, etc. These data are being analyzed, then
various calculation like how many angles to rotate, what should be speed, when to stop,
etc carried out. These calculations help to take action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool
(like Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to provide the
answer of the various question asked by users. This tool tracks the location of the user,
their local time, season, other data related to question asked, etc. Analyzing all such
data, it provides an answer.
7. IoT:
Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any problem
when it requires repairing so that company can take action before the situation when
machine facing a lot of issues or gets totally down. Thus, the cost to replace the whole
machine can be saved.
In the Healthcare field, Big data is providing a significant contribution. Using big data tool,
data regarding patient experience is collected and is used by doctors to give better
treatment. IoT device can sense a symptom of probable coming disease in the human
body and prevent it from giving advance treatment. IoT Sensor placed near-patient, new-
born baby constantly keeps track of various health condition like heart bit rate, blood
presser, etc. Whenever any parameter crosses the safe limit, an alarm sent to a doctor,
so that they can take step remotely very soon.
8. Education Sector: Online educational course conducting organization utilize big data
to search candidate, interested in that course. If someone searches for YouTube tutorial
Page | 31
Lokesh Suryawanshi
video on a subject, then online or offline course provider organization on that subject
send ad online to that person about their course.
9. Energy Sector: Smart electric meter read consumed power every 15 minutes and
sends this read data to the server, where data analyzed and it can be estimated what is
the time in a day when the power load is less throughout the city. By this system
manufacturing unit or housekeeper are suggested the time when they should drive their
heavy machine in the night time when power load less to enjoy less electricity bill.
10. Media and Entertainment Sector: Media and entertainment service providing
company like Netflix, Amazon Prime, Spotify do analysis on data collected from their
users. Data like what type of video, music users are watching, listening most, how long
users are spending on site, etc are collected and analyzed to set the next business
strategy.
Designing Data Architecture:
Data architecture design is important for creating a vision of interactions occurring
between data systems, like for example if data architect wants to implement data
integration, so it will need interaction between two systems and by using data
architecture the visionary model of data interaction during the process can be achieved.
Data architecture also describes the type of data structures applied to manage data and
it provides an easy way for data preprocessing. The data architecture is formed by dividing
into three essential models and then are combined :
Page | 32
Lokesh Suryawanshi
• Conceptual model –
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
• Logical model –
It is a model where problems are represented in the form of logic such as rows
and column of data, classes, xml tags and other DBMS techniques.
• Physical model –
Physical models holds the database design like which type of database
technology will be suitable for architecture.
Little Data and Big Data:
Below is a table of differences between Small Data and Big Data:
Feature Small Data Big Data
Data is typically structured Data is often unstructured and
Variety
and uniform heterogeneous
Data is generally high Data quality and reliability can vary
Veracity
quality and reliable widely
Data can often be Data requires distributed
Processing processed on a single processing frameworks such as
machine or in-memory MapReduce or Spark
Technology Traditional Modern
Traditional statistical Advanced analytics techniques
Analytics techniques can be used to such as machine learning are often
analyze data require
Collection Generally, it is obtained in The Big Data collection is done by
an organized manner than using pipelines having queues like
Page | 33
Lokesh Suryawanshi
Feature Small Data Big Data
is inserted into AWS Kinesis or Google Pub / Sub to
the database balance high-speed data
Data in the range of tens or
Volume Size of Data is more than Terabytes
hundreds of Gigabytes
Clusters(Data Scientists), Data
Analysis Areas Data marts(Analysts)
marts(Analysts)
Contains less noise as data
Usually, the quality of data is not
Quality is less collected in a
guaranteed
controlled manner
It requires batch-oriented It has both batch and stream
Processing
processing pipelines processing pipelines
Database SQL NoSQL
A regulated and constant Data arrives at extremely high
Velocity flow of data, data speeds, large volumes of data
aggregation is slow aggregation in a short time
Numerous variety of data set
Structured data in tabular
including tabular data, text, audio,
Structure format with fixed
images, video, logs, JSON etc.(Non
schema(Relational)
Relational)
They are mostly based on
They are usually vertically horizontally scaling architectures,
Scalability
scaled which gives more versatility at a
lower cost
Page | 34
Lokesh Suryawanshi
Feature Small Data Big Data
Query
only Sequel Python, R, Java, Sequel
Language
Hardware A single server is sufficient Requires more than one server
Complex data mining techniques
Business Intelligence,
Value for pattern finding,
analysis and reporting
recommendation, prediction etc.
Data can be optimized Requires machine learning
Optimization
manually(human powered) techniques for data optimization
Usually requires distributed storage
Storage within enterprises,
Storage systems on cloud or in external file
local servers etc.
systems
Data Analysts, Database Data Scientists, Data Analysts,
People Administrators and Data Database Administrators and Data
Engineers Engineers
Securing Big Data systems are
Security practices for Small much more complicated. Best
Data include user security practices include data
Security
privileges, data encryption, encryption, cluster network
hashing, etc. isolation, strong access control
protocols etc.
Database, Data
Nomenclature Data Lake
Warehouse, Data Mart
Page | 35
Lokesh Suryawanshi
Feature Small Data Big Data
Predictable resource
More agile infrastructure with
Infrastructure allocation, mostly vertically
horizontally scalable hardware
scalable hardware.
Large-scale applications, such as
Small-scale applications,
enterprise-level data management,
Applications such as personal or small
internet of things (IoT), and social
business data management
media analysis
Page | 36
Lokesh Suryawanshi