Unit 1 BDT
Unit 1 BDT
Getting an Overview of Big Data: Introduction to Big Data, Structuring Big Data,
Elements of Big Data, Big Data Analytics. Exploring the use of Big Data in Business
Context: Use of Big Data in Social Networking, Use of Big Data Preventing Fraudulent
Activities, Use of Big Data in Retail Industry
Big Data is a collection of data that is huge in volume, yet growing exponentially with
time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently.
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to
many Petabytes.
Every minute, nearly 510 comments are posted, 293,000 statuses are updated, and
136,000 photos are uploaded on Facebook.
Every hour, Walmart, a global discount departmental store chain, handles more
than 1 million customer transactions.
Every day, consumers make around 11.5 million payments by using PayPal.
Hadoop by Apache is widely used for storing and managing Big Data. Analyzing Big
Data is a challenging task as it involves large distributed file systems.
The process of capturing or collecting Big Data is known as ‘datafication.’ Big Data is
‘datafied’ so that it can be used productively.
Healthcare Providers
Education
Government
Insurance
Transportation
Location Tracking
Precision Medicine
Advertising
Personalized marketing.
Big Data is the new term of data evolution directed by the enormous velocity, variety,
and volume of data. Velocity implies the speed with which the data flows in an
organization; variety refers to the varied forms of data, such as structured, semi-
structured, or unstructured; and volume defines the amount or quantity of data an
organization has to deal with.
Structuring of data, in simple terms, is arranging the available data in a manner such
that it becomes easy to study, analyze, and derive conclusion from it.
Today, various sources generate a variety of data, such as images, text, audios, etc.
All such different types of data can be structured only if it is sorted and organized in
some logical pattern. Thus, the process of structuring data requires one to first
understand the various types of data available today.
Types of Data
Data that comes from multiple sources, such as databases, Enterprise Resource
Planning (ERP) systems, weblogs, chat history, and GPS maps, varies in its format.
Structured data
Unstructured data
Semi-structured data
In a real-world scenario, typically, the unstructured data is larger in volume than the
structured and semi-structured data, approximately 70% to 80% of data is in
unstructured form. Figure below illustrates the types of data that comprise Big Data:
Structured Data
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data. Structured data can be defined as the data that has a
defined repeating pattern. This pattern makes it easier for any program to sort, read,
and process the data. Processing structured data is much easier and faster than
processing data without any specific repeating patterns.
Structured data:
Flat files in the form of records (like comma separated values (csv) and tab-
separated files)
Table shows a sample of structured data in which the attribute data for every
customer is stored in the defined fields:
Unstructured Data
Unstructured data is a set of data that might or might not have any logical or
repeating patterns. Any data with unknown form or the structure is classified as
unstructured data.
Unstructured data:
Comprises inconsistent data, such as data obtained from files, social media
websites, satellites, etc.
Consists of data in different formats such as e-mails, text, audio, video, or images
Some sources of unstructured data include:
Working with unstructured data poses certain challenges, which are as follows:
Sorting, organizing, and arranging unstructured data in different sets and formats
Combining and linking unstructured data in a more structured format to derive any
logical conclusions out of the available information
Costing in terms of storage space and human resource (data analysts and
scientists) needed to deal with the exponential growth of unstructured data.
Semi-structured data
According to Gartner, data is growing at the rate of 59% every year. This growth can
be depicted in terms of the following four Vs:
Volume
Velocity
Variety
Veracity
Volume
The Internet alone generates a huge amount of data. The followings figures help us
to get an idea of the Internet traffic:
Internet has around 14.3 trillion live Web pages, and 48 billion Web pages are
indexed by Google Inc.; 14 billion Web pages are indexed by Microsoft Bing.
Total world-wide Internet traffic in the year 2013 was 43,639 petabytes.
Over 9,00,000 servers are owned by Google Inc., which is the largest in the world.
Total data stored on the Internet is over 1 yottabyte
Velocity
The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites, sensors,
Mobile devices, etc. The flow of data is massive and continuous.
Social media, including facebook posts, tweets, and other social media activities,
create huge amount of data, which is to be analyzed instantly at a fast speed
because the value degrades quickly with time.
Portable device, including mobile, PDA, etc., also generate data at a high speed.
Variety
We all know that data is being generated at a very fast pace. Now, this data is
generated from different types of sources, such as internal, external, social, and
behavioral, and comes in different formats, such as images, text, videos, etc. Even a
single source can generate data in varied formats, for example, GPS and social
networking sites, such as Facebook, produce data of all types, including text, images,
videos, etc.
Veracity
Veracity generally refers to the uncertainty of data, i.e., whether the obtained data is
correct or consistent. Out of the huge amount of data that is generated in almost
every process, only the data that is correct and consistent can be used for further
analysis. Data when processed becomes information; however, a lot of effort goes in
processing the data. Big Data, especially in the unstructured and semi-structured
forms, is messy in nature, and it takes a good amount of time and expertise to clean
that data and make it suitable for analysis.
The process of analysis of large volumes of diverse data sets, using advanced
analytic techniques is referred to as Big Data Analytics.
These diverse data sets include structured, semi-structured, and unstructured data,
from different sources, and in different sizes from terabytes to zettabytes.
The different types of data require different approaches. This different approach of
analytics gives rise to the four different types of big data analytics.
Descriptive Analytics
An example of the use of descriptive analytics is the Dow Chemical Company. The
company utilized its past data to increase its facility utilization across its offices and
labs.
Predictive Analytics
Predictive Analytics, as can be discerned from the name itself, is concerned with
predicting future incidents. These future incidents can be market trends, consumer
trends, and many such market-related events.
This type of analytics makes use of historical and present data to predict future
events. This is the most commonly used form of analytics among businesses.
Predictive analytics doesn’t only work for the service providers but also for the
consumers. It keeps track of our past activities and based on them, predicts what we
may do next.
Predictive analytics uses models like data mining, AI, and machine learning to
analyze current data and forecast what might happen in specific scenarios.
Examples of Predictive analytics include next best offers, churn risk, and renewal risk
analysis.
We can take the example of PayPal to understand how businesses use predictive
analytics.
The company determines the steps they need to take the steps to protect their
client’s fraudulent transactions. It uses all past payment data and user behavior data
to predict fraudulent activities.
Prescriptive Analytics
Prescriptive analytics is the most valuable yet underused form of analytics. It is the
next step in predictive analytics. The prescriptive analysis explores several possible
actions and suggests actions depending on the results of descriptive and predictive
analytics of a given dataset.
Prescriptive analytics is a combination of data and various business rules. The data
of prescriptive analytics can be both internal (organizational inputs) and external
(social media insights).
Examples of prescriptive analytics for customer retention is the next best action and
next best offer analysis.
A use case of prescriptive analytics can be the Aurora Health Care system. It saved
$6 million by reducing the readmission rates by 10%.
Prescriptive analytics has good use in the healthcare industry. It can be used to
enhance the process of drug development, finding the right patients for clinical trials,
etc.
For example, if we have to find the best way of shipping goods from a factory to a
destination, to minimize costs, we will use the prescriptive analytics. Figure shows a
diagrammatic representation of the stages involved in the prescriptive analytics:
Table describes various analytical approaches typically associated with Big Data:
Advantages of Big Data Analytics:
The right analysis of the available data can improve major business processes in
various ways. For example, in a manufacturing unit, data analytics can improve the
functioning of the following processes:
Procurement—To find out which suppliers are more efficient and cost-effective in
delivering products on time
A closer look at some specific industries will help you to understand the application
of Big Data in these sectors.
Transportation
Big Data has greatly improved transportation services. The data containing traffic
information is analyzed to identify traffic jam areas. Suitable steps can then be taken,
on the basis of this analysis, to keep the traffic moving in such areas. Distributed
sensors are installed in handheld devices, on the roads and on vehicles to provide
real-time traffic information. This information is analyzed and Big Data has
transformed disseminated to commuters and also to the traffic control authority.
Education
Big Data has transformed the modern-day education processes through innovative
approaches, such as e-learning for teachers to analyze the students’ ability to
comprehend and thus impart education effectively in accordance with each
student’s needs. The analysis is done by studying the responses to questions,
recording the time consumed in attempting those questions, and analyzing other
behavioral signals of the students. Big Data also assists in analyzing the
requirements and finding easy and innovative ways of imparting education,
especially distance learning over vast geographical areas.
Travel
The travel industry also uses Big Data to conduct business. It maintains complete
details of all the customer records that are then analyzed to determine certain
behavioral patterns in customers. For example, in the airline industry, Big Data is
analyzed for identifying personal preferences or spotting which passengers like to
have window seats for short-haul flights and aisle seats for long-haul flights. This
helps airlines to offer the similar seats to customers when they make a fresh
booking with the airways.
Big Data also helps airlines to track customers who regularly fly between specific
routes so that they the right cross-sell and up-sell offers. Some airlines also apply
analytics to pricing, inventory, and advertising for improving customer experiences,
leading to more customer satisfaction, and hence, more business. Some airlines
even go to the length of evaluating customers who tend to miss their flights. They try
to help such customers by delaying the flights or booking them on another flight.
Government
Big Data has come to play an important role in almost all the undertaking and
processes of government. According to the UK free market, “the UK government
could save up to £33 billion a year by using public Big Data more effectively.”
Analysis of Big Data promotes clarity and transparency in various government
processes and helps in:
Using budgets more judiciously and reducing unnecessary wastage and costs
Healthcare
In healthcare, the pharmacy and medical device companies use Big Data to improve
their research and development practices, while health insurance companies use it
to determine patient-specific treatment therapy modes that promise the best results.
Big Data also helps researchers to work towards eliminating healthcare-related
challenges before they become real problems. Big Data helps doctors to analyze the
requirement and medical history of every patient and provide individualistic services
to them, depending on their medical condition. Telecom
The mobile revolution and the Internet usage on mobile phones have led to a
tremendous increase in the amount of data generated in the telecom sector.
Managing this huge pool of data has almost become a challenge for the telecom
industry.
A human being lives in a social environment and gains knowledge and experience
through communication. Today, communication is not restricted to meeting in
person. The affordable and handy use of mobile phones and the Internet have made
communication and sharing data of all kinds possible across the globe. Some
popular social networking sites are Twitter, Facebook, and LinkedIn. These social
networking sites are also called social media.
we analyze the effects of Big Data generated from the social media on different
industries. Let’s first understand the meaning of social network data.
Social network data refers to the data generated from people socializing on social
media. On a social networking site, you will find different people constantly adding
and updating comments, statuses, preferences, etc. All these activities generate
large amounts of data. Analyzing and mining such large volumes of data show
business trends with respect to wants and preferences and likes and dislikes of a
wide audience.
This data can be segregated on the basis of different age groups, locations, and
genders for the purpose of analysis. Based on the information extracted,
organizations design products and services specific to people’s need.
Figure shows the social network data generated daily through various social media:
Figure: Social Network Data Generated Every Minute of the Day
Social Network Analysis (SNA) is the analysis performed on the data obtained from
social media. As the data generated is huge in volume, it results in the formation of a
Big Data pool.
Let’s understand the importance of social network data with the help of an example
of a Mobile Network Operator (MNO). The data captured by an MNO in a day, such as
the cell phone calls, text messages, and other related details of all its customers is
very huge in volume. This type of data is used daily for different purposes.
An MNO does not simply need to record and analyze the calls of a customer but the
entire network calls related to that customer. The company must study the data of
the people whom the customer called and also of the people in the customer’s
network who called back the customer. Such a network is called a social network.
Some facts about big data and social media are listed as follows:
Facebook collects 500 times more data each day than the New York Stock
Exchange. (Source: BI Intelligence)
Twitter produces 12 times more data each day than the New York Stock Exchange.
(Source: BI Intelligence)
By 2016, there will be 18.9 billion network connections, i.e., 2.5 connections per
person. (Source: IBM Big Data Hub)
The following are the areas in which decision-making processes are influenced by
social network data:
Business intelligence
Marketing
Business Intelligence
Today, the preferences of consumers have changed due to their busy schedules.
They no longer have the time to read newspapers thoroughly, watch all the TV
commercials, or go through all the e-mails they receive in their inbox. Consumers can
now make their preferences clear and select the marketing messages they wish to
receive―when, where, and from whom. In today’s competitive scenario, marketers
aim to deliver what consumers want by using interactive communication across
digital channels such as e-mail, mobile, social, and the Web.
With the increasing popularity of social media and growing volume of data every
second, organizations competing to make it big in the market must not only identify
and extract the information relevant for their company, products, and services but
also comprehend and respond to the information on a continuous basis.
Credit card fraud—This type of fraud is quite common these days and is related to
the use of credit card facilities. In an online shopping transaction, the online retailer
cannot see the authentic user of the card and therefore, the valid owner of the card
cannot be verified. It is quite likely that a fake or a stolen card is used in the
transaction. In an online transaction, in spite of the security checks, such as address
verification or card security code, fraudsters manage to manipulate the loopholes in
the system.
Exchange or return policy fraud—An online retailer always has a policy allowing
the exchange and return of goods and sometimes, people take advantage of this
policy. These people buy a product online, use it, and then return it back as they are
not satisfied with the product. Sometimes, they even report non-delivery of the
product and later attempt to sell it online. What leads to such a fraud is that retailers
encourage consumers to order products in bulk and later return the ones that they
don’t require. Such a fraud can be averted by charging a restocking fee on the
returned goods, getting customer’s signature on the delivery of the product, and
staying cautious of such customers who are known to commit such frauds.
Personal information fraud—In this type of fraud, people obtain the login
information of a customer and then log-in to the customer’s account, purchase a
product online, and then change the delivery address to a different location. The
actual customer keeps calling the retailer to refund the amount as he or she has not
made the transaction. Once the transaction is proved fraudulent, the retailer has to
refund the amount to the customer.
All these frauds can be prevented only by studying the customer’s ordering patterns
and keeping track of out-of-line orders. Other aspects should also be taken into
consideration such as any change in the shipping address, rush orders, sudden huge
orders, and suspicious billing addresses. By observing such precautions, the
frequency of the occurrence of such frauds can be reduced to a certain extent, but
cannot be completely eliminated.
We have seen that one of the ways to prevent financial frauds is to study the
customer’s ordering pattern and other related data. However, this method works only
when the data to be analyzed is small in size. In order to deal with huge amounts of
data and gain meaningful business insights, organizations need to apply Big Data
analytics. Analyzing Big Data allows organizations to:
Identify new methods of fraud and add them to the list of fraud-prevention checks
Verify whether a product has actually been delivered to the valid recipient
Determine the location of the customer and the time when the product was
actually delivered Check the listings of popular retail sites, such as e-Bay, to find
whether the product is up for sale somewhere else Fraud Detection in Real
Big Data also helps to detect frauds in real time. It compares live transactions with
different data sources to validate the authenticity of online transactions. For
example, in an online transaction, Big Data would compare the incoming IP address
with the geo-data received from the customer’s smart phone apps. A valid match
between the two confirms the authenticity of the transaction.
Big Data also examines the entire historical data to track suspicious patterns of the
customer order. These patterns are then used to create checks for avoiding real-
time fraud. Big Data analysis is performed in real time by retailers to know the actual
time when the products were delivered to customers. Costly products often have
sensors attached to them that transmit their location information. When such
products are delivered to customers, the streaming data obtained from these
sensors provides location information to the retailer, thereby, preventing frauds.
Centralization of Big Data takes place through MPP systems. Any organization that
aims at improving its analytic scalability needs an MPP system. With the continuous
increase in the volume of data, it is not always possible to move data as part of the
analysis process except where it is absolutely required. MPP is the most widely used
technique of storing and analyzing huge volumes of data.
Let us now understand what an MPP database is and what makes it so special and
preferred. An MPP database has several independent pieces of data stored on
multiple networks of connected computers. It eliminates the concept of one central
server having a single CPU and disk.
The data in an MPP database is divided into different disks managed by different
CPUs across different servers, as shown in Figure:
Figure: MPP System Data Storage
Image analytics is another emerging field that can help detect frauds. It refers to the
process of analyzing image data with the help of digital processing of the image.
Examples include the use of bar codes and QR codes. Some other examples include
complex solutions such as facial recognition and position-and-movement analysis.
Today, images and videos contribute to 80 percent of unstructured data. Analytical
systems that deal with Big Data are designed to integrate and understand images,
videos, text, and numbers.
Big Data can also help in creating maps and graphs for comparisons that can be
used to analyze situations and take decisions. An analysis in the graphical form, for
example, can help identify the customers, areas, and products that display a high
fraud rate. Big Data can even show comparisons between products and regions,
which alert retailers as to where a greater probability of fraud exists. The retailer can
then take proper actions to mitigate the risk accordingly.
Big Data has huge potential for the retail industry as well. Considering the immense
number of transactions and their correlation, the retail industry offers a promising
space for Big Data to operate.
Seemingly simple questions, such as the following, are easy to answer when there is
a single retail location and a small customer base:
What else has customer X bought, and what kind of coupons can we send to
customer X?
Many times, extracting data in real time is not feasible as systems are affected
because of scaling issues. Suppose you want to know if a particular item is in stock
in another nearby store. This data cannot be found immediately and needs some
phone calls or other ways of accessing information and therefore, prevents the
immediate sale of the item. If access to the data is possible, there may not be
anything particularly rich or useful about it. Raw transactional data can only help a
company understand its sales but does not provide any relationships, patterns, or
other clues for deeper analysis. Also, the fact remains that most of the Big Data is
just not required and not useful either. Some information in a Big Data feed can have
a long-term strategic value while some information will be used immediately and
some information will not be used at all. The main part of taming Big Data is to
identify which portions fall into which category.
The RFID technology helps better item tracking by differentiating the items that are
out of stock and that are available on shelves. For instance, if an item is not
available on the shelves, it does not imply that the item is not available throughout.
With the help of an RFID reader and a mobile computer, the inventory can be
immediately verified and stocks replenished, if required.
Various types of RFID tags are available for various environments such as cardboard
boxes, wooden, glass, or metal containers. Tags also come in various sizes and are
of varied capabilities, including read and write capability, memory, and power
requirements. They also have a wide range of durability. Some varieties are paper-
thin and are typically for one-time use and are called ‘smart labels.’ RFID tags can
also be customized and withstand heat, moisture, acids, and other extreme
conditions. Some RFID tags are also reusable, thus offering a Total Cost of
Ownership (TCO) benefit over bar code labels.
The use of RFIDs saves time, reduces labor, enhances the visibility of products
throughout the production-delivery cycle, and saves costs.
Organizations can tag all their capital assets, such as pallets, vehicles, and tools, in
order to trace them anytime and from any location. Readers fixed at specific
locations can observe and record all movements of the tagged assets with great
accuracy. This mechanism also works as a security check and alerts supervisors
and raises an alarm in case anyone tries to take the asset outside the authorized
area.
When containers are loaded for shipment, tracking pallets with RFIDs are included in
them. These RFIDs contain records of what is stored in the container. This helps
production managers to have a complete view of the inventory level and location of
containers. This information can be used to locate items and fulfil rush orders
without any waste of time.
Shipping containers, pallets, cylinders, and reusable plastic bottles having RFID tags
can be easily identified at the dock entry as they leave with an outbound
consignment. After the database is matched with the shipping information, the
manufacturers of the products create a log of each shipping container with its
details and develop a procedure for tracking their goods. This information can be
utilized to reduce the time required for documentation and can be of great value in
resolving disputes of lost and damaged goods.
Inventory Control:
One of the primary benefits of using RFID is inventory tracking, especially in areas
where tracking has not been done or was not possible before. RFID tags can be read
even if the contents are packed and are not in the direct line of sight. This means
that an entire pallet with an assortment of goods can be read without disturbing the
arrangement of goods in the pallet. RFID tags are resistant to temperature and
environmental variances such as dirt, moisture, heat, and contaminants. On the
other hand, bar codes cannot handle such conditions and are prone to damage or
errors.
Using an RFID tracking system can result in an optimized inventory level, and thus
reduce the overall cost of stocking and labor. RFID allows manufacturers to track
inventory for raw materials, work in progress, or finished goods. Readers installed
on shelves can update inventory automatically and raise alarms in case the
requirement for restocking arises.
RFID tags can also be used to trigger automated shipment tracking applications.
Manufacturers use the readings obtained from these tags for generating a shipment
manifest, which is used for many tasks including:
Nowadays, Serial Shipping Container Code (SSCC) is widely used in shipping labels.
SSCC can be easily converted into RFID tags in order to provide automatic handling
of shipment. The data contained in the RFID tag can be considered with the
shipment information, which can easily be read by the receiving organization to
simplify the receiving process and eliminate processing delays.
Regulatory Compliance:
The entire custody trail can be produced before regulatory bodies such as the Food
and Drug Administration (FDA), Department of Transportation (DOT), and
Occupational Safety and Health Administration (OSHA) along with other regulatory
requirements, provided the RFID tag that travels with the material has been updated
with all the handling data. This could be of great use for companies that work with
hazardous items, food, pharmaceuticals, and other regulated materials.