5
UNIT ‐ I:
Introduction to big data: Data, Characteristics of data and Types of digital data:
Unstructured, Semi- structured and Structured - Sources of data. Big Data Evolution
-Definition of big data-Characteristics and Need of big data-Challenges of big data.
Big data analytics, Overview of business intelligence.
1.1 What is Data?
Data is defined as individual facts, such as numbers, words, measurements,
observations or just descriptions of things.
For example, data might include individual prices, weights, addresses, ages, names,
temperatures, dates, or distances.
There are two main types of data:
1. Quantitative data is provided in numerical form, like the weight, volume, or cost
of an item.
2. Qualitative data is descriptive, but non-numerical, like the name, sex, or eye
colour of a person.
1.2 Characteristics of Data
The following are six key characteristics of data which discussed below:
1. Accuracy
2. Validity
3. Reliability
4. Timeliness
5. Relevance
6. Completeness
1. Accuracy
Data should be sufficiently accurate for the intended use and should be captured only
once, although it may have multiple uses. Data should be captured at the point of
activity.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
6
2. Validity
Data should be recorded and used in compliance with relevant requirements, including
the correct application of any rules or definitions. This will ensure consistency between
periods and with similar organizations, measuring what is intended to be measured.
3. Reliability
Data should reflect stable and consistent data collection processes across collection
points and over time. Progress toward performance targets should reflect real changes
rather than variations in data collection approaches or methods. Source data is clearly
identified and readily available from manual, automated, or other systems and records.
4. Timeliness
Data should be captured as quickly as possible after the event or activity and must be
available for the intended use within a reasonable time period. Data must be available
quickly and frequently enough to support information needs and to influence service
or management decisions.
5. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will
require a periodic review of requirements to reflect changing needs.
6. Completeness
Data requirements should be clearly specified based on the information needs of the
organization and data collection processes matched to these requirements.
1.3 Types of Digital Data
Digital data is the electronic representation of information in a format or
language that machines can read and understand.
In more technical terms, Digital data is a binary format of information that's
converted into a machine-readable digital format.
The power of digital data is that any analog inputs, from very simple text
documents to genome sequencing results, can be represented with the binary
system.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
7
Types of Digital Data:
Structured
Unstructured
Semi Structured Data
Structured Data:
Structured data refers to any data that resides in a fixed field within a record or
file.
Having a particular Data Model.
Meaningful data.
Data arranged in arow and column.
Structured data has the advantage of being easily entered, stored, queried and
analysed.
E.g.: Relational Data Base, Spread sheets.
Structured data is often managed using Structured Query Language (SQL)
Sources of Structured Data:
SQL Databases
Spreadsheets such as Excel
OLTP Systems
Online forms
Sensors such as GPS or RFID tags
Network and Web server logs
Medical devices
Advantages of Structured Data:
Easy to understand and use: Structured data has a well-defined schema or data
model, making it easy to understand and use. This allows for easy data
retrieval, analysis, and reporting.
Consistency: The well-defined structure of structured data ensures consistency
and accuracy in the data, making it easier to compare and analyze data across
different sources.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
8
Efficient storage and retrieval: Structured data is typically stored in relational
databases, which are designed to efficiently store and retrieve large amounts
of data. This makes it easy to access and process data quickly.
Enhanced data security: Structured data can be more easily secured than
unstructured or semi-structured data, as access to the data can be controlled
through database security protocols.
Clear data lineage: Structured data typically has a clear lineage or history,
making it easy to track changes and ensure data quality.
Disadvantages of Structured Data:
Inflexibility: Structured data can be inflexible in terms of accommodating new
types of data, as any changes to the schema or data model require significant
changes to the database.
Limited complexity: Structured data is often limited in terms of the complexity
of relationships between data entities. This can make it difficult to model
complex real-world scenarios.
Limited context: Structured data often lacks the additional context and
information that unstructured or semi-structured data can provide, making it
more difficult to understand the meaning and significance of the data.
Expensive: Structured data requires the use of relational databases and related
technologies, which can be expensive to implement and maintain.
Data quality: The structured nature of the data can sometimes lead to missing
or incomplete data, or data that does not fit cleanly into the defined schema,
leading to data quality issues.
Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Also called unclassified data.
Which does not confirm to any data model.
Business rules are not applied.
Indexing is not required.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
9
E.g.: photos and graphic images, videos, streaming instrument data,
webpages, Pdf files, PowerPoint presentations, emails, blog entries, wikis and
word processing documents.
Sources of Unstructured Data:
Web pages
Images (JPEG, GIF, PNG, etc.)
Videos
Memos
Reports
Word documents and PowerPoint presentations
Surveys
Advantages of Unstructured Data:
Its supports the data which lacks a proper format or sequence
The data is not constrained by a fixed schema
Very Flexible due to absence of schema.
Data is portable
It is very scalable
It can deal easily with the heterogeneity of sources.
These type of data have a variety of business intelligence and analytics
applications.
Disadvantages Of Unstructured data:
It is difficult to store and manage unstructured data due to lack of schema and
structure
Indexing the data is difficult and error prone due to unclear structure and not
having pre-defined attributes. Due to which search results are not very
accurate.
Ensuring security to data is difficult task.
Semi structured Data:
Self-describing data.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
10
Metadata (Data about data).
Also called quiz data: data in between structured and semi structured.
It is a type of structured data but not followed data model.
Data which does not have rigid structure.
E.g.: E-mails, word processing software.
XML and other markup language are often used to manage semi structured
data.
Sources of semi-structured Data:
E-mails
XML and other markup languages
Binary executables
TCP/IP packets
Zipped files
Integration of data from different sources
Web pages
Advantages of Semi-structured Data:
The data is not constrained by a fixed schema
Flexible i.e Schema can be easily changed.
Data is portable
It is possible to view structured data as semi-structured data
Its supports users who can not express their need in SQL
It can deal easily with the heterogeneity of sources.
Flexibility: Semi-structured data provides more flexibility in terms of data
storage and management, as it can accommodate data that does not fit into a
strict, predefined schema. This makes it easier to incorporate new types of data
into an existing database or data processing pipeline.
Scalability: Semi-structured data is particularly well-suited for managing large
volumes of data, as it can be stored and processed using distributed computing
systems, such as Hadoop or Spark, which can scale to handle massive
amounts of data.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
11
Faster data processing: Semi-structured data can be processed more quickly
than traditional structured data, as it can be indexed and queried in a more
flexible way. This makes it easier to retrieve specific subsets of data for analysis
and reporting.
Improved data integration: Semi-structured data can be more easily integrated
with other types of data, such as unstructured data, making it easier to combine
and analyze data from multiple sources.
Richer data analysis: Semi-structured data often contains more contextual
information than traditional structured data, such as metadata or tags. This can
provide additional insights and context that can improve the accuracy and
relevance of data analysis.
Disadvantages of Semi-structured data
Lack of fixed, rigid schema make it difficult in storage of the data
Interpreting the relationship between data is difficult as there is no separation
of the schema and the data.
Queries are less efficient as compared to structured data.
Complexity: Semi-structured data can be more complex to manage and
process than structured data, as it may contain a wide variety of formats, tags,
and metadata. This can make it more difficult to develop and maintain data
models and processing pipelines.
Lack of standardization: Semi-structured data often lacks the standardization
and consistency of structured data, which can make it more difficult to ensure
data quality and accuracy. This can also make it harder to compare and analyze
data across different sources.
Reduced performance: Processing semi-structured data can be more resource-
intensive than processing structured data, as it often requires more complex
parsing and indexing operations. This can lead to reduced performance and
longer processing times.
Limited tooling: While there are many tools and technologies available for
working with structured data, there are fewer options for working with semi-
structured data. This can make it more challenging to find the right tools and
technologies for a particular use case.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
12
Data security: Semi-structured data can be more difficult to secure than
structured data, as it may contain sensitive information in unstructured or less-
visible parts of the data. This can make it more challenging to identify and
protect sensitive information from unauthorized access.
Overall, while semi-structured data offers many advantages in terms of
flexibility and scalability, it also presents some challenges and limitations that
need to be carefully considered when designing and implementing data
processing and analysis pipelines.
1.4 Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with
time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but with
huge size.
What is an Example of Big Data?
Following are some of the Big Data examples-
New York Stock Exchange : The New York Stock Exchange is an example of Big
Data that generates about one terabyte of new trade data per day.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
13
Social Media: The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.
Jet engine :A single Jet engine can generate 10+terabytes of data in 30 minutes of
flight time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
14
1.5 Big Data Characteristics
Volume:
The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of
data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
Variety:
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, but these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate
the data. Veracity is the process of being able to handle and manage data efficiently.
Big Data is also essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
15
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile
devices, etc.
1.6 Why Big Data?
Big Data initiatives were rated as “extremely important” to 93% of companies.
Leveraging a Big Data analytics solution helps organizations to unlock the strategic
values and take full advantage of their assets.
It helps organizations like
To understand Where, When and Why their customers buy
Protect the company’s client base with improved loyalty programs
Seizing cross-selling and upselling opportunities
Provide targeted promotional information
Optimize Workforce planning and operations
Improve inefficiencies in the company’s supply chain
Predict market trends
Predict future needs
Make companies more innovative and competitive
It helps companies to discover new sources of revenue
Companies are using Big Data to know what their customers want, who are their best
customers, why people choose different products. The more a company knows about
its customers, the more competitive it becomes.
We can use it with Machine Learning for creating market strategies based on
predictions about customers. Leveraging big data makes companies customer-centric.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
16
Companies can use Historical and real-time data to assess evolving consumers’
preferences. This consequently enables businesses to improve and update their
marketing strategies which make companies more responsive to customer needs.
Importance of big data
Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company
uses its data, more rapidly it grows.
The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources.
Tools like Hadoop help them to analyze data immediately thus helping in making quick
decisions based on the learnings.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
17
3. Understand the market conditions
Big Data analysis helps businesses to get a better understanding of market situations.
For example, analysis of customer purchasing behavior helps companies to identify
the products sold most and thus produces those products accordingly. This helps
companies to get ahead of their competitors.
4. Social Media Listening
Companies can perform sentiment analysis using Big Data tools. These enable them
to get feedback about their company, that is, who is saying what about the company.
Companies can use big data tools to improve their online presence.
5. Boost Customer Acquisition and Retention
Customers are a vital asset on which any business depends on. No single business
can achieve its success without building a robust customer base. But even with a solid
customer base, the companies can’t ignore the competition in the market.
If we don’t know what our customers want then it will degrade companies’ success. It
will result in the loss of clientele which creates an adverse effect on business growth.
Big data analytics helps businesses to identify customer related trends and patterns.
Customer behavior analysis leads to a profitable business.
6. Solve Advertisers Problem and Offer Marketing Insights
Big data analytics shapes all business operations. It enables companies to fulfill
customer expectations. Big data analytics helps in changing the company’s product
line. It ensures powerful marketing campaigns.
7. The driver of Innovations and Product Development
Big data makes companies capable to innovate and redevelop their products.
1.7 Challenges of Big Data
When implementing a big data solution, here are some of the common challenges
your business might run into, along with solutions.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
18
1. Managing massive amounts of data
It's in the name—big data is big. Most companies are increasing the amount of
data they collect daily. Eventually, the storage capacity a traditional data center
can provide will be inadequate, which worries many business leaders. Forty-three
percent of IT decision-makers in the technology sector worry about this data influx
overwhelming their infrastructure [2] .
To handle this challenge, companies are migrating their IT infrastructure to the
cloud. Cloud storage solutions can scale dynamically as more storage is
needed. Big data software is designed to store large volumes of data that can
be accessed and queried quickly.
2. Integrating data from multiple sources
The data itself presents another challenge to businesses. There is a lot, but it is
also diverse because it can come from a variety of different sources. A business
could have analytics data from multiple websites, sharing data from social media,
user information from CRM software, email data, and more. None of this data is
structured the same but may have to be integrated and reconciled to gather
necessary insights and create reports.
To deal with this challenge, businesses use data integration software, ETL
software, and business intelligence software to map disparate data sources
into a common structure and combine them so they can generate accurate
reports.
3. Ensuring data quality
Analytics and machine learning processes that depend on big data to run also
depend on clean, accurate data to generate valid insights and predictions. If the
data is corrupted or incomplete, the results may not be what you expect. But as
the sources, types, and quantity of data increase, it can be hard to determine if
the data has the quality you need for accurate insights.
Fortunately, there are solutions for this. Data governance applications will help
organize, manage, and secure the data you use in your big data projects while
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
19
also validating data sources against what you expect them to be and cleaning up
corrupted and incomplete data sets. Data quality software can also be used
specifically for the task of validating and cleaning your data before it is processed.
4. Keeping data secure
Many companies handle data that is sensitive, such as:
Company data that competitors could use to take a bigger market share
of the industry
Financial data that could give hackers access to accounts
Personal user information of customers that could be used for identity
theft
If a business handles sensitive data, it will become a target of hackers. To protect
this data from attack, businesses often hire cybersecurity professionals who keep
up to date on security best practices and techniques to secure their systems.
Whether you hire a consultant or keep it in-house, you need to ensure that data
is encrypted, so the data is useless without an encryption key. Add identity and
access authorization control to all resources so only the intended users can
access it. Implement endpoint protection software so malware can't infect the
system and real-time monitoring to stop threats immediately if they are detected.
5. Selecting the right big data tools
Fortunately, when a business decides to start working with data, there is no
shortage of tools to help them do it. At the same time, the wealth of options is also
a challenge. Big data software comes in many varieties, and their capabilities
often overlap. How do you make sure you are choosing the right big data tools?
Often, the best option is to hire a consultant who can determine which tools will
fit best with what your business wants to do with big data. A big data professional
can look at your current and future needs and choose an enterprise data
streaming or ETL solution that will collect data from all your data sources and
aggregate it. They can configure your cloud services and scale dynamically based
on workloads. Once your system is set up with big data tools that fit your needs,
the system will run seamlessly with very little maintenance.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
20
Thinking about hiring a data analytics company to help your business implement
a big data strategy? Browse our list of top data analytics companies, and learn
more about their services in our hiring guide.
6. Scaling systems and costs efficiently
If you start building a big data solution without a well-thought-out plan, you can
spend a lot of money storing and processing data that is either useless or not
exactly what your business needs. Big data is big, but it doesn' t mean you have
to process all of your data.
When your business begins a data project, start with goals in mind and strategies
for how you will use the data you have available to reach those goals. The team
involved in implementing a solution needs to plan the type of data they need and
the schemas they will use before they start building the system so the project
doesn't go in the wrong direction. They also need to create policies for purging
old data from the system once it is no longer useful.
7. Lack of skilled data professionals
One of the big data problems that many companies run into is that their current
staff have never worked with big data before, and this is not the type of skill set
you build overnight. Working with untrained personnel can result in dead ends,
disruptions of workflow, and errors in processing.
There are a few ways to solve this problem. One is to hire a big data
specialist and have that specialist manage and train your data team until they
are up to speed. The specialist can either be hired on as a full -time employee or
as a consultant who trains your team and moves on, depending on your budget.
Another option, if you have time to prepare ahead, is to offer training to your
current team members so they will have the skills once your big data project is in
motion.
A third option is to choose one of the self-service analytics or business
intelligence solutions that are designed to be used by professionals who don't
have a data science background.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
21
8. Organizational resistance
Another way people can be a challenge to a data project is when they resist
change. The bigger an organization is, the more resistant it is to change. Leaders
may not see the value in big data, analytics, or machine learning. Or they may
simply not want to spend the time and money on a new project.
This can be a hard challenge to tackle, but it can be done. You can start with a
smaller project and a small team and let the results of that project prove the value
of big data to other leaders and gradually become a data-driven business. Another
option is placing big data experts in leadership roles so they can guide your
business towards transformation.
1.8 What is Business Intelligence?
BI(Business Intelligence) is a set of processes, architectures, and technologies that
convert raw data into meaningful information that drives profitable business actions. It
is a suite of software and services to transform data into actionable intelligence and
knowledge.
BI has a direct impact on organization’s strategic, tactical and operational business
decisions.
BI supports fact-based decision making using historical data rather than assumptions
and gut feeling.
BI tools perform data analysis and create reports, summaries, dashboards, maps,
graphs, and charts to provide users with detailed intelligence about the nature of the
business.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
22
Why is BI important?
Measurement: creating KPI (Key Performance Indicators) based on historic
data
Identify and set benchmarks for varied processes.
With BI systems organizations can identify market trends and spot business
problems that need to be addressed.
BI helps on data visualization that enhances the data quality and thereby the
quality of decision making.
BI systems can be used not just by enterprises but SME (Small and Medium
Enterprises)
How Business Intelligence systems are implemented?
Here are the steps:
Step 1) Raw Data from corporate databases is extracted. The data could be spread
across multiple systems heterogeneous systems.
Step 2) The data is cleaned and transformed into the data warehouse. The table can
be linked, and data cubes are formed.
Step 3) Using BI system the user can ask quires, request ad-hoc reports or conduct
any other analysis.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
23
Advantages of Business Intelligence
Here are some of the advantages of using Business Intelligence System:
1. Boost productivity
With a BI program, It is possible for businesses to create reports with a single click
thus saves lots of time and resources. It also allows employees to be more productive
on their tasks.
2. To improve visibility
BI also helps to improve the visibility of these processes and make it possible to
identify any areas which need attention.
3. Fix Accountability
BI system assigns accountability in the organization as there must be someone who
should own accountability and ownership for the organization’s performance against
its set goals.
4. It gives a bird’s eye view:
BI system also helps organizations as decision makers get an overall bird’s eye view
through typical BI features like dashboards and scorecards.
5. It streamlines business processes:
BI takes out all complexity associated with business processes. It also automates
analytics by offering predictive analysis, computer modeling, benchmarking and other
methodologies.
6. It allows for easy analytics.
BI software has democratized its usage, allowing even nontechnical or non-analysts
users to collect and process data quickly. This also allows putting the power of
analytics from the hand’s many people.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY
24
BI System Disadvantages
1. Cost:
Business intelligence can prove costly for small as well as for medium-sized
enterprises. The use of such type of system may be expensive for routine business
transactions.
2. Complexity:
Another drawback of BI is its complexity in implementation of datawarehouse. It can
be so complex that it can make business techniques rigid to deal with.
3. Limited use
Like all improved technologies, BI was first established keeping in consideration the
buying competence of rich firms. Therefore, BI system is yet not affordable for many
small and medium size companies.
4. Time Consuming Implementation
It takes almost one and half year for data warehousing system to be completely
implemented. Therefore, it is a time-consuming process.
BIG DATA ANALYTICS DEPT.OF INFORMATION TECHNOLOGY