Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views25 pages

Week 01 Data Analytics - An Overview

Uploaded by

shamilie17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

Week 01 Data Analytics - An Overview

Uploaded by

shamilie17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Week 1: Data Analytics – An Overview

Data Analytics
– An Overview

Contributor

Prof. Chandan Chakraborty


Professor
Department of Computer Science & Engineering

1
Week 1: Data Analytics – An Overview

Learning Outcomes

On successful competition of this, the learners will be able to

 Explore the concept of data science and data analytics with their relation.
 Develop an understanding about data, data types and its various sources with
examples.
 Explore the data analytics workflow including data collection, methods of storing
data, data cleansing, data visualization and data security.
 Develop an insight into the types of data analytics as well as data modeling
approaches.

Contents:

1. Introduction to Data Analytics


2. Types of Data
3. Data Analytics workflow
a. Collection and Storing
b. Data Cleansing and Transformation
c. Data Visualization
d. Data Modelling
e. Data Security
4. Types of Data Analytics

2
Week 1: Data Analytics – An Overview

1. Introduction to Data Analytics

The world has witnessed the digital revolution in almost all the fields. In fact, the advent of
information technology has generated large volumes of data in various formats like records,
files, documents, images, sound, videos, scientific data and many new data formats. It is here to
mention that the volume of data stored in the world would be more than 40 zettabytes (1021) by
2020, as published by Hammad et. al. (2015).

Fig. Data volume increasing trend in zettabyte across the globe [source: Hammad et. al. 2015]

Due to the increased volume, velocity and variety of the data, the datasets are very often
referred as ‘big data’. Like other fields, a large amount of data is generated day-by-day at the
educational settings through the increased e-learning resources, learning management
systems, student information systems, laboratory experiments, all academic and administrative
resources etc. All these data collected from different sources require proper analytical
framework in order to extract knowledge for strategic planning and better decision making. Data
Analytics, being the most impactful analytical platform, plays the crucial role in visualization,
summarization and modelling of data for predictions / inferencing.

Data analytics refers to the science of exploring, analyzing, visualizing data from both internal
and external sources for drawing inferences, and making predictions to enable innovation, gain
competitive business advantage, and finally helping in strategic decision-making. In fact, it has
become an interdisciplinary research field that has adopted aspects from many other scientific

3
Week 1: Data Analytics – An Overview

disciplines such as computer science, statistics, pattern recognition, computational intelligence,


machine learning.

2. Types of Data

(a) According to scales and measurements

Data generated at various sources can be broadly categorized into two segments viz.,
quantitative vs. qualitative data.

Quantitative data are represented by numerical values through which it remains always easy to
explain precisely and accurately. Such type of data is easily amenable to statistical manipulation
and can be represented by a wide variety of statistical types of graphs and charts such as line,
bar graph, scatter plot, and etc. Few examples are cited as follows:

 Scores on class tests -e.g. 85, 67, 90 and etc.


 Room temperature of ICU.
 Height and weight of a person etc.

Again, there are two general types of quantitative data - discrete and continuous. Continuous
data is information that could be meaningfully divided into finer levels. It can be measured on a
scale or continuum and can have almost any numeric value. Both discrete and continuous
quantitative data are measureable at two scales – interval and ratio scales.

An interval scale is one where there is order and the difference between two values is
meaningful. Examples are cited like temperature (Farenheit), temperature (Celcius), pH, SAT
score (200-800), excellent grade (90-100), credit score (300-850) etc. A variable measured at
ratio scale has all the properties of an interval scale and in addition, it includes zero. For
example, enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight,
length, temperature in Kelvin (0.0 Kelvin really does mean “no heat”) etc are measured at ratio
scale. The following figure shows the classification of data types as per measurement scales.

4
Week 1: Data Analytics – An Overview

Interval

Discrete Ratio
Quantitative
Continuous Interval

Data
Ratio
Nominal
Qualitative
Ordinal

Fig. Types of data and its measurement scales

On the other hand, qualitative data consist of values that can be placed in the nonnumeric
categories. Such type of data can be measured at nominal and ordinal scales. Nominal scale
provides only categories without any ordering or direction; whereas ordinal scale measures the
ranking or ordering of values.

(b) According to structure of data

Advent of information technology has increasingly developed the huge repository of digital data.
According to its various formats, this digital data can be broadly classified into structured, semi-
structured and unstructured data.

Unstructured Data Semi-structured Data Structured Data

i) Unstructured Data:

Unstructured data refer to the data that either does not conform to any pre-defined model. It
might have an internal structure but is not structured via pre-defined data models. On the other
way, it can be explained that the data with some structure may still be labeled as unstructured if
the structure does not help in processing task in hand. It may be textual or non-textual
information which are either human-generated or machine-generated. Various sources of
human and machine –generated information are given below as examples.

5
Week 1: Data Analytics – An Overview

Table: Examples of human and machine generated unstructured data

Human-generated unstructured data Machine-generated unstructured data

 Text files: Word processing, spreadsheets,  Satellite imagery: Weather data,


presentations, emails, logs. landforms, military movements.
 Social Media: Facebook, Twitter, WhatsApp,  Scientific data: Oil and gas
LinkedIn data. exploration, space exploration,
 Website: YouTube, Instagram. seismic imagery, atmospheric data.
 Mobile data: SMS, Text messages, Chats.  Digital surveillance: Surveillance
 Communications: Chat, phone recordings, photos and video.
collaboration software.  Sensor data: Traffic, weather,
 Media: MP3, audio and video files. oceanographic sensors.

 Email: Body of the mail.


 MOODLE: Chat forum for discussions,

In today’s scenario, majority of the enterprise data leads to the unstructured data due to
varieties of reasons like easy multimedia communications, ease data repository and transfers
etc. Such data can’t be ignored due to its increased large volume. In order to extract meaningful
information, these data are required to be analyzed with following analytical techniques.

Data
Mining

Natural Unstructured
Text
Language Data
Analytics
Processing Analytics

Noisy Text
Analytics

Fig. Handling with unstructured data

6
Week 1: Data Analytics – An Overview

Data Mining – It refers to the process of discovering interesting and meaningful patterns and
relationships from large volumes of data. It is also known as ‘knowledge discovery in data’
(KDD). Data mining has emerged as an important multi-disciplinary area integrating tools and
techniques from different disciplines such as artificial intelligence, machine learning, statistics
and probability, database management, data visualization etc.

Fig. Various disciplines integrated with Data Mining

From functionality point of view, it can be divided into two general components viz., descriptive
data mining and predictive data mining.

Example - Data mining solves varieties of real applications in the areas of defense, biology,
healthcare, industry, banking, education etc. In fact, educational data mining (EDM) has
emerged as an important analytics area that explores different data mining techniques to
analyze students’ data in educational environments to improve performance.

Natural Language Processing – It is a branch of artificial intelligence (AI) that enables the
computers to understand the texts and spoken words expressed in human or natural language.
It integrates the disciplines like statistics, machine learning and deep learning models.

7
Week 1: Data Analytics – An Overview

Statistics

NLP

Machine Deep
Learning Learning

Fig. Computational dimensions of natural language processing

Example in education – With the advancement in AI, natural language processing has been
significantly contributing to bring improvement in the scientific process of teaching and learning
especially. It also assists in adopting advance technologies for bringing improvement in the
educational system. For example, NLP is applied in education for e-learning, which assist in
producing educational material with technological development. In the context of teaching-
learning, there are numerous electronic, online sources available in English language, which
provides students and teachers to access e-materials. With the availability of large number of
online resource materials, it becomes very important to verify the reliability of the resources.
Hence, this requires intelligent automatic processing for preventing the use of such unreliable
resources and promoting the use of authentic resources. Application of NLP in education is also
effective for mining, information retrieval, quality assessment and academic performance
improvement.

Text Analytics – Text analytics or text mining is a process of extracting high quality and
meaningful information from text on the basis of patterns and trends of statistical pattern
learning. It performs various tasks like text categorization, text clustering, sentiment analysis,
concept analysis etc.

Example – Text analytics plays crucial role to derive useful information based on the opinions
received from the students to support the different educational processes and decision-making
strategies. It helps to determine the qualities that the undergraduates consider essential in the
evaluation of the teaching-learning process and the performance of the teachers. There is an
unprecedented increase in the amount of text-based data in different activities within the
educational processes, which can be leveraged to provide useful strategic intelligence and
improvement insights. Educators can apply the resultant methods and technologies, process

8
Week 1: Data Analytics – An Overview

innovations, and contextual-based information for ample support and monitoring of the teaching-
learning processes and decision making.

Noisy Text Analytics – It is the process of extracting structured and unstructured information
from noise unstructured data such as chats, blogs, wikis, mails, text messages etc. Noisy
unstructured data usually comprises of one or more of the following – Spelling mistakes,
abbreviation, acronym, non-standard words, missing punctuation, missing letter case and filler
words “uh” “um” etc.

Example – In academic institutions, students’ feedback is usually collected through


questionnaire at the end of the session to improve the quality of teaching of the instructors. All
these responses include numerical answers to Likert scale questions and textual comments to
open-ended questions. The textual comments are basically suggestions about how the
instructor can further enhance the student learning experience. Eventually, these suggestions
provided by the students involve spelling mistakes, non-standard words, incomplete phrase etc.
However, it becomes tough task to manually go through all the text (qualitative) comments and
extract the suggestions. Such noisy texts having meaningful information are usually modelled by
the analytical methods like rule-based methods and statistical classifiers for extracting and
summarizing the explicit suggestions.

ii)Semi-structured Data:
Semi-structured data is information that does not reside in a relational database but that have
some organizational properties that make it easier to analyze. It is more flexible than
structured data but less flexible than unstructured data. Examples include email, XML and
other markup languages. It is also referred to as self-describing structure.
Semi-structured data falls in the middle between structured and unstructured data. It contains
certain aspects that are structured, and others that are not. For example, X-rays and other
large images consist largely of unstructured data – in this case, a great many pixels. It is
impossible to search and query these X-rays in the same way that a large relational database
can be searched, queried and analyzed. After all, all you are searching against are pixels
within an image. Fortunately, there is a way around this. Although the files themselves may
consist of no more than pixels, words or objects, most files include a small section known as
metadata.

9
Week 1: Data Analytics – An Overview

iii) Structured Data:

Structured data refer to such type of data that conform to a pre-defined schema or structure.
These data are highly organized and usually resides in relational databases. As per the formats,
examples of such data include XML, Excel, ORACLE etc. In the real life situations, enterprise
resource planning (ERP) data, transactional data, students marksheets, records etc. are some
examples of such structured data as well.

c) According to the sources of data

After defining the problem and its associated variables, the next most important stage is data
collection where data come from different sources. Depending upon the availability, data can be
divided into the following

 Primary Data
 Secondary Data

If the data are collected by the decision maker directly through survey or interview or
experiment, data are called as primary data. On the other hand, data collected from existing
sources (where somebody already collected and stored in some form of repository like MS file,
database, cloud etc) are called secondary data.

3. Data Analytics Workflow

It is well established that the data received from varieties of sources in different formats give
rise to enormous large datasets. Once the voluminous data collected as per the defined
problem, data analytics tools and techniques need to be essentially explored in order to
extract meaningful information for augmented interpretation. In this context, it is to mention
that data analytics framework includes several important steps to execute systematically. In
view of this, the analytics flow is presented in the below diagram. Most importantly, it
involves transforming and inspecting data to infer the inherent meaningful information.

10
Week 1: Data Analytics – An Overview

The major steps involved in the above workflow are as follows:

(a) At first, the problem of interest is identified for solving. In fact, the problem statement will
guide about what types of data to gather and the important features that represent the
possible solutions. A lot of domain expertise is needed in data analytics, and a problem
space where expertise is accessible is almost mandatory.

(b) Once the problem is identified, appropriate data needs to be collected. The collected
data needs to be represented in a format that optimizes on space without losing
resolution in information. Enterprises now need to be aware of compliance and security.
Access to the data might need to be restricted to authorized personnel and data could
be confidential in some cases.

(c) The stored data needs to be cleansed for data preprocessing. Cleansing involves
removal of outliers, missing values, and bad records. The result of the data analysis
depends much on cleansing of the data. Data that is not cleansed might lead to skewed
analysis.

(d) The cleansed data is transformed into a representation that can be used for analysis.
One example of transformation is normalization of data to a range between 0 and 1.
Another example is changing the scale of the data to ease computation.

(e) The transformed data is then analyzed using statistical and machine learning algorithms.
Statistical techniques are used for summarization, visualization and predictive modeling

11
Week 1: Data Analytics – An Overview

etc. On the other side, machine learning (ML) algorithms refer to a class of learning
algorithms mainly used for clustering and classifications of data for meaningful
predictions.

(f) Validation step is also very important in order to validate or cross-validate the obtained
results in concurrence with the experts’ domain knowledge or a set of test users. Post-
validation, if the results can be used to make meaningful decisions, it ends the process
of analysis. Otherwise, the data scientist and associates get back to the drawing board
and tweak the parameters in the pipeline.

(g) The validated results are visually represented for the stakeholders (which could include
users) for easy understanding with various data visualization techniques.

(h) Finally, the validated results are used to decision making.

Amongst the above, the following major components of data analytics are described briefly:

(a) Data Collection and Storing

Data Collection refers to the acquisition of data from various sources in order to address the
problems, as defined by the decision maker. Broadly there are two ways for collecting data as
follows:

 Quantitative Data Collection


 Qualitative Data Collection

Quantitative data collection methods are driven by the (statistical) random sampling techniques
those of which are very popular, as described below:

 Simple Random Sampling – A method by which a


subset of data is drawn randomly from a large data
repository. The fig. explains the random selection of
05 individuals from 15 with equal of chance of
inclusion.

 Stratified Random Sampling – A method by which


the large dataset is first stratified into a set of
homogeneous groups and hence a sample is drawn
randomly from each group. Here, 15 individuals at

12
Week 1: Data Analytics – An Overview

first are divided into four homogenous groups and then gets selected from each group to
keep signature of all groups to attain heterogeneity in the final sample.

In addition, there are many other methods for information gathering as follows:

o Questionnaire – Online (like Google form) and Offline (paper based).


o Interviews – Face to Face; Telephonic; Email; Chat/Messaging
o Direct observations – Experimental data (Manual and Digital)
o Documents / Literature
o Case-studies etc.

Data Storage

Data collected in the form of whether character, image, sound, or some other form, are saved
initially into the computer. And then these are converted into an appropriate format/s so that the
digital data can be stored and processed under the computing environment. Different digital
data formats are given in the table:

DATA TYPES DIGITAL FORMATS


Text, XML, PDF/A, HTML, Plain Text
Documentation,
Scripts
Still Image TIFF, JPEG 2000, PNG, JPEG/JFIF, DNG (digital
negative), BMP, GIF
Geospatial Shapefile (SHP, DBF, SHX), GeoTIFF, NetCDF.
Audio WAVE, AIFF, MP3, MXF, FLAC.
Video MOV, MPEG-4, AVI, MXF.
Database XML, CSV, TAB.

Data in various digital formats are stored in the computer based on data storage methods where
the most important features of these methods are described here.

 File Storage – data storing method following a hierarchical structure that enables fast
navigation as well as storage of complex files. In fact, it offers a limited space scalability.
 Block Storage – data are fragmented at first and stored in blocks, permitting them to be
freely configured and quickly recovered. Data management in this system, however,
requires more complex operation.

13
Week 1: Data Analytics – An Overview

 Object Storage – data are stored in the form of objects having a unique identifier and a
wide metadata package (creation date, author, etc.). This solution provides financial
savings and security of stored data, however, it prevents their free modification.

In addition, the three methods of data storage are also compared as follows:

File Storage Block Storage Object Storage

Amount of Data Lower Large Large

Data Location LAN Beyond LAN Beyond LAN

Scalability of Limited Unlimited Unlimited


Space

Application Files on private Extensive databases, Static data repositories,


computers business applications, photos and videos,
virtual machines archives

(b) Data Cleansing and Transformation

While data are being collected and stored for processing and analysis towards decision making,
it is very important to preprocess prior to all so that the noise / outlier may not mislead the
decisions. Data cleansing or data cleaning refers to the process of detecting and correcting (or
removing) corrupt or inaccurate records from data repository. In fact, it enables to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or
deleting the dirty or coarse data. Moreover, improving data quality can eliminate problems like
expensive processing errors, manual troubleshooting, and incorrect invoices.

For examples, Excels and Google spreadsheets, being the most common data storing
packages amongst all such software, very often include misspelled words, stubborn trailing
spaces, unwanted prefixes, improper cases, and nonprinting characters those of which make a
bad first impression. In that case, the basics steps may be followed in order to clean the data to
improve its quality in the following ways:

 Import the data from an external data source.


 Create a backup copy of the original data in a separate workbook.

14
Week 1: Data Analytics – An Overview

 Ensure that the data is in a tabular format of rows and columns with: similar data in each
column, all columns and rows visible, and no blank rows within the range. For best
results, use an Excel table.
 Do tasks that don't require column manipulation first, such as spell-checking or using
the Find and Replace dialog box.
 Next, do tasks that do require column manipulation. The general steps for manipulating
a column are:
o Insert a new column (B) next to the original column (A) that needs cleaning.
o Add a formula that will transform the data at the top of the new column (B).
o Fill down the formula in the new column (B). In an Excel table, a calculated
column is automatically created with values filled down.
o Select the new column (B), copy it, and then paste as values into the new column
(B).
o Remove the original column (A), which converts the new column from B to A.

Spell checking - spell checker and grammar can be used to not only find misspelled words, but
to find values that are not used consistently, such as product or company names, by adding
those values to a custom dictionary.

Removing Duplicate Box - Duplicate rows are a common problem when you import data. It is
a good idea to filter for unique values first to confirm that the results are what you want before
you remove duplicate values.

Fixing numbers and number sign - There are two main issues with numbers that may require
you to clean the data: the number was inadvertently imported as text, and the negative sign
needs to be changed to the standard for your organization.

Fixing dates and timings - Because there are so many different date formats, and because
these formats may be confused with numbered part codes or other strings that contain slash
marks or hyphens, dates and times often need to be converted and reformatted.

(c) Data Visualization


While gathering the raw data and developing its data repository, it is always expected to present
these data in a more scientific way. But it is true fact that the raw data don’t provide much
insight unless it is processed and presented. The ways data is presented have a huge impact in
providing meaningful analysis and interpretation. The human brain retains more the information

15
Week 1: Data Analytics – An Overview

contained in visuals rather than the information conveyed by written words or words spoken in a
conversation and hence there lies the importance of data visualization.

Data visualization refers to the use of computer graphics to create visual images which aid in
the understanding of complex, often massive representations of data. There are majorly three
types of data visualization as follows:

 Scientific Visualization
 Information Visualization
 Visual Analytics

Scientific visualization mainly deals with structured data like seismic, medical, high-throughput
data etc. In this case, the digital data generated by the methods and systems are in fact
properly structured in nature and hence these are processed systematically for retrieving /
extracting meaningful information. For example, the following scientific images provide
information related to climate condition from seismic data, brain structure from MRI, and
molecular behavior from NMR spectra.

(a) (b)

16
Week 1: Data Analytics – An Overview

(c) (d)

Fig. Scientific visualization of (a-b) Seismic data visualization and its Geo-mapping; (c ) Brain
MRI and (d) NMR for molecular profiling data.
In case of information visualization, there is no inherent structure like news, stock market,
facebook, twitter, whatsapp etc those of which are basically unstructured in nature.

On the other hand, visual analytics leads to understand and synthesize large amounts of
multimodal data – audio, video, text, images, networks of people etc.

17
Week 1: Data Analytics – An Overview

Statistical Tools for Data Visualization -

Statistics, being the core of data analytics, plays crucial role in data visualization also. In order
to represent data graphically to draw meaningful interpretation, more frequently used statistical
tools may be mentioned as follows:

Bar Chart – A pictorial representation of a grouped data that are depicted in rectangular bars,
where the length of bars are proportional to the measure of data in a category. These charts
may be displayed with vertical columns, horizontal bars, comparative bars (multiple bars to
show a comparison between values), or stacked bars (bars containing multiple types of
information). For example, it can be used to compare the performance of B.Tech. students in
semesters I-VIII over an academic year.

Pie Chart – A circular graphical chart that reflects the proportions of data with respect to the
division of whole circular angle i.e., 360 . It can be used to represent the relative size of a
variety of the data. There are many real life examples of pie charts like – representation of
marks obtained by students in a class, representation of types of COVID-19 masks used by
people in a month.

Fig. A schematic pie chart indicates favorite subjects differentiated by colors

18
Week 1: Data Analytics – An Overview

Bubble Chart – A kind of scatterplot that represents three dimensions of data. In general, it is a
triplet (x, y, size) where (x,y) indicates location and third one refers to size. It actually facilitates
the understanding of various relationships like social, economic, medical, and other scientific
relationships. Here, an example is shown where the average life expectancy, gdp per capita and
population size for more than 100 countries are visualized by a bubble plot. In fact, first two
parameters indicate the scatter plot and bigger size of bubble indicates comparatively larger
population.

Fig. Bubble plot showing GDP per capita vs Life Expectancy [source: https://www.data-to-
viz.com/graph/bubble.html]

Scatter Plot – A graphical approach to visualize 2D or 3D dimensional data for better


understanding the trend of relationship between variables. It can provide an impression about a
positive relationship, a negative relationship, or no relationship.

19
Week 1: Data Analytics – An Overview

Fig. Scatter plots showing the relationship between variables

Histogram – A statistical graph in order to show distribution of the grouped data where x-axis
and y-axis indicate class interval and frequency respectively.

Box Plot – A graphical display of five statistical parameters in a summarized way for depicting
the degree of spread and symmetricity of data distribution. It includes minimum, maximum and
three quantile values of a data set. This graph is more frequently used for outlier detection and
comparing two or more number of population.

20
Week 1: Data Analytics – An Overview

(d) Data Modelling

Data modeling basically refers to the process of creating a data model of either a whole
information system or parts of it to communicate connections between data points and
structures by applying certain formal techniques. This provides a common, consistent, and
predictable way of defining and managing data resources across an organization, or even
beyond. Ideally, data models are living documents that evolve along with changing business
needs. Data models can generally be divided into three categories, which vary according to their
degree of abstraction. The process will start with a conceptual model, progress to a logical
model and conclude with a physical model. Each type of data model is briefly discussed below:

Conceptual data models – This type of models are usually named as domain models that
provide an overall broad picture about what the system will contain, how it will be organized,
and which business rules are involved. Moreover, such models are usually derived as part of
the process of gathering initial project requirements. Eventually, these include entity classes
describing the types of things that are important for the business to represent in the data model,
their characteristics and constraints, the relationships between them and relevant security and
data integrity requirements.

Logical data models – The conceptual model is translated into a logical data model, which
documents structures of the data that can be implemented in databases. Implementation of one
conceptual data model may require multiple logical data models. These models are less
abstract and provide greater detail about the concepts and relationships in the domain under
consideration. Logical data models don’t specify any technical system requirements.

Physical data models - The final step in data modeling is transforming the logical data model
into a physical model that organizes the data into tables, and accounts for access, performance
and storage details. As these models are least abstract, it can offer a finalized design that can
be implemented as a relational database, including associative tables that illustrate the
relationships among entities as well as the primary keys and foreign keys that will be used to
maintain those relationships.

21
Week 1: Data Analytics – An Overview

Physical
Model

Logical Model (Define the data


structure and interconnections)

Conceptual Model
(Identifying business / program concepts)

Fig. Transitional flow of data models

(e) Data Security

Data security is the practice of protecting digital information from unauthorized access,
corruption, or theft throughout its entire lifecycle. In fact, it encompasses every aspect of
information security from the physical security of hardware and storage devices to
administrative and access controls, as well as the logical security of software
applications. There are various types of data security, discussed below:

Encryption – A method of transforming normal text characters into an unreadable format,


called encrypted keys, so that only authorized users can read it. File and database encryption
solutions serve as a final line of defense for sensitive volumes by obscuring their contents
through encryption or tokenization. Most solutions also include security key management
capabilities.

Data Erasure - More secure than standard data wiping, data erasure uses software to
completely overwrite data on any storage device. It verifies that the data is unrecoverable.

Data Masking - By masking data, organizations can allow teams to develop applications or train
people using real data. It masks personally identifiable information (PII) where necessary so that
development can occur in environments that are compliant.

With the advancement of data encryption, tokenization and key management methods to protect
data across applications, transactions, storage etc. the following data security solutions may be
named as follows:
 Cloud data security – Protection platform that allows the user to move to the
cloud securely while protecting data in cloud applications.
 Data encryption – Data-centric and tokenization security solutions that protect
data across enterprise, cloud, mobile and big data environments.

22
Week 1: Data Analytics – An Overview

 Hardware security module - Hardware security module that protects financial


data and meets industry security and compliance requirements.
 Key management -- Solution that protects data and enables industry regulation
compliance.
 Enterprise Data Protection – Solution that provides an end-to-end data-centric
approach to enterprise data protection.
 Payments Security – Solution provides complete point-to-point
encryption and tokenization for retail payment transactions.
 Big Data, Hadoop and IofT data protection – Solution that protects sensitive data
in the Data Lake – including Hadoop, Teradata, Micro Focus Vertica, and other
Big Data platforms.
 Mobile App Security - Protecting sensitive data in native mobile apps while
safeguarding the data end-to-end.
 Web Browser Security - Protects sensitive data captured at the browser, from the
point the customer enters cardholder or personal data, and keeps it protected
through the ecosystem to the trusted host destination.
 eMail Security – Solution that provides end-to-end encryption for email and
mobile messaging, keeping Personally Identifiable Information and Personal
Health Information secure and private.

4. Types of Data Analytics

Depending upon the phase of workflow and nature of analysis needed, data analytics
can be classified majorly in four types:

 Descriptive Analytics
 Diagnostic Analytics
 Predictive Analytics and
 Prescriptive Analytics.

23
Week 1: Data Analytics – An Overview

Value

Complexity

In fact, the above mentioned data analytics types are based on the trade-off between degree of
complexity and its value. Each type is discussed below.

Descriptive Analytics

Descriptive analytics refers to the processing and analysis of historical data in order to provide
insights into the past leading to the present using descriptive statistics, interactive explorations
of the data, and data mining. It identifies relationships in data, often with the intent to categorize
data into groups. It enables learning about what happened (past) and assessing how the past
might influence future outcomes. In fact, such analytics mechanism can help to identify the
areas of strength and weakness in an organization. Very common examples like year-over-year
pricing changes, month-over-month sales growth, the number of users, or the total revenue per
subscriber etc may be considered for the usage of descriptive analytics.

Diagnostic Analytics

It is a kind of data analytics that investigates about ‘why did it (output/result/etc.) happen’ on the
internal data. Basically, it monitors changes in the data or results by performing the tasks
including analyzing variance, computing historical performance and building reasonable
forecasts. It offers an in-depth insight into a given problem provided they have enough data at
their disposal. Also, it enables to identify anomalies and determine casual relationships in data.
For example, e-Commerce giants like Amazon can drill the sales and gross profit down to
various product categories like Amazon Echo to find out why they missed on their overall profit

24
Week 1: Data Analytics – An Overview

margins. Diagnostic analytics also find applications in education for identifying the pros and
cons of online education due to Covid-19 pandemic across the starts of the country.

Predictive Analytics

Predictive analytics exploits patterns found in historical data to make predictions about future
behavior. It analyzes the past data patterns and estimate the trends more accurately. It enables
to predict the likelihood of a future outcome by using various statistical and machine learning
algorithms. For example, in the context of education, it can be used to predict the performance
of a students based his/her marks obtained in various examinations.

Prescriptive Analytics

Prescriptive analytics suggest on possible outcomes and results in actions that are likely to
maximize outcome. In fact, it is an advanced analytics concept based on –
 Optimization that helps achieve the best outcomes.
 Stochastic optimization helps understand how to achieve the best outcome and identify
data uncertainties to make better decisions.
Simulating the future, under various sets of assumptions, allows scenario analysis - which when
combined with different optimization techniques, allows prescriptive analysis to be performed.
The prescriptive analysis explores several possible actions and suggests actions depending on
the results of descriptive and predictive analytics of a given dataset.

References:

 Hammad et. al. (2015), Proc. of the 2015 International Conference on Operations Excellence
and Service Engineering Orlando, Florida, USA, September 10-11, 2015.
 https://www.ibm.com/cloud/learn/data-modeling#toc-data-model-FyL4yPFQ

25

You might also like