Week 01 Data Analytics - An Overview
Week 01 Data Analytics - An Overview
Data Analytics
– An Overview
Contributor
1
Week 1: Data Analytics – An Overview
Learning Outcomes
Explore the concept of data science and data analytics with their relation.
Develop an understanding about data, data types and its various sources with
examples.
Explore the data analytics workflow including data collection, methods of storing
data, data cleansing, data visualization and data security.
Develop an insight into the types of data analytics as well as data modeling
approaches.
Contents:
2
Week 1: Data Analytics – An Overview
The world has witnessed the digital revolution in almost all the fields. In fact, the advent of
information technology has generated large volumes of data in various formats like records,
files, documents, images, sound, videos, scientific data and many new data formats. It is here to
mention that the volume of data stored in the world would be more than 40 zettabytes (1021) by
2020, as published by Hammad et. al. (2015).
Fig. Data volume increasing trend in zettabyte across the globe [source: Hammad et. al. 2015]
Due to the increased volume, velocity and variety of the data, the datasets are very often
referred as ‘big data’. Like other fields, a large amount of data is generated day-by-day at the
educational settings through the increased e-learning resources, learning management
systems, student information systems, laboratory experiments, all academic and administrative
resources etc. All these data collected from different sources require proper analytical
framework in order to extract knowledge for strategic planning and better decision making. Data
Analytics, being the most impactful analytical platform, plays the crucial role in visualization,
summarization and modelling of data for predictions / inferencing.
Data analytics refers to the science of exploring, analyzing, visualizing data from both internal
and external sources for drawing inferences, and making predictions to enable innovation, gain
competitive business advantage, and finally helping in strategic decision-making. In fact, it has
become an interdisciplinary research field that has adopted aspects from many other scientific
3
Week 1: Data Analytics – An Overview
2. Types of Data
Data generated at various sources can be broadly categorized into two segments viz.,
quantitative vs. qualitative data.
Quantitative data are represented by numerical values through which it remains always easy to
explain precisely and accurately. Such type of data is easily amenable to statistical manipulation
and can be represented by a wide variety of statistical types of graphs and charts such as line,
bar graph, scatter plot, and etc. Few examples are cited as follows:
Again, there are two general types of quantitative data - discrete and continuous. Continuous
data is information that could be meaningfully divided into finer levels. It can be measured on a
scale or continuum and can have almost any numeric value. Both discrete and continuous
quantitative data are measureable at two scales – interval and ratio scales.
An interval scale is one where there is order and the difference between two values is
meaningful. Examples are cited like temperature (Farenheit), temperature (Celcius), pH, SAT
score (200-800), excellent grade (90-100), credit score (300-850) etc. A variable measured at
ratio scale has all the properties of an interval scale and in addition, it includes zero. For
example, enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight,
length, temperature in Kelvin (0.0 Kelvin really does mean “no heat”) etc are measured at ratio
scale. The following figure shows the classification of data types as per measurement scales.
4
Week 1: Data Analytics – An Overview
Interval
Discrete Ratio
Quantitative
Continuous Interval
Data
Ratio
Nominal
Qualitative
Ordinal
On the other hand, qualitative data consist of values that can be placed in the nonnumeric
categories. Such type of data can be measured at nominal and ordinal scales. Nominal scale
provides only categories without any ordering or direction; whereas ordinal scale measures the
ranking or ordering of values.
Advent of information technology has increasingly developed the huge repository of digital data.
According to its various formats, this digital data can be broadly classified into structured, semi-
structured and unstructured data.
i) Unstructured Data:
Unstructured data refer to the data that either does not conform to any pre-defined model. It
might have an internal structure but is not structured via pre-defined data models. On the other
way, it can be explained that the data with some structure may still be labeled as unstructured if
the structure does not help in processing task in hand. It may be textual or non-textual
information which are either human-generated or machine-generated. Various sources of
human and machine –generated information are given below as examples.
5
Week 1: Data Analytics – An Overview
In today’s scenario, majority of the enterprise data leads to the unstructured data due to
varieties of reasons like easy multimedia communications, ease data repository and transfers
etc. Such data can’t be ignored due to its increased large volume. In order to extract meaningful
information, these data are required to be analyzed with following analytical techniques.
Data
Mining
Natural Unstructured
Text
Language Data
Analytics
Processing Analytics
Noisy Text
Analytics
6
Week 1: Data Analytics – An Overview
Data Mining – It refers to the process of discovering interesting and meaningful patterns and
relationships from large volumes of data. It is also known as ‘knowledge discovery in data’
(KDD). Data mining has emerged as an important multi-disciplinary area integrating tools and
techniques from different disciplines such as artificial intelligence, machine learning, statistics
and probability, database management, data visualization etc.
From functionality point of view, it can be divided into two general components viz., descriptive
data mining and predictive data mining.
Example - Data mining solves varieties of real applications in the areas of defense, biology,
healthcare, industry, banking, education etc. In fact, educational data mining (EDM) has
emerged as an important analytics area that explores different data mining techniques to
analyze students’ data in educational environments to improve performance.
Natural Language Processing – It is a branch of artificial intelligence (AI) that enables the
computers to understand the texts and spoken words expressed in human or natural language.
It integrates the disciplines like statistics, machine learning and deep learning models.
7
Week 1: Data Analytics – An Overview
Statistics
NLP
Machine Deep
Learning Learning
Example in education – With the advancement in AI, natural language processing has been
significantly contributing to bring improvement in the scientific process of teaching and learning
especially. It also assists in adopting advance technologies for bringing improvement in the
educational system. For example, NLP is applied in education for e-learning, which assist in
producing educational material with technological development. In the context of teaching-
learning, there are numerous electronic, online sources available in English language, which
provides students and teachers to access e-materials. With the availability of large number of
online resource materials, it becomes very important to verify the reliability of the resources.
Hence, this requires intelligent automatic processing for preventing the use of such unreliable
resources and promoting the use of authentic resources. Application of NLP in education is also
effective for mining, information retrieval, quality assessment and academic performance
improvement.
Text Analytics – Text analytics or text mining is a process of extracting high quality and
meaningful information from text on the basis of patterns and trends of statistical pattern
learning. It performs various tasks like text categorization, text clustering, sentiment analysis,
concept analysis etc.
Example – Text analytics plays crucial role to derive useful information based on the opinions
received from the students to support the different educational processes and decision-making
strategies. It helps to determine the qualities that the undergraduates consider essential in the
evaluation of the teaching-learning process and the performance of the teachers. There is an
unprecedented increase in the amount of text-based data in different activities within the
educational processes, which can be leveraged to provide useful strategic intelligence and
improvement insights. Educators can apply the resultant methods and technologies, process
8
Week 1: Data Analytics – An Overview
innovations, and contextual-based information for ample support and monitoring of the teaching-
learning processes and decision making.
Noisy Text Analytics – It is the process of extracting structured and unstructured information
from noise unstructured data such as chats, blogs, wikis, mails, text messages etc. Noisy
unstructured data usually comprises of one or more of the following – Spelling mistakes,
abbreviation, acronym, non-standard words, missing punctuation, missing letter case and filler
words “uh” “um” etc.
ii)Semi-structured Data:
Semi-structured data is information that does not reside in a relational database but that have
some organizational properties that make it easier to analyze. It is more flexible than
structured data but less flexible than unstructured data. Examples include email, XML and
other markup languages. It is also referred to as self-describing structure.
Semi-structured data falls in the middle between structured and unstructured data. It contains
certain aspects that are structured, and others that are not. For example, X-rays and other
large images consist largely of unstructured data – in this case, a great many pixels. It is
impossible to search and query these X-rays in the same way that a large relational database
can be searched, queried and analyzed. After all, all you are searching against are pixels
within an image. Fortunately, there is a way around this. Although the files themselves may
consist of no more than pixels, words or objects, most files include a small section known as
metadata.
9
Week 1: Data Analytics – An Overview
Structured data refer to such type of data that conform to a pre-defined schema or structure.
These data are highly organized and usually resides in relational databases. As per the formats,
examples of such data include XML, Excel, ORACLE etc. In the real life situations, enterprise
resource planning (ERP) data, transactional data, students marksheets, records etc. are some
examples of such structured data as well.
After defining the problem and its associated variables, the next most important stage is data
collection where data come from different sources. Depending upon the availability, data can be
divided into the following
Primary Data
Secondary Data
If the data are collected by the decision maker directly through survey or interview or
experiment, data are called as primary data. On the other hand, data collected from existing
sources (where somebody already collected and stored in some form of repository like MS file,
database, cloud etc) are called secondary data.
It is well established that the data received from varieties of sources in different formats give
rise to enormous large datasets. Once the voluminous data collected as per the defined
problem, data analytics tools and techniques need to be essentially explored in order to
extract meaningful information for augmented interpretation. In this context, it is to mention
that data analytics framework includes several important steps to execute systematically. In
view of this, the analytics flow is presented in the below diagram. Most importantly, it
involves transforming and inspecting data to infer the inherent meaningful information.
10
Week 1: Data Analytics – An Overview
(a) At first, the problem of interest is identified for solving. In fact, the problem statement will
guide about what types of data to gather and the important features that represent the
possible solutions. A lot of domain expertise is needed in data analytics, and a problem
space where expertise is accessible is almost mandatory.
(b) Once the problem is identified, appropriate data needs to be collected. The collected
data needs to be represented in a format that optimizes on space without losing
resolution in information. Enterprises now need to be aware of compliance and security.
Access to the data might need to be restricted to authorized personnel and data could
be confidential in some cases.
(c) The stored data needs to be cleansed for data preprocessing. Cleansing involves
removal of outliers, missing values, and bad records. The result of the data analysis
depends much on cleansing of the data. Data that is not cleansed might lead to skewed
analysis.
(d) The cleansed data is transformed into a representation that can be used for analysis.
One example of transformation is normalization of data to a range between 0 and 1.
Another example is changing the scale of the data to ease computation.
(e) The transformed data is then analyzed using statistical and machine learning algorithms.
Statistical techniques are used for summarization, visualization and predictive modeling
11
Week 1: Data Analytics – An Overview
etc. On the other side, machine learning (ML) algorithms refer to a class of learning
algorithms mainly used for clustering and classifications of data for meaningful
predictions.
(f) Validation step is also very important in order to validate or cross-validate the obtained
results in concurrence with the experts’ domain knowledge or a set of test users. Post-
validation, if the results can be used to make meaningful decisions, it ends the process
of analysis. Otherwise, the data scientist and associates get back to the drawing board
and tweak the parameters in the pipeline.
(g) The validated results are visually represented for the stakeholders (which could include
users) for easy understanding with various data visualization techniques.
Amongst the above, the following major components of data analytics are described briefly:
Data Collection refers to the acquisition of data from various sources in order to address the
problems, as defined by the decision maker. Broadly there are two ways for collecting data as
follows:
Quantitative data collection methods are driven by the (statistical) random sampling techniques
those of which are very popular, as described below:
12
Week 1: Data Analytics – An Overview
first are divided into four homogenous groups and then gets selected from each group to
keep signature of all groups to attain heterogeneity in the final sample.
In addition, there are many other methods for information gathering as follows:
Data Storage
Data collected in the form of whether character, image, sound, or some other form, are saved
initially into the computer. And then these are converted into an appropriate format/s so that the
digital data can be stored and processed under the computing environment. Different digital
data formats are given in the table:
Data in various digital formats are stored in the computer based on data storage methods where
the most important features of these methods are described here.
File Storage – data storing method following a hierarchical structure that enables fast
navigation as well as storage of complex files. In fact, it offers a limited space scalability.
Block Storage – data are fragmented at first and stored in blocks, permitting them to be
freely configured and quickly recovered. Data management in this system, however,
requires more complex operation.
13
Week 1: Data Analytics – An Overview
Object Storage – data are stored in the form of objects having a unique identifier and a
wide metadata package (creation date, author, etc.). This solution provides financial
savings and security of stored data, however, it prevents their free modification.
In addition, the three methods of data storage are also compared as follows:
While data are being collected and stored for processing and analysis towards decision making,
it is very important to preprocess prior to all so that the noise / outlier may not mislead the
decisions. Data cleansing or data cleaning refers to the process of detecting and correcting (or
removing) corrupt or inaccurate records from data repository. In fact, it enables to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or
deleting the dirty or coarse data. Moreover, improving data quality can eliminate problems like
expensive processing errors, manual troubleshooting, and incorrect invoices.
For examples, Excels and Google spreadsheets, being the most common data storing
packages amongst all such software, very often include misspelled words, stubborn trailing
spaces, unwanted prefixes, improper cases, and nonprinting characters those of which make a
bad first impression. In that case, the basics steps may be followed in order to clean the data to
improve its quality in the following ways:
14
Week 1: Data Analytics – An Overview
Ensure that the data is in a tabular format of rows and columns with: similar data in each
column, all columns and rows visible, and no blank rows within the range. For best
results, use an Excel table.
Do tasks that don't require column manipulation first, such as spell-checking or using
the Find and Replace dialog box.
Next, do tasks that do require column manipulation. The general steps for manipulating
a column are:
o Insert a new column (B) next to the original column (A) that needs cleaning.
o Add a formula that will transform the data at the top of the new column (B).
o Fill down the formula in the new column (B). In an Excel table, a calculated
column is automatically created with values filled down.
o Select the new column (B), copy it, and then paste as values into the new column
(B).
o Remove the original column (A), which converts the new column from B to A.
Spell checking - spell checker and grammar can be used to not only find misspelled words, but
to find values that are not used consistently, such as product or company names, by adding
those values to a custom dictionary.
Removing Duplicate Box - Duplicate rows are a common problem when you import data. It is
a good idea to filter for unique values first to confirm that the results are what you want before
you remove duplicate values.
Fixing numbers and number sign - There are two main issues with numbers that may require
you to clean the data: the number was inadvertently imported as text, and the negative sign
needs to be changed to the standard for your organization.
Fixing dates and timings - Because there are so many different date formats, and because
these formats may be confused with numbered part codes or other strings that contain slash
marks or hyphens, dates and times often need to be converted and reformatted.
15
Week 1: Data Analytics – An Overview
contained in visuals rather than the information conveyed by written words or words spoken in a
conversation and hence there lies the importance of data visualization.
Data visualization refers to the use of computer graphics to create visual images which aid in
the understanding of complex, often massive representations of data. There are majorly three
types of data visualization as follows:
Scientific Visualization
Information Visualization
Visual Analytics
Scientific visualization mainly deals with structured data like seismic, medical, high-throughput
data etc. In this case, the digital data generated by the methods and systems are in fact
properly structured in nature and hence these are processed systematically for retrieving /
extracting meaningful information. For example, the following scientific images provide
information related to climate condition from seismic data, brain structure from MRI, and
molecular behavior from NMR spectra.
(a) (b)
16
Week 1: Data Analytics – An Overview
(c) (d)
Fig. Scientific visualization of (a-b) Seismic data visualization and its Geo-mapping; (c ) Brain
MRI and (d) NMR for molecular profiling data.
In case of information visualization, there is no inherent structure like news, stock market,
facebook, twitter, whatsapp etc those of which are basically unstructured in nature.
On the other hand, visual analytics leads to understand and synthesize large amounts of
multimodal data – audio, video, text, images, networks of people etc.
17
Week 1: Data Analytics – An Overview
Statistics, being the core of data analytics, plays crucial role in data visualization also. In order
to represent data graphically to draw meaningful interpretation, more frequently used statistical
tools may be mentioned as follows:
Bar Chart – A pictorial representation of a grouped data that are depicted in rectangular bars,
where the length of bars are proportional to the measure of data in a category. These charts
may be displayed with vertical columns, horizontal bars, comparative bars (multiple bars to
show a comparison between values), or stacked bars (bars containing multiple types of
information). For example, it can be used to compare the performance of B.Tech. students in
semesters I-VIII over an academic year.
Pie Chart – A circular graphical chart that reflects the proportions of data with respect to the
division of whole circular angle i.e., 360 . It can be used to represent the relative size of a
variety of the data. There are many real life examples of pie charts like – representation of
marks obtained by students in a class, representation of types of COVID-19 masks used by
people in a month.
18
Week 1: Data Analytics – An Overview
Bubble Chart – A kind of scatterplot that represents three dimensions of data. In general, it is a
triplet (x, y, size) where (x,y) indicates location and third one refers to size. It actually facilitates
the understanding of various relationships like social, economic, medical, and other scientific
relationships. Here, an example is shown where the average life expectancy, gdp per capita and
population size for more than 100 countries are visualized by a bubble plot. In fact, first two
parameters indicate the scatter plot and bigger size of bubble indicates comparatively larger
population.
Fig. Bubble plot showing GDP per capita vs Life Expectancy [source: https://www.data-to-
viz.com/graph/bubble.html]
19
Week 1: Data Analytics – An Overview
Histogram – A statistical graph in order to show distribution of the grouped data where x-axis
and y-axis indicate class interval and frequency respectively.
Box Plot – A graphical display of five statistical parameters in a summarized way for depicting
the degree of spread and symmetricity of data distribution. It includes minimum, maximum and
three quantile values of a data set. This graph is more frequently used for outlier detection and
comparing two or more number of population.
20
Week 1: Data Analytics – An Overview
Data modeling basically refers to the process of creating a data model of either a whole
information system or parts of it to communicate connections between data points and
structures by applying certain formal techniques. This provides a common, consistent, and
predictable way of defining and managing data resources across an organization, or even
beyond. Ideally, data models are living documents that evolve along with changing business
needs. Data models can generally be divided into three categories, which vary according to their
degree of abstraction. The process will start with a conceptual model, progress to a logical
model and conclude with a physical model. Each type of data model is briefly discussed below:
Conceptual data models – This type of models are usually named as domain models that
provide an overall broad picture about what the system will contain, how it will be organized,
and which business rules are involved. Moreover, such models are usually derived as part of
the process of gathering initial project requirements. Eventually, these include entity classes
describing the types of things that are important for the business to represent in the data model,
their characteristics and constraints, the relationships between them and relevant security and
data integrity requirements.
Logical data models – The conceptual model is translated into a logical data model, which
documents structures of the data that can be implemented in databases. Implementation of one
conceptual data model may require multiple logical data models. These models are less
abstract and provide greater detail about the concepts and relationships in the domain under
consideration. Logical data models don’t specify any technical system requirements.
Physical data models - The final step in data modeling is transforming the logical data model
into a physical model that organizes the data into tables, and accounts for access, performance
and storage details. As these models are least abstract, it can offer a finalized design that can
be implemented as a relational database, including associative tables that illustrate the
relationships among entities as well as the primary keys and foreign keys that will be used to
maintain those relationships.
21
Week 1: Data Analytics – An Overview
Physical
Model
Conceptual Model
(Identifying business / program concepts)
Data security is the practice of protecting digital information from unauthorized access,
corruption, or theft throughout its entire lifecycle. In fact, it encompasses every aspect of
information security from the physical security of hardware and storage devices to
administrative and access controls, as well as the logical security of software
applications. There are various types of data security, discussed below:
Data Erasure - More secure than standard data wiping, data erasure uses software to
completely overwrite data on any storage device. It verifies that the data is unrecoverable.
Data Masking - By masking data, organizations can allow teams to develop applications or train
people using real data. It masks personally identifiable information (PII) where necessary so that
development can occur in environments that are compliant.
With the advancement of data encryption, tokenization and key management methods to protect
data across applications, transactions, storage etc. the following data security solutions may be
named as follows:
Cloud data security – Protection platform that allows the user to move to the
cloud securely while protecting data in cloud applications.
Data encryption – Data-centric and tokenization security solutions that protect
data across enterprise, cloud, mobile and big data environments.
22
Week 1: Data Analytics – An Overview
Depending upon the phase of workflow and nature of analysis needed, data analytics
can be classified majorly in four types:
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics and
Prescriptive Analytics.
23
Week 1: Data Analytics – An Overview
Value
Complexity
In fact, the above mentioned data analytics types are based on the trade-off between degree of
complexity and its value. Each type is discussed below.
Descriptive Analytics
Descriptive analytics refers to the processing and analysis of historical data in order to provide
insights into the past leading to the present using descriptive statistics, interactive explorations
of the data, and data mining. It identifies relationships in data, often with the intent to categorize
data into groups. It enables learning about what happened (past) and assessing how the past
might influence future outcomes. In fact, such analytics mechanism can help to identify the
areas of strength and weakness in an organization. Very common examples like year-over-year
pricing changes, month-over-month sales growth, the number of users, or the total revenue per
subscriber etc may be considered for the usage of descriptive analytics.
Diagnostic Analytics
It is a kind of data analytics that investigates about ‘why did it (output/result/etc.) happen’ on the
internal data. Basically, it monitors changes in the data or results by performing the tasks
including analyzing variance, computing historical performance and building reasonable
forecasts. It offers an in-depth insight into a given problem provided they have enough data at
their disposal. Also, it enables to identify anomalies and determine casual relationships in data.
For example, e-Commerce giants like Amazon can drill the sales and gross profit down to
various product categories like Amazon Echo to find out why they missed on their overall profit
24
Week 1: Data Analytics – An Overview
margins. Diagnostic analytics also find applications in education for identifying the pros and
cons of online education due to Covid-19 pandemic across the starts of the country.
Predictive Analytics
Predictive analytics exploits patterns found in historical data to make predictions about future
behavior. It analyzes the past data patterns and estimate the trends more accurately. It enables
to predict the likelihood of a future outcome by using various statistical and machine learning
algorithms. For example, in the context of education, it can be used to predict the performance
of a students based his/her marks obtained in various examinations.
Prescriptive Analytics
Prescriptive analytics suggest on possible outcomes and results in actions that are likely to
maximize outcome. In fact, it is an advanced analytics concept based on –
Optimization that helps achieve the best outcomes.
Stochastic optimization helps understand how to achieve the best outcome and identify
data uncertainties to make better decisions.
Simulating the future, under various sets of assumptions, allows scenario analysis - which when
combined with different optimization techniques, allows prescriptive analysis to be performed.
The prescriptive analysis explores several possible actions and suggests actions depending on
the results of descriptive and predictive analytics of a given dataset.
References:
Hammad et. al. (2015), Proc. of the 2015 International Conference on Operations Excellence
and Service Engineering Orlando, Florida, USA, September 10-11, 2015.
https://www.ibm.com/cloud/learn/data-modeling#toc-data-model-FyL4yPFQ
25