0% found this document useful (0 votes)

6 views33 pages

Da Unit II Notes

Unit II of the document covers Data Analytics, introducing key concepts, tools, and applications in various sectors. It discusses four types of data analytics: descriptive, diagnostic, predictive, and prescriptive, highlighting their roles in decision-making and business modeling. Additionally, it outlines various tools like R, Tableau, Python, and others that facilitate data analysis and visualization.

Uploaded by

saikrishna10022005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views33 pages

Da Unit II Notes

Uploaded by

saikrishna10022005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

UNIT – II Data Analytics

Topics:
1. Introduction to Analytics
2. Introduction to Tools and Environment
3. Application of Modeling in Business
4. Databases & Types of Data and Variables
5. Data Modeling Techniques, Missing Imputations etc.
6. Need for Business Modeling.

1) INTRODUCTION TO ANALYTICS:

Data analytics involves the process of collecting, processing, interpreting and help in
making decisions. It is the practice of examining raw data to identify trends, draw
conclusions, and extract meaningful information.

Benefits of Data Analytics:

 Improved Decision Making
 Efficient Operations
 Better Customer Service
 Effective Marketing
 Cost Reduction and Financial optimization

Types of Data Analytics:

There are four major types of data analytics:

a) Descriptive analytics
b) Diagnostic analytics
c) Predictive analytics
d) Prescriptive analytics

(a) Descriptive analytics:

Descriptive analytics is the process of using current and historical data to identify
trends and relationships. It looks at what has happened in the past. The data is presented in a
way that can be easily understood by a wide audience. The two main techniques used in
descriptive analytics are data aggregation and data mining— the data analyst first gathers the
data and presents it in a summarized format and then “mines” the data to discover patterns.
It’s important to note that descriptive analytics doesn’t try to explain the historical
data or establish cause-and-effect relationships. It is simply a case of determining and
describing the “what happened”.
1
The data could be split into various categories such as numbers, gender, residency,
age and so on. The information summarizes or groups the data into a fixed set that describes
all and the specific information.
This summary does not require any further analysis and is utilized as a summary to
make sense of the information.Thus, it is a type of descriptive analytics.

(b) Diagnostic analytics:

While descriptive analytics looks at the “what happened”, ”, diagnosti
diagnostic analytics
explores the “why it happened
happened”.
”. When running diagnostic analytics, data analysts will first
seek to identify anomalies within the data. For example: If the data shows that there was a
sudden drop in sales for the month of March, the data analys
analystt will need to investigate the
cause.

At this stage, data analysts may use probability theory, regression analysis, filtering, and
time-series data analytics.After
After applying diagnostic analytics to discover why an event
occurred, companies can use that knowledge to create solutions and develop predictive
models for the future.Diagnostic
Diagnostic analytics will be helpful to determine better plan for future.

For example,
1. A hospital examines readmission rates for a particular type of surgery. Diagnostic analytics
analyti
might identify factors like inadequate post-surgical care instructions or lack of follow-up
appointments.

2
2. A pharmaceutical company analyzes clinical trial data for a new drug and discovers
higher-than-expected side effects in a specific patient subgroup. Diagnostic analytics can help
identify the cause and potentially refine the drug’s target audience.
3. A social media influencer experiences a sudden drop in audience engagement.Diagnostic
analytics can help identify the cause and increase the audience.

(c) Predictive analytics:

Predictive analytics tries to predict what is likely to happen in the future. Predictive
analytics uses machine learning algorithms (Linear Regression, Logistic Regression,
Decision Trees, Neural Networks etc.,) to process historical data to anticipate future events
or outcomes. A simple use case is extracting patterns and relationships from large datasets to
identify trends, patterns, and probabilities. Businesses apply such predictive analysis to make
data-driven decisions and take proactive measures to improve outcomes.
The operation of predictive analytics is based on mathematical models, historical data
and current data.Both businesses and customers can benefit from predictive analytics.

Applications of Predictive Analytics:

3
(d) Prescriptive analytics:

 Prescriptive analytics tries to answer “What do we need to do next to achieve?”.

 It uses machine learning to help businesses decide a course of action based on a
computer program’s predictions.
 By considering all relevant factors, this type of analysis yields recommendations for
next steps.

Example of 4 types:

4
Applications of Analytics:

1. Energy
The energy sector's applications of data analytics focus on consumption analysis and
grid optimization. In an era of rising energy demands, efficient distribution and consumption
become paramount.
Through these analytics applications, energy distribution can be optimized, and
consumption patterns predicted.
2. Finance & Banking
In finance and banking, the applications of data analytics are primarily directed
towards fraud detection and risk management. Every transaction provides data that, when
analyzed, can reveal anomalies.
This usage of data analytics reduces fraudulent activities and helps manage risks
linked to loans and investments.
3. Government & Public Sector
Governments utilize the applications of data analytics in policy formation and
resource distribution. The vast administrative data provides insights into public needs and
requirements.
These analytics applications allow for policies that are more aligned with public
needs, ensuring resources are allocated wisely and public services improve.
4. Health Care
In the health care domain, data analytics applications play a pivotal role in diagnosis
and treatment optimization. Massive volumes of patient data are now analyzed to detect
patterns and correlations.
These analytics applications guide health care professionals in making decisions that
lead to enhanced patient outcomes and substantial reductions in medical expenses.
5. Manufacturing
Manufacturing industries utilize data analytics applications for quality control and
process efficiency. With complex machinery and operations, every stage provides vital data.
Predictive analytics helps in pre-empting manufacturing defects and refining
production workflows, leading to reduced waste and superior products.
5
6. Marketing & Advertising
Marketing professionals analytics applications for precise customer segmentation and
to measure the effectiveness of their operations.
With the understandings from these data analytics applications, businesses can target
audiences more effectively and assess their movement ROI.
7. Real Estate
The real estate sector's applications of data analytics involve property valuation and
tracking market trends. The fluctuating property market generates vast amounts of data.
Real estate professionals can more accurately price properties and anticipate market
movements.
8. Retail & E-Commerce
The retail and e-commerce sector taps into analytics applications to gain customer
insights and manage inventory.
With data analytics applications, retailers can discern customer preferences, hone
pricing strategies, online shopping and oversee optimal stock levels, translating to boosted
sales and cost savings.
9. Insurance
In the insurance sector, data analytics applications are crucial for risk assessment and
claim processing. With countless policyholders and claims, insurers rely on analytics to make
accurate predictions and decisions.
These analytics applications allow insurers to set premiums more accurately based on
risk, as well as expedite claim processes, which enhances customer satisfaction and
operational efficiency.
10. Transport & Logistics
In transport and logistics, the applications of data analytics involve route optimization
and demand prediction. The constant movement of goods provides a continuous stream of
data to be processed.

2) INTRODUCTION TO TOOLS AND ENVIRONMENT:

Tools and environments are designed to enhance productivity. Tools are programs or
applications that assist in development, testing and maintenance of software. The
environment is a collection of tools that allow users to perform specific tasks.
Environments used :RStudio, Microsoft Visual Studio, Eclipse, Jupyter
Tools Used:

6
1. R Programming

 R is the industry’s leading analytical tool, commonly used in data modeling and statIt
stat
is a free language and software for statistical computing and graphics programming.

 R allows manipulation and information presented in a variety of ways.

 R compiles and operates on many platforms, including macOS, Windows, and Linux.

 It has the option to navigate packages by category 11,556 packages.

 R also offerss instruments to install all the packages automatically, which can be well-
well
assembled with large amounts of information according to the user’s needs.

2. Tableau

Tableau is a powerful data analytics and visualization tool to explore and analyze data from
various sources.

 It offers drag-and-drop
drop functionality to create interactive dashboards and reports,
making it easy for non-technical
non users to work with data.

 Tableau supports many data sources and file formats, including spreadsheets,
databases, cloud-based
sed data, and big data platforms.

 It provides a range of advanced analytics features such as predictive modeling,

forecasting, and statistical analysis to gain insights into data.

 Tableau also supports collaboration and sharing of data and reports among teams,
te
making it an ideal tool for enterprise-level
enterprise data analysis.

7
3. Python

Python is an object-oriented,
oriented, user
user-friendly, open-source
source language developed by Guido van
Rossum in the early 1980s.

 It is a popular language for data analytics due to its simplicity,

simplicity, versatility, and
extensive libraries such as NumPy, Pandas, and Scikit
Scikit-learn.

 It offers various data visualization tools like Matplotlib and Seaborn, which helps to
create interactive charts and graphs to analyze data better.

 With Python’s capabilities in machine learning, it is widely used in predictive

analytics, natural language processing, and image processing.

 It is an open-source
source language with a vast community offering support, documentation,
and tutorials, making learning and implementing data analytics
analytics easy.

4. Excel

Excel is a Microsoft software program part of the software productivity suite Microsoft
Office.

 It is a core and standard analytical tool used in almost every industry.

 Excel is essential when analytics on the inner information of the customer is required.

 It analyzes the complicated job of summarizing the information using a preview of

pivot tables to filter it according to customer requirements.

 Excel has the advanced option of business analytics to assist with modeling pre-
pre
created
d options such as automatic relationship detection, DAX measures, and time
grouping.

 Typically, uses of Excel involve calculating cells, pivoting tables, and graphing
multiple instruments.

 For example, one can create a monthly budget for Excel, track busin
business expenses, or
sort and organize large amounts of data with an Excel table.
8
5. PowerBI

 Power BI is a powerful and user-friendly business intelligence tool that allows you to
create interactive data visualizations and reports quickly and easily.

 It can connect to various data sources, including Excel spreadsheets, databases, and
cloud-based services like Azure and Salesforce.

 With Power BI, you can create custom dashboards, charts, and graphs that make it
easy to visualize and understand complex data sets.

 Power BI includes advanced analytics features like forecasting, clustering, and trend
analysis, which can help you gain deeper insights into your data.

 It also allows you to share your reports and dashboards with others, collaborate in real
time, and access your data from anywhere using the cloud-based Power BI service.

6. SAS

SAS stands for Statistical Analysis System and is a programming environment and language
for data management and analysis.

 The SAS Institute developed SAS in 1966 and developed it in the 1980s and 1990s.

 SAS is easy to manage, and information from all sources can be analyzed.

 In 2011, SAS launched a wide range of customer intelligence goods and many SAS
modules, commonly applied to client profiling and future opportunities for web,
social media, and marketing analytics.

 It can also predict, manage, and optimize their behavior.

 SAS uses memory and distributed processing to analyze enormous databases quickly.

 Moreover, SAS helps to model predictive information.

9
7. Apache Spark

Created in 2009 by the University of California, AMP Lab of Berkeley.

 Apache Spark is a quick-scale

quick scale data processing engine and runs apps 100 times quicker
in memory and ten times faster
fas on disk in Hadoop clusters.

 Spark is based on data science, and its idea facilitates data science.

 Apache Spark is also famous for developing information pipelines and machine
models.

 Spark also has a library – MLlib, that supplies several machine tools
tool for recurring
methods in information science, such as regression, grading, clustering, collaborative
filtration, etc.

 Apache Software Foundation launched Spark to speed up the Hadoop software

computing process.

8. RapidMiner

RapidMiner is an open-sourcee data analytics tool for data preparation, machine learning, and
predictive modeling.

 It offers an intuitive graphical user interface, supports over 40 machine learning

algorithms, and provides various data visualization options.

 RapidMiner supports various data types, including structured, semi-structured,

semi and
unstructured data, making it a versatile tool for multiple data analytics projects.

 The platform offers a drag-and-drop

drag drop interface that enables users to quickly design and
execute data processin
processing,
g, modeling, and visualization workflows without requiring
extensive programming knowledge.

 RapidMiner also provides a community-driven

community driven marketplace where users can access
hundreds of pre-built
built machine learning algorithms, data connectors, and extensions
andd share their custom-built
custom built components with others in the community.
10
9. KNIME

A team of software engineers from Constance University developed KNIME in January

2004.

 It is an open-Source
Source workflow platform for information processing, building, and
execution.

 KNIME utilizes nodes to build graphs that map information flow from input to output.

 With its modular pipeline idea, KNIME is a powerful leading open-

open-source reporting
and built-in
in analytical tool to evaluate and model the information through visual
programming,
ramming, integrating different data mining elements and machine learning.

 Every node carries out a single workflow job.

10. QlikView

QlikView has many distinctive characteristics, such as patented technology and memory
processing, which can quickly execute
execute the outcome for end customers and store the
information in the document itself.

 Data association is automatically retained in QlikView, and almost 10% of the initial
volume can be compressed.

 Color visualization of the information connection

connection– a particular
ar color for associated
and non-related
related information.

 As an auto-service
service BI tool, QlikView is usually easy to collect without having unique
data analysis or programming abilities for most company customers.

 It is often used in marketing, staffing, and sales departments, as well as in

management dashboards to monitor general company transactions at the highest
management level.

 Most organizations train company users before providing software access, while no
unique abilities are needed.

11
11. Splunk

Launched in 2004, Splunk gradually became viral among businesses and began to purchase
their company licenses.

 Splunk is a software technology used to monitor, search, analyze, and view

information produced by the computer in real-time.
real

 It can track and read variou

variouss log files and save information on indexers as
occurrences.

 Splunk can display information on different types of dashboards with these tools.

12. IBM SPSS Modeler

IBM SPSS Modeler is a powerful data mining and predictive analytics software that helps
businesses
sinesses and organizations identify patterns and relationships within their data to gain
insights and make informed decisions.

 The tool provides a range of advanced statistical and analytical techniques, such as
clustering, decision trees, and neural networks,
networks, to help users uncover hidden patterns
and trends in their data.

 With a user-friendly
friendly interface and drag
drag-and-drop
drop functionality, IBM SPSS Modeler
allows users to quickly and easily build and deploy predictive models without a deep
understanding of programming
ramming or data science.

 It contains a variety of sophisticated analytical and algorithms.

 It is most potent when used to uncover strong patterns in continuing business

processes and then capitalize by deploying business models to predict choices better
andd achieve optimum results.

12
13.

 Impala was introduced by Cloudera in October 2012 as a beta version and its stable

release was out in May 2013. Initially, Impala only supported data stored in HDFS

(Hadoop Distributed File System) and HBase, but later extended its support to data

stored in Amazon and other cloud storage systems.

 Cloudera Impala is an open-source, interactive, distributed SQL query engine that

enables users to conduct real-time queries.

 Impala supports a broad spectrum of SQL syntax, including joins, nested queries,

aggregate functions, and user-defined functions.

 The key features of Cloudera Impala include: Scalability and flexibility, Real-time,

interactive analysis of data, Integration with cloud storage systems.

 Impala follows a massively parallel processing architecture.

 Key advantages include faster query execution, efficient resource utilization, and cost

efficiency. Impala supports various data formats and integrates with multiple data

sources.

13
3) APPLICATION OF MODELING IIN BUSINESS:

Business modeling can be described as creating a map of what happens within a

business, detailing how tasks are carried out, by whom, and in what order. By modeling their
processes, companies can see the full picture of how they operate with transparency
transparency into even
the most complex of processes.

For example, A business model determines what products make sense for a company to sell,
how it wants to promote its products, what type of people it should try to cater to, and what
revenue streams it may expect.

14
BUSINESS MODEL OF HOSPITAL

Applications:
 Improves data Quality - It enables the concerned stakeholders to make data driven
decisions.
 Clearrepresentation - It is easier to analyse the data properly and quick overview of
data which can then be used by the developers in varied applications.
 Gap analysis – Identifying business needs by comparing current performance level
and desired performance level
 SWOT (Strengths, Weaknesses, Opportunities and Threats) Analysis
 Faster performance
 Better documentation
 Reduced Cost

4) DATABASES & TYPES OF DATA AND VARIABLES:

DATABASE:

A database is a collection of data that is organized and stored electronically in a

computer system. Databases can store any type of data, including words, numbers, images,
videos, and files. They are designed to hold large amounts of information and allow multiple
users to access and query the data.

15
Databases are used in variousdomains, including business, scientific, and government
organizations.
A DBMS is software that stores, retrieves, and edits data in a database. It acts as an
interface between the database and its users or programs. Some examples of DBMSs include
MySQL, Microsoft Access, and Microsoft SQL Server.

Some real-life
life examples of databases include eCommerce platforms, healthcare
systems, social media platforms, online banking systems, hotel booking systems, airline
reservation systems, HRMS, email services, ride-hailing
ride ng applications, and online learning
platforms. A database could store a grocery store's inventory, or a cat store's customers,
products, and orders.

Types of databases:
There are many types of databases, including relational databases, object-oriented
object
databases,
atabases, and NoSQL databases.
A relational database is a type of database that organizes data in tables, with each
table representing a relation, and each row and column representing a specific entity and its
attributes. The tables are linked together by a unique identifier or key in each row.
That is, Store data in a structured format with rows and columns.
Example: MySQL, Oracle, PostgreSQL, Microsoft SQL Server

An object-oriented
oriented database (OODB) is a database that stores data as objects and
classes, similar to the way object-oriented
object oriented programming (OOP) languages manage data.
OODBs are a powerful data management system that's especially useful for complex data
relationships,
tionships, multimedia applications, and advanced database systems.
Example :ObjectDB, IBM Db2

16
NoSQL stands for "not only SQL" and refers to a type of database that stores data in a
non-tabular format, unlike relational databases. NoSQL databases are also known as non-
relational databases.Example :MongoDB, Cassandra, Redis, Couchbase

TYPES OF DATA AND VARIABLES:

17
Qualitative or Categorical Data:
Data
Qualitative data, also known as the categorical data,, describes the data that fits into
the categories. Qualitative data are not numerical.. The categorical information involves
categorical variables that describe the features such as a person’s
rson’s gender, home town etc..

Sometimes categorical data can hold numerical values (quantitative value), but those
values do not have a mathematical sense. Examples of the categorical data are birthdate,
favourite sport, schoolpin code.
code Here, the birthdate and school pincode
code hold the quantitative
value, but it does not give numerical meaning.

The main difference between nominal and ordinal data is that nominal data has no
intrinsic order,, while ordinal data has a natural order or ranking.

If the question is "How old are you?" it's a nominal variable.

If the question is "What age range are you in?" it's an ordinal variable.

18
Nominal Data:

Nominal data is one of the types of qualitative information which helps to label the variables
without providing the numerical value. Nominal data is also called the nominal scale. It
cannot be ordered and measured. The nominal data are examined using the grouping
method. In this method, the data are grouped into categories.

For example,

Male/female, Single/married, Employed/unemployed, Comedy/drama,

Residential / park, Car/ truck/motorcycle

Nominal data can be expressed in words or numbers, but the labels cannot be ordered in a
meaningful way or used for arithmetic operations.

19
Ordinal Data:
Ordinal data/variable is a type of data that follows a natural order. The significant feature of
the nominal data is that the difference between the data values is not determined.

Examples of ordinal data are the Survey, level of education, the range of income(Low-
income Middle-income High-income), grades (Poor- Fair- Good -Excellent) and Frequency
of zoo visits (Never Rarely Occasionally Regularly Daily)

20
Quantitative or Numerical Data:

Quantitative data is also known as numerical data which represents the numerical value (i.e.,
how much, how often, how many). Numerical data gives information about the quantities of a
specific thing. Some examples of numerical data are height, length, size, weight, and so on.
The quantitative data can be classified into two different types based on the data sets. The
two different classifications of numerical data are discrete data and continuous data. Discrete
data is countable. Continuous data is measurable.

21
Interval data:
Interval data is a type of quantitative data that has a consistent order and a consistent
difference between values, but lacks a true zero point.

For example, Temperature in Celsius or Fahrenheit, IQ scores.

22
Interval data is measured along a numerical scale that has equal intervals between adjacent
values.

Ratio Data:
Ratio data is a level of measurement that has equal intervals and a true zero point, allowing
for meaningful operations such as multiplication and division.

For example, Age, height, weight.

Ratio data is measured along a numerical scale that has equal distances between adjacent
values, and a true zero.

23
5) DATA MODELING TECHNIQUES, MISSING IMPUTATIONS ETC.

Data modeling is the process of creating a conceptual model of data and its relationships to
help organize and manage data.Data
data. modeling is the process of diagramming data flows.
When creating a new orr alternate database structure, the designer starts with a diagram of
how data will flow into and out of the database.
 Data modeling can involve cleansing data, defining measures and dimensions, and
adding formulas.
 Data modeling can help to identify and resolve
esolve potential issues before they occur.
 A well-designed
designed data model can help to create a logical and efficient database
structure.
 A clear and consistent data model can help ensure everyone is on the same page
regarding data standards and definitions.
 A well-designed
designed data model can make it easier to add new features or data sources in
the future.

Data Modeling Techniques:

1. ER (Entity-Relationship)
Relationship) Model

The Entity Relational Model is a model for identifying entities to be represented in the
database and representation of how those entities are related. It represents the overall logical
structure of a database graphically. ER models are used to model real-worldworld objects like a
person, a car, or a company and the relation between these real
real-world
world objects.

24
An Entity may be an object with a physical existence – a particular person like
Student,, car, house, or employee – or it may be an object with a conceptual existence – a
company, a job, or a university course. It is represented by a rectangle.
Attributes are the properties that define the entity type. For example, Student ID,
Name, DOB, Age,, Address, and Mobile_No are the attributes that define entity type Student.
In ER diagram, the attribute is represented by an oval.
A Relationship Type represents the association between entity types. For example,
‘Study’’ is a relationship type that exis
exists
ts between entity type Student and Course. In ER
diagram, the relationship type is represented by a diamond.
diamond

2. Hierarchical Model

The hierarchical model was the first database management system model. It can
accurately represent many real-world
real world relationships, such as website sitemaps and food
recipes. It supports one-to-many
many relationships
relationships.The
The hierarchical model uses a tree-like
tree
structure to organize data.
The hierarchical database model mandates that each child record has only one parent,
whereas each parent record can have one or more child records. In order to retrieve data from
a hierarchical database, the whole tree needs to be ttraversed
raversed starting from the root node.

25
3. Network Model:

The Network Model in a Database Management System (DBMS) is a data model that
allows the representation of many-to-many
many many relationships in a more flexible and complex
structure compared to the Hierarchical
Hierarchi Model.The
The network model offers a flexible way to
represent complex data relationships through its graph
graph-based structure.
The Network Model in DBMS is one type of the hierarchical model that is used to
represent the many-to-many
many relationship among the database constraints. It is represented in
the form of a graph hence it is a simple and easy-to-construct
easy construct database model. The network
model in DBMS allows 1 : 1 (one-to-one),
(one 1 : M (one-to-many), M : 1 (many-to-one) and
M : N (many-to-many) relationships among the entities or members.

26
In the above example, we can observe that the node student has two parents, CSE Department
and Library. That is, Many to One relationship.

In the above example, we can observe that 1 : M (one-to-many), M : 1 (many-to-one)

relationships.
A many-to-many relationship exists when one or more items in one table can have a
relationship to one or more items in another table.
For example:
A student can take multiple courses, and a course can have multiple students.
A customer can place multiple orders, and an order can be placed by multiple customers.
A writer can write multiple books, and a book can be written by multiple writers.

4. Relational Model:
A relational database is defined as a group of independent tables which are linked to
each other using some common fields of each related table. This model can be represented as
a table with columns and rows. Each row is known as a tuple. Each table of the column has a
name or attribute. It is well knows in database technology because it is usually used to
represent real-world objects and the relationships between them. Some popular relational
databases are used nowadays like Oracle, Sybase, DB2, MySQL Server etc.
27
5. Object-Oriented Database Model:
In Object Oriented Data Model, data and their relationships are contained in a single structure
which is referred as object in this data model. In this, real world problems are represented as
objects with different attributes. All objects have multiple relationships between them.

28
Objects:
An object is an abstraction of a real world entity or we can say it is an instance of class.
Objects encapsulates data and code into a single unit which provide data abstraction by
hiding the implementation details from the user.
For example: Instances of student, doctor, engineer in above figure.
Attribute:
An attribute describes the properties of object.
For example: Object is STUDENT and its attribute are Roll no, Branch, Setmarks() in the
Student class.
Methods:
Method represents the behavior of an
an object. Basically, it represents the real-world
real action.
For example: Finding a STUDENT marks in above figure as Setmarks().
Class:
A class is a collection of similar objects with shared structure i.e. attributes and behavior i.e.
methods. An object is an instance of class.
For example: Person, Student, Doctor, Engineer.
Engineer

Now let us consider an another

anotherexample given below.

 Here Transport, Bus, Ship, and Plane are objects.

 Bus has Road Transport as the attribute.
 Ship has Water Transport as the attribute.
 Plane has Air Transport as the attribute.
 The Transport object is the base object and the Bus, Ship, and Plane objects derive
from it.

6. Object-Relational Model:
An Object relational model is a combination of an Object oriented database model
and a Relational database model. So, it supports objects, classes, inheritance etc. just like
Object Oriented models and has support for data types, tabular structures etc. like Relational
data model.

29
Data ModelingTools :

1. ER/Studio
ER/Studio is Idera's powerful data modeling tool, enabling efficient classification of
current data assets and sources across platforms. You can also create and share data models
and track data provenance end-to-end. With ER/Studio, organizations can quickly understand
the interactions between data, processes and people. Key Features of ER/Studio:

A. It accommodates both logical and physical design.

B. Ensure model and database consistency.
C. Scriptable and automated Run an impact analysis of new fixes at the database level.
D. HTML, PNG, JPEG, RTF, XML, Schema, and DTD are supported display formats.
2. DbSchema
DbSchema extends functionality to the JDBC driver and provides a complete GUI for
sorting complex data. It provides a great user experience for SQL and NoSQL in general
provides efficient reverse engineering. DbSchema serves as database users, administrators,
and programmers, and is also considered a copy of the visualization of relationships between
tables and other data models.
3. HeidiSQL
This free open source software is one of the most popular data modeling tools for
MariaDB and MySQL worldwide. It also supports MS SQL, PostgreSQL and SQLite
database systems.
4. Toad Data Modeler
Toad Data Modeler is an ideal solution for cross-platform support for databases such
as SAP, SQL Server, DB2, Microsoft Access, and more. Toad Data Modeler also provides
developers with flexibility and offers easy customization and migration. This tool is also
useful for building complex logical and physical objects (both forward and reverse
engineering)
5. ERBuilder
ERBuilder Data Modeler allows developers to create graphs using interactive charts
and can generate the most popular SQL data. It also provides a beautiful visual design
environment, allowing developers to communicate information easily.
6. Lucidchart
A cloud-based diagramming tool that's easy to use and has collaborative editing
features.
7. Erwin Data Modeler
A tool that offers a range of features and capabilities for data modeling, data
governance, and metadata management.
8. Archi
A tool that offers a range of features and capabilities for data modeling, data governance, and
metadata management.
9. MySQL Workbench
A tool that offers a range of features and capabilities for data modeling, data governance, and
metadata management.
10. ConceptDraw
A desktop tool that can be used to create business process diagrams, generate
documentation and reports, and improve business processes.
11. Draw.io
An online tool that allows users to create data models, export diagrams to various
formats, and share them online
30
MISSING IMPUTATION:

Missing data imputation is a statistical technique that replaces missing values in a

dataset with estimated values. It's an important step in data analysis because incomplete data
can lead to biased results and make it harder to process and analyze the data.
Data imputation is a method for retaining the majority of the dataset's dat data and
information by substituting missing data with a different value. These methods are employed
because it would be impractical to remove data from a dataset each time. Additionally, doing
so would substantially reduce the dataset's size, raising questions
questions about bias and impairing
analysis.

 Imputation methods: The method you choose depends on the type of data, the amount of
missing data, and how the data is missing. Some common methods include:
 Mean, median, or mode: A simple method that works well for missing
numerical data, but may not be accurate if the data is skewed
 K-nearest
nearest neighbors (KNN) imputation: A more advanced method that
considers the relationships between variables to estimate missing values
 Hot deck imputation: A method that randomly selects a value from the same
feature or column to replace missing values

Data Imputation Techniques

Techniques:

 Next or Previous Value

 K Nearest Neighbors

 Maximum or Minimum Value

 Missing Value Prediction

 Most Frequent Value

 Average or Linear Interpolati

Interpolation

 (Rounded) Mean or Moving Average or Median Value

 Fixed Value

31
1. Next or Previous Value:
For time-series data or ordered data, there are specific imputation techniques. These
techniques take into consideration the dataset's sorted structure, wherein nearby values are
likely more comparable than far-off ones. The next or previous value inside the time series
is typically substituted for the missing value as part of a common method for imputed
incomplete data in the time series. This strategy is effective for both nominal and numerical
values.
2. K NearestNeighbors:
The objective is to find the k nearest examples in the data where the value in the
relevant feature is not absent and then substitute the value of the feature that occurs most
frequently in the group.
3. Maximum or Minimum Value:
You can use the minimum or maximum of the range as the replacement cost for
missing values if you are aware that the data must fit within a specific range [minimum,
maximum].
4. Missing Value Prediction:
Depending on the type of feature, we can employ any machine learning model or
regression or classification model. The algorithm is used to forecast the most likely value of
each missing value in all samples.
A basic imputation approach, such as the mean value, is used to temporarily impute
all missing values when there is missing data in more than a feature field. Then, one column's
values are restored to missing. After training, the model is used to complete the missing
variables. In this manner, it is trained for every feature that has a missing value up until a
model can impute all of the missing values.
5. Most Frequent Value:
The most frequent value in the column is used to replace the missing values.
6. Average or Linear Interpolation:
The average or linear interpolation, which calculates between the previous and next
accessible value and substitutes the missing value, is similar to the previous/next value
imputation but only applicable to numerical data.
7. Mean or Median Value:
Median, Mean, or rounded mean are further popular imputation techniques for
numerical features. The technique, replaces the null values with mean, rounded mean, or
median values determined across the whole dataset. It is advised to utilize the median rather
than the mean when the dataset has a significant number of outliers.
8. Fixed Value:
Fixed value imputation is a universal technique that replaces the null data with a
fixed value and is applicable to all data types. You can impute the null values in a survey
using "not answered" as an example of using fixed imputation on nominal features and zero
on numerical features.

32
6) NEED FOR BUSINESS MODELING:

 Business modelling is an effective tool that has opened new doors for companies to
make informed decisions and plan for the future.
 Business modeling is essential for accurate and actionable business process analysis.
Without a clear picture of business processes, businesses can't confidently analyze and
make decisions on their processes and operations.
 The goal of data modeling is to produce high quality, consistent, structured data for
running business applications and achieving consistent results
 Business process modeling enables business analysts and business systems analysts to
coherently map and thoroughly analyze business requirements and identify and
specify the business systems requirements to support the business requirements.
 Business modeling identifies the products or services the business plans to sell, its
identified target market, and any anticipated expenses.
 Business Modeling is important because it helps streamline and optimize operations,
improve communications and collaboration, and facilitate continuous improvement
initiatives to help a company maintain an edge in the marketplace.
 Business Modeling is essential for entrepreneurs as it defines their value proposition,
revenue streams, target market, competitive advantage, and growth potential. It serves
as a strategic roadmap, guiding decision-making, resource allocation, and risk
management.
 Business modeling helps it identify its target customers and understand their needs
and preferences.
 Business modeling allows organizations to spell out specific details and requirements
for both the overall network of connected databases as well as the design of individual
databases. With a clear visual overview, it's much easier to identify any gaps or
opportunities before the blueprints go into development.
 Business modeling aids in decision-making and process optimization. It also enables
organizations to identify potential bottlenecks and inefficiencies, paving the way for
continuous improvement.
 It helps companies attract investment, recruit talent, and motivate management and
staff.
 Business modeling helps to establish standard data definitions and internal data
standards, often in connection with governance.
 Business models are a vital data architecture component, along with flow diagrams,
architectural blueprints, a unified vocabulary, and other artifacts.
 Business modeling makes it easier for developers, data architects, business analysts,
and other stakeholders to see and understand the relationships between data in a
database or data warehouse.
 It simplifies and accelerates the database design process at the conceptual, logical and
physical level.
 The goal of data modeling to illustrate the types of data used and stored within the
system, the relationships among these data types, the ways the data can be grouped
and organized and its formats and attributes.
 Business modeling provides the learner with a picture of the successful outcome as
well as the process leading to success.
 Business modeling helps create a simplified, logical database that eliminates
redundancy, reduces storage requirements, and enables efficient retrieval.
 Business modeling defines how a company will make money, describes the target
audience, value proposition, sales channels, and customer relationships.
 Business modeling used to outline and plan a process or project. By using this tool,
you can help your employer meet their needs and implement effective problem-
solving solutions within their organisation.

Analytics Full Text
No ratings yet
Analytics Full Text
191 pages
Unidad 3 The Wonders of The Modern Technology
No ratings yet
Unidad 3 The Wonders of The Modern Technology
34 pages
Unit-II (Data Analytics)
100% (1)
Unit-II (Data Analytics)
17 pages
Chap 1 Notes
No ratings yet
Chap 1 Notes
26 pages
Data Analytics - UNIT1
No ratings yet
Data Analytics - UNIT1
18 pages
Data Analytics Fundamentals Guide
90% (10)
Data Analytics Fundamentals Guide
17 pages
DA Notes-Unit 1
No ratings yet
DA Notes-Unit 1
11 pages
CAC 428 Topic 3 - Data Analysis and Data Analytics
No ratings yet
CAC 428 Topic 3 - Data Analysis and Data Analytics
37 pages
Data Analytics Unit 1
No ratings yet
Data Analytics Unit 1
25 pages
Unit 1 - 1 Data Analytics Introduction
No ratings yet
Unit 1 - 1 Data Analytics Introduction
73 pages
Dbms Vtu Notes
100% (1)
Dbms Vtu Notes
104 pages
Business Analytics Theory Exam Notes
No ratings yet
Business Analytics Theory Exam Notes
61 pages
Big - Data Unit-2
100% (2)
Big - Data Unit-2
64 pages
Konsep Big Data
No ratings yet
Konsep Big Data
28 pages
Unit 1-2
No ratings yet
Unit 1-2
8 pages
DAUnit 2
No ratings yet
DAUnit 2
18 pages
Data Science Essentials Guide
No ratings yet
Data Science Essentials Guide
12 pages
Module 1
No ratings yet
Module 1
40 pages
Data Analytics Chapter - 1
No ratings yet
Data Analytics Chapter - 1
42 pages
Unit 1
No ratings yet
Unit 1
50 pages
Unit 2 DS
No ratings yet
Unit 2 DS
30 pages
AA THeory and Methods
No ratings yet
AA THeory and Methods
40 pages
2 Types of Data Analytics
No ratings yet
2 Types of Data Analytics
21 pages
Q) Concept of Data Analytics
No ratings yet
Q) Concept of Data Analytics
28 pages
Lesson 1 What Is Data Analytics
No ratings yet
Lesson 1 What Is Data Analytics
75 pages
Business Analytics CH 1
No ratings yet
Business Analytics CH 1
37 pages
Presentation Data Analytics
No ratings yet
Presentation Data Analytics
14 pages
Lecture 2m
No ratings yet
Lecture 2m
53 pages
Unit 1
No ratings yet
Unit 1
21 pages
Module 1
No ratings yet
Module 1
49 pages
Technical Seminar 2
No ratings yet
Technical Seminar 2
22 pages
SAP Workflow Interview Questions and Answers
No ratings yet
SAP Workflow Interview Questions and Answers
1 page
Case Study (16b) Group
No ratings yet
Case Study (16b) Group
18 pages
Intro To Business Analytics Business Analytics
No ratings yet
Intro To Business Analytics Business Analytics
8 pages
Bda CH1
No ratings yet
Bda CH1
18 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Da Unit 1
No ratings yet
Da Unit 1
12 pages
Module 2
No ratings yet
Module 2
18 pages
Dataanalyticsunit-1 (2) 104014
No ratings yet
Dataanalyticsunit-1 (2) 104014
51 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
Sample
0% (3)
Sample
8 pages
Data Analytics Introduction Guide
No ratings yet
Data Analytics Introduction Guide
13 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
8 pages
BigData DataAnalyticsTypes
No ratings yet
BigData DataAnalyticsTypes
9 pages
Business Analytics & Decision Making
No ratings yet
Business Analytics & Decision Making
37 pages
Data Analytics: Types and Uses
No ratings yet
Data Analytics: Types and Uses
3 pages
Module 2 - Fund. of Business Analytics
No ratings yet
Module 2 - Fund. of Business Analytics
26 pages
Data Handling
No ratings yet
Data Handling
7 pages
DA Unit 2
No ratings yet
DA Unit 2
12 pages
Data Analysis Models: Service & Manufacturing Examples
No ratings yet
Data Analysis Models: Service & Manufacturing Examples
9 pages
Business Analytics for Stakeholders
No ratings yet
Business Analytics for Stakeholders
2 pages
Business Analytics for SCM Students
No ratings yet
Business Analytics for SCM Students
14 pages
ANALYTICS - by - Debdeep Ghosh
No ratings yet
ANALYTICS - by - Debdeep Ghosh
5 pages
Data Analytics Guide
No ratings yet
Data Analytics Guide
10 pages
Types of Data Analytics
No ratings yet
Types of Data Analytics
3 pages
Cisco Data Analysis Course Week
No ratings yet
Cisco Data Analysis Course Week
5 pages
Module 2 Data Analytics and Its Type
No ratings yet
Module 2 Data Analytics and Its Type
9 pages
Dt. Ananta Prasad Nanda Faculty IBCS, SOA University Sub:BA Outline For Today:D-P-P Analytics
No ratings yet
Dt. Ananta Prasad Nanda Faculty IBCS, SOA University Sub:BA Outline For Today:D-P-P Analytics
14 pages
21cs71BDA Question Bank
No ratings yet
21cs71BDA Question Bank
4 pages
58 - SQL vs. NoSQL
0% (1)
58 - SQL vs. NoSQL
6 pages
Types of Computer Storage Devices
100% (1)
Types of Computer Storage Devices
11 pages
ODI Password Recovery
No ratings yet
ODI Password Recovery
15 pages
Complete On - Page SEO Checklist - Sheet1
No ratings yet
Complete On - Page SEO Checklist - Sheet1
2 pages
Ingenieria en Sistemas de Informacion: Carlos Patricio López Loja
No ratings yet
Ingenieria en Sistemas de Informacion: Carlos Patricio López Loja
3 pages
Salesforce Record Access Under The Hood
No ratings yet
Salesforce Record Access Under The Hood
22 pages
Hadoop Admin Course
No ratings yet
Hadoop Admin Course
8 pages
Oracle Database Concepts
100% (2)
Oracle Database Concepts
471 pages
Magento Interview Questions and Answers
No ratings yet
Magento Interview Questions and Answers
14 pages
Computer 2.3.1
No ratings yet
Computer 2.3.1
5 pages
MapReduce for Big Data Analysis
No ratings yet
MapReduce for Big Data Analysis
59 pages
SQ L Server Information
No ratings yet
SQ L Server Information
18 pages
SQL Zero To Advance
No ratings yet
SQL Zero To Advance
46 pages
Data Base Management Complete Book
No ratings yet
Data Base Management Complete Book
41 pages
BTech Data Mining Exam Prep
No ratings yet
BTech Data Mining Exam Prep
8 pages
UCS312 - MST Even 24
No ratings yet
UCS312 - MST Even 24
2 pages
Organiser Log
No ratings yet
Organiser Log
95 pages
3NF-Third Normal Form - NOTES
No ratings yet
3NF-Third Normal Form - NOTES
2 pages
BI102 FBI ASSESSMENT EXAM 2017 Answers
No ratings yet
BI102 FBI ASSESSMENT EXAM 2017 Answers
4 pages
Backup & Recovery
No ratings yet
Backup & Recovery
28 pages
Infix and Postfix Expressions
No ratings yet
Infix and Postfix Expressions
32 pages
In Memory DB
No ratings yet
In Memory DB
9 pages
App Support Interview
No ratings yet
App Support Interview
5 pages
Extension Ledgers in S/4 HANA Guide
No ratings yet
Extension Ledgers in S/4 HANA Guide
3 pages
JNTU B.tech DBMS Lab Manual All Queries and Programs
100% (1)
JNTU B.tech DBMS Lab Manual All Queries and Programs
56 pages
SQL's Impact on SCADA Systems
No ratings yet
SQL's Impact on SCADA Systems
7 pages

Da Unit II Notes

Uploaded by

Da Unit II Notes

Uploaded by

UNIT – II Data Analytics

Benefits of Data Analytics:

Types of Data Analytics:

(a) Descriptive analytics:

(b) Diagnostic analytics:

(c) Predictive analytics:

Applications of Predictive Analytics:

 Prescriptive analytics tries to answer “What do we need to do next to achieve?”.

2) INTRODUCTION TO TOOLS AND ENVIRONMENT:

 R allows manipulation and information presented in a variety of ways.

 It has the option to navigate packages by category 11,556 packages.

 It provides a range of advanced analytics features such as predictive modeling,

 It is a popular language for data analytics due to its simplicity,

 With Python’s capabilities in machine learning, it is widely used in predictive

 It is a core and standard analytical tool used in almost every industry.

 It analyzes the complicated job of summarizing the information using a preview of

 It can also predict, manage, and optimize their behavior.

 Moreover, SAS helps to model predictive information.

Created in 2009 by the University of California, AMP Lab of Berkeley.

 Apache Spark is a quick-scale

 Apache Software Foundation launched Spark to speed up the Hadoop software

 It offers an intuitive graphical user interface, supports over 40 machine learning

 RapidMiner supports various data types, including structured, semi-structured,

 The platform offers a drag-and-drop

 RapidMiner also provides a community-driven

A team of software engineers from Constance University developed KNIME in January

 With its modular pipeline idea, KNIME is a powerful leading open-

 Every node carries out a single workflow job.

 Color visualization of the information connection

 It is often used in marketing, staffing, and sales departments, as well as in

 Splunk is a software technology used to monitor, search, analyze, and view

 It can track and read variou

12. IBM SPSS Modeler

 It contains a variety of sophisticated analytical and algorithms.

 It is most potent when used to uncover strong patterns in continuing business

stored in Amazon and other cloud storage systems.

 Cloudera Impala is an open-source, interactive, distributed SQL query engine that

enables users to conduct real-time queries.

aggregate functions, and user-defined functions.

interactive analysis of data, Integration with cloud storage systems.

 Impala follows a massively parallel processing architecture.

Business modeling can be described as creating a map of what happens within a

4) DATABASES & TYPES OF DATA AND VARIABLES:

A database is a collection of data that is organized and stored electronically in a

TYPES OF DATA AND VARIABLES:

If the question is "How old are you?" it's a nominal variable.

Male/female, Single/married, Employed/unemployed, Comedy/drama,

For example, Temperature in Celsius or Fahrenheit, IQ scores.

For example, Age, height, weight.

Data Modeling Techniques:

In the above example, we can observe that 1 : M (one-to-many), M : 1 (many-to-one)

Now let us consider an another

 Here Transport, Bus, Ship, and Plane are objects.

A. It accommodates both logical and physical design.

Missing data imputation is a statistical technique that replaces missing values in a

Data Imputation Techniques

 Next or Previous Value

 Maximum or Minimum Value

 Missing Value Prediction

 Most Frequent Value

 Average or Linear Interpolati

 (Rounded) Mean or Moving Average or Median Value

You might also like