Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views15 pages

Introduction To Data Analytics PDF

The document provides an overview of data analytics, including its definition, lifecycle phases, and types of analytics such as predictive, descriptive, prescriptive, and diagnostic. It also discusses various tools for data analysis, data modeling techniques, and the importance of business modeling in enhancing decision-making and operational efficiency. Additionally, it outlines the structure and purpose of databases and different types of data variables.

Uploaded by

sahithi.n64
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Introduction To Data Analytics PDF

The document provides an overview of data analytics, including its definition, lifecycle phases, and types of analytics such as predictive, descriptive, prescriptive, and diagnostic. It also discusses various tools for data analysis, data modeling techniques, and the importance of business modeling in enhancing decision-making and operational efficiency. Additionally, it outlines the structure and purpose of databases and different types of data variables.

Uploaded by

sahithi.n64
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT 2

CONTENTS : Introduction to Data Analytics, Introduction to tools and environment,


application of modeling in business, Data modeling techniques, Need for business modeling,
Database and types of data variables, , Missing imputations.

1. Introduction to Data Analytics


Definition: Data analytics is defined as a science of extracting meaningful,
valuable information from raw data.

Definition: Data analytics is the process of examining data sets to draw


conclusions, predict trends and inform decision making through various analytical
techniques.

The goal of Data analytics is to get actionable insights from raw data resulting
better decisions.

Data analysis is key part of data analytics.

It involves in scrutinizing existing data to gain insights and draw conclusions.

The data analytics lifecycle is a process that consists of six basic stages/phases

1. Data discovery 2.Data preparation 3.Model planning

4. Model building 5.communication results 6.Operationalize

• The lifecycle of the data analytics provides a framework for the best
performances of each phase from the creation of the project until its completion.
Phase 1 - Data Discovery:

• Data discovery is the 1st phase to set project's objectives and find ways to
achieve a complete data analytics lifecycle.

• Data discovery phase defining the purpose of data and how to achieve it by
the end of the data analytics lifecycle.

• Data discovery phase consists of identifying critical objectives a business


is trying to discover by mapping out the data.

Phase 2 - Data Preparation:

• Data preparation phase of the data analytics lifecycle involves data


preparation, which includes the steps to explore, preprocess and condition
data prior to modeling and analysis.

• Data are loaded in the sandbox in three ways namely, ETL (Extract,
Transform and Load), ELT (Extract, Load, and Transform) and ETLT.

Phase 3 - Model Planning:

• The 3rd phase of the lifecycle is model planning, where the data analytics
team members makes proper planning of the methods to be adapted and the
various workflow to be followed during the next phase of model building.

• Model planning is a phase where the data analytics team members have to
analyze the quality of data and find a suitable model for the project.

Phase 4 - Model Building:

• In this phase the team works on developing datasets for training and testing
as well as for production purposes.

• This phase is based on the planning made in the previous phase; the
execution of the model is carried out by the team.

• Model building is the process where team has to deploy the planned model
in a real-time environment.

• The environment needed for the execution of the model is decided and
prepared so that if a more robust environment is required, it is accordingly
applied.
Phase 5 - Communicate Results:

• The 5th phase of the life cycle of data analytics checks the results of the
project to find whether it is a success or failure.

• The result is scrutinized by the entire team to draw inferences on the key
findings and summarize the entire work done.

Phase 6 - Operationalize: • In 6th phase, the team delivers final reports is


prepared by the team along with the briefings, source code and related
technical documents.

• Operationalize phase also involves running the pilot project to implement


the model and test it in a real-time environment.

• As soon the team prepares a detailed report including the key findings,
documents, and briefings, the data analytics life cycle almost comes close to
the end.

Types of Data Analytics


There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive analytics:
Predictive analytics turn the data into valuable, actionable
information. Predictive analytics holds a variety of statistical techniques
that analyze current and historical facts to make predictions about a future
event.
Techniques that are used for predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining

Descriptive Analytics
Descriptive analytics looks at past performance and understands the
performance by mining historical data to understand the cause of success
or failure in the past. Almost all management reporting such as sales,
marketing, operations, and finance uses this type of analysis.
Prescriptive Analytics
Prescriptive analytics goes beyond predicting future outcomes by also
suggesting action benefits from the predictions. Prescriptive Analytics not
only anticipates what will happen and when to happen but also why it will
happen. Further, Prescriptive Analytics can suggest decision options on
how to take advantage of a future opportunity or mitigate a future risk.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer
any question or for the solution of any problem. We try to find any
dependency and pattern in the historical data of the particular problem.
Common techniques used for Diagnostic Analytics are:
 Data discovery
 Data mining
 Correlations

2. Introduction to tools and environment


Many tools are available to facilitate data analysis, ranging from simple
spreadsheets to advanced software with powerful machine-learning
capabilities. Some popular tools include:

1. Power BI
 Power BI provides interactive visualizations and business intelligence
capabilities with a user-friendly interface for creating reports and
dashboards.
 It's a data simplification powerhouse, connecting to numerous data
sources, and delivering visually engaging reports that translate into
meaningful business insights.
 Power BI Available as software as a service (SaaS), desktop
application, and mobile app, Power BI offers a comprehensive view
of business data, making team collaboration easy.

2. Excel
 It is a versatile tool for both professionals and beginners.
 Excel can handle large datasets, perform complex calculations, and
automate tasks using macros.
 It features tables for dynamic data summarization and a variety of
charting tools for data visualization. Excel supports functions for
statistical analysis financial modeling and data cleaning.

3. Tableau

 Tableau is a powerful data visualization tool that specializes in


creating interactive and shareable dashboards. It converts raw data
into an understandable format using advanced visualizations, making
it easier to extract insights.
 Tableau supports drag-and-drop functionalities, real-time data
analysis, and integration with various data sources, including
databases, spreadsheets, and cloud services.
 Its analytical capabilities are enhanced by advanced chart types,
filtering, and dashboard actions, while its user-friendly interface
allows for quick learning and application.

4. SAS

 SAS (Statistical Analysis System) is a comprehensive software suite


designed for advanced analytics, business intelligence, data management,
and predictive analytics. It is known for its robust analytical power and
extensive capabilities.
 SAS provides advanced data manipulation, extensive statistical analysis,
predictive modeling, and machine learning capabilities.
 It supports large-scale data analysis with highly customizable options,
allowing for detailed data preparation, reporting, and visualization. SAS
also offers specialized solutions for various industries, including
healthcare, finance, and retail.
5. R programming
 R is an open-source programming language and software
environment tailored for statistical computing and graphics.
 It is widely used for data analysis, modeling, and visualization,
particularly in academic and research settings.
 R excels in data manipulation, calculation, and graphical display.
 It offers a vast library of statistical and graphical techniques, including
linear and nonlinear modeling, classical statistical tests, time-series
analysis, and machine learning.
6. Python

 Python is a versatile, high-level programming language renowned for its


readability and extensive libraries.
 It is widely used in data science, machine learning, and data analysis,
offering a powerful and flexible environment for various tasks.
 Python thrives in automating data manipulation tasks, performing
complex statistical analyses, and creating visualizations. Libraries such
as Pandas, NumPy, SciPy, and Matplotlib enhance Python's data analysis
capabilities, while frameworks like Tensor Flow, scikit-learn, and Keras
enable sophisticated machine learning models.

3. Data modeling techniques


The main objective of data modeling is to provide a precise and well-organized
framework for data organization and representation, since it enables efficient
analysis and decision-making. Analysts can discover trends, understand the
connections between various data items, and make sure that data is efficiently and
accurately stored by building models.

What is Data Model?


Data models are visual representations of an enterprise’s data elements and the
connections between them. Models assist to define and arrange data in the
context of key business processes, hence facilitating the creation of successful
information systems.
Data Modeling Process
The practice of conceptually representing data items and their connections to one
another is known as data modeling.
1. Identifying data sources: The first stage is to identify and investigate the
different sources of data both inside and outside the company. Determining
the sources of data is essential since it guarantees a thorough framework for
data modeling..
2. Defining Entities and Attributes: This stage is all on identifying the entities
(items or ideas) and the characteristics that go along with them. Entities
constitute the subject matter of the data, whereas attributes specify the
particular qualities of each entity. The foundation of data modeling is the
definition of entities and characteristics.
3. Mapping Relationships: Relationships show the connections or associations
between various things. Relationship mapping entails locating and
characterizing these linkages, indicating the nature and cardinality of every
relationship. In order to capture the interdependencies within the data, it is
essential to understand relationships. It improves the correctness of the model
by capturing the relationships between various data pieces that exist in the real
world.
5. Choosing a model Type: The right data model type is selected based on the
project needs and data properties. Choosing between conceptual, logical, or
physical models, or going with a particular model like relational or object-
oriented, may be part of this decision. The degree of abstraction and detail in
the representation is determined by the model type that is selected

6. Implementing and Maintaining: The process of implementation converts


a physical or logical data model into a database schema. This entails
establishing constraints, generating tables, and adding database-specific
information. Updating the model to account for shifting technological or
commercial needs is called maintenance.

Types of Data Modeling


These are the 5 different types of data models:
1. Hierarchical Model: The structure of the hierarchical model resembles
a tree. The remaining child nodes are arranged in a certain sequence, and
there is only one root node or one parent node. However, the hierarchical
approach is no longer widely applied. Approach connections in the actual
world may be modeled using this approach.
2. Relational Model: Relational Model represents the links between tables
by representing data as rows and columns in tables. It is frequently utilized
in database design and is strongly related to relational database
management systems (RDBMS).
3. Object-Oriented Data Model: In this model, data is represented as
objects, similar to those used in object-oriented programming; creating
objects with stored values are the object-oriented method. In addition to
allowing data abstraction, inheritance, and encapsulation, the object-
oriented architecture facilitates communication.
4. Network Model: We have a versatile approach to represent objects and
the relationships among these things. One of its features is a schema,
which is a graph representation of the data. An item is stored within a
node, and the relationship between them is represented as an edge. This
allows them to generalize the maintenance of many parent and child
records.
5. ER-Model: A high-level relational model called the entity-relationship
model (ER model) is used to specify the data pieces and relationships
between the entities in a system. This conceptual design gives us an easier-
to-understand perspective on the facts. An entity-relationship diagram,
which is made up of entities, attributes, and relationships, is used in this
model to depict the whole database.
A relationship between entities is called an association. Mapping
cardinality many associations like:
one to one; one to many; many to one; many to many.

4. Application of modeling in business

Data modeling plays a crucial role in business data analytics. It enables


organizations to transform raw data into actionable insights, fostering better
decision-making and business growth.
Data modeling is applied in business data analytics is as follows:

1. Organizing and understanding data


 Data modeling structures data in a logical and organized manner, making it easier
to understand and manage.

 It provides a clear and standardized way of illustrating data structures, allowing


analysts and stakeholders to effectively communicate about data requirements.

 Through this organization, analysts can identify gaps and inconsistencies in


existing data structures and business rules.

2. Enhancing data quality and integrity


 Data modeling improves data quality by identifying and correcting errors and
inconsistencies.

 It ensures data integrity and prevents anomalies through the enforcement of


constraints and relationships.

 By minimizing redundancy, data modeling reduces unnecessary data duplication.


3. Optimizing for analytics and reporting
 Data modeling provides a structured framework for data analysis and reporting,
aiding in the generation of insights.

 It facilitates efficient data retrieval, improving system performance and reducing


query and report generation time.

 This optimization supports faster query execution and scalable analytics


infrastructure, particularly with large datasets.

4. Supporting data-driven decision-making

 Data modeling enables organizations to make decisions based on empirical


evidence.

 It helps optimize strategies for better results and competitiveness by analyzing past
performance and predicting future outcomes.

 It empowers employees to make data-driven decisions by ensuring data is available


and interpretable.

5. Improving customer experience and operational efficiency

 Data modeling and analytics provide insights into customer preferences and
behaviors through the analysis of customer data and feedback.

 This enables personalized offerings and tailored marketing, enhancing customer


experience.

 Analyzing operational data through data modeling allows businesses to streamline


processes and allocate resources efficiently, increasing productivity and reducing
costs.

6. Facilitating database design and application development

 Data modeling is a crucial step in designing efficient database structures, acting as


a blueprint.

 It guides application development by defining data requirements upfront, reducing


errors.
5. Need for business modeling
Business modeling is essential in data analytics for providing a structured approach
to understanding, analyzing, and improving business operations and decision-
making. It helps organizations define their goals, identify key performance
indicators (KPIs), and understand how data can be used to achieve those goals. By
creating business models, companies can ensure alignment between their data
strategies and overall business objectives.

Business modeling is crucial in data analytics:

1. Defining Business Objectives and Requirements:


 Business modeling helps clarify what the organization wants to achieve and
identifies the specific data needed to support those goals.
 It ensures that data analytics efforts are aligned with business needs and priorities,
preventing wasted resources on irrelevant data analysis.
2. Optimizing Business Processes:
 By modeling business processes, organizations can identify areas for improvement,
such as inefficiencies or bottlenecks.
 Data analytics can then be used to analyze these processes, identify root causes of
problems, and develop data-driven solutions.
3. Enhancing Decision-Making:
 Business models provide a framework for understanding the relationships between
different aspects of the business.
 This understanding allows for more informed and effective decision-making,
leading to better business outcomes.
4. Improving Data Quality and Accessibility:
 Data modeling techniques, often part of business modeling, ensure data
consistency, accuracy, and accessibility.
 This is crucial for reliable data analysis and reporting, preventing errors and
inconsistencies that can lead to poor decisions.
5. Facilitating Communication and Collaboration:
 Business models provide a common language and framework for communication
between business and technical teams.
 This facilitates collaboration and ensures that everyone is working towards the
same goals with a shared understanding of the business.
6. Supporting Predictive Analytics and Forecasting:
 Business models can be used to build predictive models that help organizations
anticipate future trends and make proactive decisions.
 This can include forecasting sales, predicting customer behavior, or identifying
potential risks.
7. Driving Innovation and Competitive Advantage:
 By understanding their business and leveraging data effectively, organizations can
identify new opportunities for innovation and gain a competitive edge.
 Data analytics, guided by business modeling, can help organizations develop new
products, services, or business models.

6. Database and types of data variables


A database is a structured collection of information, or data, stored electronically in
a computer system. It is designed for efficient storage, retrieval, and manipulation
of data, often managed using a Database Management System (DBMS).

1. Hierarchical Databases
Hierarchical databases organize data in a tree-like structure, where each
parent record can have multiple child records. This model works well for
scenarios where data follows a predefined hierarchical relationship, where
data is arranged in levels or ranks.
2. Network Databases
A network databases build on the hierarchical model but allow child
records to be linked to multiple parent records, creating a web-like
structure of interconnected data. These results in a more flexible structure
often referred to as a graph model, where entities can be connected in
many different ways.

3. Object-Oriented Databases
Object-oriented databases are based on the principles of object-oriented
programming where data is stored as objects. These objects include
attributes (data) and methods (functions), making them easily referenced
and manipulated. These databases are designed to handle complex data
structures such as multimedia, graphics, and large files.
4. Relational Databases
Relational databases are the most widely used type of database today. They
store data in tables, with rows representing records and columns
representing attributes of the records. In these databases, every piece of
information has a relationship with every other piece of information. Every
row of data in the database is linked with another row using a primary key.
Similarly, every table is linked with another table using a foreign key.

5. Cloud Databases
A cloud database operates in a virtual environment hosted on cloud
computing platforms. It is designed for storing, managing, and executing
data over the internet, providing flexibility and scalability. Cloud databases
are widely used for applications requiring dynamic workloads, as they
eliminate the need for on-premises infrastructure.

6. Centralized Databases
A centralized database is a database stored and managed at a single
location, such as a central server or data center. It ensures higher security
and consistency as all data are maintained in one place, making it easier to
control and manage.
Users can access the database remotely to fetch or update information.
Centralized databases are commonly used in enterprise systems where data
consistency and security are critical. However, scalability and performance
limitations should be carefully considered.
7. Personal Databases
A personal database is a small-scale database designed for a single user,
typically used on personal computers or mobile devices. These databases
are ideal for managing individual data like contacts, budgets, notes, or
schedules. They are lightweight, easy to use, and require minimal database
administration, making them accessible for non-technical users.
8. Operational Databases
An operational database is designed to manage and process real-time data
for daily operations within organizations and businesses. It allows users to
create, update, and delete data efficiently, ensuring that the database
reflects current activities and transactions.
9. NoSQL Databases
A NoSQL database (short for "non-SQL" or "non-relational") provides a
mechanism for storing and retrieving data that does not rely on traditional
table-based relational models.
NoSQL databases are known for their simplicity of design, horizontal
scalability (adding more servers for scaling), and high availability. Unlike
relational databases, their data structures allow faster operations in certain
use cases. MongoDB, for instance, is a widely used document-based
NoSQL database.

Types of data

Data can be broadly classified into two main categories: qualitative and
quantitative, each further subdivided.

1. Qualitative (Categorical) data

This type describes qualities or characteristics and is typically non-numeric.

 Nominal Data: Labels variables without any order or numerical value, like hair
color or nationality.

 Ordinal Data: Presents a natural order or ranking but does not quantify the
differences between categories, such as customer satisfaction ratings.

2. Quantitative (Numerical) data

This data represents measurable quantities or numerical values that can be counted
or expressed using numbers.

 Discrete Data: Consists of countable, distinct values, usually whole numbers, like
the number of students in a class.

 Continuous Data: Can assume any value within a given range, such as height,
temperature, or time.

Variables: the building blocks of data


Understanding the types of data and variables is crucial for effective data
analysis. Different data types require specific statistical methods for accurate
interpretation

A variable is any characteristic, number, or quantity that can be measured or


counted.

Types of variables

 Independent Variable: The factor that is manipulated or changed in an


experiment to observe its effect on the dependent variable.
 Dependent Variable: The outcome or effect that is measured in response to
changes in the independent variable.

 Categorical Variables: Represent categories or groups, including nominal and


ordinal variables.

 Continuous Variables: Quantitative variables capable of taking an infinite number


of values within a range.

 Confounding Variables: Extraneous variables that can cause a false association


between independent and dependent variables, potentially leading to incorrect
conclusions.

7. Missing imputations
Missing data is a common issue in data analysis and machine learning. It occurs
when data is not recorded for certain variables or participants, appearing as blank
cells, null values (like "NA" or "NaN"), or special symbols. Failing to address
missing data can negatively impact the accuracy and reliability of models and
analysis.

Imputation is a frequently used method to handle missing data by replacing absent


values with substituted ones.

Imputation is important because it:

 Preserves data integrity: Replacing missing values helps retain valuable data
points without deleting rows or columns.

 Improves model accuracy: Addressing missing data helps reduce bias and
enhance model performance by training on more complete datasets.

 Reduces bias: Proper handling of missing data helps avoid biased results,
especially when the Missingness is not random.

 Enables use of machine learning algorithms: Many algorithms require complete


datasets to function effectively.

Types of missing data


Identifying the type of missing data is essential for selecting the appropriate
imputation method.
 Missing completely at Random (MCAR): Missingness is random and unrelated
to other variables.

 Missing at Random (MAR): Missingness is related to other observed variables


but not the missing values themselves.

 Missing Not at Random (MNAR): Missingness is related to the missing values or


unobserved factors.

Imputation techniques
Common imputation techniques include:

 Mean/Median/Mode Imputation: Replacing missing values with the mean,


median, or mode of the feature.

 Forward Fill and Backward Fill: Filling missing values with the last or next
known value, often used for time-series data.

 K-Nearest Neighbors (KNN) Imputation: Replacing missing values based on


similar data points.

 Regression Imputation: Using a regression model to predict missing values based


on relationships in the data.

 Multiple Imputation: Creating multiple datasets with different imputed values to


account for uncertainty.

Choosing the right method


Selecting the best imputation method depends on factors such as:

 Type of Data: Different data types may require different approaches.

 Missing Data Mechanism: Understanding the missing data type helps minimize
bias.

 Proportion of Missing Data: The amount of missing data can influence the
complexity of the needed method.

 Impact on Analysis: Consider how the chosen method will affect your results or
model performance.

You might also like