Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
56 views47 pages

Data Analysis and Visualization Uploaded 1744086898389

The document outlines the curriculum for a module on Data Analysis and Visualization as part of the M.Sc in Computer Application at Symbiosis International University. It covers essential topics such as data sources, data cleaning, preparation, and the importance of visualization tools like Python, R, and Tableau. The module aims to equip students with foundational knowledge and practical skills for effective data analysis and decision-making.

Uploaded by

CyberPunk 360
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views47 pages

Data Analysis and Visualization Uploaded 1744086898389

The document outlines the curriculum for a module on Data Analysis and Visualization as part of the M.Sc in Computer Application at Symbiosis International University. It covers essential topics such as data sources, data cleaning, preparation, and the importance of visualization tools like Python, R, and Tableau. The module aims to equip students with foundational knowledge and practical skills for effective data analysis and decision-making.

Uploaded by

CyberPunk 360
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

SYMBIOSIS INTERNATIONAL

(DEEMED UNIVERSITY)
Established under Section 3 of the UGC Act. 1956
Awarded Category - I by UGC

Symbiosis School for Online


and Digital Learning
Gram: Lavale, Tal: Mulshi, Dist: Pune, Maharashtra,
India Pin: 412115

E-CONTENT
DATA ANALYSIS AND VISUALIZATION
M.Sc (Computer Application) SEM – 2

Dr. Baljit Kaur


MODULE 1 INTRODUCTION TO DATA ANALYSIS AND VISUALIZATION

Topics to be covered:

1.1. Learning Outcomes

1.2. Introduction

1.3. Review of probability, statistics and random processes.

1.4. Sources of Data, Data cleaning, data preparation, handling missing data

1.5. Summary

1.6. Keywords

1.7. Self-assessment Questions

1.1. Learning Outcomes


To provide students and professionals with the foundational knowledge and skills needed to
work effectively with data in various domains, including data science, analytics, and research.

To emphasize the importance of data visualization, statistical analysis, probability, and data
preparation in the data analysis and visualization process.

1.2. Introduction
Data visualization is crucial in the field of data analysis and decision-making across various
industries. Python, R, and Tableau are three popular tools for data visualization, each offering
its unique advantages. Python's data visualization libraries, such as Matplotlib, Seaborn, and
Plotly, offer flexibility in creating a wide range of static and interactive visualizations. You can
customize visualizations to suit specific data and presentation needs. Python integrates
seamlessly with data manipulation libraries like pandas and NumPy, allowing for efficient data
preprocessing before visualization. This integration streamlines the entire data analysis
pipeline. Python enables the automation of data visualization processes, making it easier to
generate and update visuals as new data becomes available. This is particularly valuable for
data-driven businesses. Python libraries like Plotly and Bokeh support interactive
visualizations. Interactivity helps users explore data dynamically and gain deeper insights.
Python's data visualization capabilities complement its extensive machine learning libraries,
making it easier to visualize model outputs, feature importance, and decision boundaries. R
was specifically designed for statistical analysis and visualization. It provides a wide range of
statistical graphics tools for visualizing complex data relationships, making it invaluable for
statisticians and data scientists. R has a rich ecosystem of packages for creating a variety of
data visualizations, including ggplot2 for grammar-based graphics and lattice for multivariate
visualizations. R's script-based approach to data visualization enhances reproducibility. You
can save and rerun scripts to regenerate visualizations with updated data or parameters. R has
a strong and active user community, which results in extensive documentation, tutorials, and
support for various visualization techniques. However, Tableau is renowned for its user-
friendly interface, making it accessible to individuals with varying technical backgrounds.
Users can create compelling visuals without extensive coding knowledge. Tableau allows for
quick prototyping of visualizations, which is valuable for rapidly exploring data and generating
insights in real-time.

In general, Data visualization simplifies complex data, making it more accessible and
understandable to a wider audience. Visualizations help users identify patterns, outliers, and
trends in data that may not be apparent in raw numbers or text. Visualizations enhance
communication by presenting data-driven insights in a visually appealing and intuitive manner.
They are particularly useful for conveying findings to non-technical stakeholders. Data
visualizations aid in data-driven decision-making by providing clear and concise information.
They empower organizations to make informed choices based on data. Visualizations can tell
a compelling data story, helping users connect with the information on a more emotional level.
In conclusion, data visualization with Python, R, and Tableau is essential for transforming raw
data into actionable insights, whether for statistical analysis, data exploration, business
intelligence, or reporting. The choice of tool depends on the specific requirements, data
complexity, and the skills and preferences of the users.

1.3. Review of probability, statistics and random processes.


Data Analysis and Visualization is important aspect in data science, statistics, and related
fields. It provides an overview of key concepts and techniques used to analyze and visualize
data. Here's an explanation of some of the key components of this introduction:
Probability: This part of the course typically covers basic probability theory, including concepts
like probability distributions, conditional probability, and the law of large numbers. Probability
is crucial for understanding uncertainty and randomness in data.

Statistics: Students learn fundamental statistical concepts such as descriptive statistics (mean,
median, standard deviation), inferential statistics (hypothesis testing, confidence intervals), and
probability distributions (normal, binomial, etc.). These tools are essential for summarizing and
drawing conclusions from data.

Random Processes: Random processes involve understanding how data evolves over time or
in a sequence of events. This might include topics like time series analysis and stochastic
processes, which are important for handling data with a temporal component.

A review of probability, statistics, and random processes is fundamental for data analysis and
visualization, as these concepts provide the necessary foundation for understanding and
interpreting data. Here is a brief overview of each of these topics in the context of data analysis
and visualization:

1. Probability:

Probability is the measure of the likelihood of an event occurring. In data analysis, probability
theory helps us understand uncertainty and randomness. Key concepts include random
variables, probability distributions, and probability density functions. Probability distributions,
such as the normal distribution and binomial distribution, are often used to model and describe
the characteristics of data.

2. Statistics:

Statistics involves collecting, analyzing, interpreting, presenting, and organizing data. It


provides tools to extract meaningful insights from data. Descriptive statistics summarize data,
including measures of central tendency (e.g., mean, median) and measures of dispersion (e.g.,
variance, standard deviation). Inferential statistics allow us to make predictions or draw
conclusions about a population based on a sample of data. This includes hypothesis testing and
confidence intervals. Regression analysis helps us understand relationships between variables
and make predictions.

3. Random Processes:
Random processes refer to sequences of random variables that evolve over time. They are
essential for modeling and understanding phenomena that involve randomness. In data
analysis, time series analysis deals with data collected over time, where observations are often
dependent on previous observations. Stochastic processes, such as Markov chains, are used to
model systems that exhibit random behavior.

The relevance of these concepts in data analysis and visualization can be summarized as
follows:

 Data Exploration: Probability helps us understand the likelihood of different data


patterns occurring, while statistics provide tools to summarize and visualize data
distributions.

 Data Modeling: Probability distributions are used to model and fit data, and statistical
techniques, such as regression, help build predictive models.

 Uncertainty and Variability: Probability and statistics allow us to quantify and


communicate uncertainty in data, which is crucial when making decisions or drawing
conclusions.

 Hypothesis Testing: Statistics play a significant role in hypothesis testing to determine


if observed patterns in data are statistically significant.

 Visual Representation: Visualization techniques, such as histograms, box plots, and


scatterplots, help depict data patterns and distributions, making it easier to
communicate insights.

 Time Series Analysis: Understanding random processes is essential for analyzing time
series data, which is common in fields like finance, economics, and climate science.

In summary, a strong grasp of probability, statistics, and random processes is essential for
anyone involved in data analysis and visualization. These concepts provide the necessary tools
to explore, model, and make informed decisions based on data.

1.4. Sources of Data, Data cleaning, data preparation, handling missing data
Sources of Data:

This section, introduces various sources of data, which can include structured data (e.g.,
databases), unstructured data (e.g., text documents), and semi-structured data (e.g., JSON or
XML files). Understanding where data comes from is crucial for data acquisition and analysis.
Data analysis and visualization can involve a wide range of data sources, depending on the
specific domain and objectives of the analysis. Here are various sources of data commonly
used in data analysis and visualization:

1. Databases:

 Relational Databases: Data is stored in structured tables using database


management systems (e.g., MySQL, PostgreSQL, SQL Server).

 NoSQL Databases: Non-relational databases (e.g., MongoDB, Cassandra) are


used for storing unstructured or semi-structured data.

2. Data Warehouses:

 Centralized repositories that consolidate data from various sources, enabling


complex querying and analysis.

3. Spreadsheets:

 Data stored in software like Microsoft Excel, Google Sheets, or LibreOffice


Calc.

4. Flat Files:

 Data in plain text or binary files, such as CSV, TSV, JSON, XML, or Parquet.

5. APIs (Application Programming Interfaces):

 Many web services and platforms provide APIs for accessing and retrieving
data programmatically (e.g., social media APIs, financial market data APIs).

6. Web Scraping:

 Extracting data from websites using web scraping tools or libraries (e.g.,
Beautiful Soup in Python, rvest in R).

7. Sensor Data:

 Data collected from IoT (Internet of Things) devices, such as temperature


sensors, GPS devices, and wearables.

8. Logs and Event Data:


 Logs generated by servers, applications, or devices that record events,
transactions, and user interactions.

9. Surveys and Questionnaires:

 Data collected through surveys, questionnaires, and feedback forms.

10. Publicly Available Datasets:

 Various organizations and government agencies provide open datasets for


research and analysis (e.g., government census data, public health data, climate
data).

11. Social Media Data:

 Data from social media platforms, including user-generated content, comments,


and interactions.

12. Geospatial Data:

 Geographic data, including maps, GPS coordinates, and spatial databases.

13. Financial Data:

 Data related to financial markets, stock prices, economic indicators, and


transactions.

14. Scientific Data:

 Data generated in scientific research, experiments, or simulations (e.g.,


genomics data, particle physics data).

15. Images and Videos:

 Multimedia data that can be analyzed using computer vision and image
processing techniques.

16. Audio Data:

 Audio recordings and speech data used for analysis, transcription, and sentiment
analysis.
17. Text Data:

 Large collections of text documents, such as books, articles, emails, and social
media text.

18. Machine-Generated Data:

 Data generated by machines and automated processes, including logs, sensor


readings, and machine learning model outputs.

19. External Data Providers:

 Third-party data vendors and subscription-based data services that offer


specialized datasets (e.g., market research data, consumer behavior data).

20. Custom Data Collection:

 Organizations and researchers collect their own data through experiments, field
studies, or user interactions with applications.

21. Hybrid Sources:

 Combining data from multiple sources to gain a more comprehensive


understanding (e.g., combining sales data with customer demographics).

Selecting the appropriate data source depends on the research question, project goals, data
availability, and data quality considerations. Often, data analysts and data scientists work with
data from multiple sources to gain valuable insights and support decision-making processes.

Data Cleaning:

Data cleaning involves the process of identifying and rectifying errors, inconsistencies, and
inaccuracies in the dataset. Common tasks include handling missing values, dealing with
outliers, and correcting data entry mistakes. Clean data is essential for accurate analysis.

Data Preparation:

Data preparation focuses on transforming raw data into a format suitable for analysis. This can
involve data encoding (e.g., converting categorical variables into numerical representations),
feature scaling (e.g., normalizing data), and creating derived features that may be more
informative for analysis.

Handling Missing Data:


Missing data is a common issue in datasets, and students learn strategies for dealing with it.
This can include techniques like imputation (replacing missing values with estimated values),
deletion of records with missing data, or modeling methods that can handle missingness.

1.5. Practice code for Sources of Data, Data cleaning, data preparation, handling
missing data
Dear learners, you can use programming environments like Python or R to practice the concepts
of Data Analysis and Visualization. We will be using Python for implementation of the
concepts in this course.

As per IBM Data analytics team, 80% of a data scientist time is spent in preparing the data for
analysis which involves activities like getting the data, cleaning the data, handling missing data
etc.

As a data analyst, you must be ready to deal with messy data, whether that means missing
values, inconsistent formatting, malformed records.

In this section, we will use pandas and NumPy libraries to clean data. I recommend use of
Jupyter notebook.

All the practice datasets, which are used in this course, can be found at following link:

https://drive.google.com/drive/folders/13A0EVAXatP4CmHPWpTgky9rd2bWLeCdf?usp=s
haring

We have used following datasets for this module:

 BL-Flickr-Images-Book.csv – A CSV file containing information about books from


the British Library
 Train.csv – A CSV file containing Big Mart Data
 Refer Python Notebook: big_mart_sales_prediction_SLM.ipynb

Let’s start working on data by first importing the important libraries:


Dropping unnecessary variables from data:

Sometime, data has so many fields, which are not part of analysis. For example, you might
have a dataset containing student information (name, grade, standard, parents’ names, and
address) but want to focus on analysing student grades. In this case, the address or parents’
names categories are not important to you. Keeping these unnecessary fields during analysis
will occupy space and will also impact the execution time.

Pandas library make use of drop() function to remove a number of columns from a given
DataFrame.

Let us create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’.

You need to provide the correct path of the file.

Make use of read_csv() function from pandas library to get the DataFrame and use head()
function to see initial five records of the DataFrame:

df = pd.read_csv('E:/Baljeet/Canada/SSODL/SLM Data Analysis and Visualization/Data/BL-


Flickr-Images-Book.csv')

df.head()
Here we can see, a lot of unwanted parameters are there like : Edition Statement, Corporate
Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

We can prepare a list of all these parameters and we can call the drop() function on our object,
passing in the inplace parameter as True and the axis parameter as 1. This tells pandas that we
want the changes to be made directly in our object and that it should look for the values to be
dropped in the columns of the object.

When we inspect the DataFrame again, we’ll see that the unwanted columns have been
removed:

to_drop = ['Edition Statement', 'Corporate Author','Corporate Contributors','Former


owner','Engraver','Contributors','Issuance type','Shelfmarks']

df.drop(to_drop, inplace=True, axis=1)

Let us use head() function to see changes in data. We can see unwanted fields are dropped.

Let us take another data set: BIG MART SALES prediction

Name of data files: Test.csv and Train.csv


BigMart Sales Prediction practice problem

We have train (8523) and test (5681) data set, train data set has both input and output
variable(s). We need to predict the sales for test data set.

 Item_Identifier: Unique product ID

 Item_Weight: Weight of product

 Item_Fat_Content: Whether the product is low fat or not

 Item_Visibility: The % of total display area of all products in a store allocated to the
particular product

 Item_Type: The category to which the product belongs

 Item_MRP: Maximum Retail Price (list price) of the product

 Outlet_Identifier: Unique store ID

 Outlet_Establishment_Year: The year in which store was established

 Outlet_Size: The size of the store in terms of ground area covered

 Outlet_Location_Type: The type of city in which the store is located

 Outlet_Type: Whether the outlet is just a grocery store or some sort of supermarket

 Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcome
variable to be predicted.

Let us read the dataset into dataframe and view the initial records using head() function:
Look at numerical and categorical variables using info() function

It has 8522rows and total 12 columns.


Some observations:

 Outlet_Establishment_Years vary from 1985 to 2009. The values might not be apt in
this form. Rather, if we can convert them to how old the particular store is, it should
have a better impact on sales.

 The lower ‘count’ of Item_Weight and Outlet_Size confirms the findings from the
missing value check.
To identify missing values use isnull() function

Item_Weight and Outlet_Size has missing values. Item_Weight is numerical value


and Outlet_Size is categorical value. So we need to handle missing vales for these variables in
different ways.

Handling missing value of Item_Weight : Lets impute the Item_Weight by the average
weight of the particular item.First get the mean weight , item wise:

Find the places where Item_Weight is missing:


Replace the missing values with mean Item_weight for the corresponding Item using
Item_Identifier:

Now we can verify the missing values for Item_Weight are filled:

Now let us handle missing values for Outlet_Size which is a categorical values. Mising
categorical values are replaced with corresponding mode value of respective variable rather
than using mean values.
So,Prepare list of missing values and replace them with corresponding mode values:

Also Outlet_Establishment_Years vary from 1985 to 2009. The values might not be apt in
this form. Rather, if we can convert them to how old the particular store is, it should have a
better impact on sales. Create new variable Outlet_Year by subtracting
Outlet_Establishment_Years from current year:

There are many more functions applied to Dataset to prepare it for analysis. Kindly refer Python
notebook for more details and practice.
1.6. Summary
Data analysis and visualization are crucial because they form the foundation for more advanced
data science and machine learning techniques. Understanding probability and statistics is
fundamental for making data-driven decisions. Learning how to clean and prepare data is often
the most time-consuming part of data analysis, but it's essential for accurate results.
Mishandling data cleaning and preparation can lead to incorrect conclusions. Handling missing
data is also vital because incomplete data can lead to biased or unreliable results if not
addressed properly. Additionally, this introduction lays the groundwork for understanding how
data can be visualized to gain insights. Visualization is a powerful tool for conveying
information from data in a meaningful way.

In summary, "Introduction to Data Analysis and Visualization" provides students with the
foundational knowledge and skills needed to work with data effectively. It equips them with
the statistical tools to understand data, the ability to clean and prepare data for analysis, and
insights into handling common data-related challenges such as missing data. This knowledge
is essential for anyone working with data in fields such as data science, analytics, and research.
1.7. Keywords
Data: data is different types of information usually formatted in a particular manner.
Probability: Probability is simply how likely something is to happen.
Data Science: Data science combines math and statistics, specialized programming, advanced
analytics, artificial intelligence (AI), and machine learning with specific subject matter
expertise to uncover actionable insights hidden in an organization’s data.
Machine Learning: Machine learning is a branch of artificial intelligence (AI) and computer
science that focuses on the use of data and algorithms to imitate the way that humans learn,
gradually improving its accuracy.
1.8. Self-assessment Questions

Question 1: What is data analysis?

A) The process of collecting data


B) The process of cleaning and preparing data
C) The process of interpreting data to extract insights
D) The process of visualizing data

Answer 1: C) The process of interpreting data to extract insights

Question 2: Which of the following is NOT a common source of data for analysis and
visualization?

A) Social Media Data


B) Geospatial Data
C) Sensor Data
D) Machine Learning Models

Answer 2: D) Machine Learning Models

Question 3: Why is data cleaning important in data analysis?

A) It makes the data look nicer


B) It reduces the size of the dataset
C) It helps identify and correct errors in the data
D) It enhances data visualization

Answer 3: C) It helps identify and correct errors in the data

Question 4: Which statistical concept helps us understand the central tendency of a dataset?

A) Mean
B) Median
C) Mode
D) Standard Deviation

Answer 4: A) Mean

Question 5: What is the primary purpose of data visualization?


A) To make data look more complex
B) To hide data patterns
C) To present data in a visually appealing and understandable way
D) To confuse the audience

Answer 5: C) To present data in a visually appealing and understandable way

Question 6: Which programming language is commonly used for data analysis and offers
libraries like Matplotlib and Seaborn for visualization?

A) Java
B) Python
C) R
D) SQL

Answer 6: B) Python

Question 7: What does probability measure?

A) The likelihood of an event occurring


B) The size of a dataset
C) The range of data values
D) The mean of a dataset

Answer 7: A) The likelihood of an event occurring

Question 8: Which term refers to the process of replacing missing values with estimated
values?

A) Data cleaning
B) Data preparation
C) Data encoding
D) Imputation

Answer 8: D) Imputation

Question 9: Which type of data source involves collecting data from IoT devices like
sensors?

A) Databases
B) Social Media Data
C) Sensor Data
D) Surveys and Questionnaires

Answer 9: C) Sensor Data

Question 10: What is the main benefit of using data visualization in data analysis?

A) To make data more complex


B) To hide data patterns
C) To present data in a meaningful and insightful way
D) To create artistic designs

Answer 10: C) To present data in a meaningful and insightful way


MODULE 2 : SHAPE OF DATA
Topics to be covered:

2.1 Learning Outcomes

2.2 Introduction

2.3 Shape of the Data

2.4 Population, Samples and Statistical interpretation from Data, Visualization Methods

2.5 Summary

2.6 Keywords

2.7 Self-assessment Questions

2.1 Learning Outcomes


To enable students to recognize, analyze, and interpret different types of data, understand data
distribution shapes, perform univariate data analysis, work with populations and samples, and
effectively use visualization methods to communicate insights derived from data. These skills
are fundamental for data-driven decision-making and analysis in various domains.

2.2 Introduction
In data analytics, understanding the types of data, the shape of the data, and univariate data
analysis are crucial steps in the data exploration process. Let's break down each of these
concepts:

Types of Data: Data can be categorized into different types based on its nature and
characteristics. The three primary types of data are:

1. Nominal Data: Nominal data, also known as categorical data, consists of categories or
labels without any specific order or ranking. Examples include colors, gender categories
(male, female, other), and types of fruits. Nominal data can be represented using text
labels or numbers. Nominal data, also known as categorical data, is a type of data that
represents categories or labels for distinct items or groups. These categories have no
inherent order or ranking; they are simply different from one another. Nominal data is
often represented using text labels or numerical codes, but the numbers themselves do
not carry any quantitative meaning. It's essential to recognize that in nominal data, you
cannot perform meaningful mathematical operations like addition or multiplication on
the categories.

Examples of Nominal Data:

Colors: The colors of cars (e.g., red, blue, green) are nominal data. Each color
represents a distinct category, and there's no inherent order among the colors.

Gender: Male, female, and other are categories of gender. Gender is a classic example
of nominal data, as these categories have no inherent ranking.

Marital Status: Marital status categories such as single, married, divorced, and
widowed are nominal. They represent different marital statuses without any particular
order.

Types of Fruits: Categories like apple, banana, and orange are nominal data because
they represent different types of fruits with no inherent ranking.

Country Names: The names of countries, such as the United States, Canada, and
Mexico, are nominal data. Each country represents a separate category.

2. Ordinal Data: Ordinal data represents categories with a specific order or ranking.
While the intervals between categories are not uniform or meaningful, we can establish
that one category is higher or lower than another. Examples include education levels
(e.g., high school, bachelor's degree, master's degree) and customer satisfaction ratings
(e.g., poor, fair, good, excellent). Ordinal data is a type of data that represents categories
with a specific order or ranking. Unlike nominal data, ordinal data's categories have a
meaningful sequence or hierarchy, but the intervals between the categories are not
uniform or quantitatively meaningful. In other words, you can compare ordinal
categories to determine which is higher or lower, but you cannot precisely measure the
difference between them.

Examples of Ordinal Data:

Education Level: Categories like high school, bachelor's degree, master's degree, and
Ph.D. represent ordinal data. They have a clear order based on the level of education
attained, but the difference between these levels is not uniform.

Customer Satisfaction Ratings: Ratings such as poor, fair, good, very good, and
excellent are ordinal data. There's a meaningful order from the least to the most
satisfied, but the difference between each rating is subjective and not quantitatively
precise.

Likert Scale: The Likert scale, often used in surveys, includes responses like strongly
disagree, disagree, neutral, agree, and strongly agree. These responses are ordinal
because they have a clear order, but the interval between them is not uniformly defined.

Economic Status: Categories like lower class, middle class, and upper class represent
ordinal data. They indicate a hierarchical order in terms of economic status, but the
income or wealth gap between these categories is not uniform.

T-shirt Sizes: T-shirt sizes like small, medium, large, and extra-large are ordinal data.
There's an order based on size, but the difference in chest measurements between sizes
is not constant.

3. Numerical Data: Numerical data consists of numeric values that can be further
categorized into two subtypes:

 Interval Data: Interval data has a consistent interval or difference between


values, but it lacks a true zero point. Common examples include temperature in
Celsius and IQ scores. In interval data, it's not meaningful to say that one value
is "twice" another value.

 Ratio Data: Ratio data has a consistent interval between values, and it includes
a true zero point, which indicates the absence of the quantity being measured.
Common examples include height, weight, age, and income. In ratio data, you
can meaningfully say that one value is "twice" or "three times" another value.

Based on the organization of data, data is also classified as Structured Data and Unstructured
Data. a

Structured Data: Structured data is highly organized and formatted, making it easy to store,
search, and analyze using traditional databases and spreadsheet tools. Characteristics of
structured data include:

 Fixed Schema: Structured data follows a predefined and fixed schema, which
means that the data's format, data types, and relationships between data
elements are well-defined in advance.
 Tabular Format: Structured data is often organized into tables or spreadsheets,
where rows represent individual records or observations, and columns represent
attributes or variables.

 Consistency: Structured data exhibits a high level of consistency. Data values


within the same column typically adhere to a specific data type, and there are
rules governing how data is entered.

 Easy to Query: Structured data can be queried using standard SQL (Structured
Query Language) or other query languages. This makes it straightforward to
retrieve specific information from structured databases.

Examples of Structured Data:

 Relational databases: Data stored in tables with predefined schemas.

 Excel spreadsheets: Organized data with rows and columns.

 CSV (Comma-Separated Values) files: Data stored in a tabular format.

 JSON or XML data with a well-defined structure.

 Inventory lists, customer databases, financial records.

Unstructured Data: Unstructured data is the opposite of structured data in terms of


organization and format. It lacks a fixed structure, making it more challenging to process using
traditional database management systems. Characteristics of unstructured data include:

 Lack of Structure: Unstructured data does not follow a predefined structure,


schema, or format. It can include text, images, audio, video, and other formats.

 Variability: Unstructured data can vary greatly in terms of content, length, and
format. It may contain natural language text, free-form notes, multimedia files,
and more.

 No Fixed Schema: There is no fixed schema or model for unstructured data.


Each piece of unstructured data may have its own unique characteristics.

 Challenging to Process: Analyzing unstructured data can be challenging


because traditional database tools are not well-suited for this type of data.
Specialized techniques, such as natural language processing (NLP) for text data
or computer vision for images, are often required.
Examples of Unstructured Data:

 Text documents: Word documents, PDFs, emails, and text files.

 Social media posts: Tweets, Facebook updates, and user-generated content.

 Images and videos: Photos, videos, and multimedia content.

 Audio recordings: Voice recordings, podcasts, and phone call logs.

 Sensor data: Raw data from IoT devices.

 Web pages and web content: HTML pages with variable structures.

Semi-Structured Data: Semi-structured data falls between structured and unstructured data.
It has some structure but does not adhere to the strict schema of structured data. Examples of
semi-structured data include JSON and XML documents, which have a defined hierarchy but
allow flexibility in data representation.

Structured data is highly organized and suitable for traditional databases, while unstructured
data lacks a fixed structure and may require specialized techniques for analysis. Semi-
structured data exhibits some level of structure but allows for flexibility in data representation.
Understanding the type of data you are dealing with is crucial for selecting appropriate tools
and techniques for data storage, processing, and analysis.

2.3 Shape of the Data


The shape of the data refers to the distribution of values within a dataset. Understanding the
shape of the data is essential for making data-driven decisions and choosing appropriate
statistical techniques. Data distributions can take various forms:

1. Normal Distribution (Bell Curve): In a normal distribution, data is symmetrically


distributed around the mean, forming a bell-shaped curve. Many natural phenomena
follow a normal distribution. The mean, median, and mode are equal in a normal
distribution.

2. Skewed Distribution: Skewed distributions are asymmetric, with a longer tail on one
side. There are two types of skewed distributions:

 Positive Skew (Right-skewed): The tail extends to the right, and the mean is
greater than the median.
 Negative Skew (Left-skewed): The tail extends to the left, and the mean is less
than the median.

3. Uniform Distribution: In a uniform distribution, all values are equally likely, and there
is no pronounced peak or skew.

4. Bimodal Distribution: Bimodal distributions have two distinct peaks, indicating the
presence of two different groups or patterns within the data.

5. Multimodal Distribution: Multimodal distributions have more than two peaks,


suggesting multiple patterns or groups in the data.

6. Exponential Distribution: Exponential distributions are often used to model the time
between events in a Poisson process. They are characterized by a rapid drop-off in
values after an initial peak.

Univariate Data Analysis: Univariate data analysis focuses on analyzing a single variable or
dataset one at a time. It involves the following key steps:

1. Descriptive Statistics: Calculate summary statistics to describe the central tendency


(mean, median, mode), dispersion (variance, standard deviation, range), and shape of
the data (skewness, kurtosis).

2. Data Visualization: Create visual representations of the data, such as histograms, box
plots, bar charts, and density plots, to better understand the distribution and identify any
outliers or patterns.

3. Frequency Distribution: Construct a frequency distribution table or chart to show the


frequency of each value or category within the dataset.

4. Measures of Central Tendency: Use measures like the mean, median, and mode to
describe the "typical" or central value of the data.

5. Measures of Dispersion: Calculate measures like the range, variance, and standard
deviation to understand the spread or variability of the data.

6. Skewness and Kurtosis: Assess skewness to determine if the data is skewed and
kurtosis to examine the shape of the distribution's tails.

7. Identify Outliers: Detect and deal with outliers, which are extreme values that can
significantly impact data analysis.
Univariate data analysis provides a foundational understanding of a single variable's
characteristics, which is often the starting point for more advanced multivariate analyses in
data analytics and statistics.

2.4 Population, Samples and Statistical interpretation from Data, Visualization Methods
Population, samples, statistical interpretation, and visualization methods are fundamental
concepts in statistics and data analysis. Let's delve into each of these concepts in detail:

Population:

The population refers to the entire group of individuals or items about which you want to draw
conclusions in a statistical study. This group can be large or small and is the complete set you
are interested in studying.

For example, if you want to understand the average income of all adults in a specific country,
the population would consist of all adults in that country.

Samples:

A sample is a subset of the population. Sampling involves selecting a portion of the population
to represent the whole. Sampling is typically done because it's often impractical or impossible
to collect data from an entire population.

Samples should be selected randomly or in a way that minimizes bias to ensure that the sample
is representative of the population. Random sampling helps to reduce the likelihood of drawing
misleading conclusions.

For example, if you want to estimate the average income of adults in a country, you might
select a random sample of 1,000 adults from the entire population.

Statistical Interpretation from Data:

Statistical interpretation involves making sense of data collected from a sample or the entire
population. This is where statistical techniques and analysis come into play.

It includes calculating descriptive statistics like mean (average), median, and standard
deviation to summarize data. It also involves using inferential statistics to make inferences or
predictions about the population based on the sample data.
Statistical interpretation helps you draw conclusions, make predictions, or test hypotheses. For
example, from a sample's average income, you might infer the average income of the entire
population.

Visualization Methods:

Visualization methods are tools and techniques used to present data graphically. Visualizations
help in understanding data, spotting trends, and communicating findings effectively.

Common visualization methods include bar charts, line graphs, histograms, scatter plots, box
plots, and pie charts. These visuals can reveal patterns, outliers, and relationships within the
data that might not be apparent from raw numbers.

Data visualization is crucial for making data more accessible and interpretable. For instance, a
histogram can show the income distribution of a sample, allowing you to see if it's skewed or
normally distributed.

In practice, the process typically involves selecting a sample from the population, collecting
data from the sample, analyzing the data using statistical methods, and presenting the results
through data visualization. The goal is to make valid inferences about the population based on
the sample data.

It's important to note that the quality of the sample and the accuracy of the statistical
interpretation can greatly impact the reliability of conclusions drawn. Well-designed studies
with representative samples and appropriate statistical analysis techniques enhance the validity
of your findings. Data visualization adds another layer of clarity to help people understand the
results more easily.
2.5 Summary
In data analytics, understanding the types of data, the shape of the data, and univariate data
analysis are crucial steps in the data exploration process. Data can be categorized into different
types based on its nature and characteristics. The three primary types of data are: Nominal
Data, Ordinal Data, Numerical Data.

The shape of the data refers to the distribution of values within a dataset. Understanding the
shape of the data is essential for making data-driven decisions and choosing appropriate
statistical techniques. Data distributions can take various forms, including: Normal Distribution
(Bell Curve), Skewed Distribution, Uniform Distribution, Bimodal Distribution, Multimodal
Distribution, Exponential Distribution.

Univariate data analysis focuses on analyzing a single variable or dataset one at a time. It
involves the following key steps:

Descriptive Statistics: Calculate summary statistics to describe the central tendency (mean,
median, mode), dispersion (variance, standard deviation, range), and shape of the data
(skewness, kurtosis).

Data Visualization: Create visual representations of the data, such as histograms, box plots, bar
charts, and density plots, to better understand the distribution and identify any outliers or
patterns.

Frequency Distribution: Construct a frequency distribution table or chart to show the frequency
of each value or category within the dataset.

Measures of Central Tendency: Use measures like the mean, median, and mode to describe the
"typical" or central value of the data.

Measures of Dispersion: Calculate measures like the range, variance, and standard deviation to
understand the spread or variability of the data.

Skewness and Kurtosis: Assess skewness to determine if the data is skewed and kurtosis to
examine the shape of the distribution's tails.

Univariate data analysis provides a foundational understanding of a single variable's


characteristics, which is often the starting point for more advanced multivariate analyses in
data analytics and statistics. These concepts are essential for anyone working with data, as they
form the basis for data exploration and analysis, enabling data professionals to draw
meaningful insights and make informed decisions
2.6 Keywords
1. Descriptive Statistics: Calculate summary statistics to describe the central tendency (mean,
median, mode), dispersion (variance, standard deviation, range), and shape of the data
(skewness, kurtosis).
2. Data Visualization: Create visual representations of the data, such as histograms, box plots,
bar charts, and density plots, to better understand the distribution and identify any outliers
or patterns.
3. Frequency Distribution: Construct a frequency distribution table or chart to show the
frequency of each value or category within the dataset.
4. Measures of Central Tendency: Use measures like the mean, median, and mode to describe
the "typical" or central value of the data.
5. Measures of Dispersion: Calculate measures like the range, variance, and standard
deviation to understand the spread or variability of the data.
6. Skewness and Kurtosis: Assess skewness to determine if the data is skewed and kurtosis to
examine the shape of the distribution's tails.
2.7 Self-assessment Questions

1. What type of data represents categories or labels without any specific order or ranking?

a. Ordinal data

b. Numerical data

c. Nominal data

d. Ratio data

Answer: c. Nominal data

2. Which of the following is an example of nominal data?

a. Temperature in Celsius

b. Customer satisfaction ratings

c. Types of fruits

d. Education levels

Answer: c. Types of fruits

3. In which type of distribution is data symmetrically distributed around the mean, forming a
bell-shaped curve?

a. Uniform distribution

b. Normal distribution

c. Skewed distribution

d. Exponential distribution

Answer: b. Normal distribution

4. What type of distribution has a longer tail on one side, with the mean greater than the
median?
a. Negative skew (Left-skewed)

b. Positive skew (Right-skewed)

c. Uniform distribution

d. Bimodal distribution

Answer: b. Positive skew (Right-skewed)

5. Which measure of central tendency is resistant to extreme outliers in a dataset?

a. Mean

b. Median

c. Mode

d. Range

Answer: b. Median

6. What is the purpose of constructing a frequency distribution table in univariate data analysis?

a. To summarize the central tendency of the data

b. To identify outliers in the dataset

c. To calculate the standard deviation

d. To show the frequency of each value or category within the dataset

Answer: d. To show the frequency of each value or category within the dataset

7. What is a sample in the context of data analysis?

a. The entire group of individuals or items under study

b. A subset of the population used to represent the whole

c. The mean value of the population


d. A statistical measure of data dispersion

Answer: b. A subset of the population used to represent the whole

8. Which of the following is NOT a common measure of central tendency used in statistical
interpretation?

a. Mean

b. Median

c. Mode

d. Variance

Answer: d. Variance

9. Which data visualization method is suitable for displaying the relationship between two
numerical variables in a scatterplot?

a. Bar chart

b. Line graph

c. Scatter plot

d. Histogram

Answer: c. Scatter plot

10. What is the primary purpose of using data visualization methods in data analysis?

a. To hide data patterns and relationships

b. To make data less accessible

c. To summarize data without visual representation

d. To understand data, spot trends, and communicate findings effectively

Answer: d. To understand data, spot trends, and communicate findings effectively


MODULE 3 : DESCRIBING RELATIONSHIP BETWEEN DATA

Topics to be covered:

3.1 Learning Outcomes

3.2 Introduction: Describing Relationship, Multivariate Data

3.3 Relationship between variables, Covariance, Correlation Coefficients

3.4 Summary

3.5 Keywords

3.6 Self-assessment Questions

3.1 Learning Outcomes


To enable students to understand data and relationship between the variables, perform
univariate, bivariate analysis using, mean, variance, covariance and correlation.

3.2 Introduction : Describing Relationship, Multivariate Data


Describing the relationship between variables involves understanding how one variable
changes concerning another. This can be a crucial step in data analysis to identify patterns or
dependencies. Multivariate data involves the analysis of datasets with more than two variables.
This can help uncover complex relationships between multiple factors. The relationship
between variables can be visualized using scatter plots, line plots, or other graphical
representations. Covariance measures how two variables change together. A positive
covariance indicates a positive relationship, while a negative covariance indicates a negative
relationship. Correlation coefficients, like Pearson correlation, quantify the strength and
direction of the linear relationship between two variables. It ranges from -1 to 1, where -1
indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship,
and 0 indicates no linear relationship. Understanding these concepts and utilizing them in
Python can be beneficial for analyzing relationships within datasets. Always remember that
correlation does not imply causation, and additional statistical tests may be required for more
comprehensive analyses.
Let us discuss about relationship between variables in detail by understanding few basic
concepts as below:

Descriptive Statistics: Descriptive statistics is a branch of statistics that involves the collection,
presentation, and interpretation of data. Its primary goal is to summarize and describe essential
features of a dataset, providing a clear and concise overview. Descriptive statistics can be
applied to one or more datasets or variables.

Univariate Analysis: In univariate analysis, you focus on describing and summarizing a


single variable. This involves measures of central tendency (mean, median, mode) and
measures of variability (variance, standard deviation).
Bivariate Analysis: Bivariate analysis explores statistical relationships between pairs of
variables. It examines how two variables change in relation to each other. Measures of
correlation, covariance, and joint variability are used in bivariate analysis.
Multivariate Analysis: Multivariate analysis involves the simultaneous analysis of multiple
variables. This approach provides a more comprehensive understanding of the relationships
and patterns within the data.

3.3 Relationship between variables


Central Tendency: Describes the center or average of a dataset.

Mean: The arithmetic average of all values in a dataset, calculated by summing all
values and dividing by the number of items.

Median: The middle value in a sorted dataset.

Mode: The most frequently occurring value in a dataset.

Variability:

Describes the spread or dispersion of data points.

Variance: A measure of how much each data point in a set differs from the mean.

Standard Deviation: The square root of the variance, providing a more interpretable
measure of dispersion.

Correlation or Joint Variability:

Describes the relationship between pairs of variables.

Covariance: Measures how two variables change together.


Correlation Coefficient: Scales covariance to a range of -1 to 1, indicating the strength
and direction of the relationship.

Population and Sample:

In statistics, a population is the complete set of elements or items of interest. Due to practical
constraints, statisticians often study a representative subset of the population, known as a
sample.

A well-chosen sample should accurately represent the essential statistical features of the entire
population. Outliers are data points that significantly deviate from the majority of the dataset.

Identifying and handling outliers is crucial for accurate analysis, as they can skew measures of
central tendency and variability.

Let us understand all these terms using python libraries pandas and numpy:

Note: Refer Describing Relationship in data (2).ipynb python notebook at


https://drive.google.com/drive/folders/13A0EVAXatP4CmHPWpTgky9rd2bWLeCdf?usp=s
haring

Mean The sample mean, also called the sample arithmetic mean or simply the average, is the
arithmetic average of all the items in a dataset. The mean of a dataset 𝑥 is mathematically
expressed as Σᵢ𝑥ᵢ/𝑛, where 𝑖 = 1, 2, …, 𝑛. In other words, it’s the sum of all the elements 𝑥ᵢ
divided by the number of items in the dataset 𝑥.

Following code uses pandas to calculate mean and covariance between two series of data:

NumPy has the function cov() that returns the covariance matrix:
The upper-left element of the covariance matrix is the variance of x. Similarly, the lower-right
element is the variance of y.

As you can see, the variances of x and y are equal to cov_matrix[0, 0] and cov_matrix[1, 1],
respectively.
The other two elements of the covariance matrix are equal and represent the actual covariance
between x and y

pandas Series have the method .cov() that you can use to calculate the covariance:

Here, you call .cov() on one Series object and pass the other object as the first argument.

Correlation Coefficient:
The correlation coefficient, or Pearson product-moment correlation coefficient, is denoted by
the symbol 𝑟. The coefficient is another measure of the correlation between data. You can think
of it as a standardized covariance. Here are some important facts about it:
The value 𝑟 > 0 indicates positive correlation.
The value 𝑟 < 0 indicates negative correlation.
The value r = 1 is the maximum possible value of 𝑟. It corresponds to a perfect positive linear
relationship between variables.
The value r = −1 is the minimum possible value of 𝑟. It corresponds to a perfect negative linear
relationship between variables.
The value r ≈ 0, or when 𝑟 is around zero, means that the correlation between variables is weak.

You’ve got the variable r that represents the correlation coefficient.


scipy.stats has the routine pearsonr() that calculates the correlation coefficient and the 𝑝-value:
Similar to the case of the covariance matrix, you can apply np.corrcoef() with x_ and y_ as the
arguments and get the correlation coefficient matrix:

The upper-left element is the correlation coefficient between x_ and x_. The lower-right
element is the correlation coefficient between y_ and y_. Their values are equal to 1.0. The
other two elements are equal and represent the actual correlation coefficient between x_ and
y_:
You can get the correlation coefficient with scipy.stats.linregress():
linregress() takes x_ and y_, performs linear regression, and returns the results. slope and
intercept define the equation of the regression line, while rvalue is the correlation coefficient.
To access particular values from the result of linregress(), including the correlation coefficient,
use dot notation:

pandas Series have the method .corr() for calculating the correlation coefficient:

Example: let us consider the BIG MART Data to study Correlation.


Here, we can see a positive correlation between Item_Outlet_Sales and Item_MRP

Let us refer to following correlation matrix:


3.4 Summary
In data analysis, understanding the relationship between variables is crucial for identifying
patterns and dependencies. This involves exploring univariate, bivariate, and multivariate
analyses.

Python libraries such as pandas and numpy are useful for implementing these concepts.
Functions like mean(), cov(), and corr() assist in calculating statistical measures. Visualization
libraries like Matplotlib can be used for graphical representations. Positive values of correlation
coefficient indicate positive correlation, negative values indicate negative correlation, and 0
indicates no linear relationship.

In summary, descriptive statistics and analyses of relationships between variables provide


valuable insights into datasets. Python, with its powerful libraries, facilitates the
implementation of these concepts, contributing to a deeper understanding of data patterns and
dependencies. Always remember that correlation does not imply causation, and additional
statistical tests may be needed for comprehensive analyses.
3.5 Keywords
Covariance: Covariance is a statistical measure that describes the degree to which two variables
change together. A positive covariance indicates that the variables tend to increase or decrease
together, while a negative covariance indicates an inverse relationship. It is calculated as the
average of the product of the differences between each variable's value and the mean of that
variable.

Correlation: Correlation is a standardized measure of the strength and direction of the linear
relationship between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive
linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear
relationship.

Correlation is less sensitive to the scale of the variables compared to covariance.

Pearson Coefficient: The Pearson correlation coefficient, often referred to as the Pearson r, is
a specific type of correlation coefficient. It measures the linear relationship between two
variables on a scale from -1 to 1. Calculated as the covariance of the variables divided by the
product of their standard deviations.
3.6 Self-assessment Questions

Multiple Choice Questions (MCQs):

What does descriptive statistics aim to achieve?

a. Predict future outcomes

b. Summarize and describe essential features of a dataset

c. Conduct hypothesis testing

d. Implement machine learning algorithms

Answer: b. Summarize and describe essential features of a dataset

In univariate analysis, what measures are used to describe a single variable?

a. Covariance and correlation

b. Variance and standard deviation

c. Mean, median, and mode

d. Scatter plots and line plots

Answer: c. Mean, median, and mode

What does covariance measure between two variables?

a. Strength of linear relationship

b. Spread or dispersion

c. Change in one variable concerning another

d. Relationship between more than two variables

Answer: c. Change in one variable concerning another

What is the range of the Pearson correlation coefficient (r)?


a. 0 to 1

b. -1 to 1

c. -∞ to ∞

d. 1 to ∞

Answer: b. -1 to 1

What is the primary goal of multivariate analysis?

a. Explore relationships between pairs of variables

b. Describe a single variable

c. Analyze multiple variables simultaneously

d. Summarize and describe essential features of a dataset

Answer: c. Analyze multiple variables simultaneously

Which of the following is a measure of the center or average of a dataset?

a. Variance

b. Standard Deviation

c. Median

d. Correlation Coefficient

Answer: c. Median

What does the correlation coefficient (r) value of -1 indicate?

a. No linear relationship

b. Perfect positive linear relationship

c. Perfect negative linear relationship


d. Weak correlation

Answer: c. Perfect negative linear relationship

What does the standard deviation measure in descriptive statistics?

a. Spread or dispersion of data points

b. Strength of linear relationship

c. Change in one variable concerning another

d. Center or average of a dataset

Answer: a. Spread or dispersion of data points

What is the primary purpose of conducting bivariate analysis?

a. Explore relationships between pairs of variables

b. Summarize and describe essential features of a dataset

c. Analyze multiple variables simultaneously

d. Predict future outcomes

Answer: a. Explore relationships between pairs of variables

What is the significance of outliers in data analysis?

a. They improve the accuracy of analysis

b. They indicate a perfect linear relationship

c. They significantly deviate from the majority and may affect results

d. They are not relevant in statistical analysis

Answer: c. They significantly deviate from the majority and may affect results

You might also like