Questions
1. Titanic Survival Dataset Analysis:
o Load the Titanic Survival dataset.
o Display 5 sample observations from the dataset.
o Check and display the dataset information.
o Count the number of survivors based on gender (sex-wise
survival count).
o Check if there are any null values in the dataset.
o Plot a count plot showing the survival count passenger-wise.
o Create a strip plot of Age vs. Sex with hue as Survival status.
Identify the key factors influencing survival.
o Plot a pie chart for survival status and identify the percentage of
survivors.
2. Dummy Variable Creation:
o Create a DataFrame in Python using the following dataset:
Item Color
Item1 Red
Item2 Green
Item3 Blue
Item4 Red
Item5 Green
o Generate dummy variables for the Color column.
o Display the resulting DataFrame with the dummy variables.
3. Untidy to Tidy Data Transformation:
o Consider the following dataset:
Populatio
Country Year GDP
n
USA 2010 308 14992
USA 2011 311 15543
Canada 2010 34 1536
Canada 2011 35 1601
o Convert this untidy dataset into a tidy format such that
Population and GDP are represented as separate variables under
one column, and their respective values are listed in another
column.
o Display the tidy dataset.
4. Winsorization Method:
o For the given data [10, 15, 20, 25, 100, 150, 200], replace the
outliers with the 5th and 95th percentiles using the Winsorization
method.
5. Missing Value Imputation:
o For the given dataset:
Feature Feature
1 2
5 12
7 NaN
3 8
NaN 15
8 6
10 9
6 NaN
NaN 5
9 11
o Replace the missing values with the mean of their respective
columns.
o Replace the missing values with the median of their respective
columns.
o Replace the missing values using the K-Nearest Neighbors (KNN)
imputation method.
Question: Exploratory Data Analysis (EDA) and Feature Engineering
a) Load the dataset Insurance Charges Prediction.csv into a DataFrame
and perform the following:
o Display the first 5 rows of the dataset.
o Display the dataset information.
o Provide the statistical summary of the dataset for numerical
features.
o Provide the statistical summary of the dataset for categorical
features.
b) Perform Univariate Analysis:
o Plot histograms for all numerical columns in the dataset.
c) Perform Bivariate Analysis:
o Visualize the distribution of charges based on:
Gender (sex) using a boxplot.
Region (region) using a boxplot.
Smoking status (smoker) using a boxplot.
o Plot a count plot to show the distribution of smoker status with
hue as sex.
o Create scatter plots for:
Age vs. Charges.
BMI vs. Charges.
d) Perform Correlation Analysis:
o Filter out the numerical columns.
o Calculate and display the correlation matrix for numerical
variables.
o Visualize the correlation matrix using a heatmap.
e) Filter the categorical variables:
o Extract only categorical features.
o Display the names of the categorical columns.
f) Perform Feature Engineering:
o Use pd.get_dummies function to encode the categorical variables
into dummy variables.
Question: Linear Regression Model and Visualization
a) Given the following dataset:
x y
5 5
15 20
25 14
35 32
45 22
55 38
b) Plot a scatter plot to visualize the relationship between x and y.
c) Fit a Linear Regression model using x and y.
o Calculate the coefficient of determination (R²) to evaluate the
goodness of fit.
o Display the intercept and coefficient of the linear regression
model.
d) Predict the dependent variable (y) for the following new values of the
independent variable (x): 8, 15, and 35.
e) Plot the original data points and overlay the fitted regression line.