EXP.
NO: 01
PERFORM DATA EXPLORATION AND PREPROCESSING
DATE: 23.01.2025
AIM:
To write a python code that will perform data exploration and preprocessing for the
uploaded dataset.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary python libraries
Step 3: Load the data set in the current file directory
Step 4: Perform data exploration and data preprocessing for the loaded dataset
Step 5: Display the output
Step 6: Stop the program
CODE:
import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None, "display.width",
None)
file_path = '/content/traffic_accidects.csv'
df = pd.read_csv(file_path)
print("First few rows of the dataset:")
print(df.head())
print("First few rows of the dataset:")
print(df.head())
print("\nSummary Statistics:")
print(df.describe(include="all"))
print("\nMissing Values:")
print(df.isnull().sum())
if 'Age' in df.columns:
df['Age'] = df['Age'].fillna(df['Age'].median())
if 'Salary' in df.columns:
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
if 'AccidentDate' in df.columns:
df['AccidentDate'] = df['AccidentDate'].fillna("Unknown")
df['AccidentDate'] = pd.to_datetime(df['AccidentDate'], errors='coerce')
if 'Gender' in df.columns:
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
if 'SeverityScore' in df.columns:
df = df.dropna(subset=['SeverityScore'])
if 'AccidentDate' in df.columns:
current_year = pd.Timestamp.now().year
df['YearsSinceAccident'] = current_year - df['AccidentDate'].dt.year
if 'Salary' in df.columns:
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]
print("\nCleaned Dataset:")
print(df.head().to_string())
OUTPUT:
Particulars Marks Allotted Marks Awarded
Program / Simulation 40
Program Execution 30
Result 20
Viva Voce 10
Total 100
RESULT:
Thus, a program for data exploration and preprocessing has been successfully
executed.
EXP.NO: 02 (a)
DATE: 30.01.2025 Implement linear and logistic regression
1). Linear regression:
a). Single linear regression:
AIM:
To write a python code for the implementation of single linear regression to find a
straight line that goes through data points.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary python libraries
Step 3: load the datasets and using formula build a code for single linear regression
Step 4: Display the output
Step 5: Stop the program
CODE:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
file_path = "/content/datasets/house_prices.csv"
df = pd.read_csv(file_path)
df = df.drop(columns=['id', 'date'])
df['bedrooms'] = df['bedrooms'].fillna(df['bedrooms'].median())
df['bathrooms'] = df['bathrooms'].fillna(df['bathrooms'].median())
df['sqft_living'] = df['sqft_living'].fillna(df['sqft_living'].median())
df['sqft_lot'] = df['sqft_lot'].fillna(df['sqft_lot'].median())
df['waterfront'] = df['waterfront'].fillna(df['waterfront'].mode()[0])
df['view'] = df['view'].fillna(df['view'].mode()[0])
df['condition'] = df['condition'].fillna(df['condition'].mode()[0])
df['grade'] = df['grade'].fillna(df['grade'].mode()[0])
df['age_of_house'] = 2025 - df['yr_built']
df['time_since_renovation'] = 2025 - df['yr_renovated']
df['time_since_renovation'] = df['time_since_renovation'].where(df['yr_renovated'] != 0,
0)
df['total_sqft'] = df['sqft_living'] + df['sqft_basement']
df['bedrooms_sqft'] = df['bedrooms'] * df['sqft_living']
df = pd.get_dummies(df, columns=['waterfront', 'view', 'condition', 'zipcode'],
drop_first=True)
X = df.drop(columns=['price'])
y = df['price']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
random_state=42)
ridge_model = Ridge(alpha=1.0)
lasso_model = Lasso(alpha=0.1)
ridge_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)
y_pred_lasso = lasso_model.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mse_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mse_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)
print("Ridge Regression Model Evaluation:")
print(f"MSE: {mse_ridge:.2f}")
print(f"RMSE: {rmse_ridge:.2f}")
print(f"R-squared: {r2_ridge:.2f}")
print("\nLasso Regression Model Evaluation:")
print(f"MSE: {mse_lasso:.2f}")
print(f"RMSE: {rmse_lasso:.2f}")
print(f"R-squared: {r2_lasso:.2f}")
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_ridge, color='blue', alpha=0.6, label="Ridge")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.title('Actual vs Predicted Housing Prices (Ridge Regression)', fontsize=16)
plt.xlabel('Actual Housing Price', fontsize=14)
plt.ylabel('Predicted Housing Price', fontsize=14)
plt.legend()
plt.grid()
plt.show()
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lasso, color='green', alpha=0.6, label="Lasso")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.title('Actual vs Predicted Housing Prices (Lasso Regression)', fontsize=16)
plt.xlabel('Actual Housing Price', fontsize=14)
plt.ylabel('Predicted Housing Price', fontsize=14)
plt.legend()
plt.grid()
plt.show()
OUTPUT:
b). Multi linear regression:
AIM:
To write a python code for the implementation of multi linear regression to find a
straight line that goes through data points.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary python libraries
Step 3: load the datasets and using formula build a code for multi linear regression
Step 4: Display the output
Step 5: Stop the program
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
file_path = "/content/house_prices.csv"
df = pd.read_csv(file_path)
price_threshold = 500000
df['price_above_threshold'] = (df['price'] > price_threshold).astype(int)
categorical_columns = ['waterfront', 'view', 'condition', 'grade', 'zipcode']
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15',
'sqft_lot15'] + [col for col in df.columns if
col.startswith(tuple(categorical_columns))]
X = df[features]
y = df['price_above_threshold']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=["Below",
"Above"], yticklabels=["Below", "Above"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title(f'Confusion Matrix (Accuracy: {accuracy:.2f})')
plt.show()
coefficients = model.coef_[0]
intercept = model.intercept_[0]
coeff_df = pd.DataFrame({'Feature': features, 'Coefficient':
coefficients}).sort_values(by='Coefficient', ascending=False)
accuracy, conf_matrix, coeff_df.head(10), intercept
OUTPUT:
EXP.NO: 02 (b)
DATE: 30.01.2025 Implement linear and logistic regression
2). Logistic regression:
AIM:
To write a python code for the implementation of logistic regression to find a sigmoid
that goes through data points.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary python libraries
Step 3: load the datasets and using formula build a code for logistic regression
Step 4: Display the output
Step 5: Stop the program
CODE:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
iris = load_iris()
X = iris.data[:, 0].reshape(-1, 1)
y = (iris.target == 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def gradient_descent(X, y, theta, lr=0.01, iters=1000):
for _ in range(iters):
theta -= lr * (X.T @ (sigmoid(X @ theta) - y)) / len(y)
return theta
theta = np.zeros(X_train_poly.shape[1])
theta_optimal = gradient_descent(X_train_poly, y_train, theta)
predictions = sigmoid(X_test_poly @ theta_optimal) >= 0.5
accuracy = np.mean(predictions == y_test)
print(f"Accuracy: {accuracy * 100:.2f}%")
x_values = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
x_poly = poly.transform(x_values)
y_values = sigmoid(x_poly @ theta_optimal) >= 0.5
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.plot(x_values, y_values, color='red', label='Decision Boundary')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Setosa (1) vs Not Setosa (0)')
plt.title('Logistic Regression with Curved Decision Boundary')
plt.legend()
plt.grid(True)
plt.show()
OUTPUT:
Particulars Marks Allotted Marks Awarded
Program / Simulation 40
Program Execution 30
Result 20
Viva Voce 10
Total 100
RESULT:
Thus, a program for linear and logistic regression has been successfully executed.
EXP.NO: 03
DATE: 06.02.2025 Naive bayes classifier
AIM:
To write a Python code for the implementation of the Naive Bayes Classifier for classifying
data based on probability distributions.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary Python libraries
Step 3: Load the dataset and preprocess the data
Step 4: Compute the prior probabilities and likelihood using Bayes' theorem
Step 5: Build the Naïve Bayes classifier and train it on the dataset
Step 6: Use the trained model to make predictions
Step 7: Display the output
Step 8: Stop the program
CODE:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
class NaiveBayesClassifier:
def __init__(self):
self.class_priors = {}
self.means = {}
self.variances = {}
self.classes = None
def fit(self, X, y):
self.classes = np.unique(y)
for c in self.classes:
X_c = X[y == c]
self.class_priors[c] = len(X_c) / len(X)
self.means[c] = np.mean(X_c, axis=0)
self.variances[c] = np.var(X_c, axis=0) + 1e-9
def gaussian_pdf(self, x, mean, variance):
coeff = 1 / np.sqrt(2 * np.pi * variance)
exponent = np.exp(-((x - mean) ** 2) / (2 * variance))
return coeff * exponent
def predict(self, X):
predictions = []
for x in X:
posteriors = {}
for c in self.classes:
prior = np.log(self.class_priors[c])
likelihood = np.sum(np.log(self.gaussian_pdf(x, self.means[c], self.variances[c])))
posteriors[c] = prior + likelihood
predictions.append(max(posteriors, key=posteriors.get))
return np.array(predictions)
df = pd.read_csv('/content/house_prices2.csv')
for col in df.columns:
if df[col].dtype == 'object':
df[col] = pd.factorize(df[col])[0]
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb = NaiveBayesClassifier()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=np.unique(y),
yticklabels=np.unique(y))
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix for Naïve Bayes Classifier")
plt.show()
test_sizes = np.linspace(0.1, 0.5, 5)
accuracies = []
for test_size in test_sizes:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,
random_state=42)
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
accuracies.append(accuracy_score(y_test, y_pred))
plt.figure(figsize=(7, 5))
plt.plot(test_sizes, accuracies, marker='o', linestyle='-', color='m', label="Naïve Bayes
Accuracy")
plt.xlabel("Test Size")
plt.ylabel("Accuracy")
plt.title("Naïve Bayes Accuracy vs. Test Size")
plt.legend()
plt.grid()
plt.show()
OUTPUT:
Particulars Marks Allotted Marks Awarded
Program / Simulation 40
Program Execution 30
Result 20
Viva Voce 10
Total 100
RESULT:
The required Naïve bayes model has been executed successfull
EXP.NO: 04
AIM:
DATE: 13.03.2025 POWER BI
To make an analytical dashboard for E-commerce
ALGORITHM:
STEP 1: Load Data - Import dataset into Power BI using Get Data and
load it into a table.
STEP 2: Preprocess Data - Handle missing values, encode Gender,
compute Experience.
STEP 3: Define Variables - Create binary target variable
SalaryAbove50K, set X and y.
STEP 4: Split Data - Divide features and target into training
and testing sets.
STEP 5: Train Model - Train Naïve Bayes model using Power
BI AI Insights.
STEP 6: Evaluate and Visualize - Compute accuracy, generate confusion matrix, plot
ROC curve.
OUTPUT:
MARK ALLOCATION:
Particulars Marks Allotted Marks Awarded
Program / Simulation 40
Program Execution 30
Result 20
Viva Voce 10
Total 100
RESULT:
Thus, the zomato sales dataset has been successfully visualized using a PowerBI
dashboard.