1/22/25, 8:14 PM Finlatics project 2 .
ipynb - Colab
FINLATICS Project 2
In this dataset we are analysing Wine Quality dataset.
# importing necessery libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# importing the data set
df = pd.read_csv('/content/wine_data.csv')
# data preprocessing
df.head()
free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates alcohol q
acidity acidity acid sugar
dioxide dioxide
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
# Checking info about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 1/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
# Checking for missing values and duplicates
print(df.isnull().sum())
print("checking duplicate rows")
print(df.duplicated().sum())
print("describing data")
print(df.describe())
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
checking duplicate rows
240
describing data
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
chlorides free sulfur dioxide total sulfur dioxide density \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690
pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 5.636023
std 0.154386 0.169507 1.065668 0.807569
min 2.740000 0.330000 8.400000 3.000000
25% 3.210000 0.550000 9.500000 5.000000
50% 3.310000 0.620000 10.200000 6.000000
75% 3.400000 0.730000 11.100000 6.000000
max 4.010000 2.000000 14.900000 8.000000
1. What is the most frequently occurring wine quality? What is the highest number in and the lowest number in
the quantity column?
# Most frequently occurring wine quality
most_frequent_quality = df['quality'].mode()[0]
quality_count = df['quality'].value_counts()
# Highest and lowest values in the 'quality' column
highest_quality = df['quality'].max()
lowest_quality = df['quality'].min()
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 2/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
print("Most frequent wine quality : ",most_frequent_quality)
print("Frequency of each wine quality : ", quality_count)
print("Highest wine quality : " ,highest_quality)
print("Lowest wine quality : " ,lowest_quality)
Most frequent wine quality : 5
Frequency of each wine quality : quality
5 681
6 638
7 199
4 53
8 18
3 10
Name: count, dtype: int64
Highest wine quality : 8
Lowest wine quality : 3
2. How is fixed acidity correlated to the quality of the wine? How does the alcohol content affect the quality?
How is the free Sulphur dioxide content correlated to the quality of the wine?
# finding correlations between given features
corr_fixed_acidity = df['fixed acidity'].corr(df['quality'])
corr_alcohol = df['alcohol'].corr(df['quality'])
corr_free_sulfur_dioxide = df['free sulfur dioxide'].corr(df['quality'])
print("correlation between fixed acidity and quality of wine : ",corr_fixed_acidity)
print("correlation between alcohol and quality of wine : ",corr_alcohol)
print("corelation between free sulfur dioxide and quality of wine : ",corr_free_sulfur_dioxide)
correlation between fixed acidity and quality of wine : 0.12405164911322428
correlation between alcohol and quality of wine : 0.4761663239995365
corelation between free sulfur dioxide and quality of wine : -0.0506560572442763
# visualizing the given correlations
import seaborn as sns
plt.figure(figsize=(8,8))
# Fixed acidity vs Quality
plt.subplot(1, 3, 1)
sns.scatterplot(x='fixed acidity', y='quality', data=df, alpha=0.5)
plt.title('Fixed Acidity vs Quality')
plt.xlabel('Fixed Acidity')
plt.ylabel('Quality')
# Alcohol vs Quality
plt.subplot(1, 3, 2)
sns.scatterplot(x='alcohol', y='quality', data=df, alpha=0.5, color='orange')
plt.title('Alcohol vs Quality')
plt.xlabel('Alcohol')
plt.ylabel('Quality')
# Free Sulfur Dioxide vs Quality
plt.subplot(1, 3, 3)
sns.scatterplot(x='free sulfur dioxide', y='quality', data=df, alpha=0.5, color='green')
plt.title('Free Sulfur Dioxide vs Quality')
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 3/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
plt.xlabel('Free Sulfur Dioxide')
plt.ylabel('Quality')
plt.tight_layout()
plt.show()
3. What is the average residual sugar for the best quality wine and the lowest quality wine in the dataset?
# average residual sugar for the best quality wine and the lowest quality wine
residual_sugar_best_quality = df[df['quality'] == df['quality'].max()]['residual sugar'].mean()
residual_sugar_lowest_quality = df[df['quality'] == df['quality'].min()]['residual sugar'].mean()
print("Average residual sugar for the best quality wine : ",residual_sugar_best_quality)
print("Average residual sugar for the lowest quality wine : ",residual_sugar_lowest_quality)
Average residual sugar for the best quality wine : 2.5777777777777775
Average residual sugar for the lowest quality wine : 2.6350000000000002
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 4/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
4. Does volatile acidity has an effect over the quality of the wine samples in the dataset?
# correlation of volatile acidity and wine quality
corr_volatile_acidity = df['volatile acidity'].corr(df['quality'])
print("correlation between volatile acidity and wine quality : ",corr_volatile_acidity)
# Scatter plot to visualize the relationship
plt.figure(figsize=(8, 5))
sns.scatterplot(x='volatile acidity', y='quality', data=df, alpha=0.5, color='green')
plt.title('Volatile Acidity vs Wine Quality')
plt.xlabel('Volatile Acidity')
plt.ylabel('Wine Quality')
plt.show()
correlation between volatile acidity and wine quality : -0.390557780264007
5. Train a Decision Tree model and Random Forest Model separately to predict the Quality of the given samples
of wine. Compare the Accuracy scores for both models.
# for this we need to train two models and compare the accuracy score of both models
# for this we need to import needed models and split the data into training and testing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Splitting data into features (X) and target (y)
X = df.drop(columns=['quality'])
y = df['quality']
# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 5/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
# Decision Tree Model
dt_model = DecisionTreeClassifier(random_state=3)
# fitting and training model
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
# accuracy score of decision tree model
dt_accuracy = accuracy_score(y_test, y_pred_dt)
print("accuracy score of decision tree model for the wine data : ",dt_accuracy)
# Random Forest Model
rf_model = RandomForestClassifier(random_state=3)
# fitting and training model
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
# accuracy score of random forest model
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print("accuracy score of random forest model for the wine data : ",rf_accuracy)
# comparing accuracy score of both models
print("for the given wine data")
print("accuracy score of decision tree model : ",dt_accuracy)
print("accuracy score of random forest model : ",rf_accuracy)
if dt_accuracy > rf_accuracy:
print("Decision Tree model performs better.")
elif dt_accuracy < rf_accuracy:
print("Random Forest model performs better.")
else:
print("Both models have the same accuracy.")
accuracy score of decision tree model for the wine data : 0.675
accuracy score of random forest model for the wine data : 0.725
for the given wine data
accuracy score of decision tree model : 0.675
accuracy score of random forest model : 0.725
Random Forest model performs better.
Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a reCAPTCHA challenge.
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 6/6