IIM Kashipur
Master of Business Administration
Analytics in Business, Term III, Academic Year 2024-26
AB Assignment 4
Section A
Submitted to: Prof. Rajiv Kumar
Submitted on: 09th February 2025
(Group 6)
MBA24040 – Yash Kulkarni
MBA24012 – Ashutosh Singh
MBA24019 – Devansh Rawat
MBA24307 – Rigzin Angmo
MBA24070 – Sumedh Inamdar
MBA24005 – Akash Bhanja
Q1. Clean the data (You may find 999, 998 values, which need to be taken care. You can
exclude those responses)
Code:
import pandas as pd
import matplotlib.pyplot as plt
file_path = "MBA Starting Salaries.xlsx"
xls = pd.ExcelFile(file_path)
df = pd.read_excel(xls, sheet_name="Salaries")
# Define columns to clean (assuming numerical values should not contain 999 or 998)
columns_to_clean = ["salary", "satis"] # You can add more columns if needed
# Remove rows where any of the selected columns contain 999 or 998
cleaned_df = df[~df[columns_to_clean].isin([999, 998]).any(axis=1)]
# Plotting the cleaned data
plt.figure(figsize=(8, 5))
plt.hist(cleaned_df['salary'], bins=20, color='orange', edgecolor='black')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.title('Distribution of Cleaned Salaries')
plt.show()
Output
1
Q2. Is there any variable (e.g., age, gender, language spoken, work experience) that
affect how much a student can expect to make?
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load and clean data
file_path = "MBA Starting Salaries.xlsx"
df = pd.read_excel(file_path, sheet_name="Salaries")
df = df[~df[["salary", "satis"]].isin([999, 998]).any(axis=1)]
# Correlation analysis
salary_correlation = df.corr()["salary"].sort_values(ascending=False)
# Visualization: Boxplot of salary by gender
plt.figure(figsize=(8, 5))
sns.boxplot(x=df["sex"], y=df["salary"])
plt.xlabel("Gender (1=Male, 2=Female)")
plt.ylabel("Salary")
plt.title("Salary Distribution by Gender")
plt.show()
# Display correlation results
salary_correlation
2
Output:
The correlation analysis reveals that most academic and experience-related factors do not
strongly influence starting salaries.
• No significant correlation exists between GMAT scores, GPA, work experience, and
salary, indicating that higher academic performance or prior experience does not
necessarily lead to higher pay.
• A weak but significant positive correlation is found between satisfaction and salary,
suggesting that students who earn higher salaries may feel more satisfied with the MBA
program.
• Work experience has a weak positive correlation with Spring GPA but a weak negative
correlation with GMAT scores, possibly indicating that students with more experience
scored slightly lower on GMAT but performed slightly better academically during the
program.
• GPA (Spring & Fall) shows a strong positive correlation, meaning students who perform
well in one semester tend to maintain their performance.
3
Q3. Is there any correlations exists between the variables (e.g., GMAT Total Score,
Salary, Spring Grades, Years of Work Experience, Satisfaction with
program)?
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
# Load and clean data
df = pd.read_excel("MBA Starting Salaries.xlsx", sheet_name="Salaries").dropna()
df = df[~df.isin([999, 998]).any(axis=1)]
#All the correlation and p-values
corr_vars = [
('gmat_tot', 'satis'),
('salary', 'work_yrs'),
('s_avg', 'salary'),
('f_avg', 'salary'),
('gmat_tot', 'salary'),
('satis', 'salary'),
('work_yrs', 's_avg'),
('gmat_tot', 'work_yrs'),
('s_avg', 'f_avg'),
('satis', 'work_yrs')
]
for var1, var2 in corr_vars:
corr, p_val = st.pearsonr(df[var1], df[var2])
print(f"Correlation between {var1} and {var2}: r = {corr:.2f}, p-value = {p_val:.4f}")
# Correlation and visualization
sns.heatmap(df[["gmat_tot", "salary", "s_avg", "work_yrs", "satis"]].corr(), annot=True,
cmap="coolwarm", fmt=".2f")
plt.show()
4
Output:
• Satisfaction & Salary: A weak but significant positive correlation (r = 0.1564, p =
0.0298).
• Work Experience & Spring GPA: Weak positive correlation (r = 0.1591, p = 0.0270).
• GMAT & Work Experience: Weak negative correlation (r = -0.1737, p = 0.0157).
• No relationship between GMAT Score & Salary or GPA & Salary.
Overall, the findings suggest that academic performance and work experience have minimal
impact on starting salaries.
5
Q4. Can starting salaries be Predicted Based on Graduating Data (i.e., GPA,
gender, work experience)?
Code:
import pandas as pd
import statsmodels.api as sm
# Load and clean data
df = pd.read_excel("MBA Starting Salaries.xlsx" , sheet_name="Salaries").dropna()
df = df[~df.isin([999, 998]).any(axis=1)]
#Building regression model
x=df[["s_avg", "sex", "work_yrs"]]
x=sm.add_constant(x)
print(x)
y=df[["salary"]]
print(y)
model=sm.OLS(y,x).fit()
print(model.summary())
Output:
6
The regression model aimed to predict starting salary based on GPA (Spring), gender, and
work experience, but the results indicate poor predictive power.
• Adjusted R² = -0.0014 → The model performs worse than a baseline model (mean
salary prediction) and explains virtually 0% of salary variation.
• P-values for GPA, gender, and work experience are all > 0.05, meaning none of these
factors have a statistically significant impact on salary.
• Negative Adjusted R² suggests that including these variables adds no real explanatory
power and may even introduce unnecessary complexity.
Conclusion
The regression analysis confirms that GPA, gender, and work experience do not significantly
predict starting salaries. The negative Adjusted R² suggests that salary variations are likely
driven by other external factors, such as industry trends, negotiation skills, job location, or
personal networking.
This aligns with the correlation analysis, reinforcing that traditional academic and experience
metrics are poor indicators of salary outcomes.