We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
~ DS5460 HomeworkS q2 (8pts)
Your task: build a linear regression model from spark MLlib and print the RMSE/r2 for your model.
How important are GRE scores?
This dataset "Admission_Predict.csv’ is created for prediction of Graduate Admissions from an
Indian perspective
Content The dataset contains several parameters which are considered important during the
application for Masters Programs. The parameters included are :
+ GRE Scores ( out of 340)
+ TOEFL Scores (out of 120)
+ University Rating ( out of 5)
+ Statement of Purpose and Letter of Recommendation Strength (out of 5)
+ Undergraduate GPA (out of 10)
+ Research Experience (either 0 or 1)
* Chance of Admit ( ranging from 0 to 1)
Referenc!
for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence
in Data Science 2019
: Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models
Ipip install pyspark
Ipip install -U -q PyDrive
lapt install openjdk-8-jdk-headless -qq
import os
0s environ["JAVA_HOME"]
Jusr/1ib/jvm/java-8-openjdk-and64”
Requirement already satisfied: pyspark in /usr/local/lib/python3.7/dist-packages (3.1.1)
Requirement already satisfied: py4j==0.10.9 in /usr/local/lib/python3.7/dist-packages (1
openjdk-8~jdk-headless is already the newest version (8u282-b08-Qubuntul~18.04).
@ upgraded, @ newly installed, @ to remove and 29 not upgraded.
from google.colab import drive
drive. mount ('/content/drive' )
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mour
%ed drive/MyDrive/Colab\ Notebooks
/content/drive/MyDrive/Colab Notebooks%ed hwo5/
/content/drive/My Drive/Colab Notebooks/hweS
from pyspark.sql import SparkSession
spark = SparkSession. builder. appName(‘hw@5').getOrcreate()
data = spark.read.csv(os.getcwd() + '/Admission_ Predict.csv',inferSchema=True,header=Tr
data.printSchema()
root
|-- Serial No.: integer (nullable = true)
= GRE Score: integer (nullable = true)
- TOEFL Score: integer (nullable = true)
- University Rating: integer (nullable = true)
= SOP: double (nullable = true)
- LOR : double (nullable = true)
= CGPA: double (nullable = true)
= Research: integer (nullable = true)
- Chance of Admit : double (nullable = true)
data. show(5)
Iserial No. |GRE Score|TOEFL Score|University Rating|SOP|LOR |CGPA|Research|Chance of Adr
only showing top 5 rows
data = data.withColumnRenamed("GRE Score", "GRE")\
-withColumnRenamed("TOEFL Score", "TOEFL")\
-withColumnRenamed( "University Rating", "Rating")\
-withColumnRenamed("Chance of Admit ", “Target")\
-withColumnRenamed("LOR ", “LOR")\
-withColumnRenamed( "Serial No.", "Serial_No")
data. show(5)
[Serial_No|GRE |TOEFL |Rating] SOP| LOR| CGPA|Research| Target |1/337| 118] 4/4,5]4.5|9.65| 1] .92|
!
I 2/324] 17] 4/4.0/4.5]8.87 1] 0.76]
I 3/316] 104] -3]3.0|3.5] 8.0] a} 0.72
I 4|322| 116] —-3/3.5]2.5]8.67] a] 0.3]
I 5[314| 103] -2/2.0|3.0/8.21] e| 0.65]
only showing top 5 rows
from pyspark.ml.feature import RFormula
formula = RFormula(
formula = "Target ~ Research + CGPA + LOR + SOP + Rating + TOEFL + GRE”
output = formula. fit(data).transform(data)
output.select("features", "label”).show(5)
I features |label|
only showing top 5 rows
train,test = output.randomSplit([@.75, @.25])
rom pyspark.ml.regression import LinearRegression
lin_reg = LinearRegression(featuresCol = ‘features’, labelCol='label')
linear_model = lin_reg.fit(train)
print ( "Coefficients:
print("\nIntercept:
+ str(linear_model. coefficients) )
" + str(Linear_model.intercept))
Coefficients: [0.021867689937870024, 0.11454791057840752, 8.018225951061264446, 0.00381000:
Intercept: -1.0836690504090625
trainSummary =
print ("RMSE: %
print("\nr2: %
Linear_model. summary
" % trainSummary.rootMeanSquaredError)
" % trainSummary.r2)RMSE: 0.062458
72: 0.793621
from pyspark.sql.functions import abs
from pyspark.ml.evaluation import RegressionEvaluator
predictions = linear_model.transform(test)
pred_evaluator = RegressionEvaluator(predictionCol="prediction”, \
labelCol="1abel” ,metricName="r2")
print("R Squared (R2) on test = %g" % pred_evaluator.evaluate(predictions))
R Squared (R2) on test = 0.810537
# importance of GRE
data.stat.corn( “target”, "GRE")
@.8026104595903502
# this means it is highly correlated therefore highly important.
d= {}
for c in data.columns:
if c not in [‘Serial_No', ‘Target’ ]:
d[c] = data.stat.corr(c, Target’)
(/CGPA': @.8732890993553003,
"GRE": @,8026104595903504,
"LOR": @.6698887920106943,
"Rating’: @.7112502503917228,
"Research': @.5532021370190406,
"Sop": 0,6757318583886724,
“TOEFL': @.7915939869351043}
# CGPA is the most important but it seems that GRE is second most important variable.