Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
105 views5 pages

HW 5 Q 2

homework 5 q2 big data scaling

Uploaded by

Ali Yaqoob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
105 views5 pages

HW 5 Q 2

homework 5 q2 big data scaling

Uploaded by

Ali Yaqoob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
~ DS5460 HomeworkS q2 (8pts) Your task: build a linear regression model from spark MLlib and print the RMSE/r2 for your model. How important are GRE scores? This dataset "Admission_Predict.csv’ is created for prediction of Graduate Admissions from an Indian perspective Content The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are : + GRE Scores ( out of 340) + TOEFL Scores (out of 120) + University Rating ( out of 5) + Statement of Purpose and Letter of Recommendation Strength (out of 5) + Undergraduate GPA (out of 10) + Research Experience (either 0 or 1) * Chance of Admit ( ranging from 0 to 1) Referenc! for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019 : Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models Ipip install pyspark Ipip install -U -q PyDrive lapt install openjdk-8-jdk-headless -qq import os 0s environ["JAVA_HOME"] Jusr/1ib/jvm/java-8-openjdk-and64” Requirement already satisfied: pyspark in /usr/local/lib/python3.7/dist-packages (3.1.1) Requirement already satisfied: py4j==0.10.9 in /usr/local/lib/python3.7/dist-packages (1 openjdk-8~jdk-headless is already the newest version (8u282-b08-Qubuntul~18.04). @ upgraded, @ newly installed, @ to remove and 29 not upgraded. from google.colab import drive drive. mount ('/content/drive' ) Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mour %ed drive/MyDrive/Colab\ Notebooks /content/drive/MyDrive/Colab Notebooks %ed hwo5/ /content/drive/My Drive/Colab Notebooks/hweS from pyspark.sql import SparkSession spark = SparkSession. builder. appName(‘hw@5').getOrcreate() data = spark.read.csv(os.getcwd() + '/Admission_ Predict.csv',inferSchema=True,header=Tr data.printSchema() root |-- Serial No.: integer (nullable = true) = GRE Score: integer (nullable = true) - TOEFL Score: integer (nullable = true) - University Rating: integer (nullable = true) = SOP: double (nullable = true) - LOR : double (nullable = true) = CGPA: double (nullable = true) = Research: integer (nullable = true) - Chance of Admit : double (nullable = true) data. show(5) Iserial No. |GRE Score|TOEFL Score|University Rating|SOP|LOR |CGPA|Research|Chance of Adr only showing top 5 rows data = data.withColumnRenamed("GRE Score", "GRE")\ -withColumnRenamed("TOEFL Score", "TOEFL")\ -withColumnRenamed( "University Rating", "Rating")\ -withColumnRenamed("Chance of Admit ", “Target")\ -withColumnRenamed("LOR ", “LOR")\ -withColumnRenamed( "Serial No.", "Serial_No") data. show(5) [Serial_No|GRE |TOEFL |Rating] SOP| LOR| CGPA|Research| Target | 1/337| 118] 4/4,5]4.5|9.65| 1] .92| ! I 2/324] 17] 4/4.0/4.5]8.87 1] 0.76] I 3/316] 104] -3]3.0|3.5] 8.0] a} 0.72 I 4|322| 116] —-3/3.5]2.5]8.67] a] 0.3] I 5[314| 103] -2/2.0|3.0/8.21] e| 0.65] only showing top 5 rows from pyspark.ml.feature import RFormula formula = RFormula( formula = "Target ~ Research + CGPA + LOR + SOP + Rating + TOEFL + GRE” output = formula. fit(data).transform(data) output.select("features", "label”).show(5) I features |label| only showing top 5 rows train,test = output.randomSplit([@.75, @.25]) rom pyspark.ml.regression import LinearRegression lin_reg = LinearRegression(featuresCol = ‘features’, labelCol='label') linear_model = lin_reg.fit(train) print ( "Coefficients: print("\nIntercept: + str(linear_model. coefficients) ) " + str(Linear_model.intercept)) Coefficients: [0.021867689937870024, 0.11454791057840752, 8.018225951061264446, 0.00381000: Intercept: -1.0836690504090625 trainSummary = print ("RMSE: % print("\nr2: % Linear_model. summary " % trainSummary.rootMeanSquaredError) " % trainSummary.r2) RMSE: 0.062458 72: 0.793621 from pyspark.sql.functions import abs from pyspark.ml.evaluation import RegressionEvaluator predictions = linear_model.transform(test) pred_evaluator = RegressionEvaluator(predictionCol="prediction”, \ labelCol="1abel” ,metricName="r2") print("R Squared (R2) on test = %g" % pred_evaluator.evaluate(predictions)) R Squared (R2) on test = 0.810537 # importance of GRE data.stat.corn( “target”, "GRE") @.8026104595903502 # this means it is highly correlated therefore highly important. d= {} for c in data.columns: if c not in [‘Serial_No', ‘Target’ ]: d[c] = data.stat.corr(c, Target’) (/CGPA': @.8732890993553003, "GRE": @,8026104595903504, "LOR": @.6698887920106943, "Rating’: @.7112502503917228, "Research': @.5532021370190406, "Sop": 0,6757318583886724, “TOEFL': @.7915939869351043} # CGPA is the most important but it seems that GRE is second most important variable.

You might also like