Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
51 views72 pages

Data Mining Report

In this project, the way of how fair pricing is predicted in the taxi service sector will be the major interest of the whole assignment as it is the most important fundamental that helps both the passengers and service provider to get what they want. And to achieve that, data understanding and preparation will be done before using models to train the fair pricing. Fair pricing helps to ensure that the charges are reasonable and satisfied for their journeys.

Uploaded by

weijie0601
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views72 pages

Data Mining Report

In this project, the way of how fair pricing is predicted in the taxi service sector will be the major interest of the whole assignment as it is the most important fundamental that helps both the passengers and service provider to get what they want. And to achieve that, data understanding and preparation will be done before using models to train the fair pricing. Fair pricing helps to ensure that the charges are reasonable and satisfied for their journeys.

Uploaded by

weijie0601
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Faculty of Information and Communication Technology (FICT)

UCCB3224
Data Mining Techniques

Group Assignment

June 23 Trimester
Lecturer: Dr. Abdulkarim Kanaan Jebna

Group Assignment

1
Assignment Marksheet

By signing below, we confirmed that the work produced was original and purely based on our
own sentence construction. Should there be any plagiarism detected, we agreed mark
penalization on the part(s) detected.

No Name ID Practical Program Signature


Group (IA/IB/DE)

1 Alex Tan Yong Xin 2002422 3 IB

2 Ang Jia Xuan 2100421 5 IB ANG

3 Beh Ling Huey 2106770 7 IB

4 Tan Le Je 2004995 7 IB

5 Wong Wei Jie 1904686 4 IB

Criteria Weightage Alex Tan Ang Jia Beh Ling Tan Le Jie Wong Wei
Yong Xin Xuan Huey Jie

Individual:

Final Deliverables 4%
(Quality of Work) (Scale*4/5)

Team:

Report and 3%
Documentation (Scale*3/5)

Effort and Technical 3%


Capability (Overall (Scale*3/5)
as a Team)

Final Deliverables 3%
(Quality of System) (Scale*3/5)

Innovativeness 2%
(Scale*2/5)

15% % % % % %

Marking Scheme

Scale (0-5) Description

5 Excellent work produced. Evidence of in-depth study and critical thought

4 Good work produced. Evidence of in-depth study or critical thought

3 Average work produced. Evidence of adequate study or thought/idea, although

2 Below average work produces, Evidence of some study or thought/idea, but not
quite adequate

1 Poor work performance or work not supported by any study/basis

0 Not attempted
Table of Contents
Executive Summary <Contribution by: All members> 1
Business Understanding 2
1.1 Background Information 2
1.2 Problem statement 2
1.3 Objective 3
1.4 Data Mining Goals 3
1.5 Success Criteria 4
1.6 Gantt Chart 4
1.7 Development Tools 4
2.1 Understanding the dataset’s structure and dimensions 5
2.2 Understanding the travel fee 6
2.3 Creating testset 6
2.4 Data visualization 7
2.5 Travel fee 10
2.6 Arrival Location 12
2.7 Departure location 13
2.8 Departure time count (original data) 14
2.9 Departure time count (time only) 14
2.10 Departure time vs average travel fee 15
2.11 Departure date 16
2.12 Departure time counts (include seconds) 17
2.13 Correlation 18
2.14 Query 1: Looking for range of departure_long and departure_lat 19
2.15 Query 2: Looking for range of arrival_long and arrival_lat 20
2.16 Query 3: Look for the most frequent departure time 20
2.17 Query 4 21
2.18 Query 5 21
Data Preparation <Contribution by: All members > 22
3.0 Data Preprocessing (missing value) 22
3.1 Data Cleaning (departure and arrival) 22
3.1.1 Before removing outliers 23
3.1.2 After removing outliers 24
3.2 Feature Engineering (distance_travelled) future improvement 24
3.3 Feature Engineering (travel_time_used)future improvement 24
3.4 Feature Engineering (time_of_day, is_weekend)future improvement 25
3.5 Correlation Test 25
3.6 Feature Selection (Dropping ID) 25
3.7 Feature Scaling (min max scaler, standard scaler) 26
3.8 Last preprocessing 27
Modeling <Contribution by: All members > 30
4.1 Train model and measure its performance 30
4.1.1 Decision Tree Regressor 30
4.1.2 Linear Regression 32
4.1.3 Non-linear regression 34
4.1.4 Random forest regressor 36
4.1.5 SVM Linear Regression 38
4.1.6 k-NN Regressor 40
4.1.7 Ridge Regression 42
4.1.8 Lasso Regression 44
4.1.9 Elastic Net Regression 46
4.1.10 Stochastic gradient descent 48
4.1.11 Gradient Boosting Regression 50
4.1.12 XGBoost Regression 52
4.1.13 LightGBM Regression 54
4.1.14 MLP 56
4.2 Short-listing 5 most promising model 58
4.3 Fine tuning result 58
4.3.1 Decision tree regressor 58
4.3.2 XGBoost regression 59
4.3.3 KNN Regressor 60
4.3.4 Random forest regressor 61
4.3.5 Gradient Boosting Regression 62
Evaluation <Contribution by: All members > 64
5.0 Evaluation 64
Deployment <Contribution by: All members> 65
6.0 Deployment 65
6.1 Digital Infrastructure 65
6.2 Web application 65
6.3 Integrating the final model 65
6.4 API endpoint 65
6.5 Testing 65
6.6 Monitor 65
6.7 Maintenance 66
6.8 Documentation 66
Conclusion <Contribution by: All members > 66
7.0 Conclusion 66
Executive Summary <Contribution by: All members>

In this project, we are expecting to create models that can predict the fair pricing for the taxi

service with the help of CRISP-DM. First of all, task one business understanding will introduce

the background of taxi service, identify the problem statements, objectives and define the data

mining goals. Followed by the second task, data understanding that requires us to understand the

data that we are going to work with like name, description, type, percentage of missing values.

Then, visualizing data with different types of charts like scatter, histogram , bar chart and line

charts and study about the correlation of features with the target variable travel_fee. The third

task is to perform data preparation. Which is handling the missing values, outliers, deciding the

numerical and categorical attributes with the help of Simple Imputer. In this stage, the irrelevant

features are being dropped out of the train set, and trying out new creation of features with the

existing features in the dataset. The fourth task, modeling involves using different algorithms

like Decision Tree Regressor, Linear Regression, Non-linear regression, Random forest

regressor, SVM Linear Regression, k-NN Regressor, Ridge Regression, Lasso Regression,

Elastic Net Regression, Stochastic gradient descent, Gradient Boosting Regression, XGBoost

Regression, LightGBM Regression, MLP. At the end of the day, five out of fourteen models are

being shortlisted. The candidates are decision tree regressor, XGBoost regression, KNN

regressor, random forest regressor and Gradient boosting regression respectively.

1
Business Understanding

1.1 Background Information

In this project, the way of how fair pricing is predicted in the taxi service sector will be the major
interest of the whole assignment as it is the most important fundamental that helps both the
passengers and service provider to get what they want. And to achieve that, data understanding
and preparation will be done before using models to train the fair pricing. Fair pricing helps to
ensure that the charges are reasonable and satisfied for their journeys. While transparency will be
the method or models that the fares will be predicted and calculated.

1.2 Problem statement

1.2.1 The overrated pricing for taxi [1]

Malaysia taxi services had bad reputations due to the unstandardized conduct by drivers, like not
using taximeters, choosing the preferred destinations regardless locals or tourists while having
poorly maintained, old taxis.

1.2.2 The dataset provided has outliers and missing values.

The dataset provided had features with outliers like arrival_long. With the outliers that exist, the
train set will be affected by the mistreated value, and it will at the end affect the test set’s
performance.

1.2.3 Datasets having underfitting or overfitting problems.

When machine learning has an underfit or overfit issue, it will affect the training and validation
data. In a way that overfit will show low training error and high validation error. And the underfit
will show both high error for training and validation due to the simple model and small dataset
being used.

1.2.4 Looking for indicators that show high predictive accuracy.

Machine learning has so many different algorithms that can be applied to make predictions for
numerical value predictions. Standardized indicators that can help to make comparisons among
all algorithms should be used.

1.2.5 The existing features do not provide enough information for data exploration.

The existing features had low correlation value not near to one with the feature travel_fee.

2
1.3 Objective

The major objective of this project is to ensure that the customers are able to get the fair pricing
every time using the taxi service with the insight gained from the previous data. Improving the
customer satisfaction for all the rides by considering the time, distance traveled and the
occupancy of each ride, and thus being able to retain more customers. Moreover, another
objective is to attract more new customers to use the taxi service as the customer can predict
their travel fare, and file a complaint if they receive an unreasonable fare. Hence, the aim is to
increase the overall sales of the taxi fares by at least 15% and also to boost customer engagement
by creating a pleasant ride experience for everyone which customers will provide good feedback
and promote the taxi ride through word of mouth.

1.4 Data Mining Goals

In this assignment, we will try to achieve 5 goals to predict the travel fee based on departure
latitude and longitude, destination latitude and longitude and occupancy.

1.4.1 To predict fair pricing for the trip.

Create a model to predict the fair price with the historical data of departure latitude and
longitude, destination latitude and longitude and occupancy.

1.4.2 To gain insight into the dataset, removing the outliers and filling in missing values.

Using different charts like histogram and scatter to visualize some of the important features that
we would like to highlight.

1.4.3 To prevent underfitting and overfitting for the models

By creating different models with different algorithms, we can look for the model with lowest
test and validation error and shortlist the most promising model.

1.4.4 To look for a model with lowest mean square error.

Models with low Mean Square error values imply the predictions are more reliable therefore
leading to better decisions.

1.4.5 To create new features by using the existing features.

Creating new features with correlation value near to one can help to select features that can have
strong influence on the target variable travel_fee

3
1.5 Success Criteria
1.5.1 Fair Pricing Prediction should be made.
1.5.2 Services costs should be reasonable and reflect the value provided.
1.5.3 Customers are not overcharged or subjected to arbitrary or discriminatory fares.

The success criteria above aims to increase the satisfactory level of the customers by making
accurate and not overrated travel fees.

1.6 Gantt Chart

1.7 Development Tools

1.7.1 Anaconda

It is a distribution graphical user interface (GUI) that supports applications such as Jupyter
Notebook.

1.7.2 Jupyter Notebook

It is a server-client programme that enables different users to run and edit the same notebook
files over a web browser that helps us in our assignment.

4
Data Understanding <Contribution by: All Members >

2.1 Understanding the dataset’s structure and dimensions

Figure 2.1 Dataset structure

Figure 2.2 Dataset Shape

The dataset has 8 columns and 1048575 rows of data. The columns are ID, travel_fee,
departure_time, departure_long, departure_lat, arrival_long, arrival_lat and occupancy.

5
2.2 Understanding the travel fee

Figure 2.3 Travel fee details

Through this, we can understand the min, max, mean, standard deviation, 25th percentile,
median and 75 percentile of the travel fee. We know that most of the travel fees are around 10
dollars. So the maximum value of 450 might be some outlier or special cases.

2.3 Creating testset

Figure 2.4 Train set and test set


The size of the train set and test set are 80% and 20% respectively. For this departure_time
column, we change the data into datetime format so that we can use the data during the data
preprocessing to create new data.

6
2.4 Data visualization

2.4.1 Occupancy

Figure 2.5 Occupancy chart

Figure 2.6 Occupancy count

From figure 2.5, we can see that the occupancy has 9 different kinds of data which are, 0 to 9,
where 1 is the most common data. From this figure, we can know that 0 might be an abnormal
case because there cannot be rides without an occupant, so it may be a noise of the data, but we
need to further discover the data to understand more about it before deciding whether it is a noise
or not. For the occupancy with the value 9, it may also be an outlier too but we still need to do
more research on it.

7
2.4.1.1 Discovering occupancy 0 and 9

Figure 2.7 Occupancy 0 and 9 (1)

8
Figure 2.8 Occupancy 0 and 9 (2)

From figures 2.7 and 2.8, we can see that the 0 occupancy is a noise because it has a lot of non
number data and also 0 value. But for the 9 occupancy, the data seems to have no issue, so we
can conclude that 0 occupancy is a noise, while the 0 occupancy is a legit data.

9
2.5 Travel fee

Figure 2.9 Travel fee value count

10
Figure 2.10 Travel fee value count

From figure 2.9, we can see most of the data is around the 10 dollar mark as shown in figure 2.3.
There is still a considerable amount of data around the 50 dollar mark, but from there onwards
the number decreases. There is also data at negative sites, this may be because the user was using
some discount or cashback voucher for the ride.

11
2.6 Arrival Location

Figure 2.11 Arrival latitude vs arrival longitude scatter plot

Figure 2.12 Arrival latitude and arrival longitude details

12
From figure 2.11, we can know that the min value for arrival_lat is around -750 and the max
value is around 50. The min value for arrival _long is around -76 and the max value is around 74.
Most of the data of arrival_long and arrival_lat are around -90 and 40 respectively, which means
that the data are all from rides to the same destination.

2.7 Departure location

Figure 2.13 Departure latitude vs departure longitude scatter plot

From figure 2.12, we can also conclude that most of the data came from the same location which
are -60 longitude and 40 latitude.

13
2.8 Departure time count (original data)

Figure 2.14 Departure time count (original data)

From this figure, we can see that the distribution of the departure time is quite even, where the
average count is around 8, but this does not help much in explaining the data.

2.9 Departure time count (time only)

Figure 2.15 Departure time count (time only)

14
From figure 2.15, we can see that there is a trend in the data, where the count increases from
5AM to 2PM, then decreases until 4.40PM, then increases until 8PM, from there onwards the
amount of rides decreases.

2.10 Departure time vs average travel fee

Figure 2.16 Departure time vs average travel fee

From figure 2.16, we can also observe a trend among the travel fee and time, when comparing to
figure 2.15, we observe that when the count of the rides decreases and the travel fare increases.

15
2.11 Departure date

Figure 2.17 Departure date count

From figure 2.17, we can see that the dates are quite evenly distributed, since the data is in date
format so we decide to not use the date for the prediction since we can see that it might not help
much in the prediction.

16
2.12 Departure time counts (include seconds)

Figure 2.18 Departure time counts (include seconds)

From figure 2.18, we can also observe a similar pattern to figure 2.15, since the pattern is similar,
so we decide to only use the hour and minute for the prediction.

17
2.13 Correlation

Figure 2.19 Correlation Scatter Matrix

18
Figure 2.20 Correlation Matrix Heatmap

Figure 2.21 Travel fee correlation

From figures 2.19 and 2.21, we can see that the location of departure and destination are very
correlated, but the data is not correlated to travel fee.

2.14 Query 1: Looking for range of departure_long and departure_lat

Figure 2.22 Query 1

19
From the query, we can see that the departure_long may have outliers.

2.15 Query 2: Looking for range of arrival_long and arrival_lat

Figure 2.23 Query 2

From this query, we can see that the arrival_long may have outliers.

2.16 Query 3: Look for the most frequent departure time

From this query, we know that the most frequent departure time is around 8PM.

20
2.17 Query 4

From this query, we can conclude that occupancy with 0 needs to be removed.

2.18 Query 5

From this query, we can see the range of the travel fee. -52 might be uncommon, but it may be
due to the voucher claim to record lost.

21
Data Preparation <Contribution by: All members >

3.0 Data Preprocessing (missing value)

Figure 3.1 fillna to fill up the missing value.


Because the data with missing value is not more than 50% of the overall data and the data is
important, so fillna with the median value is being used.

3.1 Data Cleaning (departure and arrival)

22
Figure 3.2 Process of removing outliers.

3.1.1 Before removing outliers

Figure 3.3 Before removing outliers.

23
3.1.2 After removing outliers

Figure 3.4 After removing outliers.

The latitude is not supposed to be more than +-180 and longitude is not supposed to be more
than +-90. By removing the outliers, the data range between the correct values.

3.2 Feature Engineering (distance_travelled) future improvement

Figure 3.5 Feature Engineering (distance_travelled).

3.3 Feature Engineering (travel_time_used)future improvement

Figure 3.6 Feature Engineering (travel_time_used).

24
3.4 Feature Engineering (time_of_day, is_weekend)future improvement

Figure 3.7 Feature Engineering (time_of_day and is_weekend).

For the feature engineering task, new feature named distance_travelled, travel_time_used,
time_of_day and is_weekend are being created.

3.5 Correlation Test

Figure 3.8 Correlation test after feature engineering.

After creating some new attributes, we tested the correlation to check if any of the new attributes
have stronger positive correlation with the target variable travel_fee.

3.6 Feature Selection (Dropping ID)

Figure 3.9 dropping ID.

Since the ID is not useful to predict the travel fare, we had drop it when creating the train set.

25
3.7 Feature Scaling (min max scaler, standard scaler)

Figure 3.10 min max scaler.

Figure 3.11 standard scaler.

26
3.8 Last preprocessing

Figure 3.12 last preprocessing.

27
Figure 3.13 Dataset after preprocessing

Figure 3.13 shows the prepared training data, it includes the departure_long, departure_lat,
arrival_long, arrival_lat, occupancy and departure_hour_min.

Figure 3.14 Preprocessing pipeline

For the numerical attributes we will first remove the noise, which is the data with 0 occupancy.
Then we filter the outliers which we only retain the data between -180 to 180, so the outliers as
observed in figures 2.22 and 2.23 will be filtered. Then we use a simple imputer with the median
strategy to replace the data with missing values. In this case there are none, but in the future if
there are missing values, the median score will replace the missing value. Finally, because
regression models are sensitive to the scale of the data, we use standardscaler to scale all the
data, so that we can train a good model.

For the categorical attributes, we use simpleimputer with the most_frequent strategy so that in
the future we can replace the data with missing value using the mode.

28
Finally, we created new attributes using the departure_time attribute. The new attribute is called
departure_hour_min, it contains the hour and minute of the data, so the date and second from the
original dataset has been dropped. For example, 2013-07-02 19:54:00+00:00 will be transformed
to 19.90. Then, we also include the imputer with the median strategy to replace the missing value
with the median in the future. Finally, we also scale the data so that the data is scaled, so that the
model can easily learn and understand the problem.

29
Modeling <Contribution by: All members >

4.1 Train model and measure its performance

4.1.1 Decision Tree Regressor

Figure 4.1 Decision tree regressor

Figure 4.2 Decision tree regressor training prediction and RMSE

30
Figure 4.3 Decision tree cross validation

Using the decision tree model, we got a RMSE of 1.735. After performing the cross validation,
we got a mean RMSE of 5.847. The difference between the training error and validation error is
quite small, hence we conclude that the model is an optimal model, where the model does not
underfit or overfit the data.

31
4.1.2 Linear Regression

Figure 4.4 Linear Regression

Figure 4.5 Linear regression prediction and RMSE

32
Figure 4.6 Linear regression cross validation

Using the linear regression model, we got a RMSE of 10.805. After performing the cross
validation, we got a mean RMSE of 116.766. The difference between the training error and
validation error is very big, hence we conclude that the model is not an optimal model, where the
model has overfitted the data. So, this model is not good to predict the data as it has memorized
the data and provides inaccurate predictions.

33
4.1.3 Non-linear regression

Figure 4.7 Non-linear regression

Figure 4.8 Non-linear regression prediction and RMSE

34
Figure 4.9 Non-linear regression cross validation

Using the non-linear regression model, we got a RMSE of 10.767. After performing the cross
validation, we got a mean RMSE of 263.416. The validation error of this model is very high.
Moreover, the difference between the training error and validation error is very big, hence we
conclude that the model is not an optimal model, where the model has overfitted the data. So,
this model is not good to predict the data as it has memorized the data and provides inaccurate
predictions.

35
4.1.4 Random forest regressor

Figure 4.10 Random forest regressor

Figure 4.11 Random forest regressor prediction and RMSE

36
Figure 4.12 Random forest regressor cross validation

Using the Random forest regressor model, we got a RMSE of 5.162. After performing the cross
validation, we got a mean RMSE of 29.1276. The validation error of this model is slightly higher
than the training error but still at an acceptable range. Hence we conclude that the model is an
optimal model, where the model does not underfit or overfit the data.

37
4.1.5 SVM Linear Regression

Figure 4.13 SVM Linear Regression

Figure 4.14 SVM linear regression prediction and RMSE

38
Figure 4.15 SVM linear regression cross validation

Using the SVM linear regression, we got a RMSE of 11.319. After performing the cross
validation, we got a mean RMSE of 127.860. The difference between the training error and
validation error is very big, hence we conclude that the model is not an optimal model, where the
model has overfitted the data.

39
4.1.6 k-NN Regressor

Figure 4.16 k-NN Regressor

Figure 4.17 k-NN Regressor prediction and RMSE

40
Figure 4.18 k-NN Regressor cross validation

Using the k-NN Regressor, we got a RMSE of 4.880. After performing the cross validation, we
got a mean RMSE of 37.236. The validation error of this model is slightly higher than the
training error but still at an acceptable range. Hence we conclude that the model is an optimal
model, where the model does not underfit or overfit the data. But we still need to fine tune the
model to obtain a much better result.

41
4.1.7 Ridge Regression

Figure 4.19 Ridge Regression

Figure 4.20 Ridge Regression prediction and RMSE

42
Figure 4.21 Ridge Regression cross validation

Using the ridge regression, we got a RMSE of 10.805. After performing the cross validation, we
got a mean RMSE of 116.766. The difference between the training error and validation error is
very big, hence we conclude that the model is not an optimal model, where the model has
overfitted the data.

43
4.1.8 Lasso Regression

Figure 4.22 Lasso regression

Figure 4.23 Lasso regression prediction and RMSE

44
Figure 4.24 Lasso regression cross validation

Using the lasso regression, we got a RMSE of 10.811. After performing the cross validation, we
got a mean RMSE of 116.878. The difference between the training error and validation error is
very big, hence we conclude that the model is not an optimal model, where the model has
overfitted the data.

45
4.1.9 Elastic Net Regression

Figure 4.25 Elastic net regression

Figure 4.26 Elastic net regression prediction and RMSE

46
Figure 4.27 Elastic net regression prediction and RMSE

Using the Elastic net regression, we got a RMSE of 10.811. After performing the cross
validation, we got a mean RMSE of 116.878. The difference between the training error and
validation error is very big, hence we conclude that the model is not an optimal model, where the
model has overfitted the data.

47
4.1.10 Stochastic gradient descent

Figure 4.28 Stochastic gradient descent

Figure 4.29 Stochastic gradient descent prediction and RMSE

48
Figure 4.30 Stochastic gradient descent cross validation

Using the Stochastic gradient descent cross validation, we got a RMSE of 10.812. After
performing the cross validation, we got a mean RMSE of 116.974. The difference between the
training error and validation error is very big, hence we conclude that the model is not an optimal
model, where the model has overfitted the data.

49
4.1.11 Gradient Boosting Regression

Figure 4.31 Gradient Boosting Regression

Figure 4.32 Gradient boosting regressor prediction and RMSE

50
Figure 4.32 Gradient boosting regressor cross validation

Using the Gradient boosting regressor, we got a RMSE of 5.295. After performing the cross
validation, we got a mean RMSE of 28.266. The validation error of this model is slightly higher
than the training error but still at an acceptable range. Hence we conclude that the model is an
optimal model, where the model does not underfit or overfit the data. But we still need to fine
tune the model to obtain a much better result.

51
4.1.12 XGBoost Regression

Figure 4.33 XGBoost Regression

Figure 4.34 XGBoost Regression prediction and RMSE

52
Figure 4.35 XGBoost Regression cross validation

53
4.1.13 LightGBM Regression

Figure 4.35 LightGBM Regression

Figure 4.34 LightGBM Regression prediction and RMSE

54
Figure 4.35 LightGBM Regression cross validation

Using the LightGBM Regression, we got a RMSE of 4.669. After performing the cross
validation, we got a mean RMSE of 22.570. The validation error of this model is slightly higher
than the training error but still at an acceptable range. Hence we conclude that the model is an
optimal model, where the model does not underfit or overfit the data.

55
4.1.14 MLP

Figure 4.35 MLP Regressor

Figure 4.36 MLP Regressor prediction and RMSE

56
Figure 4.36 MLP Regressor cross validation

Using the MLP regressor, we got a RMSE of 4.848. After performing the cross validation, we
got a mean RMSE of 44.763. The validation error of this model is slightly higher than the
training error but there are still better models. Hence we conclude that the model is not an
optimal model and has overfitted the data.

57
4.2 Short-listing 5 most promising model

Out of the 14 models we had tested, we decided to choose decision tree regressor, XGBoost
regression, KNN regressor, random forest regressor and Gradient boosting regression. This is
because they have the most optimal model among the others, where they did not overfit or
underfit the data, and are able to provide a good prediction of the travel fare. So, we will fine
tune the short listed model in order to determine the best model among them. For the fine tuning,
we will use grid search to perform the hyper-parameter optimisation. Due to the long loading
time, we decided to cut down the amount of hyper-parameter we will use for the optimization,
because we had tried to use many hyper-parameters but the results were unable to be obtained
even after loading for more than 10 hours.

4.3 Fine tuning result

4.3.1 Decision tree regressor

Figure 4.37 Decision tree regressor fine tuning

For the decision tree regressor, the hyperparameter we chose are min_samples_split and
min_samples_leaf using [2,4,6] and [1,5,10,15,20] respectively. After fine tuning the model, the
best hyper parameters are 20 samples leaf and 2 sample split. The best score we obtain is
-22.451.

58
4.3.2 XGBoost regression

Figure 4.38 XGBoost regression fine tuning

For the XGBoost regression, the hyperparameters we chose are reg_alphae and reg_lambda, the
reason we choose this hyperparameter is that it can help prevent overfitting by adding a penalty
term. The best parameters are 0.1 reg_alpha and 0.1 reg_lambda. The best score we obtain is
-20.433.

59
4.3.3 KNN Regressor

Figure 4.39 KNN regressor fine tuning

For the KNN regressor, the hyperparameters we chose are n_neighbors using [3,5,7,10]. The best
parameter is 3 n_neighbors. The best score we obtain is -41.943, hence this is not a good model
when comparing to the previous two models, and the score obtained is much larger.

60
4.3.4 Random forest regressor

Figure 4.40 Random forest regressor fine tuning

For the Random forest regressor, the hyperparameters we chose are max_features because it can
prevent the model from focusing on irrelevant or noisy features by limiting features used for
each split. The best parameter is 1.0 which is auto. The best score we obtain is -29.071, which is
still considered good when compared to the other models.

61
4.3.5 Gradient Boosting Regression

Figure 4.41 Gradient Boosting Regression fine tuning

For the gradient boosting regression, the hyperparameters we chose are max_depth using [3,5,7].
The best parameter is 7 max depth. The best score we obtained was -20.984.

4.4 Access Model

After performing the fine tuning on the 5 short listed models, the best model we obtain is
XGBoost with the best score of -20.433, followed by gradient boosting regression with -20.984,
decision tree with -22.451, random forest with -29.071, and KNN regressor with -41.943. Based
on the data mining goals, we want to prevent overfitting and underfitting and look for the best
model with lowest mean squared error, hence we decide to use XGBoost as the final model as it
is the best model among the 5 short listed models.

62
Figure 4.42 Final Model

63
Evaluation <Contribution by: All members >

5.0 Evaluation

Figure 5.1 Evaluation

To do the evaluation on the test, we first drop the ID and travel fee to get the x_test and then
change the format of the departure time of x_test into datetime. Then we copy the travel_fee
from the test set to get the y_test. Using the final model and x_test, we predicted the travel fare
and got the mean squared error of 4.7269. We are satisfied with this result as it can provide a
quite accurate result with a low error rate and the model has not underfitted or overfitted the
training data. The model has generalized well from the training data to unseen data and is
making accurate predictions on new and unseen data. Hence, we have successfully trained a
model that can help us to predict the travel fare accurately.

64
Deployment <Contribution by: All members>

6.0 Deployment

6.1 Digital Infrastructure


We plan to deploy the model using a cloud service, this is because it is easier for people to access
the program through a web server. The server we chose is Google cloud because it offers a
scalable, adaptable cloud infrastructure, with good security and data analytics capability.

6.2 Web application


We plan to develop a travel fare web application using Flask. We chose this web framework
because it is a simple and light framework for us to develop the web application in python
language. Moreover our application is just a small and simple program that is used to get the
input from the user and predict the travel fare, so we only need a light framework for this
application.

6.3 Integrating the final model


The final model which is the XGBoost regressor can be loaded into the web application using the
‘joblib’ library from the saved file on the cloud.

6.4 API endpoint


We need to define an API endpoint so that the user can key in their data which include the
departure and arrival longitude and latitude, date time and occupancy so that the model can
predict the travel fare.

6.5 Testing
Before deploying the application, we will test the application to make sure that the program is
working as intended and provide accurate travel fare prediction. We plan to use the Postman API
to send some data to the API endpoint and check the accuracy of the response.

6.6 Monitor
After the testing is complete and the application has been deployed, we need to monitor the
performance of the application which includes the traffic, response rate and also the error
percentage. We will also constantly get the feedback from the user so that we can plan for further

65
improvement and also to know the performance of our model. The actual value of the travel fare
is an important data that we should collect so that we can evaluate our model performance.

6.7 Maintenance
We will perform regular model retraining so that the model stays up to date with the latest trend.
Data patterns will change over the time, so we will schedule to retrain the data every 2 weeks,
and also base on the feedback we receive. We will also make sure that the data we use is good
quality so that the model we train can perform well.

6.8 Documentation
We will constantly update the documentation of our application and model, and record all the
changes to the model, and also all the problems faced and errors reports we received.

Conclusion <Contribution by: All members >

7.0 Conclusion
In conclusion, data mining allowed us to do prediction using the current data, with the dataset
provided we were able to train a model that allows us to predict the travel fare quite accurately
with just 4.727 RMSE. With this model, we can allow people to predict their travel fare and
increase the customer satisfaction because they can know whether they receive a fair pricing or
not. Based on our model, we can see that the data is useful in predicting the travel fare where we
can observe a pattern among the data with the travel fare especially with the departure time
where there is a clear trend between the data. Some of the problems we faced during the
assignment, mainly are the training and fine tuning of the model, due to prevent our model from
overfitting and underfitting, we decide to use the full training dataset instead of randomly
sampling a subset, hence this result in the training time to become very long, and thus in order to
speed up the training process, we decided to reduce the amount of hyperparameter we use for the
fine tuning, which may cause us to unable to get a better model. In the future, we plan to improve
the preprocessing pipeline, feature engineering is one of the thing we are considering to do but
unable to do so, in our file, we had created new attributes including distance traveled, travel
duration, is weekend, is holiday, but decided not to include it into the model because we think
that the model is already complicated and decide to leave it for future improvement and update.
Furthermore, we still need to do more research on the value of the travel fee to see whether it
will be affected by the discount or voucher and so on, for example the dataset we got has a
negative value on the travel fees that might be caused by these issues. Finally, we would also try
to use more hyperparameters for the fine tuning, but this would require us to make an investment
to purchase a better device so that the training speed can be increased.

66

You might also like