Data Mining Report
Data Mining Report
UCCB3224
Data Mining Techniques
Group Assignment
June 23 Trimester
Lecturer: Dr. Abdulkarim Kanaan Jebna
Group Assignment
1
Assignment Marksheet
By signing below, we confirmed that the work produced was original and purely based on our
own sentence construction. Should there be any plagiarism detected, we agreed mark
penalization on the part(s) detected.
4 Tan Le Je 2004995 7 IB
Criteria Weightage Alex Tan Ang Jia Beh Ling Tan Le Jie Wong Wei
Yong Xin Xuan Huey Jie
Individual:
Final Deliverables 4%
(Quality of Work) (Scale*4/5)
Team:
Report and 3%
Documentation (Scale*3/5)
Final Deliverables 3%
(Quality of System) (Scale*3/5)
Innovativeness 2%
(Scale*2/5)
15% % % % % %
Marking Scheme
2 Below average work produces, Evidence of some study or thought/idea, but not
quite adequate
0 Not attempted
Table of Contents
Executive Summary <Contribution by: All members> 1
Business Understanding 2
1.1 Background Information 2
1.2 Problem statement 2
1.3 Objective 3
1.4 Data Mining Goals 3
1.5 Success Criteria 4
1.6 Gantt Chart 4
1.7 Development Tools 4
2.1 Understanding the dataset’s structure and dimensions 5
2.2 Understanding the travel fee 6
2.3 Creating testset 6
2.4 Data visualization 7
2.5 Travel fee 10
2.6 Arrival Location 12
2.7 Departure location 13
2.8 Departure time count (original data) 14
2.9 Departure time count (time only) 14
2.10 Departure time vs average travel fee 15
2.11 Departure date 16
2.12 Departure time counts (include seconds) 17
2.13 Correlation 18
2.14 Query 1: Looking for range of departure_long and departure_lat 19
2.15 Query 2: Looking for range of arrival_long and arrival_lat 20
2.16 Query 3: Look for the most frequent departure time 20
2.17 Query 4 21
2.18 Query 5 21
Data Preparation <Contribution by: All members > 22
3.0 Data Preprocessing (missing value) 22
3.1 Data Cleaning (departure and arrival) 22
3.1.1 Before removing outliers 23
3.1.2 After removing outliers 24
3.2 Feature Engineering (distance_travelled) future improvement 24
3.3 Feature Engineering (travel_time_used)future improvement 24
3.4 Feature Engineering (time_of_day, is_weekend)future improvement 25
3.5 Correlation Test 25
3.6 Feature Selection (Dropping ID) 25
3.7 Feature Scaling (min max scaler, standard scaler) 26
3.8 Last preprocessing 27
Modeling <Contribution by: All members > 30
4.1 Train model and measure its performance 30
4.1.1 Decision Tree Regressor 30
4.1.2 Linear Regression 32
4.1.3 Non-linear regression 34
4.1.4 Random forest regressor 36
4.1.5 SVM Linear Regression 38
4.1.6 k-NN Regressor 40
4.1.7 Ridge Regression 42
4.1.8 Lasso Regression 44
4.1.9 Elastic Net Regression 46
4.1.10 Stochastic gradient descent 48
4.1.11 Gradient Boosting Regression 50
4.1.12 XGBoost Regression 52
4.1.13 LightGBM Regression 54
4.1.14 MLP 56
4.2 Short-listing 5 most promising model 58
4.3 Fine tuning result 58
4.3.1 Decision tree regressor 58
4.3.2 XGBoost regression 59
4.3.3 KNN Regressor 60
4.3.4 Random forest regressor 61
4.3.5 Gradient Boosting Regression 62
Evaluation <Contribution by: All members > 64
5.0 Evaluation 64
Deployment <Contribution by: All members> 65
6.0 Deployment 65
6.1 Digital Infrastructure 65
6.2 Web application 65
6.3 Integrating the final model 65
6.4 API endpoint 65
6.5 Testing 65
6.6 Monitor 65
6.7 Maintenance 66
6.8 Documentation 66
Conclusion <Contribution by: All members > 66
7.0 Conclusion 66
Executive Summary <Contribution by: All members>
In this project, we are expecting to create models that can predict the fair pricing for the taxi
service with the help of CRISP-DM. First of all, task one business understanding will introduce
the background of taxi service, identify the problem statements, objectives and define the data
mining goals. Followed by the second task, data understanding that requires us to understand the
data that we are going to work with like name, description, type, percentage of missing values.
Then, visualizing data with different types of charts like scatter, histogram , bar chart and line
charts and study about the correlation of features with the target variable travel_fee. The third
task is to perform data preparation. Which is handling the missing values, outliers, deciding the
numerical and categorical attributes with the help of Simple Imputer. In this stage, the irrelevant
features are being dropped out of the train set, and trying out new creation of features with the
existing features in the dataset. The fourth task, modeling involves using different algorithms
like Decision Tree Regressor, Linear Regression, Non-linear regression, Random forest
regressor, SVM Linear Regression, k-NN Regressor, Ridge Regression, Lasso Regression,
Elastic Net Regression, Stochastic gradient descent, Gradient Boosting Regression, XGBoost
Regression, LightGBM Regression, MLP. At the end of the day, five out of fourteen models are
being shortlisted. The candidates are decision tree regressor, XGBoost regression, KNN
1
Business Understanding
In this project, the way of how fair pricing is predicted in the taxi service sector will be the major
interest of the whole assignment as it is the most important fundamental that helps both the
passengers and service provider to get what they want. And to achieve that, data understanding
and preparation will be done before using models to train the fair pricing. Fair pricing helps to
ensure that the charges are reasonable and satisfied for their journeys. While transparency will be
the method or models that the fares will be predicted and calculated.
Malaysia taxi services had bad reputations due to the unstandardized conduct by drivers, like not
using taximeters, choosing the preferred destinations regardless locals or tourists while having
poorly maintained, old taxis.
The dataset provided had features with outliers like arrival_long. With the outliers that exist, the
train set will be affected by the mistreated value, and it will at the end affect the test set’s
performance.
When machine learning has an underfit or overfit issue, it will affect the training and validation
data. In a way that overfit will show low training error and high validation error. And the underfit
will show both high error for training and validation due to the simple model and small dataset
being used.
Machine learning has so many different algorithms that can be applied to make predictions for
numerical value predictions. Standardized indicators that can help to make comparisons among
all algorithms should be used.
1.2.5 The existing features do not provide enough information for data exploration.
The existing features had low correlation value not near to one with the feature travel_fee.
2
1.3 Objective
The major objective of this project is to ensure that the customers are able to get the fair pricing
every time using the taxi service with the insight gained from the previous data. Improving the
customer satisfaction for all the rides by considering the time, distance traveled and the
occupancy of each ride, and thus being able to retain more customers. Moreover, another
objective is to attract more new customers to use the taxi service as the customer can predict
their travel fare, and file a complaint if they receive an unreasonable fare. Hence, the aim is to
increase the overall sales of the taxi fares by at least 15% and also to boost customer engagement
by creating a pleasant ride experience for everyone which customers will provide good feedback
and promote the taxi ride through word of mouth.
In this assignment, we will try to achieve 5 goals to predict the travel fee based on departure
latitude and longitude, destination latitude and longitude and occupancy.
Create a model to predict the fair price with the historical data of departure latitude and
longitude, destination latitude and longitude and occupancy.
1.4.2 To gain insight into the dataset, removing the outliers and filling in missing values.
Using different charts like histogram and scatter to visualize some of the important features that
we would like to highlight.
By creating different models with different algorithms, we can look for the model with lowest
test and validation error and shortlist the most promising model.
Models with low Mean Square error values imply the predictions are more reliable therefore
leading to better decisions.
Creating new features with correlation value near to one can help to select features that can have
strong influence on the target variable travel_fee
3
1.5 Success Criteria
1.5.1 Fair Pricing Prediction should be made.
1.5.2 Services costs should be reasonable and reflect the value provided.
1.5.3 Customers are not overcharged or subjected to arbitrary or discriminatory fares.
The success criteria above aims to increase the satisfactory level of the customers by making
accurate and not overrated travel fees.
1.7.1 Anaconda
It is a distribution graphical user interface (GUI) that supports applications such as Jupyter
Notebook.
It is a server-client programme that enables different users to run and edit the same notebook
files over a web browser that helps us in our assignment.
4
Data Understanding <Contribution by: All Members >
The dataset has 8 columns and 1048575 rows of data. The columns are ID, travel_fee,
departure_time, departure_long, departure_lat, arrival_long, arrival_lat and occupancy.
5
2.2 Understanding the travel fee
Through this, we can understand the min, max, mean, standard deviation, 25th percentile,
median and 75 percentile of the travel fee. We know that most of the travel fees are around 10
dollars. So the maximum value of 450 might be some outlier or special cases.
6
2.4 Data visualization
2.4.1 Occupancy
From figure 2.5, we can see that the occupancy has 9 different kinds of data which are, 0 to 9,
where 1 is the most common data. From this figure, we can know that 0 might be an abnormal
case because there cannot be rides without an occupant, so it may be a noise of the data, but we
need to further discover the data to understand more about it before deciding whether it is a noise
or not. For the occupancy with the value 9, it may also be an outlier too but we still need to do
more research on it.
7
2.4.1.1 Discovering occupancy 0 and 9
8
Figure 2.8 Occupancy 0 and 9 (2)
From figures 2.7 and 2.8, we can see that the 0 occupancy is a noise because it has a lot of non
number data and also 0 value. But for the 9 occupancy, the data seems to have no issue, so we
can conclude that 0 occupancy is a noise, while the 0 occupancy is a legit data.
9
2.5 Travel fee
10
Figure 2.10 Travel fee value count
From figure 2.9, we can see most of the data is around the 10 dollar mark as shown in figure 2.3.
There is still a considerable amount of data around the 50 dollar mark, but from there onwards
the number decreases. There is also data at negative sites, this may be because the user was using
some discount or cashback voucher for the ride.
11
2.6 Arrival Location
12
From figure 2.11, we can know that the min value for arrival_lat is around -750 and the max
value is around 50. The min value for arrival _long is around -76 and the max value is around 74.
Most of the data of arrival_long and arrival_lat are around -90 and 40 respectively, which means
that the data are all from rides to the same destination.
From figure 2.12, we can also conclude that most of the data came from the same location which
are -60 longitude and 40 latitude.
13
2.8 Departure time count (original data)
From this figure, we can see that the distribution of the departure time is quite even, where the
average count is around 8, but this does not help much in explaining the data.
14
From figure 2.15, we can see that there is a trend in the data, where the count increases from
5AM to 2PM, then decreases until 4.40PM, then increases until 8PM, from there onwards the
amount of rides decreases.
From figure 2.16, we can also observe a trend among the travel fee and time, when comparing to
figure 2.15, we observe that when the count of the rides decreases and the travel fare increases.
15
2.11 Departure date
From figure 2.17, we can see that the dates are quite evenly distributed, since the data is in date
format so we decide to not use the date for the prediction since we can see that it might not help
much in the prediction.
16
2.12 Departure time counts (include seconds)
From figure 2.18, we can also observe a similar pattern to figure 2.15, since the pattern is similar,
so we decide to only use the hour and minute for the prediction.
17
2.13 Correlation
18
Figure 2.20 Correlation Matrix Heatmap
From figures 2.19 and 2.21, we can see that the location of departure and destination are very
correlated, but the data is not correlated to travel fee.
19
From the query, we can see that the departure_long may have outliers.
From this query, we can see that the arrival_long may have outliers.
From this query, we know that the most frequent departure time is around 8PM.
20
2.17 Query 4
From this query, we can conclude that occupancy with 0 needs to be removed.
2.18 Query 5
From this query, we can see the range of the travel fee. -52 might be uncommon, but it may be
due to the voucher claim to record lost.
21
Data Preparation <Contribution by: All members >
22
Figure 3.2 Process of removing outliers.
23
3.1.2 After removing outliers
The latitude is not supposed to be more than +-180 and longitude is not supposed to be more
than +-90. By removing the outliers, the data range between the correct values.
24
3.4 Feature Engineering (time_of_day, is_weekend)future improvement
For the feature engineering task, new feature named distance_travelled, travel_time_used,
time_of_day and is_weekend are being created.
After creating some new attributes, we tested the correlation to check if any of the new attributes
have stronger positive correlation with the target variable travel_fee.
Since the ID is not useful to predict the travel fare, we had drop it when creating the train set.
25
3.7 Feature Scaling (min max scaler, standard scaler)
26
3.8 Last preprocessing
27
Figure 3.13 Dataset after preprocessing
Figure 3.13 shows the prepared training data, it includes the departure_long, departure_lat,
arrival_long, arrival_lat, occupancy and departure_hour_min.
For the numerical attributes we will first remove the noise, which is the data with 0 occupancy.
Then we filter the outliers which we only retain the data between -180 to 180, so the outliers as
observed in figures 2.22 and 2.23 will be filtered. Then we use a simple imputer with the median
strategy to replace the data with missing values. In this case there are none, but in the future if
there are missing values, the median score will replace the missing value. Finally, because
regression models are sensitive to the scale of the data, we use standardscaler to scale all the
data, so that we can train a good model.
For the categorical attributes, we use simpleimputer with the most_frequent strategy so that in
the future we can replace the data with missing value using the mode.
28
Finally, we created new attributes using the departure_time attribute. The new attribute is called
departure_hour_min, it contains the hour and minute of the data, so the date and second from the
original dataset has been dropped. For example, 2013-07-02 19:54:00+00:00 will be transformed
to 19.90. Then, we also include the imputer with the median strategy to replace the missing value
with the median in the future. Finally, we also scale the data so that the data is scaled, so that the
model can easily learn and understand the problem.
29
Modeling <Contribution by: All members >
30
Figure 4.3 Decision tree cross validation
Using the decision tree model, we got a RMSE of 1.735. After performing the cross validation,
we got a mean RMSE of 5.847. The difference between the training error and validation error is
quite small, hence we conclude that the model is an optimal model, where the model does not
underfit or overfit the data.
31
4.1.2 Linear Regression
32
Figure 4.6 Linear regression cross validation
Using the linear regression model, we got a RMSE of 10.805. After performing the cross
validation, we got a mean RMSE of 116.766. The difference between the training error and
validation error is very big, hence we conclude that the model is not an optimal model, where the
model has overfitted the data. So, this model is not good to predict the data as it has memorized
the data and provides inaccurate predictions.
33
4.1.3 Non-linear regression
34
Figure 4.9 Non-linear regression cross validation
Using the non-linear regression model, we got a RMSE of 10.767. After performing the cross
validation, we got a mean RMSE of 263.416. The validation error of this model is very high.
Moreover, the difference between the training error and validation error is very big, hence we
conclude that the model is not an optimal model, where the model has overfitted the data. So,
this model is not good to predict the data as it has memorized the data and provides inaccurate
predictions.
35
4.1.4 Random forest regressor
36
Figure 4.12 Random forest regressor cross validation
Using the Random forest regressor model, we got a RMSE of 5.162. After performing the cross
validation, we got a mean RMSE of 29.1276. The validation error of this model is slightly higher
than the training error but still at an acceptable range. Hence we conclude that the model is an
optimal model, where the model does not underfit or overfit the data.
37
4.1.5 SVM Linear Regression
38
Figure 4.15 SVM linear regression cross validation
Using the SVM linear regression, we got a RMSE of 11.319. After performing the cross
validation, we got a mean RMSE of 127.860. The difference between the training error and
validation error is very big, hence we conclude that the model is not an optimal model, where the
model has overfitted the data.
39
4.1.6 k-NN Regressor
40
Figure 4.18 k-NN Regressor cross validation
Using the k-NN Regressor, we got a RMSE of 4.880. After performing the cross validation, we
got a mean RMSE of 37.236. The validation error of this model is slightly higher than the
training error but still at an acceptable range. Hence we conclude that the model is an optimal
model, where the model does not underfit or overfit the data. But we still need to fine tune the
model to obtain a much better result.
41
4.1.7 Ridge Regression
42
Figure 4.21 Ridge Regression cross validation
Using the ridge regression, we got a RMSE of 10.805. After performing the cross validation, we
got a mean RMSE of 116.766. The difference between the training error and validation error is
very big, hence we conclude that the model is not an optimal model, where the model has
overfitted the data.
43
4.1.8 Lasso Regression
44
Figure 4.24 Lasso regression cross validation
Using the lasso regression, we got a RMSE of 10.811. After performing the cross validation, we
got a mean RMSE of 116.878. The difference between the training error and validation error is
very big, hence we conclude that the model is not an optimal model, where the model has
overfitted the data.
45
4.1.9 Elastic Net Regression
46
Figure 4.27 Elastic net regression prediction and RMSE
Using the Elastic net regression, we got a RMSE of 10.811. After performing the cross
validation, we got a mean RMSE of 116.878. The difference between the training error and
validation error is very big, hence we conclude that the model is not an optimal model, where the
model has overfitted the data.
47
4.1.10 Stochastic gradient descent
48
Figure 4.30 Stochastic gradient descent cross validation
Using the Stochastic gradient descent cross validation, we got a RMSE of 10.812. After
performing the cross validation, we got a mean RMSE of 116.974. The difference between the
training error and validation error is very big, hence we conclude that the model is not an optimal
model, where the model has overfitted the data.
49
4.1.11 Gradient Boosting Regression
50
Figure 4.32 Gradient boosting regressor cross validation
Using the Gradient boosting regressor, we got a RMSE of 5.295. After performing the cross
validation, we got a mean RMSE of 28.266. The validation error of this model is slightly higher
than the training error but still at an acceptable range. Hence we conclude that the model is an
optimal model, where the model does not underfit or overfit the data. But we still need to fine
tune the model to obtain a much better result.
51
4.1.12 XGBoost Regression
52
Figure 4.35 XGBoost Regression cross validation
53
4.1.13 LightGBM Regression
54
Figure 4.35 LightGBM Regression cross validation
Using the LightGBM Regression, we got a RMSE of 4.669. After performing the cross
validation, we got a mean RMSE of 22.570. The validation error of this model is slightly higher
than the training error but still at an acceptable range. Hence we conclude that the model is an
optimal model, where the model does not underfit or overfit the data.
55
4.1.14 MLP
56
Figure 4.36 MLP Regressor cross validation
Using the MLP regressor, we got a RMSE of 4.848. After performing the cross validation, we
got a mean RMSE of 44.763. The validation error of this model is slightly higher than the
training error but there are still better models. Hence we conclude that the model is not an
optimal model and has overfitted the data.
57
4.2 Short-listing 5 most promising model
Out of the 14 models we had tested, we decided to choose decision tree regressor, XGBoost
regression, KNN regressor, random forest regressor and Gradient boosting regression. This is
because they have the most optimal model among the others, where they did not overfit or
underfit the data, and are able to provide a good prediction of the travel fare. So, we will fine
tune the short listed model in order to determine the best model among them. For the fine tuning,
we will use grid search to perform the hyper-parameter optimisation. Due to the long loading
time, we decided to cut down the amount of hyper-parameter we will use for the optimization,
because we had tried to use many hyper-parameters but the results were unable to be obtained
even after loading for more than 10 hours.
For the decision tree regressor, the hyperparameter we chose are min_samples_split and
min_samples_leaf using [2,4,6] and [1,5,10,15,20] respectively. After fine tuning the model, the
best hyper parameters are 20 samples leaf and 2 sample split. The best score we obtain is
-22.451.
58
4.3.2 XGBoost regression
For the XGBoost regression, the hyperparameters we chose are reg_alphae and reg_lambda, the
reason we choose this hyperparameter is that it can help prevent overfitting by adding a penalty
term. The best parameters are 0.1 reg_alpha and 0.1 reg_lambda. The best score we obtain is
-20.433.
59
4.3.3 KNN Regressor
For the KNN regressor, the hyperparameters we chose are n_neighbors using [3,5,7,10]. The best
parameter is 3 n_neighbors. The best score we obtain is -41.943, hence this is not a good model
when comparing to the previous two models, and the score obtained is much larger.
60
4.3.4 Random forest regressor
For the Random forest regressor, the hyperparameters we chose are max_features because it can
prevent the model from focusing on irrelevant or noisy features by limiting features used for
each split. The best parameter is 1.0 which is auto. The best score we obtain is -29.071, which is
still considered good when compared to the other models.
61
4.3.5 Gradient Boosting Regression
For the gradient boosting regression, the hyperparameters we chose are max_depth using [3,5,7].
The best parameter is 7 max depth. The best score we obtained was -20.984.
After performing the fine tuning on the 5 short listed models, the best model we obtain is
XGBoost with the best score of -20.433, followed by gradient boosting regression with -20.984,
decision tree with -22.451, random forest with -29.071, and KNN regressor with -41.943. Based
on the data mining goals, we want to prevent overfitting and underfitting and look for the best
model with lowest mean squared error, hence we decide to use XGBoost as the final model as it
is the best model among the 5 short listed models.
62
Figure 4.42 Final Model
63
Evaluation <Contribution by: All members >
5.0 Evaluation
To do the evaluation on the test, we first drop the ID and travel fee to get the x_test and then
change the format of the departure time of x_test into datetime. Then we copy the travel_fee
from the test set to get the y_test. Using the final model and x_test, we predicted the travel fare
and got the mean squared error of 4.7269. We are satisfied with this result as it can provide a
quite accurate result with a low error rate and the model has not underfitted or overfitted the
training data. The model has generalized well from the training data to unseen data and is
making accurate predictions on new and unseen data. Hence, we have successfully trained a
model that can help us to predict the travel fare accurately.
64
Deployment <Contribution by: All members>
6.0 Deployment
6.5 Testing
Before deploying the application, we will test the application to make sure that the program is
working as intended and provide accurate travel fare prediction. We plan to use the Postman API
to send some data to the API endpoint and check the accuracy of the response.
6.6 Monitor
After the testing is complete and the application has been deployed, we need to monitor the
performance of the application which includes the traffic, response rate and also the error
percentage. We will also constantly get the feedback from the user so that we can plan for further
65
improvement and also to know the performance of our model. The actual value of the travel fare
is an important data that we should collect so that we can evaluate our model performance.
6.7 Maintenance
We will perform regular model retraining so that the model stays up to date with the latest trend.
Data patterns will change over the time, so we will schedule to retrain the data every 2 weeks,
and also base on the feedback we receive. We will also make sure that the data we use is good
quality so that the model we train can perform well.
6.8 Documentation
We will constantly update the documentation of our application and model, and record all the
changes to the model, and also all the problems faced and errors reports we received.
7.0 Conclusion
In conclusion, data mining allowed us to do prediction using the current data, with the dataset
provided we were able to train a model that allows us to predict the travel fare quite accurately
with just 4.727 RMSE. With this model, we can allow people to predict their travel fare and
increase the customer satisfaction because they can know whether they receive a fair pricing or
not. Based on our model, we can see that the data is useful in predicting the travel fare where we
can observe a pattern among the data with the travel fare especially with the departure time
where there is a clear trend between the data. Some of the problems we faced during the
assignment, mainly are the training and fine tuning of the model, due to prevent our model from
overfitting and underfitting, we decide to use the full training dataset instead of randomly
sampling a subset, hence this result in the training time to become very long, and thus in order to
speed up the training process, we decided to reduce the amount of hyperparameter we use for the
fine tuning, which may cause us to unable to get a better model. In the future, we plan to improve
the preprocessing pipeline, feature engineering is one of the thing we are considering to do but
unable to do so, in our file, we had created new attributes including distance traveled, travel
duration, is weekend, is holiday, but decided not to include it into the model because we think
that the model is already complicated and decide to leave it for future improvement and update.
Furthermore, we still need to do more research on the value of the travel fee to see whether it
will be affected by the discount or voucher and so on, for example the dataset we got has a
negative value on the travel fees that might be caused by these issues. Finally, we would also try
to use more hyperparameters for the fine tuning, but this would require us to make an investment
to purchase a better device so that the training speed can be increased.
66