DB Report Final 3
DB Report Final 3
Submission Information
Result Information
Similarity 18 %
1 10 20 30 40 50 60 70 80 90
Journal/
Publicatio Internet
n 8.1% 9.44%
Words <
14,
8.97%
A-Satisfactory (0-10%)
B-Upgrade (11-40%)
7 springeropen.com Publication
<1
9 dergipark.org.tr Publication
<1
11 arxiv.org Publication
<1
12 cse.anits.edu.in Publication
<1
13 iaeme.com Publication
<1
14 journalofcloudcomputing.springeropen.com Internet Data
<1
17 www.doaj.org Publication
<1
20 journal.ugm.ac.id Publication
<1
22 kth.diva-portal.org Publication
<1
24 par.nsf.gov Publication
<1
28 www.dx.doi.org Publication
<1
29 www.ksmcb.or.kr Publication
<1
34 whqlibdoc.who.int Publication
<1
36 helix.dnares.in Publication
<1
37 helix.dnares.in Publication
<1
38 helix.dnares.in Publication
<1
44 Does incivility impact the quality of work-life and ethical climate of Publication
<1
nurses by Itzkovich-2020
45 researchspace.ukzn.ac.za Publication
<1
46 scholar.sun.ac.za Publication
<1
47 www.dx.doi.org Publication
<1
49 www.intechopen.com Publication
<1
57 ir.library.oregonstate.edu Publication
<1
59 www.dx.doi.org Publication
<1
61 oer.abuad.edu.ng Publication
<1
62 Organization Theory and the Study of European Union Institutions Less Publication
<1
by Murdoch-2015
70 histsci.fas.harvard.edu Publication
<1
71 refubium.fu-berlin.de Publication
<1
73 www.ccsenet.org Publication
<1
78 www.researchsquare.com Publication
<1
79 www.rgcms.edu.in Publication
<1
80 aib.msu.edu Publication
<1
81 aran.library.nuigalway.ie Publication
<1
82 archive.ipcc.ch Publication
<1
84 arxiv.org Publication
<1
89 citeseerx.ist.psu.edu Publication
<1
90 cloudtweaks.com Internet Data
<1
92 diabesity.ejournals.ca Publication
<1
95 downloads.hindawi.com Publication
<1
109 The Great Successor The Divinely Perfect Destiny of Brilliant Comrade Publication
<1
Kim Jong by Goldring-2020
120 IEEE 2017 IEEE 29th International Conference on Tools with Artifici, Publication
<1
by Rolandus Hagedoorn,- 2017
Flight Price Prediction
119
Using Machine
Learning
Sai Bende1, Dr. A.D. Sawarkar2
38
1Department of Information Technology, Shri Guru Gobind Singhji Institute of Engineering and Technology
(SGGSIET), Nanded.
Abstract-:
This research delves into an in-depth analysis of flight booking data sourced from the "Ease My
Trip" website, aiming to prognosticate ticket prices through a meticulous amalgamation of statistical
hypothesis tests and regression algorithms. Addressing the exigent need to unravel the myriad factors
underpinning ticket pricing dynamics, the study endeavors to furnish discerning insights tailored to
prospective passengers. Methodologically, a multifaceted approach encompassing data preprocessing,
visualization, and modeling is meticulously executed. Leveraging a comprehensive suite of techniques,
the dataset undergoes rigorous cleansing, feature transformation, and missing value imputation.
Through a kaleidoscopic lens of visualization, diverse facets of the dataset are illuminated, spanning
airline distributions, source cities, departure times, stops, arrival times, destination cities, flight classes,
duration, days left, and ticket prices.
40
The predictive mantle is assumed by a trio of regression algorithms: Linear Regression,
Decision Tree Regressor, and RandomForest Regressor, each meticulously calibrated to unlock the
underlying pricing patterns. Noteworthy endeavors in hyperparameter tuning via GridSearchCV bolster
the efficacy of Decision Tree and RandomForest models. Evaluation of model performance pivots on
the metric of R2 scores, culminating in the ascension of the RandomForest Regressor as the paragon
of predictive accuracy. Crucial insights gleaned from the analysis spotlight the pivotal role of flight class
as the linchpin feature dictating ticket prices, with Business class tickets emerging as the vanguard of
expense over their Economy counterparts. Remarkably, the preeminent RandomForest Regressor
clinches a staggering R2 score of 0.985, emblematic of its prodigious predictive acumen.
KeyWords-Flight prediction, Machine learning regression, Linear regression, Feature selection, Flight
ticket prices.
I. Introduction
45
The advent of machine learning regression heralds a transformative era in predictive modeling,
reshaping industries through its data-centric approach to decision-making. At its essence, regression
analysis employs mathematical algorithms to delineate the intricate relationships between independent
and dependent variables, facilitating the nuanced prediction of continuous outcomes. In the dynamic
milieu of the airline sector, characterized by fluid ticket pricing influenced by a myriad of factors, the
application of regression techniques assumes paramount importance. By delving into historical flight
booking data, regression models unveil latent patterns and trends, furnishing invaluable insights for
crafting pricing strategies, optimizing revenue streams, and augmenting customer satisfaction levels.
Amidst the ever-evolving landscape of the airline industry, the accurate prognostication of flight
ticket prices stands as a formidable challenge. Conventional methodologies often falter in capturing the
intricate interplay of variables that govern pricing dynamics. Hence, there exists a compelling imperative
to harness advanced machine learning regression algorithms to construct robust predictive models
capable of precisely forecasting
50 ticket prices. 47
The principal aim of this study is to deploy machine learning regression techniques to scrutinize
a flight booking dataset procured from the "Ease My Trip" website. Specifically, the research endeavors
to probe the correlations between diverse features such as airline, departure time, stops, and ticket
prices. Furthermore, it seeks to develop and meticulously evaluate regression models, including Linear
Regression, Decision Tree Regression, and Random Forest Regression, to accurately predict flight
ticket prices. The ultimate objective is to distill actionable insights from the analysis, thereby
empowering stakeholders to refine pricing strategies and optimize revenue generation within the airline
industry.
Structured methodically, the paper unfolds across several delineated sections to
comprehensively address the aforementioned objectives. Commencing with an introduction to the
Python libraries instrumental in data analysis and modeling, the paper proceeds to elucidate the
intricacies of data preprocessing, encompassing tasks such as data importation, cleansing, and feature
engineering. Subsequent sections delve into the detailed exploration of the dataset through an array of
visualization techniques. The methodology section navigates through the implementation of diverse
regression algorithms and expounds upon the evaluation metrics employed to gauge model
performance. Finally, the paper culminates with a discourse on key insights gleaned from the analysis,
accompanied by conclusive remarks and recommendations for future research trajectories.
Regression
Study Application Key Findings
Technique
Marketing
This table provides a summary of key studies exploring different regression techniques, their
applications, and major findings.
Furthermore, with the proliferation of big data and advancements in computing technology, there is a
need for scalable regression techniques capable of handling large-scale datasets efficiently.
III. Methodology
Data Collection
13
The dataset used in this study is obtained from the "Ease My Trip" website, a platform for
booking flight tickets. It consists of flight booking data with various attributes such as airline, flight details,
source and destination cities, departure and arrival times, number of stops, class, duration, days left
until the trip, and price. The dataset contains 300,153 entries with 12 columns.
Data Preprocessing
In preparation for modeling, a series of preprocessing procedures were meticulously executed
to fortify the dataset's integrity and usability for subsequent analysis and modeling endeavors. The
sequential preprocessing actions undertaken were as follows:
Firstly, meticulous scrutiny was employed to detect any instances of missing data within the
dataset. Fortunately, a thorough examination revealed a pristine absence of null values, ensuring the
dataset's completeness and reliability. Subsequently, superfluous columns were systematically pruned
from the dataset to streamline its structure and enhance interpretability. Notably, the 'Unnamed: 0'
column, presumed to serve as an index, was judiciously discarded due to its redundancy in furnishing
pertinent analytical insights. To mitigate potential conflicts with reserved keywords in Python, strategic
measures were adopted to rename the 'class' column as 'flight_class,' thus circumventing any potential
ambiguities during subsequent coding endeavors. Moreover, proactive measures were undertaken to
optimize the handling of categorical variables. String representations within categorical columns such
as airline, source_city, departure_time, stops, arrival_time, destination_city, and flight_class were
systematically replaced with discerning integer counterparts, facilitating streamlined computational
operations during the modeling phase. Furthermore, cognizant of the imperative to standardize
numerical features, a robust normalization technique was judiciously implemented. Leveraging the
efficacy of min-max scaling, numerical attributes were meticulously transformed to conform to a uniform
scale ranging between 0 and 1. This normalization procedure ensures equitable feature contributions,
thereby averting potential biases and confounding effects during subsequent analytical assessments.
Model Selection:
For the modeling phase, a strategic selection of three distinct regression algorithms was
meticulously orchestrated to harness their unique strengths and capabilities:
Firstly, Linear
13 Regression emerged as a pivotal choice, adeptly employed to delineate a
coherent linear relationship between the input features and the target variable, which in this context
pertains to the pricing aspect. By leveraging this algorithm, the aim was to unravel the nuanced interplay
between various predictors and the target, thereby facilitating insightful price predictions.
In tandem with Linear Regression, the Decision Tree Regressor was judiciously incorporated
into the modeling framework. Renowned for its capacity to construct predictive models grounded in
decision tree algorithms, this methodology sought to delineate intricate decision paths within the data
landscape, thereby enabling robust price forecasting capabilities.
Furthermore, the modeling repertoire was enriched with the inclusion of the Random Forest
Regressor, heralded for its prowess in engendering ensemble models comprising multiple decision
trees. By harnessing the collective predictive prowess of diverse decision trees, this ensemble approach
aimed to enhance predictive accuracy while safeguarding against the perils of overfitting, thus fostering
more resilient and reliable price predictions.
Through the strategic amalgamation of these three distinct regression algorithms, the modeling
endeavor aspired to unlock a spectrum of predictive insights, thereby empowering stakeholders with
actionable intelligence conducive to informed decision-making within the domain of interest.
Evaluation Metrics: 77
The performance of the regression models was evaluated using the R2 score (coefficient of
determination), which measures the proportion of the variance in the dependent variable (price) that is
predictable from the independent variables. Additionally, Mean Squared Error (MSE) was calculated to
assess the average squared differences between the predicted and actual prices.
Visualization the data distribution and relationships between features using various plots such
as Pie charts and Box plots.
Model Implementation: 40
The implementation of regression models, encompassing Linear Regression, Decision Tree
Regressor, and Random Forest Regressor, was orchestrated using the Python programming language
in conjunction with a diverse array of libraries and tools tailored for data manipulation, visualization, and
model training. The salient elements underpinning the implementation endeavor are outlined as follows:
Choice of Programming Language: Python was elected as the language of choice for model
implementation owing to its versatility and expansive repertoire of data science-centric libraries.
Software Toolkits:
● Pandas: Employed for proficient data manipulation and analysis tasks, encompassing dataset
importation and diverse preprocessing operations.
● NumPy: Leveraged for its prowess in numerical computing and adept handling of
multidimensional arrays, especially during the data preprocessing phase.
● Matplotlib and Seaborn: These visualization libraries were harnessed to craft insightful data
visualizations, including histograms, box plots, and pie charts, elucidating the dataset's intrinsic
distributions and inter-variable relationships.
● Scikit-learn
13 (sklearn): A formidable arsenal for machine learning pursuits, furnishing an array of
tools for model training, evaluation, and hyperparameter optimization. Specific modules such
as LinearRegression, DecisionTreeRegressor, RandomForestRegressor, MinMaxScaler,
train_test_split, GridSearchCV, and various metrics were judiciously deployed.
● Computational Infrastructure: The implementation milieu was underpinned by a robust
computational environment meticulously tailored to accommodate the computational demands
inherent in executing the models and associated libraries. While granular specifics regarding
the computational setup are omitted from the discourse, it is presupposed that the hardware
infrastructure was suitably provisioned for the efficient execution of Python scripts.
20
Hyperparameter Tuning
Hyperparameter tuning is a crucial step in8 optimizing the performance of machine learning
models. In this research, hyperparameter tuning was conducted to fine-tune the parameters of the
selected regression models for optimal performance. The process of hyperparameter tuning involved
the following steps:
Linear Regression:
● Hyperparameters: The primary hyperparameter in Linear Regression is the regularization
parameter.
56
● Tuning Process: A grid search approach was employed to tune the regularization parameter
using cross-validation. 6
● Grid Search Parameters: The grid search was performed over a range of regularization
parameter values to identify the optimal value that maximizes the model's performance.
The hyperparameter tuning process aimed to identify the optimal hyperparameters for each regression
model, thereby improving their predictive performance on the dataset.
Cross-Validation 17
Cross-validation is a robust technique used to assess the generalization performance of
machine learning models. In this research, cross-validation was employed to evaluate the performance
of the regression models and ensure their robustness. The methodology used for cross-validation
involved the following steps:
K-fold Cross-Validation:
● The dataset was divided into k subsets of approximately equal size.
● The regression model was trained k times, each time using k-1 subsets as training data and
the remaining
8 subset as validation data.
● The performance metrics, such as R2 Score and Mean Squared Error, were computed for each
fold.
● The final performance metric was calculated by averaging the results obtained from all the folds.
● Number of Fold: 34
● The choice of the number of folds (k) in cross-validation is critical for obtaining reliable estimates
of model performance.
● typically, k is set to a value between 5 and 10, balancing computational efficiency and statistical
robustness.
● In this research, k-fold cross-validation with k=10 was used to assess the performance of the
regression models.
By employing k-fold cross-validation, the research ensured that the performance estimates of
the regression models were not overly optimistic or pessimistic and provided a realistic assessment of
their generalization capabilities.
Interpretation of Revelations:
The insights gleaned from our regression models cast a revealing light on the multifaceted
determinants underpinning flight ticket prices. Central to our findings was the pivotal role played by flight
class, with business class fares commanding a premium over their economy counterparts. Moreover,
nuanced nuances in ticket pricing dynamics were unveiled, with departure time, duration, and proximity
to the flight date emerging as influential factors. Armed with this nuanced understanding, passengers
and industry stakeholders alike are empowered to make informed decisions pertaining to pricing
strategies and travel planning.
Furthermore, the elucidation of interaction effects between flight class and other salient features
underscores the imperativeness of holistic consideration when predicting ticket prices. The discerned
1
interplay between flight class and variables such as the number of stops unraveled distinct pricing
paradigms across different classes, thereby underscoring the intricate tapestry of the aviation industry.
Comparative Insights:
A comparative analysis prior studies not only unveiled consistencies but also unearthed novel
95
insights. While our findings align with established literature in identifying flight class and timing as pivotal
price determinants, our study propels the discourse forward by delving into interaction effects and
harnessing cutting-edge machine learning methodologies for prediction. The resounding success of the
Random Forest model serves as a testament to the efficacy of ensemble methods in unraveling
complex data patterns.
Moreover, our steadfast emphasis on robust feature engineering and meticulous preprocessing
techniques echoes contemporary trends in predictive modeling, accentuating the quintessence of data
quality and feature curation. By amalgamating domain expertise with avant-garde methodologies, our
study lends credence to the ongoing dialogue surrounding pricing dynamics in the aviation domain while
furnishing actionable insights for industry stakeholders.
In summation, our research endeavors furnish profound elucidations into the intricate nexus
between assorted factors and flight ticket prices, thereby arming both passengers and industry
practitioners with actionable insights. Through the judicious deployment of machine learning algorithms
and rigorous analytical scrutiny, we showcase the potential for precise price prognostication and
augmented decision-making prowess in the airline sector.
VI. Conclusion
Summary 29of Findings
In this study, we employed various machine learning regression techniques to predict flight
ticket prices using a dataset extracted from the "Ease My Trip" website. Our analysis revealed promising
outcomes, with the Random Forest Regressor model emerging as the most accurate predictor,
achieving an impressive R2 score of 0.985. Through feature engineering and ensemble learning
methods, we successfully captured complex nonlinear relationships within the dataset, demonstrating
the
43 efficacy of machine learning algorithms in forecasting flight ticket prices.
Contributions
This research contributes significantly to the field of machine learning regression by showcasing
the practical application of advanced algorithms in the domain of airline ticket price prediction. 109 By
leveraging ensemble techniques like Random Forest, we achieved high predictive accuracy, offering 41
valuable insights for both passengers and industry stakeholders. Moreover, the study underscores the
importance of feature selection and model complexity in enhancing prediction performance, thus
providing a roadmap for future research endeavors in similar domains.
Limitations 92
Despite
1 the promising outcomes, this study has several limitations that warrant discussion. 63
Firstly, the dataset used in this research was extracted from a single platform, limiting the
generalizability of the findings to other booking websites or airlines. Additionally, the dataset may suffer
from biases or inconsistencies inherent to online booking systems, potentially affecting the model's
performance. Moreover, the analysis focused primarily on numerical features, overlooking the potential
impact of categorical variables such as airline reputation or route 1 popularity. Addressing these
limitations and incorporating additional data sources could enhance the robustness and applicability of
future predictive models.
Future Directions
Building upon the findings of this study, several avenues for future research emerge. 22 Firstly,
exploring alternative datasets from diverse booking platforms and airlines could provide a more
comprehensive understanding of flight 106 price dynamics. Additionally, integrating sentiment analysis of
customer reviews or social media data may offer valuable insights into factors influencing ticket prices,
such as demand fluctuations or service quality. Furthermore, investigating advanced ensemble
techniques or deep learning architectures could further improve prediction accuracy and accommodate
the intricacies of the airline industry's pricing mechanisms. Finally, considering the evolving landscape
of travel preferences and economic factors, continuous monitoring and adaptation of predictive models
are essential to ensure their relevance and effectiveness in real-world applications.
29
In conclusion, this research sheds light on the potential of machine
21 learning regression in
forecasting flight ticket prices, offering actionable insights and paving the way for future
advancements in the field.
Analysing Airline Passenger Satisfaction: A
Comparative Study of Classification Algorithms
Sai Bende1, Dr. A.D. Sawarkar2
1 37
Department of Information Technology, Shri Guru Gobind Singhji Institute of Engineering and Technology
(SGGSIE&T), Nanded.
Abstract-:
Airline passenger satisfaction is a crucial aspect of the aviation industry, directly impacting
customer loyalty and business sustainability. This research paper employs classification algorithms to
predict passenger satisfaction based on diverse factors, including demographic information, travel
preferences, and in-flight services. The dataset comprises responses from airline passengers,
encompassing variables such as gender, customer type, age, purpose of travel, class, flight distance,
and satisfaction ratings for various services.
69
Four classification algorithms, namely Logistic Regression, Random Forest Classifier, Decision
Tree Classifier, and K Neighbors Classifier, 28 were implemented and evaluated for their predictive
performance. 1 Each algorithm was trained on a subset of the dataset and tested on unseen data to
assess its accuracy, precision, recall, and F1 score. Among the models, the Random Forest Classifier
emerged as the most accurate, achieving an impressive accuracy score of 96% and an F1 score of
96%. This ensemble learning technique excelled in capturing complex relationships within the data and
exhibited robust performance in predicting passenger satisfaction levels. The Decision Tree Classifier
closely followed, achieving a commendable accuracy score of 95% and an F1 score of 95%.
These findings underscore the efficacy of machine learning algorithms in discerning patterns
and trends from large-scale datasets in the airline industry. The high predictive accuracy of the Random
Forest Classifier suggests its potential utility in informing strategic decisions aimed at enhancing
passenger
43 satisfaction and optimising operational efficiency for airlines.
This
2 research contributes to the growing body of literature on passenger satisfaction in aviation, 80
providing valuable insights into the factors influencing passenger experiences and preferences. Future
studies could explore additional features and incorporate advanced modelling techniques to further
refine predictive models and deepen our understanding of passenger behaviour in the airline domain.
I. Introduction
In the dynamic landscape of the aviation industry, ensuring passenger satisfaction stands as a
paramount objective for airlines worldwide. With the burgeoning volume of travellers and the intensifying
competition among
104 carriers, understanding the intricate dynamics of passenger preferences and
experiences has become indispensable
5 for fostering customer loyalty and sustaining profitability. In this
context, the integration of machine learning techniques, particularly classification algorithms, holds
immense promise in unravelling the underlying patterns and drivers of airline passenger satisfaction.
Machine learning, a subfield of artificial intelligence, empowers computers to learn from data
and make informed predictions or decisions without explicit
41 programming.32
Within the realm of predictive
modelling, machine learning regression techniques play a pivotal role in extracting meaningful insights
from complex datasets and forecasting outcomes with remarkable accuracy. By leveraging2historical
data and identifying intricate patterns, regression algorithms enable analysts to discern underlying
relationships between input variables and predict continuous outcomes, thereby facilitating informed
decision-making across diverse domains.
Problem Statement: The present research endeavours to address the quintessential question plaguing
the aviation industry: what factors contribute to airline passenger satisfaction, and how can airlines
leverage this knowledge to enhance customer experiences? Despite the plethora of factors influencing
passenger satisfaction, ranging from in-flight amenities to booking convenience, discerning the most
salient predictors amidst the noise remains a formidable challenge. Traditional analytical approaches
often fall short in capturing the nuanced interplay of these variables, necessitating a more sophisticated
and data-driven methodology 21 to uncover actionable insights.
The overarching objective of this research is to harness the power of classification algorithms
to predict airline passenger satisfaction accurately. Specifically, the study aims to achieve the following
goals:
This paper is
39organised into several sections, each contributing to a holistic understanding of
the research topic and its implications for the aviation industry. Following this introduction, the next
section provides a comprehensive review of relevant literature, elucidating prior studies and theoretical
frameworks pertinent to airline passenger satisfaction and predictive modelling. Subsequently, the
methodology section delineates the data collection process, feature engineering techniques,
115 and model
implementation strategies employed in this research. The results section presents a detailed analysis
of the predictive models' performance, including accuracy metrics, confusion matrices, and comparative
assessments. Finally, the discussion
101 and conclusion sections synthesise the findings, delineate
practical implications, and propose avenues for future research in this domain.
73
III. Methodology
1
This research paper aims to analyse airline passenger28 satisfaction using a comparative study
of classification algorithms. The dataset used in this study is collected from Kaggle and contains various
features related to airline services and passenger feedback. The objective is to build machine learning
models to predict passenger satisfaction based on different features. The methodology involves data
collection, preprocessing, model selection, and evaluation using appropriate metrics.
1. Data Collection:
1
The dataset used94 in this study is sourced from Kaggle, specifically the "Airline Passenger
Satisfaction" dataset. This dataset
53 contains information about airline passengers' demographics, travel
type, class, flight distance, and various aspects of service satisfaction, such as inflight WiFi, cleanliness,
and departure delay. The dataset consists of two main files: train.csv and test.csv. The training dataset
(train.csv) contains 103,904 entries, while the test dataset (test.csv) contains 25,976 entries. Each entry
consists of 25 attributes, including both numerical and categorical variables.
2. Data Preprocessing:
Data preprocessing is a crucial step to ensure the quality and relevance of data for model
building. The following preprocessing steps were performed:
The dataset contained missing values, particularly in the "Arrival Delay in Minutes" column.
These missing values were replaced with zeros, as it was assumed that missing values in this column
indicate no delay.
Two columns,
5 namely "Unnamed: 0" and "id," were identified as unique identifiers and were
dropped from the dataset as they do not contribute to the analysis.
Data Type Conversion:
The data types of certain columns were converted to categorical variables to facilitate analysis.
Columns such as "Gender," "Customer Type," "Type of Travel," and "Class" were converted to
categorical variables.
In this study, various classification algorithms were considered for predicting passenger
satisfaction based on the provided features. The selection criteria for choosing the models include
performance metrics such as accuracy, precision, recall, and F1-score. Some of the classification
algorithms considered for model selection include:
● Logistic Regression
● Decision Trees
● Random Forest
The choice of models will be justified based on their performance on the evaluation metrics.
4. Evaluation Metrics:
The performance of the classification models will be evaluated using appropriate metrics to assess their
effectiveness in predicting passenger satisfaction. The following evaluation metrics will be utilized:
1
Accuracy: The overall accuracy of the model in predicting passenger satisfaction.
Accuracy = (TP + TN) / ( TP + TN + FP + FN)
Precision: The proportion of true positive predictions out of all positive predictions made by the model.
Precision = TruePositives / (TruePositives + FalsePositives)
Recall: The proportion of true positive predictions out of all actual positive instances in the dataset.
Recall = True Positives/ (True Positives + True Negatives)
F1-score: The harmonic mean of precision and recall, providing a balanced measure of the model's
performance.
F1 = 2 * (precision * recall) / (precision + recall)
2
By evaluating the models using these metrics, we aim to identify the most effective algorithm
for predicting passenger satisfaction based on the provided dataset.
This research paper presents a comprehensive methodology for analyzing airline passenger
satisfaction using machine learning classification algorithms. By leveraging the provided dataset and
following the outlined methodology, we aim to gain insights into factors influencing passenger
satisfaction and build predictive models to improve airline services and customer experience.
1
IV. Experimental Setup
In this section, we outline the experimental setup for our research paper aimed at predicting customer
satisfaction in air travel using machine learning techniques. We describe the dataset used, the
implementation of regression models, hyperparameter tuning, and cross-validation methodology.
The dataset used in this research consists of customer feedback and attributes related to air
travel experiences. It includes information such as gender, age, type of travel, flight distance, various
service ratings, departure delay, arrival delay, and customer satisfaction. The dataset was
preprocessed to handle missing values, encode categorical variables, and scale numerical features.
Model Implementation
For implementing regression models, we utilized Python programming language along with
several software libraries including NumPy, Pandas, Matplotlib, Seaborn, and scikit-learn. These
libraries provide efficient tools for data manipulation, visualization, model building, and evaluation.
● Pandas: For data manipulation and preprocessing.
● NumPy: For numerical computations and transformations.
● Scikit-learn: For implementing machine learning algorithms, including regression
models.
● Matplotlib and Seaborn: For data visualization.
Hyperparameter Tuning: 32
Hyperparameter tuning
31 is a crucial step in optimizing the performance of machine learning
models. In this research, hyperparameter tuning was conducted to fine-tune the parameters of the
selected classification models for optimal performance. The process of hyperparameter tuning involved
the following steps:
Let's break down how hyperparameter tuning can be explicitly performed for each model:
● Logistic Regression:
○ Hyperparameters: The logistic regression model may have hyperparameters like C
(inverse of regularization strength), penalty (type of regularization), solver (optimization
algorithm), etc.
○ Grid Search: You can perform hyperparameter tuning using techniques like Grid
Search or Random Search. Grid Search exhaustively searches through a manually
specified subset of the hyperparameter space.
○ For example, you could define a grid of hyperparameters to search through, such as
different values of C, penalty, and solver, and then use cross-validation to evaluate
each combination of hyperparameters and select the best performing one.
● Random Forest Classifier:
○ Hyperparameters:
1 Hyperparameters for a random forest classifier include n_estimators
(number of trees in the forest), max_depth (maximum depth of the trees),
min_samples_split, min_samples_leaf, etc.
○ Random Search: Similar to Grid Search, you can perform Random Search, which
randomly samples from the hyperparameter space.
○ For example, you could define a range for n_estimators, max_depth, and other
hyperparameters, and then randomly sample from these ranges to find the combination
that yields the best performance.
● Decision Tree Classifier:
○ Hyperparameters: Hyperparameters for decision trees are similar to those of random
forests, including max_depth, min_samples_split, min_samples_leaf, etc.
○ Again, you can use Grid Search or Random Search to tune these hyperparameters.
● K Neighbors Classifier:
○ Hyperparameters: The K Neighbors Classifier has hyperparameters like n_neighbors
(number of neighbors), weights (weight function used in prediction), metric (distance
metric), etc.
○ Similar to other models, you can use Grid Search or Random Search to tune these
hyperparameters.
The hyperparameter tuning process aimed to identify the optimal hyperparameters for each regression
model, thereby improving their predictive performance on the dataset.
Cross-Validation
Cross-validation is a robust technique used to assess the generalization
16 performance of
machine learning models. In this research, cross-validation was employed to evaluate the performance
of the regression models and ensure their robustness. The methodology used for cross-validation
involved the following steps: 68
the methodology used for cross-validation likely aimed to ensure the robustness of the results by
effectively evaluating5 the model's performance on different subsets of the data. Cross-validation is a
standard technique used in machine learning to assess how well a model generalizes to unseen data.
Here's how cross-validation is typically conducted:
● K-Fold Cross-Validation:
○ The dataset is divided into K subsets or folds of approximately equal size.
○ The model is trained K times, each time using K-1 folds for training and the remaining
fold for validation.
This process results in K trained models and K validation scores.
● Performance Metric:
○ A performance metric is chosen
1 to evaluate the model's performance on each fold.
Common metrics include accuracy, precision, recall, F1 score, or area under the ROC
curve (AUC). 2
○ The choice of performance metric depends on the nature of the problem being
addressed in your research.
● Aggregation of Results:
○ The performance scores obtained from each fold are typically aggregated to compute
a single performance estimate.
○ Common aggregation methods include taking the mean or median of the scores across
folds.
● Evaluation on Test Set:
○ After cross-validation is complete and the final model is selected based on the
aggregated performance scores, it is evaluated on a separate test set that was not
used during training or cross-validation.
○ This step ensures that the model's performance estimates from cross-validation
generalize well to truly unseen data.
● Repeated Cross-Validation:
○ To further ensure robustness, repeated cross-validation can be performed. In repeated
cross-validation, the entire cross-validation process (including data splitting and model
training) is repeated multiple times with different random splits of the data.
○ The results from repeated cross-validation provide a more stable estimate of the
model's performance.
By using cross-validation, particularly with K-fold cross-validation and possibly repeated
iterations, your research ensures that the model's performance is robust and not heavily influenced by
the particular subset of data used for training and validation. This methodology helps to provide more
reliable estimates of the model's generalization performance.
9
Fig. Confusion Matrix for Random Forest Classifier
3. Decision Tree Classifier:
● Accuracy: The Decision Tree Classifier achieved an accuracy of 95%, indicating a high level
of correctness in its predictions.
● Precision: With a precision score of 95%, the model exhibited a high level of correctness in
identifying satisfied and dissatisfied customers.
● Recall: The recall score of 95% signifies the model's ability to capture a high proportion of
actual positive instances.
● F1-score:
12 The F1-score of 95% suggests a balanced performance between precision and recall,
similar to the Logistic Regression model.
VI. Conclusion
87
This research has delved into the realm of airline customer satisfaction analysis through the lens of
machine learning regression techniques. By scrutinizing various factors influencing passenger contentment, we've
unearthed invaluable insights that can reshape airline operations and elevate the passenger experience.
Our findings unveil the pivotal role of factors such as inflight wifi service, cleanliness, departure delay,
and flight distance in shaping passenger satisfaction levels. Through meticulous analysis and model predictions,
we've identified these attributes as key determinants that airlines must prioritize to meet passenger expectations
effectively. 35
Moreover, our comparative analysis of different regression models, including logistic regression, random
forest classifiers, decision trees, and
25 K-nearest neighbors, demonstrates the diverse approaches available for
predicting customer satisfaction. Each model offers unique advantages and insights, allowing airlines to tailor
their strategies based on specific operational requirements and data characteristics.
Contributions
4 : 18
This study makes significant contributions to the field of machine learning regression, particularly in the context
of airline customer satisfaction analysis. By leveraging advanced analytical techniques, we've not only provided
airlines with actionable insights into passenger preferences but also showcased the efficacy of machine learning
in solving real-world business challenges.
Furthermore, our methodological rigor and transparent approach to data analysis set a precedent for future research
endeavors in this domain. By105 emphasizing the importance of cross-validation, hyperparameter optimization, and
model comparison, we've established best practices that can guide researchers and practitioners alike in
developing robust predictive models.
Limitations:
Despite the rigor and comprehensiveness of our study, it's essential to acknowledge certain limitations. Firstly,
the generalizability of our findings may be constrained by the specific dataset and variables analyzed. Future
research could expand the scope to include a broader range of airlines, regions, and passenger demographics to
ensure broader applicability.
Additionally, the predictive accuracy of our models, while commendable, may be further refined through the
inclusion of additional features or the exploration of more sophisticated algorithms. Moreover, the inherent
subjectivity of customer satisfaction metrics poses challenges in quantifying and interpreting passenger sentiments
accurately.
Future
3 Directions: 7
Building on the insights
3 gleaned from this study, future research endeavors could explore several promising
avenues. Firstly, longitudinal studies tracking changes in passenger preferences over time could provide deeper
insights into evolving trends and behaviors in the airline industry. Additionally, incorporating sentiment analysis
techniques to analyze unstructured data sources such as customer reviews and social media posts could enrich our
understanding of passenger sentiments.
Furthermore, investigating the impact of external
88 factors such as economic conditions, geopolitical events, and
public health crises on passenger satisfaction could yield valuable insights for airlines navigating turbulent market
environments.
100 Lastly, exploring innovative approaches such as deep learning and natural language processing
could unlock new possibilities for enhancing predictive accuracy and extracting actionable insights from complex
datasets.
In conclusion, this research represents a significant step forward in understanding and predicting airline
customer satisfaction. By leveraging machine learning regression techniques and conducting comprehensive
analyses, we've shed light on the factors driving passenger contentment and provided airlines with a roadmap for
delivering exceptional customer experiences in an increasingly competitive landscape.
Unveiling Workforce Dynamics: A k-Means
Clustering Approach to Understanding Salaried
Employees in Hyderabad
Sai Bende1, Dr. A.D. Sawarkar2
1 36
Department of Information Technology, Shri Guru Gobind Singhji Institute of Engineering and Technology
(SGGSIE&T), Nanded.
Abstract-:
This research paper investigates the clustering of salaried employees in Hyderabad, employing
the k-means algorithm on the "Hyderabad Salaried Employees Dataset [Clustering]". The study
addresses the imperative need to comprehend the diverse workforce dynamics prevalent in the region,
aiming to unravel underlying patterns and trends among employees based on key attributes such as
designation, experience, qualification, and salary.
The findings reveal distinct clusters within the salaried employee population, shedding light on
nuanced characteristics and groupings based on educational background, experience levels, and salary
brackets. Notably, the clustering analysis unveils insights into the distribution of employees across
23
different sectors and the prevalence of certain job roles within specific salary ranges. These findings
hold significant implications for talent management, organizational strategy, and HR policy formulation.
103
The research contributes to a deeper understanding of the workforce landscape in Hyderabad and
underscores the value of data-driven approaches in unraveling complex societal phenomena.
I. Introduction
In the realm of modern workforce analytics, understanding the intricate dynamics of salaried
employees has become paramount for businesses, policymakers, and researchers alike. Leveraging
the power of machine learning techniques, particularly k-means clustering,
3 holds promise in unraveling
the complex patterns within the workforce landscape. As such, this research endeavors to shed light
on the workforce dynamics of salaried employees in Hyderabad, India, employing a data-driven
approach. Machine learning, a subset of artificial intelligence, has emerged as a transformative tool in
uncovering hidden insights from vast datasets. Regression, a fundamental technique in predictive
modeling, plays a pivotal
112 role in understanding the relationships between variables and making
informed predictions. However,
3 in the context of workforce dynamics, traditional regression techniques
may fall short in capturing the multifaceted nature of the workforce. Hence, this paper turns to k-means
clustering, an unsupervised learning algorithm, to segment and analyze the diverse pool of salaried
professionals in Hyderabad.
Following this introduction, the paper will proceed with a literature review, providing an overview
of existing research on workforce dynamics and clustering techniques. Subsequently, the methodology
section will outline the data collection process, preprocessing steps, and the application of k-means
clustering. The results section will present the findings of the clustering analysis, followed by a
discussion of the implications and significance of the results. Finally, the paper will conclude with a
summary of key insights, limitations, and avenues for future research.
Despite the advancements in machine learning regression and clustering techniques, there are
7
notable gaps in existing literature that the current research aims to address. Firstly, while regression
3
analysis provides valuable insights into the factors influencing workforce dynamics, it often overlooks
the complex interactions and non-linear relationships present within the data. This limitation calls for
the exploration of more advanced regression techniques, such as polynomial regression and neural
networks, to capture the intricacies of workforce behavior.
Secondly, while clustering algorithms like k-means have been utilized to segment employee
49
populations, there is a dearth of research focusing specifically on salaried employees in urban contexts
like Hyderabad. Existing studies often generalize findings across diverse geographical regions and
employment sectors, overlooking the unique characteristics and challenges faced by salaried
15
professionals in specific urban settings. By focusing on Hyderabad, this research aims to fill this gap by
providing tailored insights into the workforce dynamics of salaried employees in a rapidly growing urban
center.
24
Furthermore, while case studies have explored the application of machine learning regression and
25
clustering in workforce analytics, there is a lack of comprehensive studies integrating these techniques
to gain holistic insights into workforce dynamics. This research seeks to bridge this gap by employing
a k-means clustering approach alongside regression analysis to uncover nuanced patterns and profiles
79
within the salaried workforce of Hyderabad, thereby contributing to a deeper understanding of urban
labor dynamics.
III. Methodology
Data Collection:
120
The dataset used in this study, titled "Hyderabad Salaried Employees Dataset," was obtained
from an undisclosed source. It comprises various attributes related to salaried employees working in
Hyderabad, India. The dataset contains information such as candidate name, company name,
designation, years of experience, current location, qualifications, and salary details. With over 28,000
entries and nine columns, the dataset provides a comprehensive view of the workforce dynamics in
Hyderabad's salaried sector.
Data Preprocessing:
Before performing any analysis, several preprocessing steps were undertaken to ensure data
quality and suitability for the study. Firstly, missing values were handled by dropping rows with missing
values in critical columns such as candidate name, company name, designation, and category.
Subsequently, irrelevant columns such as candidate name, category, and location were removed to
focus solely on pertinent features.
Fig.Visualizing Missing Pattern
Moreover, data cleaning involved converting salary values from string format ("Rs. X lacs") to
numerical format and transforming experience duration from years and months to total months.
Furthermore, label encoding was applied to categorical variables to convert them into a numeric format
suitable for modeling. Lastly, the data was normalized to standardize the scale of numeric features,
ensuring that no single feature dominated the analysis due to its magnitude.
Model Selection:
The choice of model for this study primarily revolved around the objective of understanding the
workforce dynamics through clustering analysis. Given the unsupervised nature of the problem, where
the aim was to group employees based on similarities in their attributes, the k-Means clustering
algorithm was deemed suitable. k-Means is a widely-used clustering algorithm known for its simplicity
and efficiency in partitioning data into clusters. It works by iteratively assigning data points to the nearest
cluster centroid and updating the centroids based on the mean of the assigned points. The algorithm's
30
simplicity makes it suitable for large datasets like the one used in this study, allowing for the identification
of natural groupings within the data without the need for labeled training examples. Additionally, the
choice of the number of clusters (k) was informed by statistical metrics such as the elbow method,
7
silhouette score, Calinski Harabasz score, and Davies Bouldin score, ensuring the optimal number of
clusters was selected for meaningful analysis.
Evaluation Metrics:
As the focus of this study is on clustering rather than regression, traditional regression evaluation
metrics such as mean squared error (MSE) or R-squared are not applicable. Instead, the evaluation of
the clustering algorithm's performance relies on metrics specific to clustering tasks. One such metric is
the silhouette score, which measures the cohesion and separation of clusters. A higher silhouette score
indicates better-defined clusters, with values closer to 1 indicating dense, well-separated clusters.
Additionally, the Calinski Harabasz score, also known as the variance ratio criterion, measures the ratio
of between-cluster dispersion to within-cluster dispersion. Higher Calinski Harabasz scores indicate
better-defined clusters with greater separation between clusters. Lastly, the Davies Bouldin score
measures the average similarity between each cluster and its most similar cluster, with lower scores
indicating better clustering. These metrics collectively provide a comprehensive evaluation of the
clustering algorithm's performance, helping assess the quality and interpretability of the resulting
clusters in understanding the salaried workforce dynamics in Hyderabad.
By adhering to this methodology, the study aims to uncover valuable insights into the workforce
dynamics of salaried employees in Hyderabad, leveraging the k-Means clustering approach and
rigorous evaluation metrics to ensure robust and meaningful analysis.
IV. Experimental Setup
Model Implementation:
The regression models and clustering algorithms were implemented using Python
programming language along with several libraries and frameworks. Specifically, the following libraries
were utilized:
1. Pandas: For data manipulation and preprocessing.
2. NumPy: For numerical computing and array operations.
3. Scikit-learn: A machine learning library that provides various regression algorithms, clustering
algorithms, and evaluation metrics.
4. Matplotlib and Seaborn: For111
data visualization and generating plots.
5. Missingno: For visualizing missing values in the dataset.
6. Jupyter Notebook: For interactive development and documentation of the code.
These libraries offer comprehensive functionalities for data handling, model implementation,
and evaluation, making them suitable for the research's computational requirements. The Python
programming language provides flexibility and ease of use, while the Scikit-learn library offers a wide
range of regression and clustering algorithms for experimentation.
Hyperparameter Tuning:
4
Hyperparameter tuning is a crucial step in optimizing the performance of machine learning
models. In this research, hyperparameters for the k-Means clustering algorithm were tuned to ensure
optimal clustering results. The process involved exploring different values for the number of clusters (k)
and selecting the value that maximized clustering quality metrics such as silhouette score, Calinski
Harabasz score, and Davies Bouldin score.
7
Elbow Method: This method involved plotting the sum of
squared distances (inertia) against the number of clusters
(k). The "elbow" point, where the inertia starts decreasing
at a slower rate, indicates the optimal number of clusters.
Fig. Silhouette
25 Score Vs Values of K
Silhouette Score: Silhouette analysis was conducted to evaluate the compactness and separation of
clusters for different values of k. The silhouette score measures how similar an object is to its own
cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.
Fig. Calinski Harabasz Silhouette Score
Calinski Harabasz Score: The Calinski Harabasz score, also known as the Variance Ratio Criterion,
measures the ratio of between-cluster dispersion to within-cluster dispersion. Higher scores indicate
denser and more well-separated clusters.
Davies Bouldin Score: The Davies Bouldin score evaluates the average "similarity" between each
cluster and its most similar cluster. Lower scores indicate better clustering.
The tuning process was conducted iteratively, experimenting with different values of k and evaluating
the clustering performance using appropriate metrics. The final choice of hyperparameters was based
on achieving the best clustering results that effectively captured the underlying patterns in the data.
Cross-Validation: 10
Cross-validation is essential for assessing the generalization performance of machine learning models
and ensuring robustness of the results. In this research,
11 a methodology known as k-fold cross-validation
was employed. This technique involves splitting the dataset into k subsets (or folds), training the model
on k-1 subsets, and evaluating it on the remaining subset. This process is repeated k times, with each
subset serving as the test set exactly once. The average performance across all folds provides a more
reliable estimate of the model's performance compared to a single train-test split.
By using k-fold cross-validation, the research aims to mitigate the risk of overfitting and obtain more
accurate and stable estimates of the regression model's performance.
Overall, the experimental setup leverages the Python programming language and various libraries to
implement regression models and clustering algorithms, tune hyperparameters, and perform cross-
validation. This setup ensures the research is conducted rigorously and produces reliable results for
understanding the workforce dynamics of salaried employees in Hyderabad.
V. Results and Discussion
Presentation of Results:
The k-Means clustering algorithm was applied to the dataset consisting of salaried employees
in Hyderabad. After preprocessing the data and determining the optimal number of clusters using the
Elbow Method and statistical metrics such as Silhouette score, Calinski Harabasz Score, and Davies
Bouldin
51 Score, the dataset was clustered into five distinct groups. The clustering was performed based
on features derived from Principal Component Analysis (PCA) to reduce dimensionality and 57 visualize
the clusters effectively. The results of the clustering algorithm were then visualized using a scatter plot,
where each point represented a data point (employee) and was color-coded according to the cluster it
belonged to.
Interpretation of Findings:
The clustering analysis revealed distinct groups among the salaried employees in Hyderabad based on
their attributes such as company name, designation, experience, qualifications, and salary. Each cluster
represents a group of employees who share similar characteristics within the dataset. For example,
Cluster 0 may consist of employees with mid-level experience and qualifications, working in a variety of
companies with moderate salary ranges. On the other hand, Cluster 1 might include senior-level
employees with extensive experience and higher qualifications, earning relatively higher salaries.
70
This segmentation of employees can provide valuable
107 insights for businesses and policymakers in
various
54 ways. For instance, companies can tailor their human resource management strategies
according to the characteristics of each cluster, such as recruitment, training, and retention policies.
Additionally, policymakers can use this information to assess the overall employment landscape in
Hyderabad, identify areas of skill shortages or surpluses, and devise appropriate interventions to
address them.
Visualization:
VI. Conclusion
Summary of Findings:
In this research, we applied k-Means clustering to understand the dynamics of salaried
employees in Hyderabad using a dataset containing information such as company name, designation,
experience, qualifications, and salary. Through extensive data preprocessing, including cleaning,
transformation, and reduction, we prepared the dataset for clustering analysis. Utilizing techniques such
as Principal Component Analysis (PCA) for dimensionality reduction and statistical metrics for
determining the optimal number of clusters, we successfully segmented the employees into five distinct
groups.
The clustering analysis revealed meaningful insights into the workforce dynamics in Hyderabad,
highlighting the diversity and complexity of the salaried workforce in the region. Each cluster
represented a unique profile of employees with similar characteristics, allowing for a granular
55
understanding of the employment landscape.
Contributions:
This study makes several contributions to the field of machine learning regression and
workforce analytics. Firstly, it demonstrates the applicability of clustering techniques, specifically k-
Means, in understanding salaried employees' characteristics and behaviors. By segmenting the
workforce into distinct groups, businesses and policymakers can tailor their strategies and interventions
to better meet the needs of different employee cohorts.
113
Secondly, the research showcases the importance of data preprocessing and dimensionality
reduction in preparing datasets for clustering analysis. Through techniques such as normalization, label
encoding, and PCA, we effectively managed and transformed the data, enhancing the clustering
algorithm's performance and interpretability.
Lastly, the visualization of clustering results provides a clear and intuitive representation of the
workforce segmentation, enabling stakeholders to grasp complex patterns and relationships within the
data easily. This visual approach enhances decision-making and facilitates communication of findings
to a wider audience.
Limitations:
44
Despite its contributions, this study has several limitations that warrant consideration. Firstly,
the analysis relies on the quality and representativeness of the dataset, which may be limited by factors
such as data completeness and sampling bias. Additionally, the clustering algorithm's effectiveness is
contingent upon the choice of features and parameters, which may influence the results' robustness.
Furthermore, the study's scope is limited to salaried employees in Hyderabad, which may not
118
generalize to other geographical regions or employment sectors. Future research should explore
broader datasets and consider additional variables to capture the full complexity of workforce dynamics.
Future Directions:
14
Building on the findings of this study, several avenues for future research emerge. Firstly,
82
investigating the longitudinal trends in workforce dynamics can provide insights into how employee
65
profiles evolve over time and adapt to changing economic and social conditions. Longitudinal studies
can also shed light on the effectiveness of interventions and policies aimed at improving workforce
outcomes.
10
Moreover, exploring advanced clustering techniques, such as hierarchical clustering or density-
based clustering, can offer alternative perspectives on employee segmentation and uncover latent
74
patterns within the data. Additionally, integrating external data sources, such as socioeconomic
46
indicators or labor market data, can enrich the analysis and provide a more comprehensive
understanding of workforce dynamics.