1.
Intro
In the service industry, every company, brand or organization places the highest
importance on the customer's attitude towards them. The definition of satisfaction is
achieving what you need or want. So service workers' mission is to please their
customers. No exception, Vietnam Airlines has also set its own very ambitious goal:
"VIAGS endeavors to become one of the best ground services companies in Asia".
We use Kaggle's dataset to make predictions based on the provided features. This
dataset is collected from Information about passenger satisfaction surveys divided
into 24 attribute columns: ID, Age, Gender, Class, Customer Type, etc. Here we use
classification to predict the likelihood of flyers achieving high levels of satisfaction.
The prediction range will be from 0% to 100%. If the ratio is greater than 80%, then it
is likely that the customer is satisfied with the airline's flight service. If it is not , high
possibility that the customer is not satisfied with the flight.
Input is taken from customer information data including: ID, age, gender, class, etc.
and the Output obtained from the above data is the level of customer satisfaction,
specifically the standard here is satisfaction.
This is a classification model. At the predicted value, there are two main classes:
neutral or dissatisfied and satisfied
2. Experience (Dataset Description section)
To be able to achieve standard output:
First understand the problem and the data.
Next we’ll define the data and preprocess the data.
Thirt is to select the training data fields, normalize the data and split (train and test).
Another important part is to train the model using classification models.
Last but not least is to evaluate the model.
=> To better understand the data set, we will find the shape of the data set by
counting the total number of rows, total number of columns, and the data type of
each column and memory requirements. And we also check for missing values in the
data.
Number format has: ID, Age, Flight Distance, Departure Delay in Minutes, Arrival
Delay in Minutes while Classification type include: Gender, Customer Type, Type of
Travel, Class And no unstructured form.
If some attributes are numeric, they are in the same range, it is outlier
If the numeric attribute has many different ranges, we need to normalize them -> A
basic way to normalize is to bring it to [0,1].
There is a label for each data sample and It is Supervised because there is a label.
-When evaluating the given data set does not reveal any noisy data appearing
his is a data set collected from a Vietnamese airline (Normally Vietnam's data is
often very limited and inaccurate)
-Compared to other customer clustering datasets, this dataset is collected from
Vietnam, where most people in the dataset are between 20-35 years old.
-When compared to other satisfaction prediction datasets, this dataset mainly
highlights in-flight services, which people are most interested in testing for
satisfaction.
5. Conclu:
In this research, our team uses new models XGBoost and LightGBM. And after
training and evaluating the model, we can see that XGBoost achieved the highest
accuracy (96.2065%) LightGBM also achieved the highest f1 score (95.511%). Both
the XGBoost and LightGBM algorithms stand out for their ability to handle big data
well. However, we acknowledge that hyperparameter tuning for XGBoost and
LightGBM may pose challenges.
The risk of overfitting exists, particularly when fine-tuning the models excessively, or
when dealing with unbalanced datasets. It is crucial for practitioners to strike a
balance between model complexity and generalizability to avoid overfitting. The
performance and speed of LGBM compatible with many algorithms (Aziz et al, 2022)