Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views57 pages

Group 4 Ins1053 Ins105301

This report analyzes the relationship between career paths and employee salaries in the data science industry, focusing on key job skills, certifications, and educational qualifications that influence compensation. It utilizes a dataset from Kaggle, containing 1,342 observations related to job postings and salaries in California, and employs various statistical methods including regression and classification analysis to identify factors affecting salary levels. The findings aim to provide insights for professionals to enhance their earning potential and career growth in a competitive job market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views57 pages

Group 4 Ins1053 Ins105301

This report analyzes the relationship between career paths and employee salaries in the data science industry, focusing on key job skills, certifications, and educational qualifications that influence compensation. It utilizes a dataset from Kaggle, containing 1,342 observations related to job postings and salaries in California, and employs various statistical methods including regression and classification analysis to identify factors affecting salary levels. The findings aim to provide insights for professionals to enhance their earning potential and career growth in a competitive job market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

VIETNAM NATIONAL UNIVERSITY, HANOI

INTERNATIONAL SCHOOL

⁎⁎⁎

REPORT
Introduction to Business Data Analytics

Topic: From Credentials To Benefits: Unpacking the


Relationship Between Career Paths and Employee Salary

Group 4’s Member Student ID No


Nguyễn Thị Phương Thảo 23070983
Trần Thị Yến 23070921
Nguyễn Ánh Nguyệt 23070990
Nguyễn Lê Kiều Trang 23070838
Matthew Veriel Malonzo 23071334

Lecturers: Th.S Trần Đức Quỳnh

Class: INS1053 - INS105301

Hanoi, June 17th, 2024


Contents
INTRODUCTION...........................................................................................................................
1. Background and Motivations:...............................................................................................
2. Objectives of the Analysis:...................................................................................................
3. About the Dataset:................................................................................................................
METHODOLOGY..........................................................................................................................
CHAPTER 1: DATA PREPARATION AND CLEANING.............................................................
1.1. Data Preprocessing:...........................................................................................................
1.2. Descriptive Statistics for Nominal Variables:...................................................................
1.3. Descriptive Statistics for Quantitative Variables:...........................................................
CHAPTER 2: EXPLORATORY DATA ANALYSIS....................................................................
2.1. Overview of Industry and Corresponding Salaries.............................................................
2.2. Relationship between Job and Salaries:..........................................................................
2.3. Factors affecting the salary level.....................................................................................
a. The Average of Careers...................................................................................................
b. The Impact of Position Level..........................................................................................
c. The Accompanying Benefits...........................................................................................
2.4. Analysis of Top Qualifications by Career.......................................................................
CHAPTER 3: PREDICTIVE MODELING..................................................................................
3.1. Feature Selection.............................................................................................................
3.2. Model Building and Evaluation.....................................................................................
3.3. Interpreting Model Results..............................................................................................
CHAPTER 4: INSIGHTS AND RECOMMENDATIONS...........................................................
4.1. Key findings and implications:...........................................................................................
4.2. Strategies for maximizing salaries and benefits:.................................................................
CONCLUSION..............................................................................................................................
APPENDIX....................................................................................................................................
CONTRIBUTION.........................................................................................................................

2
INTRODUCTION

1. Background and Motivations:

In today’s competitive job market, understanding the relationship between job skills and

salaries is crucial for career success and financial stability. This knowledge guides

professionals in making informed career decisions and strategically planning skill

development. By identifying high-demand skills that command higher salaries, individuals

can tailor their education and training efforts to enhance employability and job satisfaction.

This understanding also empowers workers during salary negotiations, enabling them to

confidently request compensation reflecting their expertise.

Comparing current salaries with industry standards ensures professionals are not underpaid,

contributing to overall job satisfaction and financial well-being. In fields like data science,

experiencing significant growth, roles such as data analysts, machine learning engineers, and

AI specialists are highly sought after due to the increasing reliance on data-driven decision-

making. Skills like programming (Python, R), statistical analysis, data visualization, and

machine learning are particularly valued and lead to higher salaries.

Therefore, knowing which skills can enhance qualifications for high-paying roles in these

fields helps professionals prioritize their education and career paths for maximum earning

potential. Staying updated on technological advancements and market trends is essential for

securing high-paying roles and advancing careers effectively in today’s dynamic labor

market.

2. Objectives of the Analysis:

3
The primary objectives of this report are to thoroughly analyze and understand the

relationship between job skills and salaries in the data science industry. Firstly, it aims to

identify key job skills, certifications, and educational qualifications associated with higher

salaries. Secondly, the report seeks to analyze how factors like job title, education level,

experience, and technical skills impact salaries and benefits. It will develop a predictive

model using statistical and machine learning techniques to estimate expected salaries for

different data science profiles.

Secondary objectives include exploring regional salary variations, industry-specific trends

across sectors like finance and healthcare, and assessing the impact of gender and diversity

on salaries. The report will also evaluate the role of continuing education in career

advancement and examine how compensation affects job satisfaction and work-life balance

among data science professionals.

By addressing these objectives, the report aims to provide a comprehensive understanding of

salary determinants in data science, offering insights for professionals to maximize their

earning potential and career growth.

o Salary Trends in Each Industry: How do salaries in the data science industry

vary across sectors such as finance, healthcare, technology, and retail? Are there

any specific trends that can be observed?

o Impact of Education and Professional Development: How do educational

qualifications and ongoing skill development affect career progression and

salaries in the data science industry? What specific types of education and

professional development yield the greatest benefits?

4
o Relationship Between Salary and Job Satisfaction: How do salary and benefits

correlate with job satisfaction and work-life balance among professionals in the

data science industry? Is there any correlation between these factors?

3. About the Dataset:

The dataset used for this analysis is sourced from Kaggle and contains information on job

postings and salaries in the data science industry, primarily focusing on the state of

California. The dataset comprises three main components: company salary information, job

qualifications, and employee benefits. The dataset consists of 1,342 observations and key

variables such as job title, salary (USD/ year), skill requirements, educational qualifications,

and provided benefits.

(Data source: https://www.kaggle.com/datasets/michaelbryantds/california-salaries-in-data

science/data)

While the initial salaries are reported in US Dollars (USD), they will be converted to

Vietnamese Dong (VND) based on the exchange rate at the time of analysis to increase

relevance for Vietnamese readers (1 USD = 25,450.00 VND as of June 15, 2024 11:00 am).

Variables such as position, company, and ID are not used in the analysis as they are not

directly related to the research objectives.

 Company Salary Information: This dataset provides details on various data science

roles, including job titles, salary ranges, and job levels (e.g., junior, senior, staff). It

covers a wide range of positions such as Data Scientists, Data Analysts, Machine

Learning Engineers, and Data Science Managers, among others. The salary

information is a crucial aspect of this analysis, as it serves as the target variable for

understanding the impact of skills and qualifications on compensation levels.

5
 Job Qualifications The qualifications dataset contains a comprehensive list of skills,

educational requirements, and technical proficiencies associated with different data

science roles. It includes information on analytical skills, communication abilities,

research experience, programming languages (such as R, Python, and SQL), machine

learning expertise, and degrees (Bachelor's, Master's, or Doctoral). This dataset will

be utilized to identify the most valuable skills and qualifications that contribute to

higher salaries in the data science industry.

 Employee Benefits The benefits dataset provides information on various perks and

benefits offered by companies, such as health insurance, paid time off, retirement

plans, stock options, and professional development opportunities. This data will be

analyzed to understand the relationship between compensation packages, including

benefits, and job satisfaction within the data science field.

This study provides an opportunity to evaluate the necessary job skills and competitive

advantages in the data science industry. It also explores the relationship between educational

qualifications, skills, salaries, job positions, and provided benefits. The findings from this

study will be highly beneficial for students in preparing for their future careers, helping them

gain a better understanding of the skills they need to develop, as well as the salaries and

benefits they can expect when joining the workforce in the data science field.

METHODOLOGY

 Regression Analysis: Identifying the Main Factors Influencing Salary Levels

6
Regression analysis is a statistical technique used to model the relationship between one or

more independent variables and a dependent variable. In this case, we aim to identify the

primary factors that influence salary levels in the data science industry. Multiple linear

regression can be employed to model the relationship between independent variables (such as

skills, educational qualifications, experience) and the dependent variable (salary). The results

can be evaluated using techniques such as coefficient significance testing, adjusted R-

squared, and residual analysis to determine the main factors contributing to higher salary

levels.

 Classification Analysis: Detailed Analysis of Factors Influencing Salary

Levels

Classification analysis is used to separate employees with high salaries from those with lower

salaries and analyze the factors influencing this difference. A predictive model can be built

using classification algorithms such as logistic regression. The model can then be used to

predict the salary level of new employees based on their characteristics, and the most

influential factors can be identified through feature importance analysis. To analyze each

influencing factor in detail, salary rates can be calculated for different subgroups of

employees. Data visualization techniques such as box plots, scatter plots, and heat maps can

also be utilized to explore the relationship between each factor and salary levels.

Additionally, descriptive statistical analyses, such as calculating mean, median, and standard

deviation, can help elucidate the distribution of employee salary and benefit variables.

Exploratory data analysis using histograms and box plots can also help identify patterns,

trends, and outliers in the data.

7
CHAPTER 1: DATA PREPARATION AND
CLEANING

1.1. Data Preprocessing:

Employee:

Figure 1: Employee data

Benefits:

Figure 2: Employee Benefits

Qualifications:

8
Figure 3: Employee Qualifications (Degree and Skills)

To preprocess the "Data science salaries" dataset, we performed several key steps.

Firstly, we checked for missing values in the dataset. We found that there were no

missing values in the data.

Secondly, we checked for duplicate values, and then dropped them from the dataset.

This ensured that the dataset only contained unique records, which made our analysis more

accurate.

Thirdly, we removed redundant columns such as "Number", "Location", "Company"

since they did not provide any valuable information for our analytical or predictive models.

Finally, we converted categorical variables such as "Career”, “Levels” into numerical

values using one-hot encoding. This was necessary because most machine learning

algorithms require all input variables to be numerical values. By transforming non-numerical

variables into numerical ones, we were able to use them in our analytical and predictive

models with greater accuracy. Our data will be presented in the manner below

1.2. Descriptive Statistics for Nominal Variables:

Career | Count

9
Table 1. Career count and Percentage

Career Observations Percentage

Data Scientist 660 51.3%

Machine Learning Engineer 139 10.8%

Machine Learning Scientist 136 10.6%

Software Engineer 78 6.06%

Data Engineer 72 5.59%

Data Science Manager 63 4.9%

Data Analyst 57 4.43%

Applied Scientist 36 2.8%

Director of Data Science 22 1.71%

Data Architect 8 0.662%

Statistician 6 0.466%

10
Head of Data Science 6 0.466%

Vice president of Data Science 4 0.311%

Total 1287 100%

Career rate | Figure 5

Observation: The bar chart clearly shows the disparity in the frequency distribution among

the occupations.

 The most frequent value is "Data Scientist" with 660 observations, accounting for

approximately 51.3% of the total observations.

 Next is "Machine Learning Engineer" with 139 observations (10.8%), "Machine

Learning Scientist" with 136 observations (10.6%), and "Software Engineer" with 78

observations (6.06%).

 Other values such as "Data Engineer" (72 observations, 5.59%), "Data Science

Manager" (63 observations, 4.9%), "Data Analyst" (57 observations with 4.43%),

11
"Applied Scientist" (36 observations with 2.8%), "Director of Data Science" (22

observations with 1.71%), ... appear with lower frequencies.

Level | Count

Table 2. Level Count and Percentage

Level Observation Percentage

Unknown (Regular) 573 44.5%

Senior 485 37.7%

Staff 80 6.2%

Junior 73 5.7%

Principal 41 3.2%

Lead 34 2.6%

Distinguished 1 0.1%

Total 1287 100%

Level rate | Figure 6

12
Observation: The bar chart clearly shows the disparity in the frequency distribution among

the different levels/positions.

 The most frequent value is "Unknown" with 573 observations, accounting for

approximately 44.5% of the total observations. We can assume that the “unknown”

levels, despite having a job title would mean that they are regular employees for data

interpretation’s sake and are higher than junior and staff level employees but lower

than senior.

 Next is "Sr." (Senior) with 485 observations (37.7%), "Staff" with 80 observations

(6.2%), and "Jr." (Junior) with 73 observations (5.7%).

 Other values such as "Principal" (41 observations with 3.2%) and "Lead" (34

observations with 2.6%) appear with lower frequencies.

 There is only a single observation at the "Distinguished" level.

 The hierarchy would be established as such in terms of level:

o Distinguished

o Principal

o Lead

o Senior

o Regular (Unknown)

13
o Staff

o Junior

1.3. Descriptive Statistics for Quantitative Variables:

Salary | Values

Table 3. Salary Statistics

Statistic Value

Count 1209

Mean 150969.91

Std 33377.15

Min 0.0

25% 130000.0

50% (Median) 150000.0

75% 170000.0

Max 434000.0

Observations:
14
 There are a total of 1209 observations for the "Salary" variable.

 The mean salary is 150969.91 USD which is around 3.8 billion VND per year or

around 320 million per Month. The average wage for VN Data Scientists is 27.5

million to 33.5 million per month.

 The standard deviation of the salary is 33377.15, indicating a significant dispersion of

the data, which is an estimated 850 million VND difference per year

 The minimum salary value is 0, which is due to people working for free or taking the

opportunity to gain experience. The number of those working without pay is very low.

 The salary at the 25th percentile (Q1) is 130000.00 or around 3.3 billion or 275

million per month

 The median salary is 150000.00 or around 3.8 billion or 320 million per month

 The salary at the 75th percentile (Q3) is 170000.00 or around 4.3 billion or 360

million per month

 The maximum salary value is 434000.00 or around 11 billion or 920 million per

month

Career and Level Variables | Description

Statistic Career Levels

Count 1287 1287

Unique 13 7

15
Top Data Scientist Regular

Freq 660 573

Observations:

 Both the "Career" and "Levels" variables have 1287 observations.

 The "Career" variable has 13 unique values, with "Data Scientist" being the most

frequent (660 times).

 The "Levels" variable has 7 unique values, with "Regular" being the most frequent

(573 times).

CHAPTER 2: EXPLORATORY DATA


ANALYSIS

2.1. Overview of Industry and Corresponding Salaries

In the section, we will explore which features exhibit a positive correlation with each other.

This helps us determine if there is a relationship between two variables. The values in a

correlation matrix range from -1 to 1. Values closer to 1 signify a stronger positive

correlation between the respective variables, while values closer to -1 indicate a stronger

negative correlation, as seen in Figure 7 below.

16
Observations:

 Job Level and Salary Correlation: The Senior level and Salary have a positive

correlation of 0.22. This indicates that senior-level employees tend to have higher

salaries. Similarly, principal job level also shows a positive correlation of 0.11 with

salary, reinforcing that higher-level employees generally earn more.

 Career Path and Salary Correlation: Careers like Director of Data Science have the

highest positive correlation with salary (0.33), suggesting that individuals in this role

tend to earn higher salaries. Other significant correlations include Head of Data

Science (0.25) and Vice President of Data Science (0.24), indicating that these high-

level positions are associated with higher pay.

 Weak Correlations with Salary: Certain career paths have weak or negative

correlations with salary, such as Data Analyst (-0.19) and Statistician (-0.09). This

suggests that these roles might not be as lucrative compared to others within the

dataset. However, given that we start out at entry level jobs like this, it allows us to

predict which field within data science we want to pursue.

17
 Inter-career Path Correlations: There are strong positive correlations between

various managerial and senior roles. For example, Head of Data Science and Director

of Data Science (0.41) show a significant overlap or similarity in responsibilities.

Additionally, Machine Learning Engineer and Machine Learning Scientist have a

strong positive correlation (0.35), indicating a close relationship between these

positions, likely due to similar skill sets or career trajectories.

 Career and Job Level Correlations: Applied Scientist has a low positive correlation

with salary (0.01) and a moderate correlation with Data Scientist (0.23). This suggests

a potential overlap in skills or responsibilities between these roles.

From these observations, to predict salary, we should focus on variables with higher

correlations with salary, such as the following: Career paths like Director of Data Science,

Head of Data Science, and Vice President of Data Science, which show significant positive

correlations with salary. Moreover, Job levels such as Senior and Principal, which also

correlate positively with salary.

Figure 8. Histogram of Salary Distribution

In this chart, you can see the salary column chart statistics. Salaries are concentrated between

120,000 and 160,000 with the majority of people surveyed at these levels. The number of

18
people earning a salary of 160,000 to 180,000 accounts for the largest number, followed by

the range of 120,000 to 140,000. Ranked 3rd is 140,000 to 160,000. There is a smaller

percentage for the remaining salaries. And there are very few people who reach the lowest

and near low salaries like 20,000 and below 60,000. A very few people achieve salaries

above 420,000.

Figure. 9 Statistical Chart of Salary Correlation

The statistical chart shows variables that are highly correlated with salary. The most

influential variable is Ontology with a correlation level above 0.3 and the least influential

variable is ARQL with a correlation close to 0. The correlation level shows which factors the

salary is affected by. It does not indicate a positive or negative effect, but it can indicate what

the effect of these variables on the salary will be.

2.2. Relationship between Job and Salaries:

19
Figure 10. Pivot Table

The table above is a Pivot table providing information on average salary by career and levels.

The results can be seen as follows:

 Applied Scientist: The highest salary is at the ‘Principal’ rank at 130,000 USD,

followed by ‘Senior’’ and ‘Staff’.

 Data Analyst: Salaries at the ‘Junior’ and ‘Regular’ ranks are lower than at other

ranks such as ‘Lead’ and ‘Senior’.

 Data Architect: Only salary data is available for the ‘Distinguished’ and ‘Senior’

ranks, with the highest ‘Distinguished’ salary being 190,000 USD.

 Data Engineer: ‘Senior’ salary is 139,310.34 USD and ‘Regular’ is 133,846.15 USD.

 Data Science Manager: Only data is available for the rank ‘Regular’ with a salary of

160,365.08 USD.

 Data Scientist: Salary is evenly distributed across ranks, with 'Lead' the highest at

179,760 USD and 'Junior' the lowest at 126,736 USD.

 Director of Data Science: Only has data for the rank ‘Regular’ with a very high

salary of 231,909.09 USD.

 Head of Data Science: Also only has data for ‘Regular’ with a salary of 175,000

USD.

20
 Machine Learning Engineer: The highest salary is at the ‘Principal’ level with

200,000 USD and the lowest is at the ‘Regular’ level with 143,733.33 USD.

 Machine Learning Scientist: Salary is evenly distributed across ranks, with

'Principal' the highest at 170,000 USD and 'Lead' the lowest at 120,000 USD.

 Software Engineer: The highest salary for ‘Senior’ is 174,285.71 USD and ‘Junior’

is 151,379.31 USD.

 Statistician: Only data is available for ‘Junior’ and ‘Senior’, with ‘Senior’ being

higher at 110,000 USD.

 Salary statistics by each occupation:

- With the salary of a Data Scientist, it can be seen that the salary ranges from 150,000 to

170,000 USD. (with over 50% of people reaching this salary level). Expanded, most

surveyors have a salary range from 100,000 to 190,000 USD with over 90% of surveyors

falling in this range. The salaries that few people achieve are those with salaries under 90,000

USD and over 200,000 USD. Especially in the range of 100,000 to 190,000 USD, there are

very few people with salaries of 130,000 USD.

- With Data Analyst: salary has less differentiation than with Data Scientist. The number of

people reaching a salary of 125,000 to 150,000 USD is the salary level most people achieve.

Next are those at 100,000 to 150,000, 150,000 to 175,000 and the least are those with salaries

below 100,000 and above 175,000 USD.

21
- Regarding salary distribution in the Machine Learning Industry: The salary level is very

highly differentiated when there are over 70 people reaching 160,000 USD and a few people

are distributed at the remaining salary level. This is also the average level in the industry. It

shows that the majority of survey participants have a salary equal to the average salary.

- With Director of Data Science: Salary also has a clear differentiation when divided into

two different sides. The majority of surveyors are completely below 250,000 USD and the

most are those with a salary of 200,000 USD. Meanwhile, in the range from 250,000 to

400,000 USD, no one reaches this level and above 400,000 USD, there are 3 people who

22
reach this level. The number of people working in this industry is quite small in the data, so

discrepancies in this industry's data are easy to occur.

- Machine Learning Engineer: Salary is relatively evenly distributed. Ranging from

100,000 to 175,000 USD (accounting for over 90% of survey participants). The salary level

of 150,000 USD has the most people achieving it with more than 50 people reaching this

level.

- Data engineer: The industry has the most skewed salary distribution when over 80% have a

salary of 100,000 to 150,000 USD and very few in the remaining levels.

23
Regarding the salary of the data engineering industry, the salary of the software engineer

salary is concentrated in the range of 130,000 to 180,000 USD. The salary has a fairly large

distribution when the majority of survey participants are concentrated in this range and there

are a few remaining in the range. Only a few people participating in the survey have a salary

level of 200,000, 250,000 and 300,000.

- For the Data Science Management industry, the salary is evenly distributed when

concentrated between 120,000 and 180,000 people with salaries of 120,000, 165,000 and

24
180,000 have a fairly similar number of survey participants. People with salaries of 140,000,

200,000 and 240,000 were fewer compared to the majority of participants.

- In the survey of people in the data architect salary, the number of surveys in this industry

is relatively small with two people achieving salaries from 200,000 to 210,000 and one

person for each salary range from 170,000 to 180,000, 180,000 to 190,000 and from 190,000

to 200,000. This gave a relatively smaller data set compared to other jobs in the data science

field.

- The number of people in the Applied Sciences is relatively fewer than the other fields of

data science. They also average around 160,000 USD per annum while the lower salaries can

25
go as low as 110,000 USD. They also have peaked around 180,000 USD per annum.

- Data collected from people in the statistics industry showed that there were a total of three

people participating in the survey with two people who were paid 85,000 to 90,000 and only

one person paid around 105,000. There is seemingly a gap between salary ranges as they had

a small sample size.

- For the data science leadership positions, they tend to have quite good salaries as they, at

their lowest gain around 120,000 USD and at their peak can reach up to 220,000 USD as they

26
represent and manage their teams, there should be a corresponding increase in the salaries

they receive.

- Main Observation: We can see that the charts with high correlation are the charts where

the number of survey participants is very small and, in these charts, the statistical significance

is almost non-existent and is only for reference.

Dummy charts provide distribution values in different occupations, in terms of averages,

percentiles, outliers

27
 DS: an average of 150,000 for this job and a concentrated distribution in the range of

130,000 to 170,000 and a small concentration in the remaining ranges, lasting from

70,000 to 230,000. There are 2 Outliers are those at salary levels 0 and 340,000

 MLE: average salary reaches 170 000 and concentrated values from 140 000 to 180

000, values extend from 100 000 to 210 000. There are 2 salaries that are separate

from the survey are those reaching 80 000 and 250 000

 MLS: the majority of salaries range from 140,000 to 170,000, with a concentration

ranging from 150,000 to 160,000. The number of survey participants in this

profession is relatively small, so the number is divided and spread across many

different salary levels. .

 SE: have an average salary of 150,000 and their salary figures range from 130,000 to

180,000, with the majority concentrated around 90,000. to 240,000 for those working

in data analysis, the salary is lower than average at 130,000 and ranges from about

100,000 to 140,000 with the focus of mainly working around 60,000 to 200,000 and

having a The exception is those in the salary range of 20,000

 For the data science management industry, like the DSM, DDM and HDM, the

average salary is recorded at 200,000 and is concentrated in the range of 190,000 to

240,000. The salary for workers is from about 160,000 up to 250,000. There are only

two exceptions for people with a salary of 140,000, while a high salary range is may

reach up to 440,000 USD.

28
Figure 11. Salary Distribution by Job Level

Salary Distribution Chart by Level

 Unknown/Regular: People who do not disclose information have an average salary

of 130,000 and are concentrated in the range of 120,000 to 140,000. Smaller numbers

range from 60,000 to 130,000 and 160,000 to 210,000. There is a small amount

scattered at the 0 mark, 10,000 and above 210,000.

 Senior: The average is around 160,000, concentrated from 140,000 to 170,000,

distributed from 100,000 to 210,000 and a few small values scattered around 90,000,

220,000, 250,000 and 350,000.

29
 Staff: The average is around 150,000, the majority is concentrated from 140,000 to

170,000 and over 90% concentrated is from 80,000 to 190,000, with an outlier at

250,000.

 Junior: The average is around 150,000 and concentrated from 130,000 to 160,000,

the majority of people at this level are distributed from 90,000 to 190,000 and there

are only 2 very small groups at 70,000 and nearly 90,000.

 Principal: The average is around 170,000 and is concentrated around 150,000 to

190,000, the majority of people at this level have salaries ranging from 100,000 to

220,000 and only a very small group is at the 80,000 and 220,000 marks.

 Lead: The average is around 170,000 and there is a concentrated distribution from

150,000 to 180,000. Surveyors over 95% between 110,000 and 200,000 and only 1

outlier at 240,000.

2.3. Factors affecting the salary level

a. The Average of Careers


The provided salary chart offers a detailed insight into the average salaries for various

roles within the Data Science and Information Technology fields. See Figure 12.

Pivot Table for Average Salary below.

30
Observations:

 The Applied Scientist role commands an impressive average salary of $151,764.71,

reflecting the high demand for professionals who can apply scientific methods to

solve real-world problems.

 Data Analysts earn an average of $122,917.32, which is relatively lower compared to

other roles in the data science domain. This difference is likely due to the focus of

Data Analysts on data processing and analysis rather than developing complex

models.

 Data Architects top the chart with an average salary of $192,000.00. This significant

figure can be attributed to their critical role in designing and maintaining an

organization’s data infrastructure.

 Data Engineers earn an average of $138,309.86, highlighting the importance of their

role in building and maintaining data systems.

 Data Science Managers have an average salary of $160,365.08. This role requires

both management skills and extensive experience in data science, which justifies the

higher salary compared to Data Scientists.

 Data Scientists have a solid average salary of $147,318.79. This popular role

demands deep analytical skills and the ability to build models, making it one of the

most sought-after positions in the field.

 The Director of Data Science position boasts an average salary of $231,909.09, the

highest in the chart. This role involves significant responsibility in strategic direction

and management of the data science department.

 The Head of Data Science earns an average of $175,000.00, underscoring the

leadership and high-level responsibilities associated with this position.

31
 Machine Learning Engineers have an average salary of $158,820.59, indicating the

high demand for skills in developing machine learning models.

 Machine Learning Scientists earn an average of $157,902.58, highlighting the

importance of research and development in the machine learning field.

 Software Engineers receive an average salary of $152,855.26. While this is a high

salary, it is somewhat lower compared to some specialized data science roles.

 Statisticians have the lowest average salary in the chart at $93,333.33. This may be

due to their focus on traditional statistical analysis rather than the more complex tasks

in data science.

 The Vice President of Data Science earns an average salary of $157,500.00,

reflecting the high-level leadership and strategic responsibilities involved.

Overall, the salary chart clearly indicates the value placed on roles within the Data Science

and IT fields, with leadership and highly specialized technical roles commanding the highest

salaries.

The provided bar chart below (Figure 13) illustrates the top 10 average salaries by career

within the Data Science and Information Technology fields, highlighting the highest and

lowest salaries and offering career recommendations.

32
Observations:

 The Director of Data Science boasts the highest average salary at approximately

$231,909. This makes it an ideal role for individuals seeking leadership positions with

significant responsibilities in strategy and management. The substantial compensation

reflects the critical nature of the role in guiding and overseeing the data science

initiatives within an organization.

 Following closely is the Data Architect, with an average salary of $192,000. This

role is essential for designing and maintaining data infrastructure, making it suitable

for those with strong skills in data management and architecture. The high salary

underscores the importance of robust data systems in supporting the analytical and

operational needs of businesses.

 The Data Scientist has the lowest average salary among the top 10 careers, at

$147,319. While it is the lowest in this elite group, it remains a highly lucrative

position. Data Scientists play a crucial role in extracting insights and building

predictive models from data, highlighting the significant demand and value of their

33
expertise. It also often plays as the first stepping stone into the field as an Entry Level

Position in most companies.

b. The Impact of Position Level

The salary chart below provides information on the salary levels for different position levels

across Data Science and Information Technology fields.

Figure 14. Average Salary per Level

Observations:

 The Principal level is the highest position for Applied Scientists with an average

salary of $130,000.00, followed by Senior at $157,142.86, Staff at $156,666.67, and

finally Unknown/Regular at $146,666.67.

 The Distinguished level is the highest position for Data Architects with an average

salary of $190,000.00, reflecting the critical responsibility in designing and

maintaining the organization's data infrastructure.

 The Principal level is the highest position for roles such as Data Scientist

($179,760.00), Machine Learning Scientist ($170,000.00), Data Analyst ($96,000.00),

34
and Applied Scientist ($130,000.00). This high salary reflects their extensive

expertise and rich experience in the field.

 The Lead level also commands a relatively high salary, e.g., Data Scientist

($169,545.45) and Machine Learning Scientist ($120,000.00), commensurate with

their leadership roles and managerial responsibilities.

 The Senior level typically has a lower salary than the Principal and Lead levels but

higher than the Staff and Junior levels (a Senior Data Scientist earns $165,748.32,

while a The Junior level is usually the lowest, with more modest salaries such as

$126,736.00 for a Junior Data Scientist and $154,000.00 for a Junior Machine

Learning Engineer, reflecting their limited experience and capabilities.

 The Regular level has varying salaries depending on the position but is generally

lower than the specified levels.

The bar chart depicts the average salaries across 7 main position levels: Regular, Staff,

Senior, Lead, Principal, Distinguished, clearly showing the trend of salaries increasing with

higher job levels.

Observations:

 The Distinguished level has the highest average salary, around $190,000, befitting the

top position and significant responsibilities in data-related fields.

35
 Next is the Principal level with an average salary of approximately $169,231,

considerably higher than the Lead ($154,309.09) and Senior ($154,093.79) levels.

 The Regular and Junior levels have the lowest average salaries, around $62,727.

 The salary gap between levels is substantial (the Principal level is around $27,000

higher than the Senior level), reflecting the premium placed on higher-level

capabilities and experience.

Based on the chart, it is evident that the position level factor has a significant impact on

salaries in Data Science. Higher levels like Distinguished, Principal, and Lead command

considerably higher salaries compared to lower levels. This reflects the high regard for the

experience, expertise, and job responsibilities of employees in senior positions.

However, the salary gap between levels varies somewhat across job roles. Observing the

table, one can see that the difference in Principal level salaries between Data Analyst

($96,000.00) and Data Scientist ($179,760.00) is quite large. While the salary differential

between levels within the same field is considerable, the gap appears smaller for Data

Engineers due to a lack of data for multiple levels. Additionally, the Regular level typically

has lower salaries, but there are exceptions, such as the Regular Data Architect earning more

than the Senior level.

c) The Accompanying Benefits

36
Figure 14. Top 10 Benefits

Observations:

 Companies prioritize offering health insurance, dental insurance, social security, paid

time off (PTO), and vision insurance as essential benefits for several compelling

reasons.

 Companies recognize the pivotal role of comprehensive employee benefits in

fostering workplace satisfaction and loyalty.

 Health insurance and dental coverage ensure employees' physical well-being,

reducing absenteeism and promoting productivity.

 Social security and disability insurance provide financial security, offering peace of

mind during unexpected challenges.

 Paid time off and vision insurance further contribute to employee wellness,

supporting both physical and mental health needs.

 Visa sponsorship opens doors for diverse talent, enriching teams with varied

perspectives and skills.

37
 Paid training and opportunities for advancement demonstrate a commitment to

professional growth, empowering employees to develop their careers within the

company.

 Additionally, 401(k) matching encourages long-term financial planning, reinforcing

employee retention and loyalty. 401k Matching is the company matching your savings

for your retirement. This would be similar to Vietnam’s Social Security System;

however, this has a higher rate of return than that of Vietnam as they essentially

double the amount put into the investment.

 Together, these benefits not only attract top talent but also nurture a motivated

workforce, driving organizational success through enhanced productivity, innovation,

and employee satisfaction.

2.4. Analysis of Top Qualifications by Career

Qualifications in the workplace are fundamental prerequisites that ensure individuals possess

the knowledge, skills, and competencies necessary to perform their roles effectively. These

qualifications serve several crucial purposes in fostering a productive and efficient work

environment.

Firstly, qualifications act as a benchmark for competency and proficiency. They provide

employers with a reliable means to assess an individual's suitability for a specific job role or

task. By requiring qualifications such as degrees, certifications, or licenses, employers can

reasonably expect that employees have acquired the necessary theoretical knowledge and

practical skills to handle job responsibilities competently.

Secondly, qualifications contribute to maintaining standards and quality within industries. In

fields such as healthcare, engineering, finance, and law, specific qualifications are often

mandated by regulatory bodies or professional associations to ensure adherence to industry

38
standards and ethical practices. This not only safeguards the integrity of services provided but

also instills trust and confidence among clients, customers, and stakeholders.

Moreover, qualifications serve as a pathway for continuous learning and professional

development. They encourage individuals to acquire new knowledge, update their skills, and

stay abreast of industry advancements. This ongoing learning process is vital in a rapidly

evolving global economy where technological innovations and market dynamics continually

reshape job requirements and expectations.

Furthermore, qualifications enhance career opportunities and mobility for individuals. They

open doors to higher-level positions, promotions, and expanded responsibilities within

organizations. Employers often prioritize candidates with relevant qualifications when

making hiring decisions, recognizing the added value and expertise these individuals can

bring to the workplace.

In conclusion, qualifications play a pivotal role in ensuring workforce readiness, maintaining

industry standards, promoting lifelong learning, and advancing career opportunities. They are

essential not only for individual career growth but also for organizational success,

contributing to a skilled, capable, and competitive workforce in today's dynamic and

demanding workplace environments.

We can separate qualifications into two categories, background of education and skills. We

will first tackle the given data for the background of education then followed by the skills

category. In the dataset, we are given the following data for their background of education

overall.

Figure 15. Educational Background | Overall

39
Background of Education Observations:

 Doctor’s Degree - With the largest sample in the dataset, they compose 42.6% of the

population. This shows that they are achievers in this field and have studied heavily to

enter into the field. See the Figure below for their Job Titles.

 Doctor of Philosophy - Being the highest standard of education, they also have the

most respondents at 41% of the sample population. The remaining 1.6% did not

disclose what type of Doctoral Degree they have.

40
Figure 16. Doctorate Level of Education | Job Titles

 Their Job Titles are as follows:

 Data Scientist - 50.4%

 Machine Learning Scientist - 15.9%

 Machine Learning Engineer - 9.2%

 Software Engineer - 6.8%

 Applied Scientist - 5.1%

 Data Engineer - 4.4%

 Data Science Manager (DSM), Director of Data Science (DDS), Data Analyst

- Each at around 2%, People with Doctorate Degrees have been able to reach high

levels like the DSM and DDS having studied and worked hard to reach that far.

 Master’s Degree - Overall, the dataset has 20.4% samples that have achieved their

Master’s Degree. 15.1% of which did not disclose their type of degree. See the Figure

below for their Job Titles.

41
 Master’s Degree of Science - They compose 4.8% of the Master’s degree

holders.

 Master’s Degree of Business Administration - They compose 0.5% of the

Master’s degree holders.

Figure 17. Master Level of Education | Job Titles

 Their Job Titles are as Follows:

 Data Scientist - 45%

 Machine Learning Scientist - 10.3%

 Machine Learning Engineer - 9.2%

 Data Engineer - 9.9%

 Software Engineer - 8.8%

 Data Analyst - 5.7%

 Data Science Manager (DSM) - 5.0%

 Applied Scientist - 1.5%

42
 Bachelor’s Degree: Overall, the dataset has 14.7% samples that have achieved their

Bachelor’s Degree. 10.4% of which did not disclose their type of degree. See the

Figure below for their Job Titles.

 Bachelor’s Degree of Science - They compose 4.3% of the Bachelor’s degree

holder

Figure 17. Bachelor Level of Education | Job Titles

 Their Job Titles are as Follows:

 Data Scientist - 44.9%

 Machine Learning Engineer - 15.0%

 Machine Learning Scientist - 9.6%

 Data Science Manager (DSM) - 9.6%

 Data Engineer - 7.5%

 Software Engineer - 4.8%

 Data Analyst - 4.3%

 Data Architect - 1.6%

 Director of Data Science (DDS) - 0.5%

43
 None Applicable or Not Disclosed - They did not disclose what they achieved in

regards to their background of education. They compose 22.1% of the Dataset.

Figure 17. Undisclosed Level of Education | Job Titles

 Their Job Titles are as Follows:

 Data Scientist - 63.5%

 Machine Learning Engineer - 9.9%

 Data Analyst - 8.2%

 Data Science Manager (DSM) - 6.4%

 Data Engineer - 2.8%

 Software Engineer - 2.5%

 Director of Data Science (DDS) - 2.1%

 Machine Learning Scientist - 1.4%

As we can see, the Job Titles that they have are different with each level of educational

background. As we aim to finish this course with a Bachelor's Degree in Business Data

Analytics, we have every opportunity to rise to similar levels of Job Titles and their Salaries

covered in Section 2.3.

44
Skills Observations:

 Programming Languages - SQL, C, R, C++, Python, Spark, and NoSQL

 They are the top skills in the data science field as we need these to program and

assist us in our data analysis. Based on the dataset, at least one programming

language is known by each personnel. Thus, making these types of skills essential

in the field.

 Machine learning - Torch, PyTorch, TensorFlow, Deep Learning, Natural Language

Processing, Elasticsearch, Relational Databases, Apache, and Apache Hive

 Under machine learning we have lots of different areas to know and learn how to

utilize it well. They were broken down to multiple different applications that assist

us in creating libraries and data management so that the AI will be able to

generate and use the provided datasets.

45
 General Skills - Communication Skills, Analysis Skills, Research, Tableau,

Analytics, Data Visualization, System Design

 These skills are transferable between jobs and job positions, as some would allow

you to work as a leader like the communication skills, while the rest allow you to

perform well even in different industries like research, analytics and data

visualization.

In conclusion we must have skills in these three areas in order to succeed in our chosen fields.

CHAPTER 3: PREDICTIVE MODELING

3.1. Feature Selection

Logistic Regression

In order to figure out the factors to the Attrition rate, we analyzed the Training Data by

Regression of Analysis Tools in Spss.

The normal regression equation is a statistical model used to explain the relationship between

a dependent variable and independent variables. The equation takes the form:

Where: - Y is the value of the dependent variable

46
- X₁, X₂,.. are the values of the independent variables

- β₀ is the intercept coefficient of the regression equation, representing the value of Y

when all independent variables are equal to 0.

- β₁, β₂,.., βᵣ are regression coefficients, which represent the degree of influence of

each independent variable on the dependent variable.

3.2. Model Building and Evaluation

The linear regression equation for the effect on Salary provided by the model is

Salary = 0 + 0,243 Levels - 0,124 Careers -0.05 Qualifications +0.031 Benefits

The regression model results show that Levels has a positive effect on salary, while career

has a negative effect. Qualifications and Benefits have P value within the allowable range

<0.05 but the level of influence is smaller. The standard deviation of the residual values

shows that the model may not fully explain the variation of Salary.

R square = 0.75 indicates that the model explains 75% of the data.

3.3. Interpreting Model Results

47
 Levels: Levels have the biggest influence on the salary of survey participants,

showing that the higher the level, the higher the surveyor's salary. It can be seen from

the coefficient table that the Levels, Qualifications and Benefits variables have the

same impact as the Salary variable with a positive coefficient, showing that when

these factors increase, the salary of surveyors also increases. First of all, the Levels

variable has the largest positive impact with coefficient = 0.643. Next is the Benefits

variable with coefficient = 0.308. So it can be understood that the higher the level of

work experience, the higher the salary, and the more benefits a company has, the

higher the salary of that company's employees.

 Career: The relationship between Careers and Salary is very weak and in the

opposite direction, it shows that in the future, when changing to another job, the

salary will tend to decrease. This means that when there is advancement, as your

career progresses, your salary tends to decrease, or conversely, when your career does

not develop, your salary may increase. This may reflect some phenomenon such as

accepting lower salaries to gain career development opportunities, or in some cases,

people whose careers are not developing can receive high salaries, due to many years

of working experience. Next, there is the variable with negative coefficient which is

career. The negative coefficient effect shows that job and degree/certificate quality

have an inverse impact on income level. A correlation coefficient of -0.624 between

career and salary may indicate a negative relationship between these two variables.

 Qualification: The relationship between Qualifications and Salary is relatively strong

and in the same direction, showing that as qualifications increase, income will also

increase and in the future the income of those who study to improve their

qualifications will increase. Education level (qualification) has a correlation

coefficient of 0.503 with salary, which shows a positive relationship between these

48
two variables. That means, as the level of education increases, the salary also tends to

increase, and vice versa. This reflects the view that higher levels of education are

often associated with greater productivity and job performance, leading to higher

wages.

 Benefits: This means that as benefits increase, salaries also tend to increase slightly.

However, this relationship is not strong enough to conclude that benefits are the main

determinant of salary. In one published study, compensation systems were shown to

have a positive influence on employee satisfaction by using motivation as a mediator.

On the other hand, another study found that only salary, not benefits, significantly

predicted job satisfaction among higher education employees. This shows that while

benefits can have an influence on salary, they are not always the most important

factor. The correlation coefficient of 0.308 between benefits and salary shows a

positive but not strong relationship.

CHAPTER 4: INSIGHTS AND RECOMMENDATIONS

4.1. Key findings and implications:

1. Work Experience and Salary:

o Insight: The model predicts that higher levels of work experience correlate with

higher salaries.

o Recommendation: Encourage individuals to accumulate work experience over time as

it tends to increase earning potential. Employers should recognize and reward

experience appropriately when setting salary levels.

49
2. Job Change and Salary:

o Insight: Changing jobs often leads to a decrease in salary.

o Recommendation: Advise individuals to carefully consider the financial implications

of changing jobs. Negotiating a competitive salary based on current market rates and

demonstrating the value of skills and experience can mitigate potential salary

decreases.

3. Education and Productivity/Salary:

o Insight: Higher levels of education are associated with greater productivity and lead to

higher wages.

o Recommendation: Promote lifelong learning and higher education as pathways to

increasing productivity and earning potential. Employers should consider education

levels as a factor in salary determination, aligning compensation with the value added

through enhanced skills and knowledge.

4. Benefits and Wages:

o Insight: Benefits tend to increase with wages, and higher wages lead to better benefits.

o Recommendation: Ensure that compensation packages are competitive and include

benefits that align with employees' needs and expectations. As wages increase, revisit

and adjust benefit packages to maintain a balanced and attractive compensation

offering.

Overall, these insights suggest a structured approach to career development and

compensation strategy:

 Encourage continuous professional growth and skill development.

 Advise strategic career decisions with consideration for salary implications.

 Recognize the value of education in enhancing productivity and earning potential.

50
 Design comprehensive compensation packages that reflect the interplay between

wages and benefits.

By aligning these recommendations with organizational goals and employee expectations,

companies can foster a motivated and productive workforce while remaining competitive in

the market.

4.2. Strategies for maximizing salaries and benefits:

1. Maximizing Salary Based on Work Experience:

 Strategy: Career Advancement and Skill Development

o Continuously seek opportunities to gain relevant work experience and skills that

are in demand.

o Consider lateral moves or promotions within your current organization to

capitalize on accumulated experience.

o Negotiate salary increases during performance reviews or when taking on

additional responsibilities that leverage your experience.

o Research and benchmark your salary against industry standards to ensure you are

being compensated fairly based on your experience level.

2. Mitigating Salary Decrease When Changing Jobs:

 Strategy: Negotiation and Market Awareness

o Research and understand current market salary trends and ranges for your

position and level of experience.

o Highlight your skills, achievements, and unique value proposition during job

interviews to justify maintaining or increasing your salary.

o Negotiate not just base salary but also consider other benefits and perks that can

offset a potential decrease in salary (e.g., signing bonuses, stock options, flexible

51
work arrangements).

o Evaluate the total compensation package, including benefits, to assess the overall

value of the offer compared to your current position.

3. Leveraging Education for Higher Wages:

 Strategy: Continuous Learning and Professional Development

o Pursue higher education or certifications that are relevant to your field and career

goals.

o Showcase your advanced skills and knowledge gained through education to

demonstrate increased productivity and capability.

o Seek out roles or projects that specifically value and require higher levels of

education.

o Negotiate salary adjustments or promotions that reflect the added value of your

educational achievements.

4. Maximizing Benefits with Wages and Vice Versa:

 Strategy: Comprehensive Compensation Package Assessment

o Understand how benefits are structured and linked to your salary level within

your current or prospective organization.

o Negotiate for benefits that align with your personal and financial needs, such as

health insurance, retirement contributions, and professional development

opportunities.

o Consider the long-term implications of benefits like retirement plans and stock

options, which can enhance your overall compensation beyond base salary.

o Monitor and review benefit offerings periodically to ensure they remain

competitive and meet your evolving needs.

52
By strategically approaching these recommendations, individuals can optimize their total

compensation package while navigating the dynamics of salary, benefits, experience, and

education as predicted by the model. It's crucial to tailor these strategies to individual career

goals, market conditions, and organizational contexts for the best outcomes.

CONCLUSION

The goal of this report is to analyze, understand, and decode from qualifications to benefits,

clarifying the relationship between career paths and employee salaries. Based on the analysis

of the "Data Science Job Postings and Salaries Dataset" regarding occupations, annual

salaries, average salaries, job levels, and required skills, we have performed data cleaning,

descriptive analysis, and regression modeling to provide valuable findings and

recommendations.

One of the most significant findings is the substantial impact of work experience on salary

levels. Our analysis revealed a positive correlation between work experience and employee

income. This highlights the importance of professional development and skill-building in

one's career path.

Additionally, the report shows that frequent job changes can lead to a decrease in salary. This

emphasizes the need for careful career decisions, considering the potential financial impacts

of job transitions. Individuals should be prepared to negotiate competitive salaries based on

their skills, experience, and value.

Furthermore, our analysis clarified the positive relationship between benefits and salaries,

indicating that higher salaries are often accompanied by better benefit packages. This

53
underscores the importance of considering both salary and benefits when evaluating job

opportunities and negotiating compensation.

Based on these significant findings, we also provide targeted strategies for individuals to

maximize their salaries and benefits, including skill development, effective negotiation

tactics, and leveraging educational qualifications to secure attractive compensation and

benefits.

In summary, this report provides a comprehensive exploration of the factors influencing

salaries and benefits, assessing the necessary job skills in the data science field. The findings

from this research will be invaluable for students preparing for their future careers, helping

them better understand the skills they need to develop, and the salaries and benefits they can

expect when entering the job market in the data science industry.

APPENDIX

 CODE 1: file:///C:/Users/Admin/Downloads/Nhapmon%20(2).html (PANDAS)

 CODE 2: file:///C:/Users/Admin/Downloads/Untitled15-1%20(2).html

(PANDAS)

 CODE 3: (POWER BI)

(Please click the link twice to be able to view the data)

CONTRIBUTION

54
Student’s name Student ID Contribution Evaluate

Contribution

Nguyễn Thị Phương Thảo 23070983 -Introduction: About the

Dataset

-Methodology

-Chapter 1: Descriptive A

Statistics for Nominal

Variables, Descriptive

Statistics for Quantitative

Variables

-Chapter 2: Relationship

between Job and Salaries,

Factors affecting the salary

level

-Conclusion

-Creating the visualizations

with Power BI (Appendix:

Code 3)

-Build outlines for report

-Assign work

-Editing the report, paper

formatting

-Slides

55
Nguyễn Lê Kiều Trang 23070838 -Introduction: Background

and Motivations, Objectives

of the Analysis

-Chapter 1: Data

Preprocessing A

-Chapter 2: Overview of

Industry and Corresponding

Salaries, Factors affecting the

salary level

-Creating the visualizations

with Python Pandas, coding

(Appendix: Code 1)

-Slides

Trần Thị Yến 23070921 -Chapter 2: Overview of

Industry and Corresponding

Salaries, Relationship

between Job and Salaries

-Chapter 3: Feature Selection, B

Model Building and

Evaluation, Interpreting

Model Results

- Creating the visualizations

with Python Pandas, coding

(Appendix: Code 2)

56
Nguyễn Ánh Nguyệt 23070990 -Chapter 2: Overview of

Industry and Corresponding

Salaries, Relationship B

between Job and Salaries

-Chapter 3: Feature Selection,

Model Building and

Evaluation, Interpreting

Model Results

- Creating the visualizations

with Python Pandas, coding

(Appendix: Code 2)

Matthew Veriel Malonzo 23071334 -Chapter 2: Factors affecting

the salary level, Analysis of

Top Qualifications by Career

-Chapter 4: Key findings and

implications, Strategies for

maximizing salaries and A

benefits

-Slides

-Paper formatting

-Present group’s work

57

You might also like