Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views42 pages

Batch-11 DC

Uploaded by

selfik961
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views42 pages

Batch-11 DC

Uploaded by

selfik961
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

DETECTION OF EMPLOYEE STRESS USING

MACHINE LEARNING
A Project report submitted in partial fulfillment of the requirements for the award of the
Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted by

N. Chaitanya Venkata Sai 20B81A05B7


N. Sai Kumar 20B81A05B8
N. Swetha Kumari 20B81A05B9
N. Sasi Priya 20B81A05C0
N. Satya Harika 20B81A05C1

Under the Esteemed Guidance of


Mrs. K. Lakshmi Prasuna

Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SIR C R REDDY COLLEGE OF ENGINEERING
Approved by AICTE-Accredited by NBA
(Affiliated to Jawaharlal Nehru Technological University, Kakinada)

Eluru-534007
A.Y 2023-2024
i
DETECTION OF EMPLOYEE STRESS USING
MACHINE LEARNING
A Project report submitted in partial fulfillment of the requirements for the award ofthe
Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by

N. Chaitanya Venkata Sai 20B81A05B7


N. Sai Kumar 20B81A05B8
N. Swetha Kumari 20B81A05B9
N. Sasi Priya 20B81A05C0
N. Satya Harika 20B81A05C1

Under the Esteemed Guidance of


Mrs. K. Lakshmi Prasuna

Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SIR C R REDDY COLLEGE OF ENGINEERING
Approved by AICTE-Accredited by NBA
(Affiliated to Jawaharlal Nehru Technological University, Kakinada)

Eluru-534007
A.Y 2023-2024

ii
SIR C R REDDY COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the project work entitled DETECTION OF EMPLOYEE

STRESS USING MACHINE LEARNING has been successfully submitted by

N. Chaitanya Venkata Sai 20B81A05B7


N. Sai Kumar 20B81A05B8
N. Swetha Kumari 20B81A05B9
N. Sasi Priya 20B81A05C0
N. Satya Harika 20B81A05C1

in partial fulfillment for the award of the Degree of Bachelor of Technology in

Computer Science during the academic year 2023-2024 as a fulfilment for the

completion of “PROJECT WORK” in COMPUTER SCIENCE AND

ENGINEERING.

Mrs. K. Lakshmi Prasuna Dr. A. Yesu Babu, M.Tech ,Ph.D

Assistant Professor Head of the Department

Department of CSE Department of CSE

External Examiner
iii
DECLERATION

I hereby declare that the project entitled DETECTION TO EMPLOYEE SRESS

USING MACHINE LEARNING submitted for the B. Tech Degree is my original

work and the project has not formed the basis for the award of any degree,

associateship, fellowship or any other similar titles.

Place: Eluru N. Chaitanya Venkata Sai 20B81A05B7


Date: N. Sai Kumar 20B81A05B8
N. Swetha Kumari 20B81A05B9
N. Sasi Priya 20B81A05C0
N. Satya Harika 20B81A05C1

iv
ACKNOWLEDGEMENT

I would like to take this opportunity to thank our management and beloved principal
Dr. K. Venkateswara Rao M.Tech., Ph.D. for providing all the necessary facilities and a great
support to us in completing the project work.
The present project work is the several days study of the various aspects of the project
development. During this effort in the present study, I have received a great amount of help
from our Head of the Department Dr. A. YESUBABU M.Tech, Ph.D., whom I wish to acknowledge
and thank from the depth of my heart.
I am deeply indebted to my project guide, and our project coordinator Dr. G. Nirmala M.Tech,
Ph.D. for providing his opportunity and constant encouragement given by him during this course.
I am grateful for his valuable guidance and suggestions during my project work.
My parents have put us ahead of themselves. Because of their hard work & dedication, we had
opportunities beyond my wildest dreams. Finally, I express my thanks to all other faculty
members, classmates, friends and neighbors who helped me with the completion of my project
and without infinite love and patience this would never have been possible.

N. Chaitanya Venkata Sai 20B81A05B7


N. Sai Kumar 20B81A05B8
N. Swetha Kumari 20B81A05B9
N. Sasi Priya 20B81A05C0
N. Satya Harika 20B81A05C1

v
ABSTRACT

Disorders of stress are very casual thing among the employees who are working in corporate
sectors. As with changing work of people and their living lifestyle, we can see the increment
of stress in the working employees. Even many corporate sectors are providing variety of
schemes related to mental health and trying to reduce the disorders of stress in the working
environment, the disorder is very far from stopping. In our project, we are going to make use
of two techniques of machines to determine the amount of stress the employee is having who
is working in corporate sectors and try to narrow down the issues that identify the stress levels.
We are going to apply two techniques of machine learning.

Keywords: Random Forests, Support Vector Machine

vi
TABLE OF CONTENTS

SNO TITLE PAGE NO

1 INTRODUCTION 1
2 LITERATURE SURVEY 2
3 EXISTING SYSTEM 7
3.1 DISADVANTAGES
4 PROPOSED SYSTEM 8
4.1 SCOPE
4.2 OBJECTIVES
5 REQUIREMENT ANALYSIS 9
5.1 FUNCTIONAL REQUIREMENTS
5.2 NON-FUNCTIONAL REQUIREMENTS
5.3 SOFTWARE REQUIREMENTS
5.4 HARDWARE REQUIREMENTS
6 DESIGN AND METHODOLOGY 12
6.1 METHODOLOGY
6.2 SYSTEM DESIGN
7 IMPLEMENTATION 17
8 TESTING 23
8.1 TYPES OF TESTING
9 RESULTS AND DISCUSSION 26
10 CONCLUSION 31
11 REFERENCES 32

vii
viii
CHAPTER-1
INTRODUCTION

Disorders of stress which are related to mental health are not rare for the employees working
in corporate sectors. Some analysis done earlier have created some concern on the very same.
Based on the work done byAssociation of Industry, Assocham, we come to know that above
42% of the professional working employees in the corporate private sectors of India are
suffering from stress or common disorders of anxiety because of late night working hours
and also due to fixed timings. This part of singles is growing as mentioned in the Economic
Times of 2018 article which is dependent on the survey that was managed by the Optum.
There is a survey that considers the replies of nearly eight lakh working employees who are
working from more than seventy huge companies, with eachsingle company having its
employees more than 4,500 working professionals. The workplace which is free form stress
must be given at most importance for higher productivity and happy living for the working
employees. There are many steps which we can take to help the employeescome up with the
disorder of stress for well-being of the mental health likeassistance for counselling, guidance
given for the career, sessions for management of stress, and creating an awareness of health
identification ofworking employees who will need such kind of help will definitely improve
the rates of such kind of measures for becoming victorious. We try to make this happen by
using our machine learning techniques toovercome with a model that predicts the rate of the
stress that is accomplished. This approach is not only going to help company HR managers
to know better about their working professionals, it will also help in taking proper
precautions to reduce the chances of stress in their working employees.

1
CHAPTER-2
LITERATURE SURVEY

2.1 Measuring Post Traumatic Stress Disorder in Twitter.Glen


Coppersmith, Mark Dredze, and Craig Harman. 2014.
Traditional mental health studies rely on information primarily collectedthrough personal
contact with a health care professional. Recent work hasshown the utility of social media
data for studying depression, but there havebeen limited evaluations of other mental health
conditions. We consider PTSD, a serious condition that affects millions worldwide, with
especially high rates in military veterans. We also present a novel method to obtain a PTSD
classifier for social media using simple searches of available Twitter data, a significant
reduction in training data costcompared to previous work. We demonstrate its utility by
examiningdifferences in language use between PTSD and random individuals, building
classifiers to separate these two groups and by detecting elevated rates ofPTSD at and
around U.S. military bases using our classifiers. IntroductionMental health conditions
affect a significant percentage of the U.S. adultpopulation each year, including depression
(6.7%), eating disorders likeanorexia and bulimia (1.6%), bipolar disorder (2.6%) and post
-traumatic stressdisorder (PTSD) (3.5%).1 PTSD and other mental illnesses are difficult
todiagnose, with competing standards for diagnosis based on self-reports andtestimony
from friends and relatives.2 In recent years, several studies haveturned to social media
data to study mental health, since it provides anunbiased collection of a person’s language
and behavior, which has beenshown to be useful in diagnosing conditions (De Choudhury
2013).Additionally, from a public health standpoint, social media data and Web data in
general have enabled large scale analyses of a population’s health statusbeyond what has
previously been possible with traditional methods (Ayers etal. 2013). While social media
provides ample data for many types of publichealth analysis (Paul and Dredze 2011),
mental health studies still face seriouschallenges. First, other health work in social media,
such as disease surveillance (Brownstein, Copyright c 2014, Association for the
Advancement in the Artificial Intelligence(www.aaai.org). Allrights reserved
www.nimh.nih.gov/health/publications/the-numbers-count-mental-disorders- in-america 2
en.wikipedia.org/wiki/List of diagnostic classification and rating.

2
scales used in psychiatry Freifeld, and Madoff 2009; Chew and Eysenbach 2010; Lamb,
Paul, and Dredze 2013) and modeling (Sadilek, Kautz, andSilenzio 2012), rely on explicit
mentions of illness or health issues; if people are sick, they say so. In contrast, mental health
conditions largely display implicit changes in language and behavior, such as a switch in the
types of topics, a shift in word usage or a shift in frequency of posts. While DeChoudhury
et al. (2013) find some examples of explicit depression mentions, the focus is on more subtle
changes in language (e.g., pronoun use). Second, obtaining labeled data for a mental health
condition is challenging since we are examining implicit features of language. De
Choudhury et al. (2013) rely on (crowdsourced) volunteers to take depression surveys and
offer their Twitter feed for research. While this yields reliable data, it is time-consuming
and challenging to build large data sets for a diverse set of mental health conditions.
Furthermore, the necessary mental health evaluations such as the DSM (Diagnostic and
Statistical Manual of Mental Disorders)3 , are difficult to perform as these evaluations
require a trained diagnostician and have been criticized as unscientific and subjective (Insel
2013). Thus, relying on data from crowdsourced volunteers to build datasets of users with
diverse mental health conditions is difficult, and perhaps untenable. We provide an alternate
method for gathering samples that partially ameliorate these problems – ideally to be used
in concert with existing methods. In this paper, we study PTSD in Twitter data, one of the
first studies to consider social media for a mental health condition beyond depression (De
Choudhury, Counts, and Horvitz 2013; De Choudhury et al. 2013; Rosenquist, Fowler, and
Christakis 2010). Rather than rely on traditional PTSD diagnostic tools (Foa 1995) for
finding data, we demonstrate that some PTSD users can be easily and automatically
identified by scanning for tweets expressing explicit diagnoses. While it is natural to be
suspicious of self-identified reporting, we find that self-identifying PTSD users have
demonstrably different language usage patterns from the random users, according to the
Linguistic Inquiry Word Count (LIWC), a psychometrically validated analysis tool
(Pennebaker et al. 2007). We demonstrate elsewhere (Coppersmith, Dredze, and Harman
2014) that data obtained 3 en.wikipedia.org/wiki/Diagnostic and Statistical Manual

3
of Mental Disorders in this way replicates analyses performed via LIWC on the
crowdsourced survey respondents of De Choudhury et al. (2013). We also demonstrate that
users who self-identify are measurably different from randomusers by learning a classifier
to discriminate between self-identified and random users. We further show how this data can
be used to train a classifier that detects elevated incidences of PTSD in tweets from U.S.
military bases as compared to the general U.S. population, with a further increase around
bases that deployed combat troops overseas. We intend for this initial finding (whichis small,
but statistically significant) to be a demonstration of the types of analysis Twitter data
enables for public health. Given the small effect size, replication and further study are called
for. Data We used an automated analysis to find potential PTSD users, and then refined the
list manually. First, we had access to a large multi-year historical collection from the Twitter
keyword streaming API, where keywords were selected to focus on health topics. We used
a regular expression4 to search for statements where the user self-identifies as being
diagnosed with PTSD. The 477 matching tweets were manually reviewed to determine if
they indicated a genuine statement of a diagnosis for PTSD. Table 1 shows examples from
the 260 tweets that indicated a PTSD diagnosis. Next, we selected the username that
authored each of these tweets and retrieved up to the 3200 most recent tweets from that user
via the Twitter API. We then filtered out users with less than 25 tweets and those whose
tweets were not at least 75% in English (measured using an automated language ID system.)
This filtering left us with 244 users as positive examples. We repeated this process for a
group of randomly selected users. We randomly selected 10,000 usernames from a list of
users who postedto our historical collection within a selected two week window. We then
downloaded all tweets from these users. After filtering (as above) 5728 random users
remain, whose tweets were used as negative examples. Methods We use our positive and
negative PTSD data to train three classifiers: one unigram language model (ULM) examining
individual whole words, one character n-gram language model (CLM), and one from the
LIWC categories above. The LMs have been shown effective for Twitter classification tasks
(Bergsma et al. 2012) and LIWC has been previously used for analysis of

4
mental health in Twitter (De Choudhury et al. 2013). The language models measure the
probability that a word (ULM) or a string of characters (CLM) was generated by the same
underlying process as the training data. Here, one of each language model (clm+ and ulm+)
is trained from the tweets of PTSD users, and a second (clm− and ulm−) from the tweets
from random users. Each test tweet t is scored by comparing proabilities from 4Case
insensitive regex:\Wptsd\W|\Wp\.t\.s\.d\.\W|post[- ]traumatic[- ]stress[- ]disorder[- ] each
LM: s= lm+(t) lm−(t) (1) A threshold of 1 for s divides scores into positive and negative
classes. In a multi-class setting, the algorithm minimizes the cross entropy, selecting the
model with the highest probability. For each user, we calculate the proportion of tweets
scored positively by each LIWC category. These proportions are used as a feature vector in
a loglinear regression model (Pedregosa et al. 2011). Prior to training, we preprocess the text
of each tweet: we replaced all usernames with a single token (USER), lowercased all text,
and removed extraneous whitespace. We also excluded any tweet that contained a URL, as
these often pertain to events external to the user (e.g., national news stories). In total, we
used 463k PTSD tweets and sampled 463k non-PTSD tweets to create a balanced data set.

2.2 Stress Detection Using Low -Cost Heart Rate Sensors


Stress can also be detected using other, less common markers like accelerometer [15], key stroke
dynamics [16], or blinking [17]. It is also common to use a combination of several markers at
the expense of an increased system cost and user involvement. Fernandes et al. used GSR and
blood pressure (BP) markers [18] for determining stress. Sun et al. describe mental stress
detection using combined data from ECG, GSR, and accelerometer [19]. De Santos Sierra et al.
in [20] used GSR and HR. Rigas et al. used ECG, GSR, and respiration for detecting stress
while driving [21]. Wijsman et al. used ECG, respiration, GSR, and EMG of trapezius muscles
for mental stress detection [22]. Riera et al. combined EEG and EMG markers [23]. Singh and
Queyam used GSR, EMG, respiration, and HR [24] for detecting stress during driving. Pupil
diameter, ECG, and photoplethysmogram were used as markers by Mokhayeri et al [25]. Baltaci
and Gokcay used pupil diameter and temperature features in stress detection [26], while Choi
used HRV, respiration, GSR, EMG, acceleration, and geographical location [27].

New noncontact methods have also been developed recently to measure stress states. Some of
them are hyperspectral imaging technique [28], human voice [29, 30], pupil diameter [31],
visible spectrum camera [32], or using stereo thermal and visible sensors [33].

5
However, observing several markers for identifying stress requires an increasing number of
input sensors which in turn increases the overall price and lowers applicability. Prices for heart
rate meters range from $70 to $500 USD; GSR devices range from $100 to $500 USD, while
EMG devices have price ranges from $450 USD up to $1750 USD. Systems combining multiple
sensors are priced much higher. For such systems prices fall between $550 USD and $5700
USD, which already can be considered excessive for a mass telemedical lifestyle counseling
application. Therefore, in an ambient assisted living (AAL) system, the number of input sensors
should be kept minimal. In the rest of the paper, we focus on the simplest and most researched
sensor input, that is, the electrical activity of the heart.

As for the reliability of HRV sensors, there are still surprisingly few reviews reported in the
literature to date on the validation of the information content of low cost sensors compared to a
clinically accepted “gold standard” device. Some devices that were tested for validity are the
Sense Wear HR Armband [34], the Smart Health Watch [35], the Actiheart [36, 37], the
Equivital LifeMonitor [38], and the PulseOn [39]; and also the Bioharness multivariable
monitoring device from Zephyr has been tested for validity [40, 41] and reliability [41, 42]. In
all cases, a gold standard device was used simultaneously with the device under test as a method
for validating data. However, the validated devices above are high-end devices with a
considerable price which present an obstacle for the penetration of telemedicine. For example,
the Bio harness device has a price around $550 USD, whereas the price of low cost heart rate
meters varies from $70 USD to $100 USD. The lack of reliability tests of low cost devices was
our motivation for our device validation study.For automated stress detection, several methods
have been published which use only HRV. In 2008, Kim et al. collected HRV data from sixty-
eight subjects [43]. HRV data were collected during three different time periods. High stress
decreased HRV features. A maximum classification accuracy of 66.1% was achieved. Melillo
et al. in 2011 used nonlinear features of HRV for real-life stress detection [44]. HRV data were
collected two times, during university examination and after holidays, on 42 students. Most of
HRV features significantly decreased during stress period. Stress detection with classification
accuracy of 90% was reported using two Poincaré plot features and Approximate Entropy. One
year later, using the same data, they designed a classification tree for automatic stress detection
based on LF and pNN50 HRV features with sensitivity of 83.33% [45]. In 2013, Karthikeyan
et al. created stress detection classifiers from ECG signal and HRV features [46]. Vanitha and
Suresh used a hierarchical classifier to classify stress into four levels with a classification
efficiency of 92% [47] in 2014.
6
CHAPTER-3
EXISTING SYSTEM

Traditional methods for detecting employee stress include surveys and self-reporting, which
can be subjective and time-consuming. Other methods include physiological measures such as
heart rate variability and cortisol levels, which can be invasive and require specialized
equipment. These methods also require significant expertise to interpret the data accurately.
3.1 DISADVANTAGES
• The disadvantages of existing methods are that they can be time-consuming, subjective.
• Surveys and self-reporting methods rely on the employee's willingness and ability to accurately
report their stress levels, which can be influenced by factors such as social desirability bias or
lack of self-awareness.
• Physiological measures such as heart rate variability and cortisol levels can be invasive and
require specialized equipment and expertise to interpret the data accurately

7
CHAPTER-4
PROPOSED SYSTEM

In this to detect employee stress by using machine learning algorithms such as SVM and
Random Forest Algorithms. To detect stress we are using social media dataset such as tweets
where employee can share their views and by analyzing this views we can identify whether
employee is in relax or stress mood but by analyzing this views manually may take lot of
human efforts so author using machine learning algorithms and the experiment with this
algorithms show stress detection accuracy more than 90%.

4.1 SCOPE

Since this project is associated with the social problem which is one of the enormously growing
field the scope is pretty high and it helps the society in a way which can identify the victims of
stress which is one of the most commonly identified disorder among the adolescents. The scope
of detecting employee stress using twitter dataset by Support Vector Machine and Random
Forest algorithms is significant. SVM excels in classifying data by finding the optimal
hyperplane that separates different classes, while Random Forest utilizes an ensemble of
decision trees to make predictions. However, the effectiveness ultimately depends on the quality
and relevance of the input data and the implementation of the algorithms.

4.2 OBJECTIVE

The objective of using Support Vector Machine (SVM) and Random Forest algorithms for the
detection of employee stress is to develop predictive models that can analyse various features
or factors associated with employees and classify them into stressed or non-stressed categories.
These algorithms aim to accurately predict and identify employees who may be experiencing
stress, which can help organizations take proactive measures to address employee well-being.

8
CHAPTER-5

REQURIMENT ANALYSIS

5.1 FUNCTIONAL REQUIREMENTS

1. Data Collection

2. Data Preprocessing

3. Training and Testing

4. Modeling

5. Predicting

5.1.1 Data Collection

Initially, we collect a dataset for our personality prediction system. After the collection of the dataset,
we split the dataset into training data and testing data. The training dataset is used for prediction model
learning and testing data is used for evaluating the prediction model. For this project, 90% of training
data is used and 10% of data is used for testing.

5.1.2 Data Preprocessing

Data pre-processing is an important step for the creation of a machine learning model. Initially, data may
not be clean or in the required format for the model which can cause misleading outcomes. In pre-
processing of data, we transform data into our required format. It is used to deal with noises, duplicates,
and missing values of the dataset. Data pre-processing has activities like importing datasets, splitting
datasets, attribute scaling, etc. Preprocessing of data is required for improving the accuracy of the model.

5.1.3 Training and Testing


Training a machine learning (ML) model is a process in which a machine learning algorithm is fed with
training data from which it can learn. Model training is the primary step in machine learning, resulting
in a working model that can then be validated, tested and deployed. Both the quality of the training data
and the choice of the algorithm are central to the model training phase. In most cases, training data is
split into two sets for training and then validation and testing. The type of training data that we provide
to the model is highly responsible for the model's accuracy and prediction ability. It means that the better
the quality of the training data, the better the performance of the model will be. Our training data is equal
9
to 90% of the total data.
Once we train the model with the training dataset, it's time to test the model with the test dataset.
This dataset evaluates the performance of the model and ensures that the model can generalize
well with the new or unseen dataset. Test data is a well-organized dataset that contains data for
each type of scenario for a given problem that the model would be facing when used in the real
world. The test dataset is 10% of the total original data for this project

5.1.4 Modelling
Machine learning models are created by training algorithms with either labeled or unlabeled data, or a
mix of both. As a result, there are three primary ways to train and produce a machine learning algorithm:

 Supervised learning: Supervised learning occurs when an algorithm is trained using “labelled
data”, or data that is tagged with a label so that an algorithm can successfully learn from it. Training
an algorithm with labelled data helps the eventual machine learning model know how to classify
data in the manner that the researcher desires.
 Unsupervised learning: Unsupervised learning uses unlabeled data to train an algorithm. In this
process, the algorithm finds patterns in the data itself and creates its own data clusters.
Unsupervised learning is helpful for researchers who are looking to find patterns in data that are
currently unknown to them.

5.1.5 Predicting
The trained model upon giving the new data makes prediction. When the new input or the
test data is given to the trained model, it predicts the personality of the user based on the
input data which is given. The trained model predicts well since the trained data used is
more than 60%. The algorithm also plays a major role in making the model predict well.

5.2 NON-FUNCTIONAL REQUIREMENTS


NON-FUNCTIONAL REQUIREMENT (NFR) specifies the quality attribute of a software
system. They judge the software system based on Responsiveness, Usability, Security,
Portability and other non-functional standards that are critical to the success of the software
system. Example of nonfunctional requirement, “how fast does the website load?” Failing to
meet non-functional requirements can result in systems that fail to satisfy user needs. Non-
functional Requirements allows you to impose constraints or restrictions on the design of the
system across the various agile backlogs. Example, the site should load in 3 seconds when the
number of simultaneous users are > 10000. Description of non-functional requirements is just
10
as critical as a functional requirement.

EXAMPLES OF NON-FUNCTIONAL REQUIREMENTS

 Users must upload dataset

 Privacy of information, the export of restricted technologies,intellectual property


rights, etc. should be audited.

5.3 SOFTWARE REQUIREMENTS

The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics
requirements, design constraints and user documentation.
The appropriation of requirements and implementation constraints gives the general
overview of the project in regards to what the areas of strength and deficit are and how to
tackle them.

 Python idel 3.7 version (or)

 Anaconda 3.7 (or)

 Jupiter (or)

 Google colab

5.4 HARDWARE REQUIREMENTS

Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that need to
store large arrays/objects in memory will require more RAM, whereas applications that
need to perform numerous calculations or tasks more quickly will require a faster
processor.
 Operating system: windows, linux

 Processor : minimum intel i3

 Ram : minimum 4 gb

 Hard disk : minimum 250gb

11
CHAPTER-6
DESIGN AND METHODOLOGY

6.1 METHODOLOGY

RANDOM FOREST ALGORITHM


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combiningmultiple
classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes
the prediction from each tree and based on the majority votes of predictions, and it predicts
the final output.

The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.

12
SUPPORT VECTOR MACHINE

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However,primarily, it is
used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

13
6.2 SYSTEM DESIGN

UML DIAGRAMS

The System Design Document describes the system requirements,operating environment,


system and subsystem architecture, files and database design, input formats, output layouts,
human-machine interfaces, detailed design, processing logic, and external interfaces.

6.2.1 CLASS DIAGRAM


In software engineering, a class diagram in the Unified Modeling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, their attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information.

6.2.2 INTERACTION DIAGRAMS

This interactive behavior is represented in UML by two diagrams known as Sequence


diagram and Collaboration diagram. The basic purpose of both the diagrams are similar. The
purpose of interaction diagrams is to visualize the interactive behavior of the system.
Visualizing the interaction is a difficult task. Hence, the solution is to use different types of
models to capture the different aspects of the interaction.

14
6.2.2.1 SEQUENCE DIAGRAM

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram


that shows how processes operate with one another and in what order. It is a construct of a
Message Sequence Chart. Sequence diagrams are sometimes called event diagrams, event
scenarios, and timing diagrams.

6.2.3 USE CASE DIAGRAM


A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purposeof a use
case diagram is to show what system functions are performed for which actor. Roles of the
actors in the system can be depicted.

15
6.2.3.1 CONTROL FLOW DIAGRAM

A control-flow diagram (CFD) is a diagram to describe the control flow of a business


process, process or review. Control diagrams are graphical notations specially designed to
represent event and control flows. Data flow is represented by solid arrow whereas control
flow is represented by dashed or shaded arrow.

16
CHAPTER-7
IMPLEMENTATION

from tkinter import messagebox


from tkinter import *
from tkinter.filedialog import askopenfilename
from tkinter import simpledialog
import tkinter
import numpy as np
from tkinter import filedialog
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import re
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
import matplotlib.pyplot as plt

stop_words = set(stopwords.words('english'))

main = tkinter.Tk()
main.title("Detection of Employee Stress Using Machine Learning")
main.geometry("1300x1200")

global model
global filename
global tokenizer
global X 17
global Y
global X_train, X_test, Y_train, Y_test
global XX
word_count = 0
global svm_acc,rf_acc
global model

def upload():
global filename
filename = filedialog.askopenfilename(initialdir = "Tweets")
pathlabel.config(text=filename)
textarea.delete('1.0', END)
textarea.insert(END,'tweets dataset loaded\n')

def preprocess():
global X
global Y
global word_count
X = []
Y = []
textarea.delete('1.0', END)
train = pd.read_csv(filename,encoding='iso-8859-1')
word_count = 0
words = []
for i in range(len(train)):
label = train.get_value(i,2,takeable = True)
tweet = train.get_value(i,1,takeable = True)
tweet = tweet.lower()
arr = tweet.split(" ")
msg = ''
for k in range(len(arr)):
word = arr[k].strip()
if len(word) > 2 and word not in stop_words:
msg+=word+" " 18
if word not in words:
words.append(word);
text = msg.strip()
X.append(text)
Y.append(int(label))
X = np.asarray(X)
Y = np.asarray(Y)
word_count = len(words)
textarea.insert(END,'Total tweets found in dataset : '+str(len(X))+"\n")
textarea.insert(END,'Total words found in all tweets : '+str(len(words))+"\n\n")
featureExtraction()

def featureExtraction():
global X
global Y
global XX
global tokenizer
global X_train, X_test, Y_train, Y_test
max_fatures = word_count
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(X)
XX = tokenizer.texts_to_sequences(X)
XX = pad_sequences(XX)
indices = np.arange(XX.shape[0])
np.random.shuffle(indices)
XX = XX[indices]
Y = Y[indices]
X_train, X_test, Y_train, Y_test = train_test_split(XX,Y, test_size = 0.13, random_state = 42)
textarea.insert(END,'Total features extracted from tweets are : '+str(X_train.shape[1])+"\n")
textarea.insert(END,'Total splitted records used for training : '+str(len(X_train))+"\n")
textarea.insert(END,'Total splitted records used for testing : '+str(len(X_test))+"\n")

def SVM():
textarea.delete('1.0', END) 19
global svm_acc
rfc = svm.SVC(C=2.0,gamma='scale',kernel = 'rbf', random_state = 2)
rfc.fit(X_train, Y_train)
textarea.insert(END,"SVM Prediction Results\n")
prediction_data = rfc.predict(X_test)
svm_acc = accuracy_score(Y_test,prediction_data)*100
textarea.insert(END,"SVM Accuracy : "+str(svm_acc)+"\n\n")

def RandomForest():
global rf_acc
global model
rfc = RandomForestClassifier(n_estimators=20, random_state=0)
rfc.fit(X_train, Y_train)
textarea.insert(END,"Random Forest Prediction Results\n")
prediction_data = rfc.predict(X_test)
rf_acc = accuracy_score(Y_test,prediction_data)*100
textarea.insert(END,"Random Forest Accuracy : "+str(rf_acc)+"\n")
model = rfc
def predict():
textarea.delete('1.0', END)
testfile = filedialog.askopenfilename(initialdir = "Tweets")
test = pd.read_csv(testfile,encoding='iso-8859-1')

for i in range(len(test)):
tweet = test.get_value(i,0,takeable = True)
arr = tweet.split(" ")
msg = ''
for j in range(len(arr)):
word = arr[j].strip()
if len(word) > 2 and word not in stop_words:
msg+=word+" "
text = msg.strip()
mytext = [text]
twts = tokenizer.texts_to_sequences(mytext) 20
twts = pad_sequences(twts, maxlen=83, dtype='int32', value=0)
stress = model.predict(twts)
print(stress)
if stress == 0:
textarea.insert(END,text+' === Prediction Result : Not Stressed\n\n')
if stress == 1:
textarea.insert(END,text+' === Prediction Result : Stressed\n\n')

def graph():
height = [svm_acc,rf_acc]
bars = ('SVM ACC','Random Forest ACC')
y_pos = np.arange(len(bars))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.show()

font = ('times', 16, 'bold')


title = Label(main, text='Detection of Employee Stress Using Machine Learning')
title.config(bg='yellow green', fg='saddle brown')
title.config(font=font)
title.config(height=3, width=120)
title.place(x=0,y=5)

font1 = ('times', 14, 'bold')


upload = Button(main, text="Upload Tweets Dataset", command=upload)
upload.place(x=780,y=100)
upload.config(font=font1)

pathlabel = Label(main)
pathlabel.config(bg='royal blue', fg='rosy brown')
pathlabel.config(font=font1)
pathlabel.place(x=780,y=150)
21
preprocessButton = Button(main, text="Data Preprocessing & Features
Extraction",command=preprocess)
preprocessButton.place(x=780,y=200)
preprocessButton.config(font=font1)

svmButton = Button(main, text="Run SVM Algorithm", command=SVM)


svmButton.place(x=780,y=250)
svmButton.config(font=font1)

rfButton = Button(main, text="Run Random Forest Algorithm", command=RandomForest)


rfButton.place(x=780,y=300)
rfButton.config(font=font1)

classifyButton = Button(main, text="Predict Stress", command=predict)


classifyButton.place(x=780,y=350)
classifyButton.config(font=font1)

modelButton = Button(main, text="Accuracy Graph", command=graph)


modelButton.place(x=780,y=400)
modelButton.config(font=font1)

font1 = ('times', 12, 'bold')


textarea=Text(main,height=30,width=90)
scroll=Scrollbar(textarea)
textarea.configure(yscrollcommand=scroll.set)
textarea.place(x=10,y=100)
textarea.config(font=font1)

main.config(bg='cadet blue')
main.mainloop()

22
CHAPTER-8

TESTING

TESTING
Testing is a process of executing a program with the aim of finding error. To make our
software perform well it should be error free. If testing is done successfully, it will remove
all the errors from the software.

8.1 TYPES OF TESTING

 White Box Testing

 Black Box Testing

 Unit testing

 Integration Testing

 Alpha Testing

 Beta Testing

 Performance Testing and so on

8.1.1 White Box Testing


Testing technique based on knowledge of the internal logic of an application's code and
includes tests like coverage of code statements, branches, paths, conditions. It is performed
by software developers

8.1.2 Performance Testing


Functional testing conducted to evaluate the compliance of a system or component with
specified performance requirements. It is usually conducted by the performance engineer.

8.1.3 Black Box Testing


Blackbox testing is testing the functionality of an application withoutknowing the details of
its implementation including internal program structure, data structures etc. Test cases for
black box testing are created based on the requirement specifications. Therefore, it is also
23
called as specification-based testing. Below represents the black box testing:

Fig.: Black Box Testing


When applied to machine learning models, black box testing would mean testing machine
learning models without knowing the internal details such as features of the machine learning

model, the algorithm used to create the model etc. The challenge, however,is to verify the test
outcome against the expected values that are known beforehand.

Fig.: Black Box Testing for Machine Learning algorithms

Input Actual Output PredictedOutput

[16,6,324,0,0,0,22,0,0,0,0,0,0] 0 0

[16,7,263,7,0,2,700,9,10,1153,832,9,2] 1 1
24
The above Fig represents the black box testing procedure for machinelearning algorithms.The
model gives out the correct out the model gives out the correct output when different inputs
are givenwhich are mentioned in Table. Therefore, the program is said to be executed as
expected or correct program

25
CHAPTER-9
RESULTS AND DISCUSSION

In above screen click on ‘Upload Tweets Dataset’ button to load dataset

26
In above screen select ‘stress_tweets.csv’ dataset and then click on ‘Open’ button to load
datasetand to get below screen.
 Upload Tweets Dataset
 Data Processing and features extraction
 Run Support Vector Machine
 Run Random Forest
 Predict stress
 Accuracy graph

In above screen click on ‘Data Preprocessing & Features Extraction’ button to read dataset

and to clean and extract features such as words from dataset and find total records in

dataset, totalwords and application using how many records for training and testing.

27
In above screen dataset contains total 10314 tweets and all tweets contain 30790 words and

total unique words are 83 and application using 8973 records for training and 1341 for testing.

Now both train and test data is ready and now click on ‘Run SVM Algorithm’ button to trained

data using SVM machine learning algorithm.

28
In above screen SVM got 89.85 correctly predicted accuracy from test data and now click on

Run Random Forest Algorithm’ button to calculate its accuracy

In above screen random forest got 97.31 correctly prediction accuracy and now click on ‘Predict

Stress’ button and upload test file which contains tweets and by analyzing those tweets machine

learning algorithm will predict whether tweets contain any stress data or not. Below is the

screen shots of test tweets which we upload in next screen

In above screen uploading ‘test’ file and now click on ‘Open’ button to predict stress

29
In above screen beside each tweet we can see predicted result as Stressed or Not stressed.

From above screen we can see application detecting stress successfully from messages and

now click on ‘Accuracy Graph’ button to get below comparison graph

In above x-axis represents algorithm name and y-axis represents accuracy of those algorithms
and from above graph we can say random forest is better than Support Vector Machine.

30
CHAPTER-10
CONCLUSION

Gender, also the family background which has the illness, and considering whether a single employer
provides the conceptual benefits of health for their employees was having more significance
compared to the other factors for determining whether an employee can obtain conceptual health
associated issues. From our study, we were able to find that the people who are working in the tech
companies are at more risk of obtaining stress, even though their job role was not based on tech.
These perceptions could be successfully used by business companies tomake more desirable HR
strategies for the working employees. A 75% correctness shows that the application of two Machine
Learning techniques ( i.e.SVM and Random forest) for predicting the stress and conceptual health
conditions provides worthy results and could be searched further, and thus meets the aim of this
project.

31
CHAPTER-11

REFERENCES

[1] Detecting and characterizing Mental Health Related Self-Disclosure in social media.
SairamBalani and Munmun De Choudhury. 2015.In Proceedings of the 33rd Annual ACM
Conference Extended Abstracts on Human Factors in Computing Systems -CHI EA ‟15, pages
1373–1378.
[2] Measuring Post Traumatic Stress Disorder in Twitter. Glen Coppersmith, Mark Dredze, and
Craig Harman. 2014.
[3] Role of social media in Tackling Challenges in Mental Health. Munmun De Choudhury. 2013.
[4] Bhattacharyya, R., &Basu, S. (2018). India Inc looks to deal with rising stress in employees.
Retrieved from „The Economic Times‟
[5] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., &Vanderplas,
J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct),
2825-2
[6] OSMI Mental Health in Tech Survey Dataset, 2017 from Kaggle.

[7] Van den Broeck, J., Cunningham, S. A., Eeckels, R., &Herbst, K. (2005). Data cleaning:
detecting, diagnosing, and editing data abnormalities. PLoS medicine, 2(10), e267.
[8] Relationship between Job Stress and Self-Rated Health among Japanese Full TimeOccupational
Physicians Takashi Shimizu and Shoji Nagata 2007 Academic Papers inJapanese 2007.
[9] Tomar, D., & Agarwal, S. (2013). A survey on Data Mining approaches for healthcare.
International Journal of Bio-Science and Bio-Technology, 5(5), 241-266.
[10] Gender and Stress. (n.d.). Retrieved from APA press release 2010

[11] Julie Aitken Harris, Robert Slatestone and Maryann Fraboni. (2000) An Evaluationof the Job
Stress Questionnaire with a Sample of Entrepreneurs”2000 JSQ scale Entrepreneurs.

[12] “Demographic and Workplace Characteristics which add to the Prediction of Stress and Job
Satisfaction within the Police Workplace”, Jeremy D. Davey, Patricial L. Obst, and Mary C.
Sheehan 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive

32
Computing (ICCICC). 2015.

[13] Mario Salai, Istv an Vass anyi, and Istv an Kosa, “Stress Detection using low-cost Heart
Rate sensors”, Journal of Healthcare Engineering, pp.1-13,Hindawi Publishing corporation. , 2016
[14] Shwetha, S, Sahil, A, Anant Kumar J, (2017) Predictive analysis using classification
techniques in healthcare domain, International Journal of Linguistics & Computing Research,
ISSN: 2456-8848, Vol. I, Issue.I, June-2017.
[15] O.M.Mozos et al, “Stress detection using wearable physiological and sociometricsensors”.
International Journal of Neural Systems,vol 27,issue 2, 2017.

33
34

You might also like