Video Exercise Recognition with GCN
Video Exercise Recognition with GCN
Abstract—Researchers frequently employ artificial cause distinct muscle firing, such devices [15, 16] are not
intelligence models to estimate human pose. Pose recognition always precise in their discrimination against activity.
can be used to monitor and supervise personal workout sessions. Additionally, wearable bands are rarely ideal due to a variety
Exercises for self-rehabilitation without supervision and of other issues, including connectivity, defective sensors,
improper physical training can result in significant injury if the excessive sweating, expense, and improper alignment.
procedure is not followed correctly. The majority of the work
that has formerly been done on exercise classification depends
on sensors that are either worn or externally attached. However,
these sensors frequently are unable to discern between similar
workouts. Additionally, they are rarely ideal due to a variety of
other issues, including connectivity, defective sensors, excessive
sweating, expense, and improper alignment. This study classifies
each person's workouts before predicting whether or not the
posture that corresponds with each exercise is correct. The
person's pose sequence and ten other novel features are used to
train a Graph Convolutional Network (GCN) architecture to
capture the relationship between the 2D coordinates of the body Fig. 1. Proposed model architecture-AERN (Angle based exercise
joints, their angles with key limbs, and a specific workout. We recognition network) model.
introduce a dataset with 2 different physical exercises (squats
and lunges) to assess our methodology. Our method produces This paper provides a video-based solution to the problem
recognition accuracy for the proper workout postures for lunges at hand by using a model known as "angle-based exercise
and squats of 94.44% and 98.65% respectively. recognition network" (AERN) to classify each individual's
workouts before making a prediction regarding whether or not
Keywords—Exercise Recognition, Human Activity the posture that matches with each exercise is appropriate. The
Recognition, Graph Convolution Networks, Human Computer only thing required is a camera, which can be easily located in
Interaction low-cost electronic devices such as PCs and mobile phones.
I. INTRODUCTION A dataset including four hundred exercise video clips
sourced from YouTube was developed for squats and lunges.
Artificial intelligence techniques are primarily being used Video based exercise recognition and posture estimation
in the field of computer vision [1-5] and human activity model AERN is proposed. Critical limbs whose relationship
recognition is one of its vital aspects. Researchers and many with each joint at any point of time plays a crucial role in the
stakeholders have expressed a strong interest in correctness of any exercise are identified. Novel features are
comprehending human activity in images and videos. designed to encode their relationship with the correctness of
Therefore, a branch of computer vision known as human any exercise form. This study accomplishes the following
posture estimation [1, 2] has attracted a lot of interest from significant advancements in technical practise in order to
academics in the present day. It is utilised in a variety of accomplish the goal:
applications, including gaming, photos, videos, animations
[6,7], and action recognition [8-13]. Making the machine learn • A dataset of 400 YouTube video clips is created for the
from the data is the goal of human pose estimation. In order to squat and lunge exercises.
determine if an object is moving, sitting, sleeping, or engaging
in any other activity, it can categorise the postures. Exercise • AERN (Angle based exercise recognition network)
recognition is an extension of human pose estimation [14]. model for video based exercise recognition and posture
Exercise is becoming a crucial component of living a healthy estimation is proposed.
life. Many people appreciate the ease of being able to exercise • Ten critical limbs are identified whose relationship
without needing the guidance of a physical trainer, especially with each joint at any point of time plays a crucial role
in the wake of the COVID-19 pandemic. However, ineffective in any form of exercise.
supervision and guidance can actually have the opposite effect
and result in catastrophic injuries such as muscle twists, • Novel features based on angle made by each joint with
ligament breaks, back pain, and even bone fractures under these critical limbs are designed to capture their
extreme circumstances. relationship with the correctness of any exercise.
As a result, practitioners could benefit from a computer The remainder of the essay is structured as follows. The
assisted tool that can identify the proper exercise postures research on human pose detection and exercise classification
prior to training. is discussed in Section 2. The proposed model and its
components are discussed in Section 3. The results of the
Wearable bands and electronic sensors are used in a experiment, network architecture and implementation and
number of research studies that have been proposed so far to discussion on results are described in Section 4. The work is
categorise exercises [14-16]. However, because exercise may concluded in Section 5.
Authorized licensed use limited to: Universitaet Linz. Downloaded on November 19,2024 at 15:09:21 UTC from IEEE Xplore. Restrictions apply.
II. LITERATURE SURVEY From YouTube, 400 fitness video clips with both
appropriate and inappropriate workout postures were gathered
A. Human Pose Detection: for two exercises, squats and lunges. The provided RGB
A deep neural network was utilised by Toshev [17] to frames are then processed in the OpenPose [24] library, as can
resolve the regression issue for estimating human pose. A be seen in Fig. 2. With the use of the Openpose [24] library,
“spatial-temporal And-Or graph model” for action the model first determines the 2D coordinates of all 25 joints
identification and posture estimation was introduced by Nie that make up the human skeleton in order to simulate the
[18]. Their model primarily breaks down the action into poses, actions of a human instructor. Novel features based on angles
and then the poses are further broken down into various between critical limbs and each skeletal joint are designed
components and then sub-components, yielding an accuracy which determines the correctness of a posture. Ten critical
of 85.50%. An approach for real-time multi-person human limbs are identified for this purpose based on intuition, whose
posture estimation using a “Deep Neural network” was put relative position and angle with each joint plays a crucial role
forth by Linna [19]. Their model is divided into two stages: for any exercise. These 10 novel features along with 2D
pre-training and fine-tuning, and their model postures are coordinates for each joint of a human skeleton in a single
constrained. Highly diversified input data are provided during frame are passed to 8 layers of ST- GCN blocks where each
pre-training, whereas domain-specific data are fed during block comprises of spatial graph convolution and temporal
fine- tuning. MoVNect is used by Hwang [20] to collect 3D graph convolution. This network exploits connectivity of
joints using a single RGB camera. The 3D pose can be joints spatially and uses motion features temporally to extract
estimated using MoVNect from an RGB picture. 15 joints are useful features. These are then passed to a fully connected
estimated by MoVNect, and they presumptively use low- layer to recognise the correctness of an exercise.
power devices in their model (e.g., smartphones). Real-time
CNN output processing produces temporally reliable 3D The major components of the model are:
skeletal data. They utilise the Human 3.6M dataset and attain • 400 fitness video clips are collected from YouTube to
a 97.30% accuracy. create a dataset for two exercises, squats and lunges.
B. Exercise Recognition: • These videos are then processed in OpenPose [24]
With the aid of just one accelerometer, Koskimaki [15] library to get the 2D coordinates of the 25 joints of a
demonstrated a case study based on 36 separate gym workouts human skeleton.
that were recognized with an accuracy of 96%. The findings
of this work were important for subsequent studies since they • 10 critical limbs are identified in a human skeleton
helped them reliably identify gym activities in the presence of whose relative position plays a crucial role for any kind
diverse data. MyHealthAssistant [21] employs a smartphone of exercise postures through intuition.
as an aggregator in conjunction with embedded wearable • Novel features based on angle made by each joint with
sensors to recognise and count gym workouts. They employed these critical limbs are designed to encode their
a Bayesian classifier to attain an accuracy of 92% when relationship with the correctness of any exercise form.
identifying gym exercises using three accelerometers. The
approach offered by Fine-fit removes the discriminatory • These features along with 2D spatial features are
effects of prior work on workouts that execute the same action passed to 8 layers of ST-GCN blocks to capture these
but differ in terms of growing muscle. For the simultaneous relationships.
detection of both body movement and the associated muscle • The extracted features are then passed to a fully
vibration, they employ a single source sensor (accumulator) connected layer to classify the exercise and its
achieving an average accuracy of 91% [14]. To record correctness.
numerous stimuli from sound, movement, and light, the
authors [22] utilise a sensor that has eight separate sensors Each of these components are discussed in detail below:
inside of it. They classified diverse activities using HMMs and
had a 92% accuracy rate. A. Dataset Creation
From YouTube, 400 fitness videos with both appropriate
III. PROPOSED MODEL: AERN and inappropriate postures were gathered for two exercises,
In an end-to-end model for skeleton-based human activity squats and lunges. These videos were captured from various
recognition named “spatial temporal graph convolutional camera perspectives. The subjects' heights differ greatly from
networks (ST-GCN)” [23], each joint is considered as a node one another. The labels for the instructions and actions were
in the graph. This network exploits connectivity of joints and manually labelled. Each RGB frames were processed using
uses motion features. However these fail to sufficiently the OpenPose [24] package to obtain 2D coordinates for the
encode the relationship between the spatial position of joints 25 recognised skeletal joints. In a skeleton sequence, each
and an exercise posture. For the correctness of any exercise body joint is represented by a joint id, a frame id, and its spatial
posture the angle made by each joint with certain critical limbs coordinates in a form of well-structured data. A database for
plays a very crucial role. the 2D coordinates of the 25 skeletal joints in each frame and
the action taking place during the video are created using the
In response to this, this paper proposed a “Angle-based output JSON file. The nose is used as the reference point to
exercise recognition network (AERN)” model as shown in fig. normalise the 2D locations across all frames to origin.
1, to better precisely characterise the details of the motion of
certain limbs relative to different joints and thus, help CNNs This dataset specifically includes two different sorts of
encode more semantic correlations for improved exercise exercises, each with a number of people that repeatedly
recognition. executed an exercise either correctly or incorrectly. The
dataset contains a total of 177 squat and 223 lunge action
Authorized licensed use limited to: Universitaet Linz. Downloaded on November 19,2024 at 15:09:21 UTC from IEEE Xplore. Restrictions apply.
181
2023 International Conference on Recent Advances in Electrical, Electronics & Digital Healthcare Technologies (REEDCON)
Total samples
Exercise Instruction Label (per Total samples (per
instruction) action)
Correct 70
Squats 177
Incorrect 107
Lunges 223
Incorrect 65 These include:
1. Neck (connected by joints 0 and 1)
2. Right shoulder (connected by joints 1 and 2)
B. Body Keypoint Estimation
3. Left shoulder (connected by joints 1 and 5)
The provided RGB frames are then processed in the
OpenPose [24] library, as can be seen in Fig. 2; this is done in 4. Spine (connected by joints 1 and 8)
order to generate a whole-body key point for each person who
was found. Information from the face, torso, arms, hands, legs, 5. Right Hip (connected by joints 8 and 9)
and feet is included in the key points that were obtained. Each 6. Left Hip (connected by joints 8 and 12)
body joint in a skeleton series is represented by a joint id, a
frame id, and its spatial coordinates in the form of well- 7. Right thigh (connected by joints 9 and 10)
structured data. These identifiers are used in a skeleton 8. Left thigh (connected by joints 12 and 13)
sequence. Because of advances in approaches for accurate
posture estimation, it is now possible to rapidly generate 9. Right Leg (connected by joints 10 and 11)
sequences of human skeletons [1]. Skeleton sequences have 10. Left Leg (connected by joints 13 and 14)
the advantage of communicating bodily movement in a
condensed manner, which makes them more resistant to The previously mentioned critical limbs are used to create
complex conditions than RGB video, such as viewpoint angles (θ1,θ2,…,θ10), and then ten new features are added to
fluctuation, occlusion, and self-occlusion. Skeleton sequences each joint utilising those angles. The key limbs (such as the
also have the advantage of being able to communicate these neck, which is joined by joints 0 and 1) and any joint, such as
conditions more clearly. It is a simplification of the positions the shoulder joint (3), combine to generate the angle θ1(0,1,3)
and motions of the human body at a higher level of for joint 3. Formally,
abstraction. In addition, human beings are capable of
θ = atan(y3 - y1, x3 - x1)-atan(y2 - y1, x2 - x1); {1}
classifying actions merely based on the joint motions they
observe [25]. where, θ is the angle formed by critical limb with any joint.
(x1,y1), (x2,y2) and (x3,y3) are 2D coordinates for the joints
With the use of the Openpose [24] library, the model first
connecting the limb (e.g., 0 and 1 here) and any joint (e.g., 3)
determines the 2D coordinates of all 25 joints that make up the
respectively.
human skeleton in order to simulate the actions of a human
instructor. Because of the minimal number of dimensions that D. ST-GCN
they have, they require very little computational effort. In “Spatial temporal graph convolutional networks (ST-
Following this, a novel method for feature enrichment, which GCN)” [23], uses Graph Convolutional Networks for skeleton
will be discussed in further detail is applied to these 2D based human activity recognition. Here, each joint is
coordinates. considered as a node in the graph. This network exploits
C. Proposed Feature Enrichment connectivity of joints and uses motion features. Each ST-GCN
[23] block consists of a spatial graph convolution layer and a
In order to detect whether or not a posture is correct, novel
temporal graph convolution layer as shown in fig. 3.
method for feature enrichment is applied to the 2D coordinates
of each joint. These novel features that are intended to depend Spatial graph convolution uses a basic human skeleton
on angles between critical limbs and each skeleton joint have graph in ST-GCN [23]. The adjacent nodes are the spatial
been devised. For this reason, ten critical limbs have been neighbours in a basic human skeleton. Inter frame convolution
selected on the basis of intuition. The relative position and is performed using spatial adjacency matrix. Temporal graph
angle of these limbs with respect to each joint has a significant convolution uses a temporal graph. Adjacent nodes in the
impact in the effectiveness of any workout. temporal graph of a joint in frame ’t’ are the same joints in the
previous frame (t-1) and next frame (t+1). Intra frame
Authorized licensed use limited to: Universitaet Linz. Downloaded on November 19,2024 at 15:09:21 UTC from IEEE Xplore. Restrictions apply.
182
2023 International Conference on Recent Advances in Electrical, Electronics & Digital Healthcare Technologies (REEDCON)
convolution is performed using temporal adjacency matrix. layers are both considered pooling layers, and the strides for
Proposed model uses 8 layers of STGCN blocks. these levels are both defined to be 2.
After that, a global pooling operation was performed on
the resulting tensor in order to produce a feature representation
that contained 256 different values for each sequence. After
that, they are put through a final step in which a SoftMax
classifier is utilised. The models are trained using a learning
rate of 0.1 and stochastic gradient descent as the learning
algorithm. The learning rate decreases by 0.01% every time a
period of 10 epochs passes. When computing losses, the cross
entropy loss metric is applied. The model has been trained for
Fig. 3. Proposed model architecture-AERN (Angle based exercise
recognition network) model. a total of 120 epochs. Throughout training, only a single Tesla
T4 GPU was utilised. The recognition rate is computed by
IV. RESULTS taking the total number of correctly classified samples and
dividing that number by the total number of samples, followed
The performance of the proposed work was evaluated on by multiplying that result by 100.
the created dataset. 70% of samples in the dataset were used
for training and 30% for testing purpose. Table 2 outlines the B. Discussion
performance comparison with competing methods which The proposed AERN model led to an increase in
displays its advantages clearly. recognition accuracy of 4.44 % for the correct squat exercise
Proposed model achieves 94.44% recognition accuracy for and of a gigantic 32 % for the lunge exercise. The
squats and 98.65% for Lunges. It improved the recognition effectiveness of AERN's classification abilities is
accuracy by 4.44% for the correct squat exercise and by a demonstrated by these results. The model makes use of and
gigantic 32% for the lunge exercise. The effectiveness of acquires knowledge from the critical elements that are
AERN's classification abilities is demonstrated by these essential for all forms of exercise. The allowed range of angles
results. In addition, the dataset is collected from YouTube and that can be created by these essential limbs with each joint for
contains real-world data, making the model applicable to real- a particular activity does not change regardless of the body
world environments as opposed to 3D Pose [26] which is type or size of the individual performing the exercise.
trained on data recorded in controlled environment. Throughout the course of any period of time for a
Table II: Recognition Accuracy Of Correct Exercise variety of activities, it acquires knowledge of the
Postures. allowable ranges of angles that each joint is required to make
with the critical limbs and the allowable set of normalised 2D
coordinates. On the basis of these learnings, the model will
Model Squat Lunge Plank next estimate its recognition probability. The model operates
in this manner in an attempt to mimic the actions of a human
instructor. In addition, the dataset is collected from YouTube
3DPose [23] 90% 66.7% 85.7% and contains real-world data, making the model applicable to
Video Based[24]
real-world environments as opposed to 3D Pose [26] which is
- - 97.01% trained on data recorded in controlled environment.
Proposed
V. CONCLUSION
Model 94.44% 98.65% -
This work tries to address an interesting problem of video
based exercise recognition which is the need of the hour for
A. Network architecture and implementation details people with unsupervised exercise regime especially after the
Throughout the tests, a three-dimensional input consisting covid pandemic. A dataset of 400 video clips was created for
of the dimensions (12, 90, 25) is provided to the model for squats and lunges. Skeleton based techniques have gained
each video sample. The number of channels is set at 12, the popularity in action recognition.
number of frames is set at 90, and the number of nodes is set This work proposed an “angle-based exercise recognition
at 25. The novel technique for feature enrichment that was network (AERN)” model to classify each person's workouts
covered in section 3.3 was utilised in order to extract these 12 before predicting whether or not the posture that corresponds
features for each joint. In the process of normalising the data, with each exercise is correct. Ten critical limbs are identified
the input skeletons are first sent to a layer that handles batch in a human skeleton whose relative position with each joint
normalisation. Within the AERN model, there are a total of plays a crucial role for any kind of exercise postures. To
eight layers of spatial temporal graph convolution operators, extract more semantic associations ten novel features for each
also known as ST- GCN blocks. In the first four layers, there skeleton joint are designed using the angles formed by the
are 64 output channels; in the next three layers, there are 128 joints with these critical limbs. These novel features based on
output channels; and in the final layer, there are 256 output angles encodes their relationship with the correctness of any
channels. The size of the spatial kernels in each layer is set at exercise which is captured by AERN. Proposed model
25, and the size of the temporal kernels is set at 21. In order to achieves 94.44% recognition accuracy for squats and 98.65%
avoid overfitting, a random dropout of features with a for Lunges.
probability of 0.1 is introduced after each ST- GCN unit. This
is done to prevent the problem. The fifth and eighth temporal The model tries to mimic a human instructor by learning
the allowable ranges of angles that each joint is required to
make with critical limbs and the set of normalised 2D
Authorized licensed use limited to: Universitaet Linz. Downloaded on November 19,2024 at 15:09:21 UTC from IEEE Xplore. Restrictions apply.
183
2023 International Conference on Recent Advances in Electrical, Electronics & Digital Healthcare Technologies (REEDCON)
coordinates for any form of exercise. This work proposes an [12] Saini R, Kumar P, Kaur B, Roy PP, Dogra DP, Santosh KC (2019)
end to end model which can be extended to incorporate more Kinect sensor-based interaction monitoring system using the blstm
neural network in healthcare. Int J Mach Learn Cybern 10(9):2529–
exercises and provide a real time solution to the problem at 2540
hand. In future this model can be used to generalise the [13] Saini R, Kumar P, Roy PP, Dogra DP (2018) A novel framework of
recognition of any form of correct exercise. continuous human-activity recognition using kinect. Neurocomputing
311:99–111
ACKNOWLEDGEMENT: This work was supported by the [14] Zhang Z, Wang N, Cui L (2018) Fine-fit: A fine-grained gym exercises
Maulana Azad National Fellowship [grant numbers: 201819- recognition system. In: Proceedings of the 24th Asia-Pacific
MANF-2018-19-DEL-100030]. Conference on Communications (APCC), pp 492–497
[15] Koskima kï H, Siirtola P (2014) Recognizing gym exercises using
acceleration data from wearable sen- sors. In: Proceedings of the IEEE
Symposium on Computational Intelligence and Data Mining (CIDM),
REFERENCES pp 321–328
[16] Kunze K, Lukowicz P (2008) Dealing with sensor displacement in
[1] Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d motion- based onbody activity recognition systems. In: Proceedings of
pose estimation using part affinity fields. In: Proceedings of the IEEE the 10th international conference on Ubiquitous computing, pp 20–29
Conference on Computer Vision and Pattern Recognition, pp 7291– [17] Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via
7299 deep neural networks. In: Proceedings of the IEEE Conference on
[2] Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via Computer Vision and Pattern Recognition, pp 1653–1660
deep neural networks. In: Proceedings of the IEEE Conference on [18] Xiaohan Nie B, Xiong C, Zhu S-C (2015) Joint action recognition and
Computer Vision and Pattern Recognition, pp 1653–1660 pose estimation from video. In: Proceedings of the IEEE Conference
[3] Aiman U, Vishwakarma V P, Face recognition using modified deep on Computer Vision and Pattern Recognition, pp 1293–1301
learning neural network. In2017 8th International Conference on [19] Linna M, Kannala J, Rahtu E (2018) Real-time human pose estimation
Computing, Communication and Networking Technologies (ICCCNT) with convolutional neural networks. In: Proceedings of the
2017 Jul 3 (pp. 1-5). IEEE. VISIGRAPP, pp 335–342
[4] Aiman U, Asrar N, Genetic algorithm based solution to SAT-3 [20] Hwang D-H, Kim S, Monet N, Koike H, Bae S (2020) Lightweight 3d
problem. Journal of Computer Sciences and Applications. human pose estimation network training using teacher-student
2015;3(2):33-9. learning. In: Proceedings of the IEEE Winter Conference on
[5] Cheng S. -H., Sarwar M. A., Daraghmi Y. -A., Ik T. -U. And Li Y. -L., Applications of Computer Vision, pp 479–488
"Periodic physical activity information segmentation, counting and [21] Seeger C, Buchmann A, Van Laerhoven K (2011) myhealthassistant: a
recognition from video", in IEEE Access, vol. 11, pp. 23019-23031, phone-based body sensor network that captures the wearer’s exercises
(2023). throughout the day. In: Proceedings of the 6th International
[6] Ke S-R, Zhu L, Hwang J-N, Pai H-I, Lan K-M, Liao C-P (2010) Real- Conferenceon Body Area Networks. Institute for Computer Sciences
time 3d human pose estimation from monocular view with applications and Social- Informatics, pp 1–7
to event detection and video gaming. In: Proceedings of the IEEE [22] Lester J, Choudhury T, Kern N, Borriello G, Hannaford B (2005) A
International Conference on Advanced Video and Signal Based hybrid discriminative/generative approach for modeling human
Surveillance, pp 489–496 activities. In: Proceedings of the 19th international joint conference on
[7] Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation Artificial intelligence, pp 766–772
of local spatio-temporal features for action recognition. In: Bmvc 2009- [23] S. Yan, Y. Xiong, and D. Lin, Spatial Temporal Graph Convolutional
british machine vision conference. BMVA Press, pp 124–1 Networks for Skeleton-based Action Recognition, Computer Vision
[8] Dhar P, Alvarez JM, Roy PP (2017) Efficient framework for action and Pattern Recognition.(2018).
recognition using reduced fisher vector encoding. In: Proceedings of [24] Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2019) Openpose:
International Conference on Computer Vision and Image Processing, realtime multi-person 2d pose estimation using part affinity fields.
pp 343–354 IEEE Trans Pattern Anal Mach Intell 43(1):172–186
[9] Kumar P, Mukherjee S, Saini R, Kaushik P, Roy PP, Dogra DP (2018) [25] P. Zhang, C. Lan, W. Zeng, J. Xue, N. Zheng, Semantics-Guided
Multimodal gait recognition with inertial sensor data and video using Neural Networks for Efficient Skeleton-Based Human Action
evolutionary algorithm. IEEE Trans Fuzzy Syst 27(5):956–965. Recognition, Computer Vision and Pattern Recognition. (2019).
[10] Kumar P, Saini R, Roy PP, Dogra DP (2018) A position and rotation [26] Zhao Z, Kiciroglu S, Vincent H, Cheng Y, Katircioglu I, Salzmann M,
invariant framework for sign language recognition using kinect. Fua P. 3D Pose Based Feedback For Physical Exercises. In Proceedings
Multimed Tools Appl 77(7):8823–8846 of the Asian Conference on Computer Vision 2022 (pp. 1316-1332).
[11] Kumar P, Saini R, Yadava M, Roy PP, Dogra DP, Balasubramanian R [27] Rangari T, Kumar S, Roy P. P., Dogra D. P., Kim B. G.,"Video based
(2017) Virtual trainer with real-time feedback using kinect sensor. In: exercise recognition and correct pose detection". Multimedia Tools and
Proceedings of the IEEE Region 10 Symposium (TENSYMP), pp 1–5 Applications. 2022 Apr 5: 1-6.
Authorized licensed use limited to: Universitaet Linz. Downloaded on November 19,2024 at 15:09:21 UTC from IEEE Xplore. Restrictions apply.
184