0% found this document useful (0 votes)

44 views115 pages

Master Thesis Armin - Aca8cc3a 5dd0 4913 8628 Caeb99602aa8

This thesis by Armin Šarić analyzes football performance in the Premier League of Bosnia and Herzegovina using machine learning techniques. It employs various data-driven methods to predict league outcomes and player contributions, revealing key performance metrics that influence team success. The study highlights the potential of machine learning in enhancing decision-making in less-studied football contexts, contributing valuable insights to the field of sports analytics.

Uploaded by

Arnes Dzido

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views115 pages

Master Thesis Armin - Aca8cc3a 5dd0 4913 8628 Caeb99602aa8

Uploaded by

Arnes Dzido

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 115

DATA DRIVEN INSIGHTS INTO FOOTBALL

PERFORMANCE: MACHINE LEARNING ANALYSIS

OF THE PREMIER LEAGUE OF BOSNIA AND
HERZEGOVINA

ARMIN ŠARIĆ

INTERNATIONAL UNIVERSITY OF SARAJEVO

2025
DATA DRIVEN INSIGHTS INTO FOOTBALL
PERFORMANCE: MACHINE LEARNING ANALYSIS OF THE
PREMIER LEAGUE OF BOSNIA AND HERZEGOVINA

ARMIN ŠARIĆ

A thesis submitted in partial fulfillment of the requirements for the

degree of Master of Science in Computer Sciences and
Engineering

Faculty of Engineering and Natural Sciences

International University of Sarajevo

June, 2025
APPROVAL PAGE

I certify that I have supervised and read this study and that in my opinion, it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and quality,
as a master’s thesis for the degree of Master of Science in Computer Sciences and
Engineering.

…………………………………………..
Assist. Prof. Dr. Mohammed Saeed Jawad
Mentor
I certify that I have read this study and that in my opinion, it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a
master’s thesis for the degree of Master of Science in Computer Sciences and Engineering.

…………………………………………..
[Committee member’s academic title and full name]
Committee Member
…………………………………………..
[Committee member’s academic title and full name]
Committee Member
This master’s thesis was submitted in partial fulfillment of the requirements for the degree
of Master of Science in Computer Sciences and Engineering.

…………………………………………..
Assist. Prof. Dr. Amal Mersni
Program Coordinator
…………………………………………..
Assoc. Prof. Dr. Altijana Hromić-Jahjaefendić
Dean

iii
DECLARATION

I hereby declare that all information in this document has been obtained and presented in
accordance with academic rules and ethical conduct. I also declare that, as required by
these rules and conduct, I have fully cited and referenced all material and results that are
not original to this work.

Name: Armin Šarić

Signature …………………. Date ………………….

iv
INTERNATIONAL UNIVERSITY OF SARAJEVO

DECLARATION OF COPYRIGHT AND AFFIRMATION

OF FAIR USE OF UNPUBLISHED WORK

Copyright © 2025 by Armin Šarić. All rights reserved.

DATA DRIVEN INSIGHTS INTO FOOTBALL PERFORMANCE: MACHINE

LEARNING ANALYSIS OF THE PREMIER LEAGUE OF BOSNIA AND
HERZEGOVINA

No part of this unpublished work may be reproduced, stored in a retrieval system, or

transmitted, in any form or by any means, electronic, mechanical, photocopying,
recording or otherwise without prior written permission of the copyright holder and
IUS Library.

Affirmed by Armin Šarić

……………………………… ……………………….
Signature Date

v
iii
ACKNOWLEDGEMENTS

I would like to thank all the people who motivated, supported, and guided me through the
process of finishing this master's thesis. This research is not only an academic achievement
but a personal experience that was possible because of the relentless support of various
people.

First and foremost, I would like to offer my most heartfelt thanks to my family for their
unrelenting support, patience, and understanding through every stage of this academic
endeavor. Their belief in me, even during the darkest moments, has been a constant source
of strength and encouragement. Were it not for their love and encouragement, this triumph
would not have been achieved.

I would also like to extend my heartfelt thanks to my good friends, especially those who
accompanied me every step of the way in this journey with genuine interest, stimulating
feedback, and well-timed moral support. Your inspiring words and thoughtful discussions
kept me on the ground and focused.

I am especially grateful to Professor Mohammed Saeed Jawad, my thesis supervisor and

mentor, for his valuable academic direction, constructive feedback, and clear guidelines
throughout the research and writing of this work. His thoughtful criticisms and persistent
encouragement have been invaluable in shaping the quality and depth of this project.

In addition, I would like to express thanks and appreciation to all the professors who taught
me throughout the duration of my study. Their instructional dedication, scholarly rigor,
and intellectual provocation laid the requisite foundation for this thesis.

All of you have had a lasting impact on my academic and personal growth, and I am truly
grateful to have been instructed by you.

To all my fellow colleagues, academic staff, and whoever contributed in some form, big
iv
or small by offering insights, being helpful, or simply by being a good listener—thank
you for being part of the journey.

This thesis is a result of teamwork, and I am humbled and grateful for the support I have
had all along.

v
LIST OF ABBREVIATIONS

ACC Accuracy
AI Artificial intelligence
API Application Programming Interface
AUC Area under the curve
BiH Bosnia and Herzegovina
CSV Comma-Separated Values
DI Disciplinary Index
EDA Exploratory Data Analysis
F1 Score Harmonic Mean of Precision and Recall
G/A Goals and Assists (Goal Involvement)
GD Goal Difference
JSON JavaScript Object Notation
KP Key Passes
KPI Key Performance Indicator
LR Logistic Regression
ML Machine Learning
PCA Principal Component Analysis
PLBiH Premier League of Bosnia and Herzegovina
PRE Precision
REC Recall
RF Random Forest
RMSE Root Mean Squared Error
ROC Receiver Operating Characteristic
SVM Support Vector Machine
xG Expected Goals
XGBoost Extreme Gradient Boosting

vi
vii
TABLE OF CONTENTS

APPROVAL PAGE ......................................................................................................... iii

DECLARATION ..............................................................................................................iv

DECLARATION OF COPYRIGHT AND AFFIRMATION OF FAIR USE OF

UNPUBLISHED WORK ................................................................................................... v

ACKNOWLEDGEMENTS ..............................................................................................iv

LIST OF ABBREVIATIONS ...........................................................................................vi

ABSTRACT ....................................................................................................................... x

LIST OF TABLES .......................................................................................................... xii

LIST OF FIGURES ........................................................................................................ xiii

1. INTRODUCTION ...................................................................................................... 1

1.1. Background of study and motivation ................................................................. 1

1.2. Problem statement .............................................................................................. 3

1.3. Research questions ............................................................................................. 5

1.4. Objectives ........................................................................................................... 9

1.5. Methodology overview .................................................................................... 10

1.6. Thesis structure ................................................................................................ 13

2. LITERATURE REVIEW ......................................................................................... 15

2.1. Introduction ...................................................................................................... 15

2.2. Football performance metrics and tactical evolution ....................................... 16

2.3. Machine learning methods in football analytics .............................................. 17

viii
3. RESEARCH METHODOLOGY ............................................................................. 20

3.1. Research design ................................................................................................ 20

3.2. Data collection and preprocessing ................................................................... 23

4. RESULTS AND ANALYSIS................................................................................... 33

4.1. Descriptive statistical analysis ......................................................................... 33

4.2. Correlation analysis .......................................................................................... 35

4.3. Distribution analysis ........................................................................................ 37

4.4. Discipline analysis ........................................................................................... 38

4.5. Ball possesion analysis..................................................................................... 41

4.6. Penalty analysis ................................................................................................ 44

4.7. Offensive and defensive analysis ..................................................................... 48

4.8. Relagation analysis........................................................................................... 52

4.9. Clustering analysis ........................................................................................... 57

4.10. Feature importance analysis ............................................................................. 63

4.11. Relagation prediction – league position perfromance-based modelling .......... 67

4.12. Player-level defensive modelling ..................................................................... 73

4.13. Midfielder perfromance modelling .................................................................. 78

4.14. Attacker-level perfromance prediction ............................................................ 81

4.15. Integrating player-level predictions with team performance ........................... 86

5. CONCLUSION ......................................................................................................... 91

REFERENCES................................................................................................................. 95

ix
ABSTRACT
DATA DRIVEN INSIGHTS INTO FOOTBALL PERFORMANCE: MACHINE
LEARNING ANALYSIS OF THE PREMIER LEAGUE OF BOSNIA AND
HERZEGOVINA

This study evaluates and predicts football performance during the 2023–2024 Premier
League of Bosnia and Herzegovina through data-driven processes and machine learning
designs. Grounded on structured match data that was downloaded from SofaScore, the
research integrates team-level statistics such as goals scored and conceded, possession,
disciplinary record, and clean sheet with player-level metrics for defenders, midfielders,
and forwards.

A multi-method strategy was followed with a mix of exploratory data analysis, correlation
analysis, clustering (K-means), and supervised machine learning models like Logistic
Regression, Random Forest, and XGBoost classifiers. Models were applied to predict
league positions, relegation, and key player contributions (like top foulers, assist leaders,
and top goal contributors). Feature engineering was used to enhance raw measures, and
stratified cross-validation was used to enhance generalizability on small datasets.

The findings point to goal conceded, clean sheets, and interceptions as the most accurate
predictors of team success as confirmed by feature importance returns from XGBoost and
Random Forest. Relegation prediction models revealed significant differences in
defensive solidity, efficiency in scoring, and disciplinary behavior between relegated and
non-relegated teams. The player-level models had Logistic Regression consistently
outperforming more complex models with high accuracy, precision, and explainability,
particularly in identifying key players by position.

This research validates the possibility of utilizing machine learning in less-well-served,

data-poor football environments, in which descriptive and predictive analysis provide
strategic insight together. It contributes to the new body of literature on football analysis
by illustrating that advanced data science techniques can be effectively used to assist
x
decision-making within obscured leagues and examined histories, with practically
meaningful applications for analysts, coaches, and football organizations in Bosnia and
Herzegovina and other similar contexts.

Keywords: Football analytics, Bosnia and Herzegovina Premier League, machine

learning, Random Forest, XGBoost, Logistic Regression, player performance, team
performance, predictive modeling, data-driven insights, sports data science

xi
LIST OF TABLES

Table 1 Correlation matrix of team performance metrics ................................................ 35

Table 2 Skewness values of key performance metrics ..................................................... 38

Table 3 Correlation matrix – Disciplinary metrics and team performance ...................... 39

Table 4 Correlation matrix – Ball possession and performance metrics .......................... 41

Table 5 Correlation matrix – Penalty metrics and team performance .............................. 45

Table 6 Correlation matrix – Goal difference and league position .................................. 49

Table 7 Average performance metrics of relegated vs. non-relegated teams ................... 52

Table 8 Mean values of performance metrics per cluster ................................................ 59

Table 9 Random Forest feature importance scores .......................................................... 65

Table 10 XGBoost feature importance scores .................................................................. 66

Table 11Model evaluation summary ................................................................................ 70

Table 12 Feature importance summary ............................................................................ 71

Table 13 Table of predicted positions .............................................................................. 71

Table 14 Comparison of modelevaluation ....................................................................... 74

Table 15 Classification results ......................................................................................... 79

Table 16 Comparison of Logistic Regression and Random Forest .................................. 83

xii
LIST OF FIGURES

Figure 1 Data extracted in JSON format for Premier league of BiH ............................... 25

Figure 2 Data for one specific team ................................................................................. 25

Figure 3 Data extracted in JSON format for midfield players ......................................... 26

Figure 4 Data extracted in JSON format for defense players .......................................... 26

Figure 5 Data extracted in JSON format for attack players ............................................. 27

Figure 6 Example of data which was not cleaned ............................................................ 28

Figure 7 Example of cleaned and preprocessed data ....................................................... 30

Figure 8 Relationship between ball possession and goals scored .................................... 42

Figure 9 Relationship between ball possession and goals conceded ............................... 43

Figure 10 Relationship between ball possession and league position.............................. 43

Figure 11 Relationship between penalty goals and goals scored ..................................... 45

Figure 12 Relationship between penalty goals and league position ................................ 46

Figure 13 Relationship between penalty success rate and league position ...................... 47

Figure 14 Relationship between goals scored and goals conceded ................................. 49

Figure 15 Goal difference of each team in the 2023/24 season ....................................... 50

Figure 16 Relationship between goal difference and league position .............................. 51

Figure 17 Distribution of goals scored – relegated vs. non-relegated teams ................... 53

xiii
Figure 18 Distribution of goals conceded – relegated vs. non-relegated teams .............. 54

Figure 19 Distribution of yellow cards – relegated vs. non-relegated teams ................... 55

Figure 20 Distribution of red cards – relegated vs. non-relegated teams......................... 55

Figure 21 Average performance comparison – relegated vs. non-relegated teams .......... 56

Figure 22 Elbow method for determining optimal number of clusters ............................ 59

Figure 23 Clusters based on goals scored and goals conceded ........................................ 62

Figure 24 Cluster means across key performance indicators ........................................... 62

Figure 25 Random Forest feature importance visualization ............................................ 64

Figure 26 XGBoost feature importance visualization ...................................................... 66

Figure 27 Comparison of model performance (Logistic Regression, XGBoost, SVM)…76

Figure 28 Visual comparison of model performance (bar chart of metrics) .................... 79

Figure 29 Comparison of model performance metrics .................................................... 84

xiv
1. INTRODUCTION

1.1. Background of study and motivation

Not only is football the most popular sport in the world, but it's also a cultural and social
phenomenon [1]. Beyond the pitch, football affects millions via national identity,
coherence of community, economic investment, and entertainment. The sport has evolved
to be a complex ecosystem where, apart from players and coaches, there are scientists,
analysts, and data engineers participating in it, given that there are billions of fans
worldwide.

Over the past decade, football has gained significant interest among the scientific
community, particularly as the connection between sport performance and digital
technologies becomes more prominent. As AI, data analysis, and sports science
increasingly overlap, there has been a new era of research and practice in football. Football
choices are more than ever based on data rather than gut feeling alone. Players' and teams'
activities can now be understood in terms much broader than the traditional statistics
because of the increasing amount of detailed match data and sophisticated analytics [2].

Traditionally, football decision-making was dominated by subjective opinion, coaching

experience, and plain measures such as goals scored, match won, or possession ratio. Such
basic statistics were prone to fail in exactly capturing the full richness of player
interactions and team strategies. But all of that is now changed with the arrival of match
result websites and online platforms such as SofaScore, Opta, and Wyscout, who now
offer analysts an abundance of performance measures. Variables such as passes completed,
successful dribbles, shot accuracy, intensity of the press, and even advanced measures
such as expected goals (xG) have added a lot of objectivity and sophistication to the
analysis of the game [3].

Meanwhile, machine learning (ML) has been a key aspect of modern sports analytics [4].
ML algorithms in football are used not just for match analysis after the event but also to
forecast match outcomes, identify opponent strengths and weaknesses, assign player
1
valuations, calculate team formations to optimize, and even suggest real-time tactical
instructions. The ability of ML models to detect hidden patterns in high-dimensional data
sets makes them ideal tools to make sense of the dynamic and complex nature of football.
Powerful European football clubs—i.e., Manchester City, Bayern Munich, and
Liverpool—already use these tools, embedding data science in their scouting, training
load optimization, and opposition analysis processes [5].

While all these developments have been occurring in the world, many smaller national
leagues such as the Premier League of Bosnia and Herzegovina are still to embrace data
analysis to any great extent, either practically or in academic research. This is particularly
intriguing given the league's rich football heritage, enthusiastic fanbase, and socio-cultural
value. While the passion for the game is there, and also willingness of coaches to use
advanced analytical tools there is not much empirical evidence to suggest that clubs or
league administrators are regularly using advanced data analytics to guide their decision-
making. This lack is a significant research gap and opens up a clear window of opportunity
to explore how data science and machine learning can be applied to a league that is still
fairly under-analyzed in existing literature.

The aim of this study is to conduct a systematic analysis of the Bosnia and Herzegovina
Premier League performance statistics using machine learning models. The dataset, in
JSON format acquired from SofaScore, was cleaned and normalized to structured CSV
files that are amenable to statistical manipulation as well as model construction. By
analyzing a wide range of team statistics including goals, ball possession, disciplinary, and
defense statistics, the study hopes to uncover the key drivers of performance that dictate
league success. Along with this, the project hopes to employ the use of predictive
modeling in order to forecast league position, compute the probability of relegation, and
provide insight into other performance-based variables.

By demonstration of the capability of ML-based analysis, this research aims to promote

further adoption of data science methods by football clubs, league officials, and
universities in Bosnia and Herzegovina. It also aims to contribute to the broader body of

2
football analysis by demonstration of a regional case study in which usage of such tools
remains in its infancy. Implications of this study can not only contribute to theoretical
advancement in sport data analysis but also have practical implications for stakeholders
who want to update football training in this league and other environments.

The study also extends its analytical focus from team-level data through the incorporation
of player-level modeling. It investigates the ways in which player performance metrics
can be utilized to inform play on the field and to affect team performance. Three separate
predictive models are developed for each of the player roles: attackers (involvement in
scoring), midfielders (assist percentage), and defenders (foul rates). These models allow
for a more sophisticated determination of individual players' contribution to the overall
team performance. By taking this two-level approach—both at team and player levels—
the research provides an all-encompassing framework for assessing football performance
with machine learning methods.

1.2. Problem statement

In today’s modern professional football, predictions and performance analysis have

become a very important part of both competition success and strategic decision-making.
The top European leagues such as La Liga, Bundesliga, and the English Premier League
have already embraced machine learning (ML) and data analytics to enhance tactical
preparation, optimize player recruitment, and monitor in-game and season-long trends in
performance. These leagues lead the way in adopting data-driven technologies into nearly
every area of football management.

However, this transformation hasn't occurred everywhere uniformly. There are substantial
disparities among high-resource and low-resource leagues, particularly in the theoretical
and actual application of advanced analytics. The Premier League of Bosnia and
Herzegovina is such a league that has yet to experience fully the impact of data-driven
methods. Despite its strong football tradition and growing online media, the use of

3
predictive analytics and systematic data analysis in this league remains marginal and
underdevelopment.

While platforms like SofaScore and Opta have placed basic match data into the hands of
the public domain, such as the number of goals, percentage possession, and disciplinary
record, the facts are reported without additional analytical overview. While clubs and
analysts do have regard for such basic numbers, few undergo higher-level statistical or
machine learning analysis that can uncover hidden trends, inform evidence-driven
coaching, or facilitate tactical optimization.

In high-resource leagues, there has been an increasing amount of evidence where machine
learning has demonstrated successfully predicting match outcomes, simulating relegation
chances, examining player contribution, and quantifying subtle tactical efficiencies in
games [6]. Highly detailed performance data such as pressing pressure, pass networks,
xG, zone occupation, and defensive actions are the typical workhorse data utilized in these
studies. Conversely, such multi-layer and context-sensitive analysis is yet to be deeply
applied in the Premier League of Bosnia and Herzegovina, thus leaving the huge body of
knowledge and immense amount of untapped potential [7].

This deficiency not only robs clubs in lower divisions of the strategic benefits of data
science but also limits the generalizability of the findings from research into football
analytics. Rein and Memmert contend that the disproportionate representation of elite
leagues in academic scholarship limits generalizability and overlooks contextual
differences in less high-profile leagues [8]. By not extending analysis to these alternative
environments, football scholarship is incomplete—omitting considerations of differential
playing style, resource shortage, and cultural framing of the game.

Furthermore, the absence of predictive modeling in such leagues limits the power to detect
early indicators of success or failure in the course of a season. By leaving predictive
models unutilized to delineate which statistics are most correlated with winning
performance or relegation threat, stakeholders—coaches, analysts, and club officials—are
deprived of beneficial knowledge that may render their competitive plan more effective.
4
This inhibits the deployment of evidence-based decision-making during training, match
preparation, and recruitment. No less overlooked is the space for individual player-level
modeling in lower-league contexts. Though analysis at the team level has had some
traction, little work has been conducted that takes into account the ways in which the
players within particular roles—forward, central midfielder, or defender—contribute on
their own toward overall team performance. Player-level classification models offer a
more detailed view of performance, enabling identification of high-impact players,
recurring tactical issues, or positional strengths/weaknesses that might otherwise be lost
in aggregate metrics.

In consideration of these theoretical and applied deficits, there is a clear and pressing need
for carrying out expert research into the application of machine learning and statistical
modeling in Bosnia and Herzegovina's Premier League. By applying advanced data
science techniques to a regional football setting, this study not only fills an important gap
in research but also offers valuable insight to stakeholders who operate within similar
underrepresented leagues. The overall goal is to demonstrate the viability and worth of
ML-based analytics in performance modeling even where there is moderate data
availability and minimal technology infrastructure.

1.3. Research questions

The purpose of this research is to fully investigate team and player performance
throughout the Bosnia and Herzegovina Premier League using both descriptive statistics
and advanced machine learning (ML) techniques. Although straightforward match
statistics are commonly produced for public release—e.g., number of goals, possession
share, or yellow cards—such measurements really don't attempt much more than
superficial examination. Hiding subtly underneath is a huge possibility to find higher-level
trends, relationships, and decision-critical information.

This study formulates several research questions that are aligned in an effort to bridge the
gap from sheer data presentation towards substantial football intelligence. All the

5
questions are aligned with relevant analytical roles undertaken in the study and are
motivated by a vast dataset that is derived from SofaScore. Additionally, the questions
draw inspiration from the existing body of research in top leagues, where these approaches
have proven to be worthwhile.

Q1: What do descriptive statistics and correlation analysis reveal regarding the most
important predictors of performance that distinguish successful from unsuccessful
teams in the BiH Premier League?

This question is answered by conducting exploratory data analysis using statistical

measures (mean, median, standard deviation) and correlation tables. It aims to reveal the
relationship between prime performance measures—goals, clean sheets, accurate passes,
and possession of the ball on average—both in league position at the end and in team
ranking. An awareness of these patterns is a step towards quantifying what makes high-
performing teams distinguishable from others [9].

Q2: What is the mean ball possession contribution to team performance in goals
scored, goals conceded, and finishing position within the league?

Possession is broadly accepted in literature to be a main indicator of the domination and

authority of a team. This research examines how possession means are associated with
both attacking and defensive results, including its linear correlation with goals for, goals
against, and eventual finishing league position. Correlation coefficient analysis and
regression modelling measure this effect [10].

Q3: How does discipline, for example yellow and red cards affect team performance
and league position?

While sometimes termed a subjective issue, this study tests the hypothesis of negative
correlation between poor discipline (cumulative yellow and red cards) and team success
quantitatively. By comparing these disciplinary statistics to others like goals allowed and

6
league position, the study looks at whether the teams with worse discipline perform worse
on average.

Q4: What are the important features to predict the league standing, and in what
order do machine learning classifiers such as Random Forest and XGBoost rank
them?

Machine learning is not only predictive but also explainable using measures of feature
importance. In this case, Random Forest and XGBoost are trained to predict league
position from various match statistics. The model finds which features such as goals
conceded, successful interceptions, accurate passes, or clean sheets contribute most to
end-of-season team positions and ranks them in increasing order.

Q5: Can machine learning algorithms predict ultimate league position or relegation
destiny using team performance statistics?

Expanding on the above, the research employs both binary classification and regression
models to analyze if season-long performance measurements can be used to accurately
forecast in which position a team will finish and whether it is at risk of relegation. The
performance of the models is measured in terms of accuracies, AUC, and precision-recall
to determine their usability in real-world applications.

Q6: In what ways are relegated team performance profiles different from non-
relegated teams, and how are trends revealed using comparative analysis?

The study contrasts relegated and non-relegated teams along all the relevant parameters
based on group-wise statistical comparisons. It brings to the fore early indicators of
warning or gameplay deficiencies that generally manifest in those teams that fail to remain
up. It also places team poor performance into perspective relative to the league average.

7
Q7: Can unsupervised learning (clustering) be applied to classify teams by their
performance measures, and what is the relationship between the clusters and league
tables?

Unsupervised learning using k-means clustering is used to group teams into performance
types such as dominant, balanced, or underperforming. The study then investigates how
these empirically determined groups correspond to real league tables based on an
independent verification of types of performance.

Q8: Can one utilize player-level stats to predict whether defenders will commit a lot
of fouls, midfielders will provide a lot of assists, or attackers will contribute heavily
in goal involvement?

After team-level modeling, this question shifts the attention toward player-level
contributions. ML classifiers are trained to classify whether a certain player is probable to
have over a certain threshold of fouls (defenders), assists (midfielders), or goals/goal
involvement (attackers). Such prediction problems enable a more detailed understanding
of who contributes the most in their respective roles.

Q9: In what way are such individual player projections echoed at the team level in
performance and outcome, i.e., league standing, threat of goals, or defensive
strength?

Finally, this question probes the ripple effects of individual play. It asks if teams with high-
quality players in specific roles (e.g., great scorers or playmakers) tend to feature higher
in aggregate league statistics. The answers to this question help to bridge micro-level
player analysis to macro-level team success.

8
1.4. Objectives

The primary aim of this study is to apply data-driven and machine learning techniques to
examine and predict football team performance in the Premier League of Bosnia and
Herzegovina. Although these techniques have been applied extensively on major
European leagues, this study will try to fill a knowledge gap in the literature by focusing
on a league that has not received a lot of attention in terms of advanced statistical or
predictive studies.

These goals align with broader trends across sports analytics, where machine learning and
performance modeling have already found value in real-time decision-making, player
evaluation, and tactical planning [11]. Specifically, the study will aim to extract useful
insight from match data, compare the efficacy of predictive models for forecasting league-
related outcomes such as final position or relegation, and establish the relevance of various
performance metrics. Specific goals are outlined below:

1. To examine descriptive and correlation-based trends among key performance

indicators, i.e., goals, passes, possession, and defensive play.
2. To explore the degree to which ball possession affects the overall performance of
a team, taking into account its correlation with goals for, goals against, and league
standings.
3. To ascertain the degree to which team performance and competitiveness are
impacted by discipline-related indexes (yellow and red cards).
4. To compare performance profiles for relegated and non-relegated teams with the
aim of determining the primary markers of poor performance.
5. To group teams into homogenous groups by their common performance profiles
using clustering analysis and analyze the relationship between the clusters and
league standings.
6. To use machine learning algorithms (XGBoost, Random Forest) to predict league
outcomes and make model performance comparisons using standard metrics.

9
7. To use feature importance analysis to determine which features have the most
impact on team performance and rank them accordingly.
8. To provide practical implications to clubs, analysts, and football decision-makers
for how data can be effectively used for measurement and improvement of
performance.
9. To train and test classification models on Random Forest and Logistic Regression
for defenders, midfielders, and forwards to predict goal involvement, assist rate,
and foul tendency, respectively.
10. To investigate how major stakeholders affect the overall team performance in the
Premier League of Bosnia and Herzegovina by incorporating individual player
modeling with team performance analysis.

1.5. Methodology overview

This study employs an evidence-based and data-driven methodology that combines

statistical analysis, exploratory data analysis, and machine learning techniques in an effort
to unveil performance patterns and construct forecasting models for football clubs
competing within the Bosnia and Herzegovina Premier League. The methodological
strategy is supported by the standard procedures of contemporary sports analytics
literature, with structured performance data employed to explore unobservable
determinants of success and derive actionable insights.

The study starts with the collection of data, where match-level and player-level data are
retrieved from the SofaScore website. SofaScore is a reputable provider of live and
historical football statistics and was used due to how easily available it was and the level
of detail that it could provide. Raw data, initially in JavaScript Object Notation (JSON)
format, were translated into Comma-Separated Values (CSV) format to enable easier
handling within Python-based data analysis environments. The data set contains a wide
variety of numerical and categorical variables that capture various dimensions of team and
player performance. Examples include but are not exhaustive of the following goals

10
scored and allowed, average possession, correct passes, yellow and red cards, clean sheets,
interceptions, fouls, penalties, and other contextual measures.

Once data is gathered, there is a heavy data preprocessing step to provide quality and
integrity. This is a very important step to avoid biased outcomes in model training and
statistical testing. Operations performed during this phase are the elimination of duplicate
records, handling missing values using imputation or exclusion (as required), data type
conversion, and normalization as necessary. Feature engineering is also used to develop
new domain-specific variables that enhance the scope of analytical research. The salient
engineered features are goal difference (goals scored against minus goals scored for),
disciplinary index (weighted summary of yellow and red cards), and penalty success rate.
The data is also cleaned up to exclude outliers, distribution skewness, and
multicollinearity, and then necessary adjustments are made to have a robust input space
for further modeling activities.

The second phase is exploratory data analysis (EDA) and statistical analysis, whose aim
is to provide baseline insight into variable relationships. Descriptive statistics (e.g., means,
medians, standard deviations) are computed, and visualisations (e.g., histograms,
heatmaps, scatter plots) are employed to review patterns between performance measures.
Correlation matrices are generated to inspect linear relationships among variables, in
particular between end of season position and relegation status. These exploratory results
help guide machine learning variable selection and serve as an initial-stage diagnostic
layer.

The core analysis phase applies supervised and unsupervised machine learning methods
to extract knowledge from the data. Along the supervised learning modeling pathway,
Random Forest and XGBoost models are applied to perform regression (predicting
ultimate league position) and classification (predicting relegation) tasks. These models are
chosen due to their small-to-medium dataset suitability, strong predictive power, and
support of feature importance scoring. The models are validated with proper performance
measures for the task: R-squared and RMSE for regression problems, accuracy, precision,

11
recall, and F1-score for classification. Cross-validation and grid search are utilized for
hyperparameter tuning and guarantee of generalization performance.

Simultaneously, unsupervised learning techniques, i.e., k-means clustering, are applied to

identify natural groupings by teams based on season-long performance profiles. The aim
during this step is to sort teams into typologies such as dominant, balanced, or struggling
and thereby offer simplistic league segmentation interpretation. The result of clustering is
cross-checked against actual league standings to determine the existence of cluster
differentiations.

The approach also applies to the modeling at the player level, in which supervised
classifiers predict individual player tendencies across three positional positions:
defenders, midfielders, and forwards. For defenders, the objective is to predict the
likelihood of being a high-fouling player; for midfielders, whether they are likely to record
high assist rates; and for forwards, whether they are actively involved in goal scoring.
Each positional model is evaluated by AUC, precision, recall, and F1-score to ensure the
ability to discriminate high-impact players from mediocrities. Player-level predictions are
also compared to team-level success to examine the connection between individual and
collective performance.

Lastly, the research includes a feature importance analysis module. This is crucial for
model explainability and for drawing practical insights that can be communicated to
coaches, analysts, and team managers. Ranking features on the basis of their predictive
power in Random Forest and XGBoost models, the study identifies which statistical
metrics contribute most significantly to league performance and achievement. These
findings are not only useful for retrospective examination but also as inputs for the
development of future strategy and club performance monitoring across the region.

More broadly, the design of methodology is indicative of an integrated and multi-level

style of analysis embracing raw data transformation, robust statistical foundation work,
advanced machine learning modeling, and interpretive meaning—hence contributing to
the emerging corpus of football analytics in under-studied league contexts.
12
1.6. Thesis structure

The seven main chapters that make up this thesis each focus on a specific stage of the
research process and contribute to the main goal of knowing and predicting football team
performance through data analytics and machine learning. The structure aims to lead the
reader from the introduction and motivation to the method design and analysis, and lastly,
to the interpretation and conclusions from the results.

CHAPTER 1 – Introduction

The context and motivation for the research are presented in this first chapter, which also
declares the absence of data-driven evidence within the Premier League of Bosnia and
Herzegovina and presents the main research questions and goals. It also provides an
overview of the thesis outline and the methodology.

CHAPTER 2 – Literature review

This chapter overviews the literature on football performance analysis, statistical and
machine learning uses in sport, and prior uses of predictive modelling in elite sport and
sub-elite leagues. In line with proposals for taking football analytics into unexplored
competitions, it focuses on important issues and gaps in the literature, which this research
aims to fill.

CHAPTER 3 – Methodology

Data preprocessing, feature engineering, exploratory analysis, machine learning algorithm

selection, and data collection are all thoroughly addressed in this research design section.
Methodological literature in sports analytics and model interpretability contains support
for the justification of the use of particular algorithms (Random Forest, XGBoost, and k-
means clustering).

13
CHAPTER 4 – Data analysis and results

This chapter demonstrates the results of performance clustering, discipline impact

analysis, correlation analysis, and descriptive analysis. Individual modeling for defenders,
midfielders, and attackers based on machine learning classification tasks and team-level
evaluation are included, as well as cluster analysis and feature importance rankings
highlighting team profiles and performance indicators.

CHAPTER 5 – Conclusion

Key findings, research questions summary, and contribution of the study to football
analytics are all covered in chapter seven. It also highlights areas for future research and
makes a few suggestions for BiH Premier League stakeholders.

Every chapter builds upon the one before it, with a reasoned progression and
comprehensive examination of the research problem. In alignment with best practice in
sport science and data analysis research, this organization generates both theoretical
investigation as well as applied findings.

14
2. LITERATURE REVIEW

2.1. Introduction

The period of the last decade revolutionized professional practice and research in football
performance analysis by introducing the strength of data analysis and machine learning
techniques [26]. While the field relied heavily on human intuition, anecdotal facts, and
elementary statistics in the past, modern analytics has transformed it into a data-science
science characterized by predictive models, tactics simulations, and real-time decision
support systems. This revolution has made it possible for one to break down almost all
elements of the game varying from player position and ball recovery zones to shape
changes of the team and strategy introduction in real match scenarios. Access to publicly
available performance data and purpose-designed analysis software has driven this
revolution, enabling analysts, coaches, and researchers to go far beyond the boundaries of
conventional performance reviews.

This latest research taps into this emerging literature by focusing on the role of
performance measures and machine learning models in football analysis. It attempts to fill
the gap between the big leagues with extensive data infrastructures and small leagues
where data-driven methods are still on the emergence phase. Specifically, the review
presents a critical synthesis of literature regarding the areas of performance analysis,
tactical development, supervised and unsupervised learning, and data limitations to low-
resourced environments. Particular emphasis is placed on studies that investigate low-tier
leagues and aim to generalize machine learning-generated information to be applicable
across competitive arenas of varying maturity.

To be concise and readable, this chapter is composed of five closely related sections:

- Overview of football performance metrics and how they may be utilized within
team performance measurement;

15
- Overview of machine learning methods used in football analytics;

- Discussion of data availability and sources, problems, and preprocessing within

football research;

- Critical examination of existing attempts at predictive modeling;

- Identification of literature gap within the context of the Premier League of Bosnia
and Herzegovina.

This systematic presentation sets the stage for discussing study design choices and
highlighting how prior work guides the use of analytical tools to a less-researched football
environment.

2.2. Football performance metrics and tactical evolution

At the heart of football analytics is the collection and analysis of playing statistics, which
provide quantifiable measures for the way that games unfold. Traditional measures such
as goals, shots on target, and possession rates have been used for decades to summarize
match results and are still central to performance evaluation. Contemporary research has
significantly expanded the analysis domain by incorporating advanced statistics that allow
for deeper tactical interpretation.

Some of the most important modern metrics involve expected goals (xG), pass network
centrality, pressing zones, and area-based coverage. These metrics are more than just
reporting results and help analysts understand why and how certain patterns developed
within a match [27]. This shift from descriptive to explanatory metrics has helped coaches
and analysts assess performance in process-based indicators, which impacts training
regimens, in-match adjustments, and post-match analysis.

The call for avoiding superficial measurements has been placed in eloquent terms by
authors such as Rein and Memmert, who argue that tactical designs and wits in the game
16
can only be ascertained by combining spatial-temporal modeling with higher levels of
comprehension into performance observation [28]. Similarly, Gudmundsson and Horton
emphasized the need to utilize tracking data so that researchers can uncover details on
team formation, movement strategies, and spatial control across various stages in play
[29]. These developments have enabled more precise tactical models, allowing for insight
into positioning, transitions, and strategic cohesion.

Even with these promising advances, however, university and industry interest continues
to be largely centered on elite competitions such as the UEFA Champions League, La
Liga, and the English Premier League. Because of this, leagues such as the Premier League
of Bosnia and Herzegovina receive essentially no exposure in large datasets or research
papers. This exclusion is a significant limitation of the generalizability of football
research, as it precludes the potential to discover performance attributes honed from
different economic constraints, tactical styles, and player development pathways.

Earlier empirical research has consistently established that statistics such as possession
rate, passes completed accurately, interceptions won, and disciplinary offenses committed
are strongly associated with team performance. These variables are particularly relevant
in seasonal assessments, where longer-term trends better reflect strategic stability. The
influence of these variables, though, can be moderated by local contextual variables such
as refereeing quality, average match intensity, and climatic conditions. Consequently,
contextualized examination in unexplored leagues like Bosnia and Herzegovina is
emphasized in this research to determine whether international results are translatable on
the regional level or if adjustments need to be made per league.

2.3. Machine learning methods in football analytics

Machine learning (ML) has been the most groundbreaking tool in football research,
enabling researchers and scientists to move away from descriptive statistics and into more
complex activities such as predictive modeling, automatic classification, and behavioral
pattern detection. ML techniques have transformed the analytical toolkit by uncovering

17
intricate, non-linear relationships in data that are often too complicated for conventional
statistical approaches to decipher accurately.

Over the past decade, the use of ML in football has spanned a broad set of applications.
They vary from match outcome prediction, player rating, injury predicting, tactical
profiling, and segmentation of the team by style of play [30]. Enhancing access to
structured data such as rich event logs, GPS data, and seasonally aggregated statistics has
allowed these models to move beyond proof-of-concepts to operational decision-making
systems for clubs and federations.

Among the most widely used families of ML algorithms in football are supervised learning
techniques, which work on learning mappings from input features to known target outputs.
Here, models learn from labeled past information to foretell outcomes like goal tallies,
league positions, or disciplinary incidents. Two of the best-performing supervised
algorithms are Random Forest and XGBoost, which have persistently shown strong
performance in football analytics [31]. Random Forest, an ensemble decision tree
algorithm, is well known for its strength against overfitting, capacity to capture complex
feature interaction, and ranking of feature importance. XGBoost, a gradient boosting
algorithm, is widely praised for its superior computational efficiency, scalability, and
predictive power in high-dimensional data environments.

Some of the other widely used supervised learning algorithms are neural networks, logistic
regression, support vector machines (SVM), and k-nearest neighbors (KNN). Neural
networks and deep learning models, particularly, have gained popularity in spatiotemporal
input tasks such as action recognition from videos or movement prediction from GPS-
based tracking data [34]. Although these models are of high accuracy, interpretability is
low especially in managerial use where explainability is greatly required.

On the other hand, unsupervised learning techniques such as Principal Component

Analysis (PCA), hierarchical clustering, and k-means clustering are applied when there
are no clear outcome labels to utilize [35]. These models assist in revealing implicit
patterns or grouping entities such as players, teams, or matches based on similarity of
18
performance. In this work, k-means clustering is used to label pre-existing Premier League
of Bosnia and Herzegovina teams based on season-level statistics to determine strategic
directions like defensive solidity, attacking dominance, or discipline inconsistency.

Another very significant aspect of ML use in football is model evaluation, for which the
proper metric for the task at hand must be selected. For classification tasks (e.g., relegation
prediction), common evaluation measures are precision, recall, F1-score, accuracy, and
Area Under the Receiver Operating Characteristic Curve (AUC). For regression tasks
(e.g., league position prediction), Mean squared error (MSE) and R-squared (R²) are used.
Cross-validation and grid search for hyperparameter optimization to avoid overfitting and
ensure generalizability are used in most studies.

While it has many positives, implementing ML in football analysis is not problem-free

[32]. Black-box models are by their nature untransparent, which complicates their
adoption into tactical strategy. Class imbalance remains an issue particularly for tasks
related to rare events such as red cards or relegation events. Furthermore, data limitations
are very severe in lower leagues, where large datasets and live-tracking infrastructure
rarely occur.

Thus, in the Premier League of Bosnia and Herzegovina, in this study, season-level
performance data is employed as a useful input for ML modeling [33]. While such data
may not be capable of describing micro-level behavior or in-game dynamics, it remains
useful when uncovering long-term trends and creating predictive models that can guide
strategic as well as tactical decision-making. The strength of ML in this instance is not
merely in anticipating outcomes but also in facilitating democratization in the application
of analytic tools to clubs with restricted technology capabilities.

19
3. RESEARCH METHODOLOGY

3.1. Research design

The study is based on a quantitative, data-driven research methodology on exploratory

data analysis (EDA) and predictive modeling. It applies machine learning techniques to
formatted performance data from the 2023–2024 Premier League of Bosnia and
Herzegovina season to attempt to detect patterns and measure performance metrics and
create models capable of predicting final league position ranking or the probability of
relegation. The use of computational methods in a rigorously analytical context is
commensurate with best practice in contemporary sports analytics literature, as it is guided
by rigor, interpretability, and reproducibility.

The research methodological design unfolds through seven various steps, each one of
which is critical within the overall investigative framework. These are discussed further
below:

1. Data collection
Programmatically gathered structured team performance information were
sourced from the SofaScore platform, which offers rich football data for a number
of competitions. The information, gathered as JSON initially, include a variety of
season-level measures like goals, passes completed, possession per game average,
fouls, yellow and red cards, and clean sheets. The information were then
programmatically mapped and placed in a structured tabular format (CSV) for
simplicity in statistical analysis as well as machine learning modeling.

This structured data formed the groundwork of the research, both for descriptive
and prediction modeling. Leveraging an open and reproducible dataset like
SofaScore ensures transparency and ease of replication of the research.

20
2. Data cleaning and preprocessing
In order to enhance the accuracy and validity of the data, a process of
preprocessing was exhaustively conducted. This included normalizing column
structures, removing duplicate rows, encoding categorical and numerical variables
as needed, and performing missing value handling.
Feature engineering was also a key aspect of this step. New performance measures
such as goal difference, disciplinary index, and penalty success rate were added to
provide richness to the dataset. Detection and correction of potential outliers and
testing for skewness and symmetry of distribution were conducted to ensure that
the input data were appropriate for application in various modeling algorithms.
These processes, elaborated in Section 3.2, imparted analytical rigour and
enhanced explainability of follow-on results.

3. Exploratory data analysis (EDA)

The third component of the research design was exploratory data analysis, which
served as a diagnostic and also hypothesis-generating phase. Descriptive statistics,
correlation matrices, histograms, and scatter plots were employed to examine the
relations between various performance measures and outcome variables e.g., end-
of-season league position and being relegated to a high degree of detail.
This stage enabled the identification of potentially useful features, visualization of
trends through sight, and early identification of any multicollinearity issues or
irregularities in data distribution. It also gave the background for positioning
performance profiles within teams and for the visual analysis of team behavior for
possession, discipline, or goals generated metrics.

4. Unsupervised learning – clustering analysis

For obscured structure identification in data, k-means clustering was applied. This
unsupervised learning technique grouped teams into variance-reducing
performance-based clusters. With the identification of teams based on similar

21
strategic types dominant, balanced, or struggling sides, the cluster procedure
reduced dimensional complexity and generated useful segmentation data.
This is a valuable step in analysis for profiling Premier League Bosnia and
Herzegovina teams, where tactical style and available resources have great
variability. Unsupervised learning may be applied here to reveal structural patterns
independent of outcome labels.

5. Supervised learning – predictive modelling

Machine learning models were subsequently trained to predict team performance.
Supervised learning algorithms, Random Forest and XGBoost, were selected due
to their effectiveness with small and medium-sized tabular data, in identifying non-
linear relations, and providing feature importance ranking.
The models were optimized to make predictions on both ordinal outcomes (e.g.,
end of season league position) and binary classification problems (e.g., relegation).
They were run with cross-validation and grid search to drive optimization, and
they are fully explained in later sections of this chapter.

6. Evaluation and interpretation

After model construction, quantitative performance metrics like R-squared (R²)
for regression and accuracy, precision, recall, and F1-score for classification were
employed in order to measure predictive performance. Furthermore, feature
importance scores were obtained from the models in order to understand which of
the metrics like goals conceded, possession, or disciplinary indicators most
strongly impacted results.
The modeling outputs' interpretation was pragmatic insight, translating statistical
results into analyst, coach, or federation administrator-directed actionable
recommendations. This assisted the research in finding balance between
computational complexity and real-world application.

22
7. Player-level predictive modelling
In addition to team-level modeling, the study also involved individual player
modeling. Individual player data were prepared and preprocessed separately for
attackers, midfielders, and defenders, and predictive models learned to predict
players in terms of binary outcomes whether they were high in foul rate, assist-
happy, or goal-involving.
Logistic Regression and Random Forest classifiers were employed to make these
predictions, which were assessed by AUC, precision, recall, and F1-score. This
level of granularity permitted individual performance to be linked with more
general team success, giving another layer of information and additional support
to the study's double-pronged approach to assessment of performance.

In summary, the strategy is built on an integrated approach that incorporates data

engineering, exploratory analysis, supervised and unsupervised learning, and performance
measurement. The organizational multi-level structure extending to groups and individual
players demonstrates how modern data science can be remolded to extract insights from
under-examined football environments, thereby adding to the overall body of sports
analytics.

3.2. Data collection and preprocessing

3.2.1 Data source and extraction

The data utilized in this study was from SofaScore, which is among the most highly rated
and reliable sports analytics websites widely known for giving elaborate, live data and
long-term statistics of many football leagues. In this case, team-level performance data in
a structured format for the 2023–2024 Premier League Bosnia and Herzegovina season
were employed. SofaScore's public API delivered data in JSON (JavaScript Object
23
Notation) format, a widely used web-based data exchange mode due to its flexibility,
lightweight, and hierarchical dataset support [12].

JSON was chosen as the initial data structure because it could effectively support storage
and retrieval of nested data, as well as neatly organize statistics such as goals scored, fouls,
clean sheets, and possession stats. This structure turned out to be extremely helpful in
maintaining the complex interactions between match statistics, player attributes, and
teams.

Upon extraction, the raw JSON files were parsed into CSV (Comma-Separated Values)
and Excel (XLSX) formats using Python to make them compatible with mainstream data
manipulation software such as Microsoft Excel and the pandas library of Python. This
helped to make the data tabular in form and easier to visualize, clean, and analyze, which
is useful in widely used machine learning processes in applied machine learning.

The conversion of JSON to table also paved the way for subsequent steps in the form of
feature engineering, correlation analysis, clustering, and supervised learning. At this stage,
the dataset included team-level datapoints along with some other derived player-level
statistics, which were represented in individual modeling pipelines across different
positional positions.

24
Figure 1 Data extracted in JSON format for Premier league of BiH

Figure 2 Data for one specific team

25
Figure 3 Data extracted in JSON format for midfield players

Figure 4 Data extracted in JSON format for defense players

26
Figure 5 Data extracted in JSON format for attack players

For purposes of thorough examination, effective preprocessing, and other modeling, the
obtained JSON files were systematically transformed into tabular representations — first
to CSV (Comma-Separated Values) and then to Microsoft Excel spreadsheets. This was
indispensable because such representations are readily supported by most typical tools for
data analysis, from manual examination with Microsoft Excel to Python's pandas library
for programmatic data manipulation. The tabular structure table provided a structured and
well-working format for finding inconsistencies in the data, doing preprocessing, and
constructing machine learning pipelines easily.

3.2.2 Raw data overview

The initial dataset consisted of semi-structured, nested columns having three main
categories: team identifiers (name, slug, gender), performance statistics (goalsScored,
interceptions, fouls), and contextual metadata (entity type, national association, number
of matches). These were collected in JSON format through the SofaScore API. But when
flattened from hierarchical structure in the JSON form into Excel-readable format, the
dataset contained structural anomalies. Especially, conversion processing resulted in a
27
sparsely populated table with duplicate alternative rows and missing (null) values,
reducing readability and usability for further processing.

One of the most significant challenges faced was the irregularity in column naming
convention and formatting. Variables, for example, were specified as longer prefixes like
__team__name, __statistics__goalsScored, and __statistics__matches, each making the
direct interpretation tough. Nested attributes were also not automatically exploded and
split into independent dimensions, and further restructuring and realignment of the data
had to be undertaken. The data was duplicated on many entries across rows, and merging
and deduplication had to be undertaken to maintain data integrity.

Figure 6 Example of data which was not cleaned

In addition to the team-level dataset, three player-level datasets were acquired and hand-
structured in Excel form. The datasets covered key positional positions: defenders,
midfielders, and attackers. Each category contained domain-specific features
characteristic of their roles on the pitch. For instance:

- Defenders were defined through variables such as duels won, yellow/red cards,
and interceptions;
- Midfielders were defined through variables such as assists, key passes, and pass
accuracy;

Attackers had metrics like total goals, shots, and overall goal involvement.

28
These player-level data sets were subsequently used in binary classification problems to
determine if a player belonged to the high-performance group in their position. The
cleaned data allowed for more sophisticated, role-specific predictive modeling at later
analysis stages.

3.2.3 Data cleaning and preparation

To ensure integrity, consistency, and analytical readiness of the dataset, a rigorous and
procedural data cleaning was undertaken. Due to the semi-structured nature of input
JSON-derived data, several preprocessing steps were employed to transform the dataset
into its clean and tabular form for statistical analysis and machine learning.

The major operations at the cleaning stage were:

Duplicate and null row removal: The flattening process created duplicate rows and null
entries. All duplicate observations and structurally empty rows were identified and
eliminated to reduce noise and maintain data consistency.

Column name normalization: Field names inherited verbose and hierarchical identifiers
such as __team__name or __statistics__goalsScored, based on the nested structure of the
JSON format. These were replaced with more readable names such as team_name or
goals_scored for easier code-based manipulation.

Record merging and consolidation: When data for a given team were divided over
multiple rows, the records were brought together in a single, one-row representation per
team. This was done to ensure one observation per team, as it is best practice when
handling panel data formatting.

Data type conversion: All the numeric fields (e.g., goals, red/yellow cards, passes, fouls)
were converted from generic object or string data types to correct numeric types (integer
or float). The reason for the conversion was to enable mathematical operations and
statistical modeling.

29
Categorical label standardization: Idiosyncratic labels in categorical variables, team
names, relegation status (e.g., yes, Yes, 1), and position categories were normalized to a
uniform coding system. For example, the binary outcome variable "relegated" was coded
as 1 for relegated and 0 for non-relegated teams.

Logical order of columns: To permit interpretability and secondary analysis, variables

were reordered in a logical order typically progressing from team identifiers to core match
statistics and then on to outcome measures (e.g., final league ranking, relegation).

Once the cleaning process was finalized, the final structured dataset contained 32 exactly
defined variables across 12 records, identical to the 12 teams that competed for the 2023–
2024 season of the Premier League of Bosnia and Herzegovina. The cleaned dataset
served as input to all the subsequent steps of descriptive analysis, clustering, and
predictive modeling performed in this study.

Figure 7 Example of cleaned and preprocessed data

3.2.4 Feature engineering

To further improve the predictive capability of the dataset and allow for more intelligent
machine learning applications, some new features were derived from the original raw
statistical variables. This is a process known as feature engineering, which transforms
existing data into more meaningful representations according to football analytics
researches.

30
The newly engineered features were:

Goal difference: The difference between goals scored and goals conceded
(goal_difference = goals_scored - goals_conceded). This is a standard statistic used to
assess a team's attack-defense balance throughout the season.

Disciplinary index: An aggregate score of team discipline, which is computed by

assigning weighted points to yellow cards and red cards. Red cards are typically weighted
more heavily to reflect their greater impact on the match.

Penalty efficiency: The number of penalty goals over penalties taken (penalty_efficiency
= penalty_goals / penalties_taken). This measures how efficient a team is at converting
penalty chances.

Average possession (Standardized): Standardized match possession percentages were

used to maintain the same format and enable comparison between teams.

Binary relegation label: A binary target variable relegation was utilized, in which
relegation was labeled as 1 for relegated teams and 0 for non-relegated teams. The label
was required by the classification models that attempted to predict relegation chances.

These were selected after prior empirical findings that they are very pertinent to football
team performance and league outcome prediction [13]. By including them in the final
dataset, the analysis had the capacity to cover a greater array of performance dynamics as
well as improve interpretability of the model.

3.2.5 Preprocessing techniques

Before feeding the dataset into machine learning models, preprocessing steps were
conducted upon the dataset to ensure it was standardized, improved performance, and
reduced the impact of noise. Preprocessing steps included:

31
Normalization: Numerical features such as number of passes, total shots, and possession
rates were normalized to a common numerical range where necessary. This step helps
algorithms, especially distance-based models, handle variables equally regardless of scale.

Missing value handling: Missing values within the dataset were either removed if found
to be non-critical or imputed using suitable techniques (e.g., median or mode imputation)
such that model training was not compromised or skewed in any manner.

Label encoding: The categorical target variable "relegated" was mapped into binary
format as 1 for "yes" and 0 for "no." The same was also applied for player-level targets
like high_fouler, high_assist_rate, and goal_involvement.

Feature selection: Irrelevant or uninformative features like jerseys, jersey numbers, and
URLs were removed to reduce dimensionality and prevent model overfitting.

The same preprocessing was applied for player-level datasets used for position-based
modeling (defenders, midfielders, forwards). The additional steps for those datasets were:

Class balance maintenance: Because of the limited sample size, threshold levels (e.g.,
what defines a "high-assist" player) were determined by median splits to balance records
evenly across classes.

Stratified k-Fold Cross-Validation: For robust and unbiased model estimation, stratified
k-fold cross-validation was applied. This process preserved class ratios across training and
validation sets, which is especially critical in imbalanced or sparse datasets.

The final result of these preprocessing steps was a sanitized, normalized dataset ready for
exploratory data analysis, clustering, and supervised machine learning. This was the
analytic basis for the empirical modeling framework of the study.

32
4. RESULTS AND ANALYSIS

Machine learning and data analysis experiment results on player and team
performance data from the Bosnia and Herzegovina Premier League are outlined in this
chapter. Statistical and computational procedures will be used to identify significant
patterns, examine hypotheses, and analyze the forecasting capability of chosen features.

The chapter is divided into several sections. To determine relationships between

important performance indicators, a correlation analysis is then conducted following a
discovery of the dataset using descriptive statistical analysis. Then, teams are then grouped
according to their performance profiles using clustering techniques, which provide
strategic trends within the league. The chapter subsequently proceeds to model defensive
behavior at the player level by comparing the performance of various classification
algorithms and predicting foul proclivities using supervised machine learning. Each
section has interpretation and visualizations that support the analytical outcomes and open
the door to their practical use.

The findings presented below form the empirical foundation upon which to answer the
research questions posed in Chapter 1 and frame the broader conclusions and implications
reported in Chapter 5.

4.1. Descriptive statistical analysis

To establish a preliminary picture of the dataset and its statistical properties, descriptive
statistical analysis was conducted on team performance statistics in the 2023–2024 season
of the Bosnia and Herzegovina Premier League. This procedure provides an indication of
the nature and distribution of the significant variables to identify trends, outliers, and
anomalies that can influence subsequent analysis.

33
Descriptive statistics such as mean, median, standard deviation, minimum and maximum
values were derived for all the significant numerical variables. Among them, but not all,
were goals scored, goals conceded, average possession, foul number, yellow and red cards,
clean sheets, won duels, interceptions, and penalty goals. Interquartile range (IQR) and
corresponding percentiles (25th and 75th) were also derived to aid in outlier detection and
observing variability in the league.

Main observations from the descriptive summary are:

- FK Velež Mostar racked up the highest average possession rate with 56.5%, which
suggests an enormously controlling and possession-based playing style.
- FK Željezničar received the most red cards (4), and NK GOŠK Gabela received
the most yellow cards with 92, which could suggest disciplinary problems.
- HŠK Zrinjski Mostar was the most attack-heavy team with 76 goals, which was
well off the league average, way higher.

Defensively, statistics showed FK Igman Konjic and FK Zvijezda 09 to be among the

leading teams in interceptions, with 1,472 and 1,359, respectively.

Breakdown continues:

Disciplinary issues were most pronounced in certain clubs. Yellow cards ranged from 45
to 92, red cards from 0 to 4, presenting degrees of tactical mischief or uncontrolled
behavior.

Clean sheets, which indicate defensive solidity, ranged from 4 to 15. Notably, FK Borac
Banja Luka, FK Velež Mostar, and HŠK Zrinjski Mostar all had 15 clean sheets, indicative
of well-organized rearguards. Penalty kick conversions ranged from 1 to 11, with HŠK
Zrinjski Mostar leading again, further indicating the strategic importance of set-piece
conversion.

These summary results not only give initial insight into team tendencies and behaviors but
also constitute a diagnostic overlay justifying the choice of features for further correlation
34
analysis and machine learning modeling. An example are those teams with higher ball
possession like FK Velež Mostar, which exhibit offense aggression, whereas high
disciplinary records in teams like NK GOŠK Gabela may prove to be a hindrance to
continued team performance.

4.2. Correlation analysis

Correlation analysis was used to establish significant correlations between the core team
performance metrics and their correlation with markers of success such as league standing,
goals scored, and clean sheets. Measuring linear relationships between variables, the
analysis provides a statistical foundation for picking predictive features when developing
machine learning models. The Pearson correlation coefficient (r) was used to test for linear
relationships among continuous variables. This coefficient ranges from:

• +1.00 (perfect positive correlation),

• -1.00 (perfect negative correlation), to
• (no linear correlation).

For our study, correlations with an absolute value of over 0.70 were considered strong and
of practical relevance.

Feature 1 Feature 2 Correlation (r)

Accurate Passes Average Ball Possession 0.94
Penalties Taken Penalty Goals 0.91
Goals Scored Shots 0.91
Shots Corners 0.90
Accurate Crosses Corners 0.79
Average Ball Possession Shots 0.79
Goals Conceded Interceptions 0.78
Goals Scored Corners 0.78
Clean Sheets Average Ball Possession 0.76
Clean Sheets Accurate Crosses 0.76
Average Ball Possession Accurate Crosses 0.75
Shots Accurate Passes 0.74
Table 1 Correlation matrix of team performance metrics

35
• Important negative correlations:

Clean sheets vs. Goals conceded: r = -0.94

Suggesting that those teams with more clean sheets always allow fewer goals—a
reasonable but statistically confirmed trend.

Shots vs. Interceptions: r = -0.88

Implies that teams with more defensive interceptions are likely to have fewer shots,
potentially because they are more reactive in nature.

• Observations and Implications:

Ball sossession and Passing accuracy are not just highly correlated but are also linked with
greater shooting activity and attacking success. These teams are likely to score more goals
and have better defenses.

Set pieces appear to matter: the close relation between penalty taken and penalty scored
suggests disciplined, direct play in the latter third of the pitch generating set-play
opportunities that determine match outcomes significantly. Defensive structure, quantified
by clean sheets and interceptions, inversely correlates with goals conceded, affirming their
roles as the chief indicators of team solidity and match control.

In the aggregate, these findings verify the strategic relevance of these measures. Strong
positive and negative correlations attest to the complexity of team performance, wherein
creativity on offense and solidity on defense are required in order to be successful. These
findings are strong endorsement for the inclusion of these variables into predictive
modeling exercises investigated in later chapters.

36
4.3. Distribution analysis

To stage data for prediction modeling, the distributional properties of performance metrics
need to be understood. Distribution analysis reveals issues such as skewness, kurtosis,
outliers, and non-normality, which have adverse effects on model assumptions as well as
performance if not addressed.

4.3.1 Skewness measurement

Skewness is a measurement of the asymmetry of a distribution from its mean. In football

analytics, skewness is used to check whether a feature has a high number of values in one
direction and long tail in the other. Skewness is interpreted as follows:

Skewness = 0: Perfectly symmetric distribution

Skewness > 0: Right-skewed (longer right tail)

Skewness < 0: Left-skewed (longer left tail)

This study considers absolute skewness measures greater than 1 (|skewness| > 1) as signs
of extreme skewness, thus necessitating potential transformation to meet machine learning
objectives. The computed values of the skewness of performance measures are presented
in the following table:

Metric Skewness
Interceptions -0.207
Red Cards 0.570
Yellow Cards -0.200
Clean Sheets -0.231
Penalty Goals 2.605
Penalties Taken 2.605
Fouls -0.043
Corners 0.033
Successful Dribbles 0.404
Shots 0.771
Accurate Crosses 0.425
37
Accurate Long Balls 0.196
Accurate Passes -0.051
Average Ball Possession 0.076
Goals Conceded 0.476
Goals Scored 1.260
Number of Matches 0.000
Awarded Matches 0.000
Table 2 Skewness values of key performance metrics

4.3.2 Interpretation of skewed features

There are three variables with extreme positive skewness:

Penalty goals and Penalties taken (Skewness = 2.605):

These are highly biased by a few clubs, most notably HŠK Zrinjski Mostar, producing a
hugely disproportionate number of penalties and successfully converting them.

Goals scored (Skewness = 1.260):

Indicates a couple of high-scoring clubs well above the league average, giving a
distribution with a long tail to the right. These imbalanced variables signal league
performance imbalances, with tactical advantage in goal scoring and set-piece play being
enjoyed by some dominant teams.

High skewness before modeling can be resolved by using logarithmic or Box-Cox

transformations. These standardize distributions, reduce the impact of outliers, and make
algorithms input distribution assumption-sensitive (e.g., logistic regression) more
efficient.

4.4. Discipline analysis

This section investigates the impact of disciplinary actions yellow and red cards on team
performance in general during the 2023/24 season of the Bosnia and Herzegovina Premier

38
League. The study looks into the relationship between disciplinary behavior and core
measures of performance such as goals scored, goals conceded, and final league position.

To quantify these correlations, Pearson correlation coefficients were calculated between

the count of red and yellow cards and selected performance metrics. Particular attention
was paid to outliers to support correlation outcomes with indicative team examples.

The resulting correlation matrix is the following:

Metric Yellow Red Goals Goals League

Cards Cards Conceded Scored Position
Yellow Cards 1.000 0.385 0.682 -0.605 0.702
Red Cards 0.385 1.000 0.116 -0.007 0.156
Goals 0.682 0.116 1.000 -0.516 0.946
Conceded
Goals Scored -0.605 -0.007 -0.516 1.000 -0.733
League 0.702 0.156 0.946 -0.733 1.000
Position
Table 3 Correlation matrix – Disciplinary metrics and team performance

Two teams stood out in terms of disciplinary records:

• NK GOŠK Gabela recorded the highest number of yellow cards (92).

• FK Željezničar received the most red cards (4).

4.4.1 Interpretation of correlations

The correlation between yellow cards and goals conceded was r = 0.682, indicating good
positive correlation. Teams that received more yellow cards also conceded more goals,
possibly due to reduced defensive control or self-restraint in play once booked.

The correlation between yellow cards and goals scored was r = -0.605, a moderate
negative one. This suggests that teams with higher yellow cards had fewer goals, possibly
because play was interrupted or suspensions were issued to star players.

39
Yellow cards and league standing had a correlation of r = 0.702. Since the larger the league
standing number, the lower the standing, this strong positive correlation is to be
understood as that poorer disciplinary behavior (higher yellow cards) correlates with
worse overall league performance.

The red cards exhibited weaker correlations:

• Red cards correlated with goals conceded at r = 0.116, a weak positive correlation.
• The correlation with goals scored was virtually zero (r = -0.007), suggesting no
meaningful relationship.
• The correlation with league position was r = 0.156, again weak and not statistically
significant.

4.4.2 Team-level insights

NK GOŠK Gabela, having received the most yellow cards, also finished lower on the table
and conceded a lot of goals. This is further corroborating evidence for the statistical data,
that many yellow cards can lead to rhythm disruption and defense organization in games.
FK Željezničar, who were shown the most red cards, however, didn't do as miserably
comparatively, which is consistent with the overall poor correlation between red card
statistics and team performance.

4.4.3 Conclusion

The results show that yellow cards are more potent than red cards in influencing team
performance. Excessive yellow card counts correlate to higher goals conceded, fewer
goals scored, and worse league positions. This highlights the importance of maintaining
discipline over the course of a season, not just in evading temporary setbacks but also in
contributing to overall team consistency and competitiveness. In comparison, red cards,
while effective for the one game, appear too uncommon to be statistically important at the
season level.

40
4.5. Ball possesion analysis

This section explores the relationship between team ball possession and team performance
throughout the Premier League of Bosnia and Herzegovina during the 2023/24 season.
Ball possession is a tactical aspect of contemporary football associated with control,
territorial acquisition, and attacking quality [14]. The research examines the way
possession correlates to three essential measures of performance: goals scored, goals
conceded, and final league position.

To quantify these correlations, Pearson correlation coefficients between mean possession

of the ball and the selected performance measures were obtained. The teams with highest
and lowest mean possession were also established to illustrate how possession extremes
relate to overall performance outcomes.

Metric Avg. Goals Goals League

Possession Scored Conceded Position
Avg.ball 1.000 0.673 -0.716 -0.770
Possession
Goals Scored 0.673 1.000 -0.516 -0.733
Goals Conceded -0.716 -0.516 1.000 0.946
League Position -0.770 -0.733 0.946 1.000
Table 4 Correlation matrix – Ball possession and performance metrics

4.5.1 Interpretation of correlations

The correlation between possession average and goals for was r = 0.673, a very positive
relationship. The teams that had larger possession percentages had more goals, as the
offense benefited from retaining possession.

The correlation between average possession and goals conceded was r = -0.716, which
showed a very strong negative relationship. The more frequently teams had possession of
the ball, the less they conceded, quite likely because they could limit the opposing team's
possession and reduce their exposure.

41
Average possession and league position correlated at r = -0.770. As higher values of league
position correspond to lower table positions, this strong negative correlation means that
the teams with more possession also had better finishes in the league table. These
correlations reflect the significance of possession as a primary contributor to both
attacking and defending effectiveness.

4.5.2 Case examples

FK Velež Mostar had the highest possession-based average, obtaining 56.5%. Dominant
possession like this might have contributed to their success in creating attacking
opportunities and defensive structure, culminating in sound team performance as a whole.
The lowest average possession was 43.2% by NK GOŠK Gabela. The lower possession
mirrors poorer match control, which was linked with both lower goal creation and higher
goals allowed, to the disadvantage of poorer league performance.

4.5.3 Visual analysis

Three visual representations support the statistical findings:

Figure 8 Relationship between ball possession and goals scored

42
Figure 9 Relationship between ball possession and goals conceded

Figure 10 Relationship between ball possession and league position

These figures visually reinforce the strength and direction of the correlations, showing
clear linear trends consistent with the numerical analysis.

43
4.5.4 Conclusion

The result confirms that greater average ball possession is highly correlated with better
overall team performance. Teams that dominate possession have better odds of creating
scoring opportunities, stopping the opposing team from scoring, and achieving higher
league positions. The results concur with contemporary football strategies, which
emphasize the importance of dominance of the ball as the foundation for success. As such,
average possession is a crucial factor to be included in predictive models and performance
measurement systems.

4.6. Penalty analysis

This section investigates the impact of penalty statistics, i.e., penalty goals and penalty
attempts, on overall team performance during the 2023/24 season of the Premier League
of Bosnia and Herzegovina [15]. The aim is to establish whether the ability to score and
convert penalties influences key performance measures, including goals for, goals against,
and final-season league position.

To study these relationships, Pearson correlation coefficients were calculated between

penalty variables and selected performance measures. Teams with most penalty earned
and converted were also determined to place the findings into perspective.

The correlation matrix depicting these relationships is as presented below:

Metric Penalty Penalties Goals Goals League

Goals Taken Scored Conceded Position
Penalty 1.000 0.911 0.688 -0.347 -0.413
Goals
Penalties 0.911 1.000 0.707 -0.433 -0.517
Taken
Goals 0.688 0.707 1.000 -0.516 -0.733
Scored
Goals -0.347 -0.433 -0.516 1.000 0.946
Conceded

44
League -0.413 -0.517 -0.733 0.946 1.000
Position
Table 5 Correlation matrix – Penalty metrics and team performance

4.6.1 Interpretation of Correlations

The correlation between goals for and penalty goals was r = 0.688, a high positive
correlation. This suggests that penalties are a large proportion of overall goal output for
certain teams.

Figure 11 Relationship between penalty goals and goals scored

The correlation between goals against and penalty goals was r = -0.347, a moderate
negative correlation. Teams scoring more goals in penalties tend to concede fewer goals,
possibly due to superior tactical control and match dominance.

The correlation between penalty goals and league position was r = -0.413. Since a higher
league position number means a lower position, this negative correlation is an indication
that the teams that score most penalties find themselves higher in the league table.

45
Figure 12 Relationship between penalty goals and league position

The correlation between penalties taken and goals scored was r = 0.707, which is
extremely strong positive correlation. This once again testifies to the importance of
penalties as a strategic way of scoring.

The correlation between penalty taken - goals conceded was r = -0.433 and shows that
teams that win more penalties also lose less goals. This might be a consequence of their
capacity to maintain control both in attacking and defensive.

Relationship between penalties received and league ranking was r = -0.517, a strong-to-
moderate inverse relationship. Those clubs receiving more penalties are most likely to
have better final rankings.

4.6.2 Case example: HŠK Zrinjski Mostar

HŠK Zrinjski Mostar is a prime example of the influence of penalty effectiveness. The
team had the highest number of taken penalties in the league (12) and also achieved the
most penalty goals (11). The high rate of penalty conversion success helped contribute
significantly to the team's attacking contribution and overall league success. The ability

46
to both gain and successfully convert penalties may therefore be described as an essential
competitive advantage.

Figure 13 Relationship between penalty success rate and league position

The visual evidence does verify that teams with higher successful penalty rates end higher
in the league, again emphasizing the tactical importance of penalties to success at
competitive football.

4.6.3 Conclusion

The evidence is readily apparent to demonstrate that penalties do play an important part
in team success. The number of penalties taken and their successful conversion are both
positively linked with goals scored and inversely linked with goals conceded and ultimate
league positioning. These results indicate that earning and converting penalty chances can
be a key factor in determining league performance. Penalty measures, therefore, need to
be considered as key input variables to any type of predictive modeling or performance
analysis for team results.

47
4.7. Offensive and defensive analysis

This section explores the relationship between offensive and defensive performance and
eventual team success in the 2023/24 Premier League of Bosnia and Herzegovina [16].
Focus is on three major indicators: goals scored (offensive capacity), goals conceded
(defensive solidity), and goal difference (net performance), with interest focusing on how
combined effect impacts final league position.

The analysis included:

• Teams with highest goals scored (best attack),

• Teams with lowest goals conceded (best defense), and
• Teams with highest and lowest goal differences.

Pearson correlation coefficient also measured the association between goal difference and
league position.

4.7.1 Key observations

• The team with the most goals scored was HŠK Zrinjski Mostar, with a total of 76
goals, indicating the most potent offensive output in the league.
• The top goal-scoring team was HŠK Zrinjski Mostar, who managed 76 goals,
reflective of the best attacking performance in the league.
• FK Borac Banja Luka had the most impressive defense, having let in just 26 goals
during the season.
• The top goal difference was once again that of HŠK Zrinjski Mostar, standing at
+49, affirming their prowess in attack and defense.
• The side with the lowest goal difference was FK Zvijezda 09, at -39, which was in
line with their position near the bottom of the league table.
•

48
The correlation matrix below demonstrates the relationship between goal difference and
league position:

Metric Goal Difference League Position

Goal Difference 1.000 -0.976
League Position -0.976 1.000
Table 6 Correlation matrix – Goal difference and league position

4.7.2 Interpretation

The correlation coefficient between goal difference and league final position was r = -
0.976, indicating a very strong negative correlation. That is, the larger the goal difference
(i.e., a team scores more goals than they allow), the improved final league position (i.e.,
the lower the numerical value of the league position).

This correlation highlights the primary importance of establishing parity in attacking and
defending performance. Teams with a positive goal difference are probably to be found in
the top half of the table, while teams with a negative goal difference are likely to be
towards the bottom or in relegation-placed positions.

Figure 14 Relationship between goals scored and goals conceded

49
This figure illustrates the general trend that teams that score most tend to occupy higher
league positions, while teams who concede more goals usually finish towards the bottom
of the table.

Figure 15 Goal difference of each team in the 2023/24 season

The figure visually represents how teams with big goal differences naturally finished
higher up in the table. Teams with negative goal differences, however, struggled and
tended to finish near relegation.

This figure confirms the statistical investigation through evidently demonstrating the
negative correlation between goal difference and end league position.

50
Figure 16 Relationship between goal difference and league position

This visualization confirms the statistical analysis by clearly depicting the inverse
relationship between goal difference and final league ranking.

4.7.3 Conclusion

The findings from this analysis strongly attest to the fact that attacking as well as defensive
performance provides a significant contribution to overall team performance. However,
the overall measure goal difference is a robust single measure of consistency of
performance. Teams that specialize in both scoring as well as defense will have better
league performance, while those who are unable to sustain these two will suffer
competition disadvantage. Therefore, goal difference emerges as an excellent predictor in
league performance modeling.

51
4.8. Relagation analysis

This section contrasts relegated and non-relegated teams' performance indicators during
the 2023/24 season of the Premier League in Bosnia and Herzegovina. The aim is to
determine which key performance indicators are most directly linked with relegation risk,
providing information about which characteristics distinguish lower-ranked teams from
those that are able to hold league level [17].

The following metrics were analyzed:

• Goals scored (offensive capability)

• Goals conceded (defensive effectiveness)

• Yellow cards and red cards (disciplinary behavior)

• Clean sheets (defensive consistency)

• Average ball possession (match control and dominance)

Individual average values of these measures were calculated separately for relegated and
non-relegated clubs. The results are given in the table below:

Metric Relegated Teams Non-Relegated Teams

Goals Scored 39.00 47.30
Goals Conceded 70.50 39.70
Yellow Cards 81.00 69.40
Red Cards 2.00 1.70
Clean Sheets 4.00 11.40
Average Ball Possession 47.10% 50.44%
Table 7 Average performance metrics of relegated vs. non-relegated teams

52
4.8.1 Key findings and interpretation

• Goals scored: Relegated teams scored many fewer goals on average (39.0)
compared to non-relegated clubs (47.3), indicating a lack of attacking efficiency
as a primary factor in poor performance.

Figure 17 Distribution of goals scored – relegated vs. non-relegated teams

• Goals conceded: Relegated teams conceded more goals (70.5) compared to their
non-relegated counterparts (39.7), demonstrating the importance of defensive
vulnerabilities in relegation.

53
Figure 18 Distribution of goals conceded – relegated vs. non-relegated teams

• Disciplinary records: Relegated teams obtained a greater number of yellow cards

(81.0) compared to non-relegated teams (69.4), and a negligible number of red
cards (2.0 versus 1.7). Although the differential between red cards is not important,
the differential between yellow cards can represent greater tactical fluidity or
indiscipline among players.

54
Figure 19 Distribution of yellow cards – relegated vs. non-relegated teams

Figure 20 Distribution of red cards – relegated vs. non-relegated teams

55
• Clean sheets: The average number of clean sheets of relegated teams was merely
4.0, compared to 11.4 of non-relegated teams. This huge difference indicates that
defensive solidity on a consistent basis is one of the key determinants of survival.

• Ball possession: Relegated teams also had lower average ball possession (47.1%)
than non-relegated teams (50.44%). This reflects compromised match control and
may be proof of the inability to maintain territorial or tactical possession.

Figure 21 Average performance comparison – relegated vs. non-relegated teams

56
4.8.2 Conclusion

The comparison shows that promoted teams perpetually underachieved in various

important performance metrics. They:

• Scored fewer goals,

• Conceded more goals,

• Accumulated more disciplinary sanctions,

• Maintained fewer clean sheets, and

• Controlled less possession overall.

These findings highlight that relegation is most frequently the result of cumulative offence
and defence frailties, typically compounded by discipline-related interruptions. The
metrics discussed here can be utilized as early warning metrics for teams at risk of
relegation and are especially applicable in predictive modeling and planning contexts in
subsequent work.

4.9. Clustering analysis

This section reports on a clustering analysis conducted on team performance data for the
2023/24 season of the Bosnia and Herzegovina Premier League. The aim was to cluster
teams into performance-based groups so that teams with similar playing styles and
statistical profiles could be identified. This unsupervised learning method sheds light on
structural patterns between teams and highlights the performance indicators that
differentiate successful and underperforming teams [18].

57
4.9.1 Methodology

Clustering was carried out using the K-means algorithm, a widely used type of
unsupervised learning that separates observations into k groups by feature similarity. The
following statistics were utilized in the analysis:

• Goals scored

• Goals conceded

• Yellow cards

• Red cards

• Clean sheets

• Average ball possession

• Accurate passes

• Interceptions

• Fouls

• Corners

Prior to clustering, the dataset was normalized such that all variables contributed equally
to the process of clustering. The optimal number of clusters were determined using the
Elbow method that identifies the point where the returns of increasing more clusters
become progressively smaller while decreasing within-cluster variance.

With this method, the optimal number of clusters was determined to be three that describes
various team performance profiles.

58
Figure 22 Elbow method for determining optimal number of clusters

Based on this approach, the optimal number of clusters was identified as three,
representing distinct team performance profiles.

4.9.2 Cluster results

The mean values of key performance metrics for each cluster are summarized below:

Metric Cluster 0 Cluster 1 Cluster 2

(Struggling) (Balanced) (Dominant)
Goals Scored 37.86 47.33 72.00
Goals Conceded 56.00 31.00 26.50
Yellow Cards 78.71 67.33 51.50
Red Cards 2.00 1.00 2.00
Clean Sheets 7.29 13.67 15.00
Average ball 46.99% 53.57% 54.50%
possesion
Accurate Passes 8,464.57 10,291.33 11,471.00
Interceptions 1,360.86 1,167.00 1,102.50
Fouls 509.71 508.00 456.50
Corners 134.29 166.33 177.50
Table 8 Mean values of performance metrics per cluster

59
4.9.3 Cluster interpretation

Cluster 0: Struggling teams

This type has low scoring results and poor defensive metrics. These teams scored fewer
goals, conceded more, and exhibited most disciplinary infractions. Low possession and
accurate passes reveal inability to dominate matches, whereas fouls and interceptions
show defensive or reactive tactics. These tend to be lower-table or fallen teams.

• Goals scored: Low

• Goals conceded: High

• Ball possession: Low

• Accurate passes: Low

• Clean sheets: Few

• Disciplinary metrics: Poor (high yellow cards, fouls)

Cluster 1: Balanced teams

Teams in this cluster performed middling in most of the measures. They possessed decent
defense records (low goals against, numerous clean sheets) but a reasonable level of
offense. Ball possession and passing accuracy were more than Cluster 0 but less than
dominant teams. Such teams are usually mid-table teams with balanced playing styles.

• Goals scored: Moderate

• Goals conceded: Low

• Clean sheets: High

• Ball possession: Moderate

60
• Disciplinary metrics: Acceptable

Cluster 2: Dominant teams

This is a cluster of high-performing clubs. Such clubs performed well both when scoring
goals and conceding, scoring the most goals and conceding the least. They also had high
passing accuracy and ball possession, showing tactical dominance. They also gave few
fouls and received few yellow cards, showing discipline and also control.

• Goals scored: High

• Goals conceded: Very Low

• Clean sheets: Very High

• Ball possession: High

• Accurate passes: High

• Fouls and cards: Low

61
Figure 23 Clusters based on goals scored and goals conceded

Figure 24 Cluster means across key performance indicators

62
4.9.4 Conclusion

Clustering analysis was successful in segmenting the teams into three performance
categories: struggling, balanced, and dominant. This segmentation provides significant
findings relating to league competitive structure and profiles of successful teams. Previous
analyses are reaffirmed: dominant teams balance offensive firepower with good defensive
organization, and struggling teams underperform along many dimensions, such as
possession, discipline, and goal efficiency. Clustering also offers a practical framework
for team benchmarking, allowing clubs to compare their performance with peers and
develop ways of addressing areas of poor performance.

4.10. Feature importance analysis

This section presents a feature importance analysis conducted to identify the most
influential performance metrics for predicting team league position in the 2023/24 season
of the Premier League of Bosnia and Herzegovina. Understanding which factors most
strongly influence success can inform strategic decisions and enhance the performance
evaluation framework for clubs [19].

Two machine learning models were employed for this purpose:

• Random Forest Regressor

• XGBoost Regressor

Both models are ensemble-based algorithms capable of ranking input features by their
relative importance in predicting a target variable. In this case, the target variable was final
league position, and the input features included a wide range of team performance
indicators, such as:

• Goals scored

• Goals conceded

63
• Clean sheets

• Interceptions

• Yellow and red cards

• Ball possession

• Accurate passes

• Shots

• Set-piece metrics (corners, penalty goals, penalties taken)

The models were trained and evaluated, and feature importance scores were extracted to
assess which variables had the greatest influence on league outcomes.

4.10.1 Results – Random Forest Feature Importances

The top-ranked features from the Random Forest model are presented below:

Figure 25 Random Forest feature importance visualization

64
Feature Importance Score
Goals Conceded 0.194
Clean Sheets 0.175
Interceptions 0.171
Average Ball Possession 0.085
Shots 0.056
Accurate Passes 0.055
Corners 0.048
Goals Scored 0.047
Successful Dribbles 0.041
Accurate Crosses 0.039
Fouls 0.028
Yellow Cards 0.026
Penalties Taken 0.016
Accurate Long Balls 0.010
Red Cards 0.005
Penalty Goals 0.004
Table 9 Random Forest feature importance scores

The results indicate that Goals Conceded is the most influential predictor of league
position, followed by Clean Sheets and Interceptions. These findings reinforce the
importance of defensive performance in determining overall team success. Features such
as average ball possession, shots, and accurate passes also contributed moderately,
suggesting that both control and offensive intent play supporting roles. In contrast,
disciplinary metrics (yellow/red cards) and penalty-related features had relatively low
importance in this model.

65
4.10.2 Results – XGBoost Feature Importances

The XGBoost model yielded the following top feature importance scores:

Figure 26 XGBoost feature importance visualization

Feature Importance Score

Goals Conceded 0.850
Yellow Cards 0.097
Goals Scored 0.052
Red Cards 0.052
Shots 0.000
Accurate Long Balls 0.000
Penalty Goals 0.000
Penalties Taken 0.000
Table 10 XGBoost feature importance scores

The XGBoost model further confirms the dominant influence of Goals Conceded, with an
exceptionally high importance score of 0.850. This again underscores the value of
defensive solidity. Interestingly, yellow and red cards had slightly more impact in this
model than in Random Forest, suggesting that disciplinary behavior may contribute to
performance variance. However, features such as penalties, shots, and long passes held
zero importance, indicating negligible contribution to the model's predictive accuracy.

66
4.10.3 Conclusion

Both models consistently highlight the central role of defensive performance in

determining league position. Metrics such as goals conceded, clean sheets, and
interceptions emerged as the most critical predictors. Although offensive performance
(e.g., goals scored, accurate passes) and control metrics (e.g., ball possession) play
supportive roles, they appear secondary to defensive factors.

Features related to discipline and set pieces showed limited influence, which suggests that
while important in isolated contexts, they are not strong general predictors of season-long
team performance. These insights are valuable for strategic planning, talent acquisition,
and performance improvement initiatives at the team management level.

4.11. Relagation prediction – league position perfromance-based modelling

This section aims to forecast the final season league position of teams competing in the
Premier League of Bosnia and Herzegovina (PLBiH) for the season 2023–2024. The key
aim was to measure the power of performance-based measures in forecasting a team's final
finishing position using machine learning regression techniques. League ranking
forecasting insights are important to coaching departments, club management, and
analysts, offering evidence-based opinions on which teams can use to judge their strategy.
The target variable was chosen to be the final position within the league at the end of the
season, a continuous outcome, and regression models were employed to infer position
location from performance measures.

67
4.11.1 Methodology

• Data Preprocessing

A series of preprocessing steps were executed to ensure the quality and consistency of the
input data:

Missing values handling: Missing values were identified and imputed by the median
approach to avoid bias caused by outliers.

Feature selection: Based on prior domain expertise and preliminary checks over feature
importance, the following features were selected: goalsScored, goalsConceded,
interceptions, fouls, yellowCards, redCards, cleanSheets, averageBallPossession,
accuratePasses, shots, corners.

• Feature Engineering:

goalDifference: Significant metric computed as difference between goals scored and

goals against.

disciplineIndex: Aggregated score weighted towards yellow and red cards to measure
team discipline.

Feature scaling: Quantitative inputs were all standardized using StandardScaler to put
features on the same scale, a necessity for most machine learning models to avoid
introducing scale-bias.

• Model Development

Three regression models were implemented and compared:

Linear Regression: A fundamental model assuming linear relationship between features

and target.

68
Random Forest: An ensemble model to address complex, non-linear relationships.

XGBoost: A state-of-the-art gradient boosting model for performance on structured data.

• Evaluation Metrics

All the models were evaluated on:

Mean absolute error (MAE): Average of the absolute errors in predictions.

Mean Squared Error (MSE): Average of the squared errors in predictions.

R-squared (R²): Proportion of the variance in the dependent variable explained by the
model.

• Hyperparameter Tuning

To improve the performance of the Random Forest Regressor, a grid search was conducted
across most vital hyperparameters:

n_estimators = 200

max_depth = None

min_samples_split = 2

This improved generalizability and optimised bias-variance tradeoff.

69
4.11.2 Results

Model MAE MSE R²

Linear Regression 0.899 1.468 0.492

Random Forest 1.203 1.466 0.492

XGBoost Regressor 1.251 1.765 0.389

RF (Tuned) 1.150 1.333 0.539

Table 11Model evaluation summary

Although Linear Regression was a good baseline, Random Forest tuned posted the best
overall performance with R² increased to 0.539, the lowest values of MAE (1.150) and
MSE (1.333) of any model tried. This suggests Random Forest best captured hidden
patterns in the data in spite of the infinitesimal dataset size.

4.11.3 Feature importance analysis

It is able to pull out feature importances from the Random Forest model tuned:

Feature Importance

Clean Sheets 0.1467

Goal Difference 0.1316

Goals Conceded 0.1294

Interceptions 0.0945

Goals Scored 0.0872

Corners 0.0811

70
Shots 0.0782

Avg. Ball Possession 0.0711

Accurate Passes 0.0600

Disciplinary Index 0.0595

Yellow Cards 0.0349

Fouls 0.0190

Red Cards 0.0069

Table 12 Feature importance summary

Interestingly, Clean sheets and Goal difference emerged as the two most important
predictors of ultimate league position, underpinning the importance of defensive solidity.
Interceptions and goals against also featured high on the list, corroborating the finding
that defensive solidity tends to be accompanied by enhanced league performance.

4.11.4 Test set prediction analysis

To determine the model's generalization capability, predictions were made for a sample of
teams that were not employed during training:

Team Actual Position Predicted Position

HŠK Posušje 5 6.19

FK Željezničar 6 7.25

HŠK Zrinjski Mostar 2 3.01

Table 13 Table of predicted positions

• Interpretation of Decimal Values in Predictions

71
The decimal values for predicted positions (such as 6.19, 7.25, 3.01) reflect the model's
confidence in the predicted league position. Since regression models forecast continuous
numeric variables, the presence of decimal points is both natural and revealing:

HŠK Posušje (6.19): Suggests strong confidence in 6th position with minimal possibility
of dropping to 7th.

FK Željezničar (7.25): Suggests a 7th-place prediction with minimal uncertainty towards

8th.

HŠK Zrinjski Mostar (3.01): Indicates near absolute certainty in 3rd position with
minimal possibility of 4th.

These analyses can help the analysts understand the confidence intervals around team
positions and guide strategic adjustments. For instance, a team with a projected 6.19th
place might experience slight improvement in crucial statistics (e.g., clean sheets or goal
difference) and find themselves securely in the 6th or even 5th place.

4.11.5 Intrepretation and practical implications

Defensive power as the key indicator: Clean sheets, goal difference, and goals against fill
the indicator category, reinforcing defense's central function in achieving league success.

Secondary offense role: Offense metrics such as goals scored, corners, and shots do have
relevance but were considered lesser to defensive metrics.

Discipline has limited predictive ability: While important within match dynamics,
disciplinary actions (yellow/red cards, disciplinary index) were a poor forecast for end-
of-league position in the models.

Strategic clarity for clubs: Decimal accuracy of predcting allows clubs to understand their
positioning uncertainty and target marginal gains that can provide actual league position
increments.

72
4.11.6 Conclusion

This performance-based regression analysis clearly demonstrated the ability of using team
performance metrics for predicting final league position. The top-performing (R² = 0.539)
Random Forest Regressor was optimized to yield interpretable and accurate league
position predictions. Some of the key findings are as follows:

The defensive solidity, reflected through clean sheets and goal difference, is the most
critical factor in determining league success. Attacking metrics contribute to a moderate
extent, and disciplinary interventions possess comparatively poor predictive ability.

Decimal predictions produce precise predictions that can be utilized to inform planning
and resource allocation. The findings support the application of machine learning in
football analytics at the national level and illustrate the promise of using structured data
from lower-popularity leagues like PLBiH to inform team management, performance
evaluation, and decision-making. Extensions should include temporal dynamics at the
match level and multi-seasonal analysis in order to improve the model's stability and
temporal generalizability.

4.12. Player-level defensive modelling

4.11.1 Objective

In this, the aim was to determine whether individual defense statistics would be suited for
the prediction of a player's likelihood of committing a high number of fouls. In this
instance, the aim was to develop a binary classification model that could classify between
high-fouling and low-fouling players for the 2023/24 season of the Bosnia and
Herzegovina Premier League.

The dummy coding of the response variable, fouls_per_match, was done in binary form
by estimating the sample median fouls per match. The players whose fouls were above

73
the median were given as high foulers (1), and those with fouls at or below the median
were given as low foulers (0) [20].

The hypothesis for testing was whether easily accessible player-level data such as
appearances, won duels, yellow cards, and red cards possess sufficient predictive power
to enable the discrimination of players based on their foul play behavior. Despite the
simplicity of the features utilized, the aim was to test if discrimination was still attainable
within low-dimensional data space.

4.11.2 Model evaluation and comparison

Stratified 5-fold cross-validation was used to train all models to ensure a balanced
comparison between class labels and avoid overfitting. The following metrics were used
to measure model performance:

Accuracy: Total correct prediction ratio

Precision: True positives among predicted positives ratio

Recall: True positives among actual positives ratio

F1 score: Harmonic mean of recall and precision

ROC AUC: Area under the Receiver Operating Characteristic curve, representing the
model's ability to distinguish between both classes

The test scores are as below in Table 11:

Metric Logistic Regression XGBoost SVM

Accuracy 0.65 ± 0.11 0.58 ± 0.11 0.61 ± 0.11
Precision 0.65 ± 0.09 0.57 ± 0.10 0.61 ± 0.09
Recall 0.64 ± 0.21 0.66 ± 0.15 0.58 ± 0.19
F1 Score 0.63 ± 0.14 0.61 ± 0.11 0.58 ± 0.14
ROC AUC 0.70 ± 0.11 0.61 ± 0.13 0.70 ± 0.11
Table 14 Comparison of modelevaluation

74
Logistic Regression performed best overall with the highest results in the majority of the
metrics such as accuracy, precision, F1 score, and AUC. This indicates that it yields a good
balanced model that has good discriminative power in separating high and low foulers.

XGBoost also had higher recall (0.66), indicating higher sensitivity to high-fouling
players but at the expense of precision and total AUC, indicating bias towards false
positives. The model's bias towards identifying more high foulers will prove beneficial in
certain uses (e.g., risk detection for discipline) but at the expense of its value as an
impartial predictor.

SVM performed poorly as well as Logistic Regression on AUC but failed behind on F1
score and recall, indicating a lower ability in detecting the minority class compared to the
other models. The linear kernel probably limited it from detecting more complicated
patterns within the dataset.

4.11.3 Visual comparison of model performance

In addition to supporting the quantitative results, Figure 27 offers a side-by-side graphical

comparison of the performance metrics of the three models. The visualization confirms
the superior balance achieved by Logistic Regression, particularly for precision, F1 score,
and AUC, which are of greatest importance in minimizing both false positives and false
negatives.

75
Figure 27 Comparison of model performance (Logistic Regression, XGBoost, SVM)

The figure indicates the improved balance of the logistic regression model, particularly in
precision and F1 score. The XGBoost's performance, while relatively good in recall, is
more variable and less AUC, revealing a less stable boundary decision.

4.11.4 Interpretation and limitations

The results are that it is highly likely to predict foul misconduct based only on basic
defensive statistics. Logistic Regression was the strongest model, and it offered
interpretability, simplicity, and acceptable performance even though it had a low-
dimensional space.

Although XGBoost performs stronger regarding complex, non-linear interactions, its

performance in this situation was hampered by many factors:

Limited richness of features – Critical context information such as playing position,

strength of opposition, and events in the match were missing.

76
Small size of dataset – With fewer examples, especially in the high-fouler class,
overfitting may occur.

Imbalanced thresholding – Binary splitting on the median can cause issues of class
imbalance, particularly near the decision boundary.

Lower F1 and recall for XGBoost means that while it captures more true high foulers, it
also classifies a lot of low foulers incorrectly, which may not be operationally desirable if
misclassification is expensive.

The lower F1 and recall rates for SVM are to be expected because it's linear and cannot
handle more subtle distinctions between classes in this data.

4.11.5 Conclusion

Despite moderate performance, the findings verify the viability of player-level foul
prediction using hidden defensive statistics. Logistic Regression, the top performer within
current constraints, generates stable predictions with readily comprehensible conclusions.

As recommendations for future work, the predictive power of more advanced models such
as XGBoost and SVM can be further improved by:

• Increasing the dataset size

• Adding additional features (such as game player position, game minutes played,
game state, and contextual variables)

• Feature engineering for time or spatial features

• Applying more advanced resampling or thresholding techniques to more
accurately balance the classes

Lastly, this exercise in modeling is a proof of concept showing that even basic features
can provide insight into challenging behavior such as fouling, with further improvements
to be explored in future work.
77
4.12. Midfielder perfromance modelling

4.12.1 Objective

In this section it is discussed whether Premier League of Bosnia and Herzegovina 2023/24
season midfielders can be effectively classified based on their assist productivity through
machine learning. Specifically, it was to investigate whether a set of numerical
performance indicators e.g., pass accuracy, key passes, chances created, and match
appearances can be used to distinguish highly assist-productive from less assist-
productive midfielders.

In order to determine binary classification, the target variable high_assist_rate was formed
by determining the median assists per game within the overall set of midfielders. 1 was
assigned to those whose assist rate exceeded the median, and 0 to those who were at or
below the median point [21].

The unspoken assumption here is that assist-making ability affected by external tactical
and situational influences can be explained at least in part by players' individual technical
information collected consistently throughout games.

4.12.2 Development of model and evaluation of it

Two of the most common classification algorithms were used for model training and
testing:

• Logistic Regression

• Random Forest Classifier

These models were trained using stratified 5-fold cross-validation to ensure that each fold
had the high and low assist rate class balance. The evaluation metrics were:

Accuracy – overall correctness of classification

78
Precision – number of correctly predicted high-assist players over all predicted high-assist
players

Recall – ability to identify all high-assist players correctly

F1 Score – compromise between recall and precision

AUC – Area Under the ROC Curve, an indicator of separability of classes

The classification results are tabulated in Table 12:

Model Accuracy Precision Recall F1 Score AUC

Logistic Regression 1.00 1.00 1.00 1.00 1.00
Random Forest 0.83 0.88 0.83 0.83 0.94
Table 15 Classification results

Figure 28 Visual comparison of model performance (bar chart of metrics)

A high linear separability of the features in relation to the target class is indicated by the
perfect scores of the Logistic Regression model on every one of the evaluation metrics. In
contrast, the Random Forest model had missed some true positives, as indicated by its

79
lower recall for class 1 (high assist rate), in spite of its high AUC score (0.94) and overall
accuracy of 83%.

4.12.3 Interpretation and discussion

The optimal performance of Logistic Regression on all the evaluation metrics (accuracy,
precision, recall, F1, and AUC) reflects a high degree of linear separability in the selected
feature space. This is to say that midfielders with improved assist rates can be separated
excellently with simple match statistics, and that the assist outcome is strongly correlated
with some inherent properties.

Promising as it is, such an evaluation measure perfection—particularly with small datasets

warrants a warning for overfitting. The number of samples of midfielders in question (n =
12) is small, and this may artificially inflate performance through the model remembering
patterns from the data instead of applying them universally. However, the Logistic
Regression model still stands as a desirable option due to its interpretability,
computational efficiency, and robustness in small-sample scenarios.

The Random Forest Classifier, while being less accurate, had a strong AUC of 0.94,
suggesting that it remains highly capable of class discrimination. The decreased recall,
though, indicates that it did not correctly label all of the high-scoring midfielders, possibly
due to feature representation lacking depth. Despite its ability to pick up nonlinear
relationships and produce feature importance, Random Forest remains a viable means of
finding which features are most indicative of midfielder creativity, even if it's not the best
tool for classification.

80
4.13. Attacker-level perfromance prediction

4.13.1 Objective

The objective of this section was to investigate whether machine learning models can
predict consistently attacking player performance based on an adequate but limited set of
offensive metrics. The data used for the analysis had been from the 2023/24 season of the
Bosnia and Herzegovina Premier League. The main outcome variable was goal
contribution per match, which had been defined as goals plus assists per appearance, as a
measure of a player's immediate contribution to scoring chance creation [22].

To transform this into a classification problem, a binary target variable was created:

High performers (label 1): Players with goal participation per game greater than the
median of the dataset.

Low performers (label 0): Players at or below the median.

Two supervised classification models were applied to this task:

• Logistic Regression (linear classifier)

• Random Forest Classifier (ensemble decision tree model)

The goal was to determine which model best classifies high-performing attackers from
others based on easily accessed performance attributes.

4.13.2 Feature selection and modeling

A good subset of 10 attacking attributes was used as input variables:

• Goals

• Assists

81
• Shots

• Shots on target

• Big Chances created

• Dribbles completed

• Key passes

• Accurate passes

• Minutes played

• Games played

Prior to model training, feature normalization was also done using StandardScaler to
ensure that all the variables will contribute equally towards the model, especially
important for distance-based or coefficient-sensitive algorithms like Logistic Regression.
A 5-fold stratified cross-validation approach was used to maintain class balance across
folds and get reliable estimates of performance.

For all of the classifiers, the following were estimated:

• Accuracy

• Precision

• Recall

• F1 Score

• ROC AUC

82
4.13.3 Results

The performance measures of the two models are provided in Table 13:

Metric Logistic Regression Random Forest

Accuracy 1.00 0.83

Precision 1.00 0.75 (Class 0), 1.00 (Class 1)

Recall 1.00 1.00 (Class 0), 0.67 (Class 1)

F1 Score 1.00 0.86 (Class 0), 0.80 (Class 1)

ROC AUC 1.00 0.94

Table 16 Comparison of Logistic Regression and Random Forest

The Logistic Regression model was ideal on all metrics, implying linear separability of
data. The Random Forest model, however, while still good (AUC = 0.94), had minor class
imbalance in its predictions. It overemphasized accurate classification of low achievers
(Class 0) and misclassified some high achievers, as shown by its poorer recall and F1
score for Class 1.

4.13.4 Visual comparison

A bar chart comparison (Figure 29) was employed to illustrate the average performance
of the two models for Precision, Recall, and F1 Score. This revealed the superior and
balanced performance of Logistic Regression very clearly.

83
Figure 29 Comparison of model performance metrics

Even though the bar chart indicates superiority for the linear model, caution is to be
exercised when interpreting such extreme values since the sample size is low (n = 12
attackers). Such a setup maximizes the danger of overfitting, especially for the simpler
models matching clean decision boundaries in low-dimensional data.

4.13.5 Interpretation and discussion

These findings support the validity of the premise that goal involvement can be predicted
from basic match statistics such as shots, assists, and key passes. That Logistic Regression,
a linear dependency-based method, is successful confirms that the performance of
attackers on offense from within this sample is linearly related to these input features.

But the perfect classification by Logistic Regression could perhaps be a sign of near-future
overfitting rather than genuine generalizability. The model likely memorized patterns in
84
this extremely small and clean data set rather than abstracting rules for unseen data. In
practical application, such accuracy would most likely collapse without extra testing or
larger training data.

The Random Forest model, as less precise but more transferable, was found to be very
effective at detecting poorly performing attackers. Its ability to learn nonlinear feature
interactions and output feature importance ranks as strengths, making it a good substitute,
especially when data complexity is high.

4.13.6 Implications and applications

This experiment demonstrates that even a small number of performance attributes are
enough to allow effective attacker evaluation through machine learning. This finding has
implications for real-world use in clubs and analysts:

Recruitment and scouting: Identification of attackers with high latent offensive

production;

Analysis of opponents: Highlighting players who are likely to affect match outcome;

Decision support in a match: Recommending substitutions and tactics;

Youth development tracking: Following the development in contributing to offense over

time;

Moreover, attacker-level predictions can be linked to team-level attacking performance in

future research. Linking predicted top-performing attackers with teams' goals-for-
effectiveness might influence squad build-up plans and value distribution across attackers.

85
4.13.7 Limitations and future work

Main limitations are:

Small sample size – just 12 attackers limits statistical power and increases overfitting risk

Feature limitation – lack of spatial, contextual, or opponent-level features

No hyperparameter tuning – the default may not be as optimal as it can be for Random
Forest

Possible future improvement is:

• Having a larger sample size across multiple seasons or leagues, as well as standard
metrics, having more advanced metrics such as expected goals (xG), progressive
runs, or possession chain involvement
• Applying regularization in Logistic Regression to increase generalizability
• Grid search or Bayesian optimization for hyperparameter tuning

4.14. Integrating player-level predictions with team performance

This section seeks to close the gap between player-level classification and team-level
statistical modeling by analyzing how predictive information at both levels synthesizes to
account for general trends of success in the 2023–24 Premier League of Bosnia and
Herzegovina season. The synthesis of descriptive statistics and machine learning
predictions offers a multi-level platform for explaining how individual efforts combine to
affect team results [23].

86
4.14.1 Team-Level analysis and modeling

Team-level statistical analysis and predictive modeling determined league performance

determinants overall. The top-performing teams consistently exhibited the following
features:

Higher ball possession – tactical dominance and more control of play

Fewer goals conceded – defensive solidity

Higher discipline – fewer yellow and red cards, less disruption

More clean sheets – defensive solidity consistently

These descriptive observations were also supported by both models' Random Forest and
XGBoost regressor feature importance. The most predictive features of league position in
both models were goals conceded, clean sheets, and interceptions. For example, the
Random Forest model rated goals conceded with the highest importance value (0.194),
which aligns with the descriptive finding that successful teams defend well.

Furthermore, the robust negative correlation between league standing and ball possession
(r = -0.77) demonstrates the necessity of possession-based strategy in competitive
achievement. This alignment of statistical correlation and machine learning feature
importance scores provides cross-validation to the indicators identified and underscores
their strategic importance to clubs looking to enhance performance.

4.14.2 Player-level modeling and positional insights

There were three independent modeling exercises carried out for defenders, midfielders,
and attackers, respectively, for position-specific performance metrics:

87
• Defenders – Fouling behavior prediction:

In logistic regression and XGBoost, the aim was to predict high foulers based on
appearances, duels, and discipline metrics. While model performance was moderate (AUC
up to 0.70), there were significant insights. Defenders from high-performing teams like
Borac and Zrinjski were less likely to be labeled as high foulers, indicating a link between
disciplined defending and team success. These observations corroborate the overall team-
level trend that better discipline is linked with higher league position.

• Midfielders – Assist rate classification:

This exercise yielded perfect performance metrics under logistic regression (Precision,
Recall, F1, AUC = 1.00), indicating strong linear relationships between features like key
passes, pass accuracy, and chances created and the likelihood of being a high-assist player.
Typical high assist providers were midfielders on teams like Sarajevo and Zrinjski, which
were consistently two of the most offensively proficient teams. This suggests the value of
creative midfield play in driving team-level offensive proficiency.

• Attackers – Goal involvement prediction:

Predictive modeling with logistic regression and Random Forest successfully picked out
attackers with high goal involvement. Individuals such as those from Zrinjski and Borac,
which led scoring charts, were predicted with consistency as high performers. The close
relationship between individual offensive production and team goal difference (an
important predictor of league position) underlines the primacy of attacking contributions
in determining team success. These position-wise analyses demonstrate that individual
player performance, when properly modeled, is in line with and explains many team-level
outcomes.

88
4.14.3 Connecting individual performance to team success

Integrating the player-level projections and team-level modeling reveals significant

synergies. Both level patterns support a shared set of determinants of performance,
including:

- Possession and passing accuracy – associated with dominant team play and
creative midfielders;
- Scoring and assist efficiency – linked to high-scoring attackers and high league
positions;
- Discipline and defensive metrics – shared among low-fouling defenders and low
team goals conceded;

These cross-sections confirm that individual performances are inseparable from team
outcomes. In the majority of cases, players whom machine learning models predicted to
be among the top performers were part of teams that finished at the top of the league table.
This suggests a feedback loop between micro-level (player) and macro-level (team)
performance, with the implication that good player modeling can be utilized as a proxy
for team success indicators and vice versa. This alignment enables the realization of data-
driven scouting, lineup optimization, and individualization of training approaches in real-
world football operations.

4.14.4 Final observations

While most characteristics had consistent significance for both levels of analysis, there
were some intriguing divergences:

Disciplinary yellow and red cards had high coverage in descriptive analysis but were given
relatively low feature importance in machine learning models. This may suggest that their
effect is more indirect or contextual, i.e., by affecting match outcomes via suspensions
rather than through accumulated performance loss.
89
Possession and passing accuracy were the cross-cutting qualities, appearing regularly in
both correlation analysis as well as feature importance rankings. These features appear
particularly dominant for midfielders and team control, confirming their centrality to
tactical design.

The combination of statistical and algorithmic methods reinforces the interpretability and
validity of findings. The two-tiered framework employed here demonstrates that the
integration of human-interpretable analysis and machine-generated intelligence can
produce actionable intelligence for football, which is ready to be used by management,
analysts, and coaches.

90
5. CONCLUSION

This thesis performed an intensive investigation into the use of machine learning and data-
driven analysis in the evaluation and prediction of football performance in the Premier
League of Bosnia and Herzegovina (PLBiH) for the 2023–2024 season. By combining
structured data collection, statistical exploration, and predictive modeling, the study
sought to produce actionable intelligence on both team and player levels. The ultimate aim
was not simply to assess how modern analytical techniques could be used within an area
football league but also to implement them in a real-world context as tools for operational
analysis and strategic decision-making within lower-resource or less internationally high-
profile leagues.

On the basis of statistics extracted from SofaScore, the research constructed a dataset of
32 variables across 12 teams, an extensive range of performance indicators from goals for
and against to disciplinary record, possession percentage, clean sheets, and passing
accuracy. This foundational dataset underpinned a multi-stage analytical pipeline that
consisted of:

Descriptive and correlation analysis, for the identification of statistically significant trends
and outliers,

Clustering analysis, for team segmentation by performance profiles,

Feature importance modeling, to establish the most predictive features of team success,

Supervised machine learning models to predict player-level outcome variables, e.g., foul
frequency, assist frequency, and total goal participation. The study results highlight the
predictive potential of famous football metrics:

At team level, clean sheets, goals conceded, and interceptions were the best predictors of
end-of-season league standing. These metrics were not only highlighted by correlation
coefficients (e.g., possession vs. league rank correlation r = -0.77), but also by feature
importance rankings in Random Forest and XGBoost models.
91
At the player level, individual models were created for defenders, midfielders, and
strikers. Logistic Regression worked best in every position across the board, indicating
extremely high separability of features utilized and demonstrating that even with primitive
performance stats, predictions that are meaningful can be made. Midfielders from teams
that perform offensively (e.g., Sarajevo, Zrinjski) were identified properly as having big
assist numbers, and attackers from highly ranked clubs were identified appropriately using
goal involvement metrics.

These results are strong indicators that machine learning models, especially interpretable
ones like Logistic Regression, can learn basic patterns in football performance from fairly
limited amounts of match-level data. This is particularly significant since it is widely
believed that successful football analytics requires gigantic amounts of player tracking
data or proprietary sensor feeds. The study illustrates that, in a data-scarce environment
like the PLBiH, open-source software and freely available statistics may be used to draw
significant and replicable conclusions.

However, the study points out some vulnerabilities too. The dataset on teams was only 12
observations, which matched the amount of teams playing in the league. That was the
natural restriction on the model's complexity and the ability to generalize findings. At the
player level, the sample in each position-specific dataset was also quite small (n=12 per
group), which again increases the risk of model instability and overfitting, especially with
more advanced models like XGBoost. Although stratified k-fold cross-validation helped
to some extent in prevention, the models' generalizability across seasons, leagues, or
contexts remains limited. Second, binary target labels, through median splits, while useful
in maintaining class balance, may have reduced more nuanced player behavior to simpler
states.

Despite these weaknesses, the thesis provides a valid proof of concept for analytics-driven
football intelligence in lower-researched leagues. It offers an replicatable methodology to
other regional competitions seeking to leverage analytics without expensive infrastructure
or proprietary data feeds.

92
To build on the results and more effectively move beyond current constraints, the
following directions for future research can be done:

Utilize multiseason datasets to support trend analysis between seasons and improve model
stability over time. This would allow for finding enduring patterns in player behavior and
team strategy within multiple competitive cycles. Also interleave match-by-match or
event-level details, e.g., pass chains, sources of shots, or player movement, to provide
temporal and tactical depth to the models. It can be applied explainable AI techniques like
SHAP (SHapley Additive exPlanations) or LIME to make black-box models more
interpretable and trustworthy to stakeholders, especially for management or coaching
decisions. Study ensemble and hybrid models that leverage the predictability of algorithms
like XGBoost but combine it with the explainability of logistic regression for more
nuanced player classification applications. Develop scouting resources or dashboards
from these models that can help clubs in recruitment, training priority, or opponent
preparation and make the analytics actionable for purposes beyond academic research.

This research contributes to the growing intersection of machine learning and sports
analytics, specifically on an under-represented regional European football league in
current academic literature. Through the demonstration of viable predictive models for
team and individual performance, the thesis illustrates how data science can facilitate
clubs, coaches, and analysts to make more objective, informed, and tactical decisions.

More broadly, it advocates for the democratization of football analytics—showing that

one does not need to possess world-class facilities or proprietary tracking data to have
illuminating performance insights. With prudent feature choice, sound validation, and
targeted modeling, Bosnian and Herzegovina clubs and leagues such as these are able to
build sound analytical systems that improve competitiveness, scouting efficiency, and
tactical preparation.

In offering a bridge between current data science practices and bottom-line football
performance measurement, this study establishes the groundwork for more widespread
adoption of smart, data-intensive strategies throughout all echelons of the sport across the
93
region. It is to be hoped that this study makes an addition to scholastic advancement as
much as to practical development in the Bosnian football regime and beyond.

94
REFERENCES

[1] Antonini, V., Scriney, M., Mileo, A., & Roantree, M. (2024). A Framework for Spatio-
Temporal Graph Analytics in Field Sports.

[2] Antonini, V., Scriney, M., et al. (2024). Time-windowed graph analytics in sports.
arXiv.

[3] Beetz, M., Gedikli, S. (2009). Game state modeling frameworks.

[4] Beetz, M., Gedikli, S., et al. (2009). Automated game analysis via spatio-temporal
data. Int. J. Comput. Sci. Sport.

[5] Beetz, M., von Hoyningen-Huene, N., et al. (2009). ASPOGAMO: Automated sports
games analysis models. International Journal of Computer Science in Sport.

[6] Bialkowski, A., Lucey, P., Carr, P., Yue, Y., & Matthews, I. (2014). Win at home and
draw away: Automatic formation analysis highlighting differences in team
behaviors. Proceedings of the MIT Sloan Sports Analytics Conference.

[7] Breihofer, P., & Memmert, D. (2006). Optimales Taktiktraining im Leistungsfußball.

Zahlreiche Editionen.

[8] Breihofer, P., V., Memmert, D. (2006). Tactical training optimization in elite football.

[9] Carling, C., Kannekens, R., Sampaio, J., & Yiannakos, A. (2005). Tactical behavior in
elite football: formation and movement analysis. Journal of Sports Sciences.

[10] Forcher, F., Beckmann, T., et al. (2024). Prediction of defensive success in elite soccer
using machine learning – Tactical analysis of defensive play using tracking data
and explainable AI. Science and Medicine in Football.

[11] Franks, I. M., & Miller, G. (Year). Expert observers’ recollection of match events.

95
[12] Franks, I. M., Miller, G. (Year). Human observational error in sports analytics.

[13] Goes, F. R., Meerhoff, L. A., et al. (2021). Unlocking the potential of big data to
support tactical performance analysis in professional soccer: A systematic review.
European Journal of Sport Science.

[14] Gudmundsson, J., & Horton, M. (2017). Spatio-Temporal Analysis of Team Sports: A
Survey. ACM Computing Surveys.

[15] Gyarmati, L., & Anguera, X. (2015). Automatic Extraction of the Passing Strategies
of Soccer Teams.

[16] Gyarmati, L., & Anguera, X. (2015). DTW-based passing strategy detection. arXiv.

[17] Horton, M., & Gudmundsson, J. (2016). Flow Diagrams for State Sequences in Team
Sports. ACM Journal of Experimental Algorithmics.

[18] Horton, M., Chawla, S., Estephan, J. (2014). Computational geometry approaches to
pass classification. arXiv.

[19] Horton, M., Gudmundsson, J., Chawla, S., & Estephan, J. (2014). Classification of
Passes in Football Matches using Spatiotemporal Data.

[20] Memmert, D. (2015). Teaching Tactical Creativity in Sport. Routledge.

[21] Memmert, D. (2017). Match Analysis, Big Data and Tactics: Current Trends in Elite
Soccer. German Journal of Sports Medicine.

[22] Memmert, D. (2017). Spatio-temporal tracking and tactical patterns in elite football.

[23] Memmert, D., & Raabe, D. (2017). Revolution in professional football: data-driven
analysis 4.0.

96
[24] Memmert, D., & Raabe, D. (2019). Data analytics in football: positional data
collection, modelling and analysis. Routledge.

[25] Memmert, D., Lemmink, K., & Sampaio, J. (2017). Current approaches to tactical
performance analyses in soccer using position data. Sports Medicine.

[26] Memmert, D., Lemmink, K., & Sampaio, J. (2020). Collective team behavior and
positional spatio-temporal modeling. Sports Medicine.

[27] Memmert, D., Lemmink, K. A. P. M., & Sampaio, J. (2020). A systematic review of
collective tactical behaviours in football using positional data. Sports Medicine.

[28] Rein, R., & Memmert, D. (2016). Big data and tactical analysis in elite soccer:
Future challenges and opportunities for sports science. SpringerPlus.

[29] Rein, R., & Memmert, D. (2016). Future challenges for sports science in big data.
PubMed Central.

[30] Rein, R., & Memmert, D. (2016). Soccer big data models for tactical decision
making. German Journal of Sports Medicine.

[31] Rein, R., Raabe, D., & Memmert, D. (2017). Which pass is better? Novel approaches
to assess passing effectiveness in elite soccer matches. Human Movement Science.

[32] Rein, R., Raabe, D., et al. (2017). Tactical creativity via spatio-temporal tracking
data. Human Movement Science.

[33] Sampaio, J., & Maças, V. (2012). Measuring tactical behavior in football. Journal of
Sports Sciences.

[34] Teferi, G., & Endalew, D. (2020). Methods of Biomechanical Performance Analyses
in Sport: A Systematic Review. American Journal of Sports Science and Medicine.

97
[35] Teferi, G., & Endalew, D. (2020). Sports biomechanics systematic approaches. Am J
Sports Sci Med.

Football Match ML Analysis
No ratings yet
Football Match ML Analysis
24 pages
Football Data Analysis Using Machine Learning Techniques
No ratings yet
Football Data Analysis Using Machine Learning Techniques
3 pages
Crafting A Player Impact Metric Through Analysis of Football Match Event Data
No ratings yet
Crafting A Player Impact Metric Through Analysis of Football Match Event Data
15 pages
Thesis Proposal Presentation
No ratings yet
Thesis Proposal Presentation
15 pages
Intelligent Classifiers For Football Player Performance
No ratings yet
Intelligent Classifiers For Football Player Performance
11 pages
Football Prediction with ML
No ratings yet
Football Prediction with ML
73 pages
Entropy 23 00090 v3
No ratings yet
Entropy 23 00090 v3
12 pages
Artificial Neural Networks For Enhancing Soccer Team Performance Through Tactical Data Analysis
No ratings yet
Artificial Neural Networks For Enhancing Soccer Team Performance Through Tactical Data Analysis
6 pages
EPL Prediction Web App
No ratings yet
EPL Prediction Web App
15 pages
INFO Assignment 1
No ratings yet
INFO Assignment 1
6 pages
Application For Football League Data Collection and Analysis
No ratings yet
Application For Football League Data Collection and Analysis
85 pages
ML in Soccer Analytics Gunjan Kumar
No ratings yet
ML in Soccer Analytics Gunjan Kumar
99 pages
Data Analytics in Football
No ratings yet
Data Analytics in Football
7 pages
Internship College Report
No ratings yet
Internship College Report
29 pages
Comparison of Football Results Using Machine Learning Algorithms
No ratings yet
Comparison of Football Results Using Machine Learning Algorithms
7 pages
Godavari Engg College 24-25 Seminar Report
No ratings yet
Godavari Engg College 24-25 Seminar Report
30 pages
Foundation of Data Science 2-1
No ratings yet
Foundation of Data Science 2-1
17 pages
Expected Goals in Soccer
No ratings yet
Expected Goals in Soccer
63 pages
Project-I Report Format 2023-2024
No ratings yet
Project-I Report Format 2023-2024
35 pages
A Comparative Study of The Different Classification Algorithms On Football Analytics
No ratings yet
A Comparative Study of The Different Classification Algorithms On Football Analytics
16 pages
Player Stats Analysis Using Machine Learning
No ratings yet
Player Stats Analysis Using Machine Learning
4 pages
Foundation of Data Science 2
No ratings yet
Foundation of Data Science 2
11 pages
Sports Match Prediction Using AI
No ratings yet
Sports Match Prediction Using AI
2 pages
Football Match Prediction System
No ratings yet
Football Match Prediction System
7 pages
Game Plan: What AI Can Do For Football, and What Football Can Do For AI
No ratings yet
Game Plan: What AI Can Do For Football, and What Football Can Do For AI
48 pages
Pattern PDF
No ratings yet
Pattern PDF
5 pages
Prediction of Football Match Score and Decision Making Process
No ratings yet
Prediction of Football Match Score and Decision Making Process
4 pages
Allahyyy
No ratings yet
Allahyyy
54 pages
Introduction New
No ratings yet
Introduction New
3 pages
2020-21 Fall 41553 Bernardo-Pinto
No ratings yet
2020-21 Fall 41553 Bernardo-Pinto
49 pages
IJCRT2304812
No ratings yet
IJCRT2304812
8 pages
Soccer Analytics
No ratings yet
Soccer Analytics
79 pages
IPL Match Winner Prediction Using ML
No ratings yet
IPL Match Winner Prediction Using ML
7 pages
A Novel Approach For Predicting Football Match Results: An Evaluation of Classification Algorithms
No ratings yet
A Novel Approach For Predicting Football Match Results: An Evaluation of Classification Algorithms
8 pages
Automated Soccer Analysis System
No ratings yet
Automated Soccer Analysis System
11 pages
Football - Match - Result - Prediction - Using - Neural - Networks - and - Deep - Learning Yeah
No ratings yet
Football - Match - Result - Prediction - Using - Neural - Networks - and - Deep - Learning Yeah
4 pages
Artificial Intelligence in Sport Performance Analysis, 1st Edition Scribd Download
No ratings yet
Artificial Intelligence in Sport Performance Analysis, 1st Edition Scribd Download
14 pages
Fantasy Football Revamp
0% (1)
Fantasy Football Revamp
5 pages
10 1109@access 2019 29139531
No ratings yet
10 1109@access 2019 29139531
13 pages
Game Plan - What AI Can Do For Football, and What Football Can Do For AI
No ratings yet
Game Plan - What AI Can Do For Football, and What Football Can Do For AI
48 pages
Football Analyst: Sistemes de Suport A La Decisió
No ratings yet
Football Analyst: Sistemes de Suport A La Decisió
15 pages
Data Mining and Machine Learning in High-Performance Sport
No ratings yet
Data Mining and Machine Learning in High-Performance Sport
63 pages
Football Pass Valuation Models
No ratings yet
Football Pass Valuation Models
73 pages
Cricket Player Data Analysis Using Clustering Technique
No ratings yet
Cricket Player Data Analysis Using Clustering Technique
5 pages
Prediction and Analysis of Franchise Cricket
No ratings yet
Prediction and Analysis of Franchise Cricket
8 pages
Journal Pone 0284318
No ratings yet
Journal Pone 0284318
15 pages
BCA 8th Project Report (Linear Regression)
No ratings yet
BCA 8th Project Report (Linear Regression)
34 pages
An Assessment of Football Through The Lens of Data Science
No ratings yet
An Assessment of Football Through The Lens of Data Science
14 pages
Football Player Performance Clustering
No ratings yet
Football Player Performance Clustering
280 pages
ICS5200 Matthew Zammit Soft
No ratings yet
ICS5200 Matthew Zammit Soft
128 pages
Machine Learning With Applications
No ratings yet
Machine Learning With Applications
9 pages
Football Match Prediction Methods
No ratings yet
Football Match Prediction Methods
59 pages
IPL Data Anlysis
No ratings yet
IPL Data Anlysis
10 pages
Machine Learning in Cricket Review
No ratings yet
Machine Learning in Cricket Review
9 pages
Using Supervised Learning To Predict English Premier League Match
No ratings yet
Using Supervised Learning To Predict English Premier League Match
79 pages
Adarsh Report
No ratings yet
Adarsh Report
39 pages
IPL Data Analysis and Prediction Using M
No ratings yet
IPL Data Analysis and Prediction Using M
4 pages
Statistics in Football
No ratings yet
Statistics in Football
8 pages
Forests 14 02440
No ratings yet
Forests 14 02440
18 pages
03 Machine Learning Enabled Quantification of Stochastic Active Metadamping in Acoustic Metamaterials
No ratings yet
03 Machine Learning Enabled Quantification of Stochastic Active Metadamping in Acoustic Metamaterials
11 pages
Optimization Techniquesot Notes For Bca 4th Sem Based On Purvanchal University PDF
No ratings yet
Optimization Techniquesot Notes For Bca 4th Sem Based On Purvanchal University PDF
33 pages
Foundations of Deep Reinforcement Learning Theory and Practice in Python (Laura Graesser, Wah Loon Keng) (Z-Library)
100% (3)
Foundations of Deep Reinforcement Learning Theory and Practice in Python (Laura Graesser, Wah Loon Keng) (Z-Library)
413 pages
Chapter 18 Operations Management
No ratings yet
Chapter 18 Operations Management
4 pages
Computer-Aided Diagnosis Systems A Comparative Study of Classical Machine Learning Versus Deep Learning-Based Approaches
No ratings yet
Computer-Aided Diagnosis Systems A Comparative Study of Classical Machine Learning Versus Deep Learning-Based Approaches
41 pages
Wooldridge 7e Ch03 IM
No ratings yet
Wooldridge 7e Ch03 IM
20 pages
Advanced Differential Equations and Mathematical Modeling
No ratings yet
Advanced Differential Equations and Mathematical Modeling
5 pages
Oracle Generative AI (1Z0-1127-25) Mock Test - Set - 7
No ratings yet
Oracle Generative AI (1Z0-1127-25) Mock Test - Set - 7
5 pages
Signals and Systems Basics
No ratings yet
Signals and Systems Basics
106 pages
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
No ratings yet
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
13 pages
Assignment Solved
No ratings yet
Assignment Solved
8 pages
Gauge R&R
No ratings yet
Gauge R&R
7 pages
ST2334 Midterm Test 2022-2023 Sem 1 Solution
No ratings yet
ST2334 Midterm Test 2022-2023 Sem 1 Solution
7 pages
ML Basics for Beginners
No ratings yet
ML Basics for Beginners
20 pages
L, - 1'), 1.' - C:-') Ty - 'T"1/ J.,... : Test-Chs Sections (25 Points)
No ratings yet
L, - 1'), 1.' - C:-') Ty - 'T"1/ J.,... : Test-Chs Sections (25 Points)
2 pages
Chapter Two - DS Algorithm Analysis
No ratings yet
Chapter Two - DS Algorithm Analysis
32 pages
Digital - Chapter3.k Map
No ratings yet
Digital - Chapter3.k Map
18 pages
COE 343 - Info Theory and Coding Lecture-3 Information Transmission Rate Dr. Eric Tutu Tchao
0% (1)
COE 343 - Info Theory and Coding Lecture-3 Information Transmission Rate Dr. Eric Tutu Tchao
17 pages
Assignement 1 ECE 434 AI
No ratings yet
Assignement 1 ECE 434 AI
4 pages
Chapter 1 - Lesson 1 - Course Intro and Discrete or Continuous Random Variables
No ratings yet
Chapter 1 - Lesson 1 - Course Intro and Discrete or Continuous Random Variables
21 pages
Smart Meter Data for Load Forecasting
No ratings yet
Smart Meter Data for Load Forecasting
22 pages
Unit I Lesson 3 Computing The Mean of A Discrete Probability Distribution
100% (1)
Unit I Lesson 3 Computing The Mean of A Discrete Probability Distribution
24 pages
Project 1
No ratings yet
Project 1
3 pages
Top 100 Deep Learning Interview Questions
No ratings yet
Top 100 Deep Learning Interview Questions
157 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
Deep Learning-Based Assessment Model For Real-Time Identification of Visual Learners Using Raw EEG
No ratings yet
Deep Learning-Based Assessment Model For Real-Time Identification of Visual Learners Using Raw EEG
13 pages
M.Tech Power Systems QBank
No ratings yet
M.Tech Power Systems QBank
6 pages
Assignment Problem
No ratings yet
Assignment Problem
12 pages
Fixed-Point Iteration Guide
No ratings yet
Fixed-Point Iteration Guide
11 pages