Master Thesis Armin - Aca8cc3a 5dd0 4913 8628 Caeb99602aa8
Master Thesis Armin - Aca8cc3a 5dd0 4913 8628 Caeb99602aa8
ARMIN ŠARIĆ
2025
DATA DRIVEN INSIGHTS INTO FOOTBALL
PERFORMANCE: MACHINE LEARNING ANALYSIS OF THE
PREMIER LEAGUE OF BOSNIA AND HERZEGOVINA
BY
ARMIN ŠARIĆ
June, 2025
APPROVAL PAGE
I certify that I have supervised and read this study and that in my opinion, it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and quality,
as a master’s thesis for the degree of Master of Science in Computer Sciences and
Engineering.
…………………………………………..
Assist. Prof. Dr. Mohammed Saeed Jawad
Mentor
I certify that I have read this study and that in my opinion, it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a
master’s thesis for the degree of Master of Science in Computer Sciences and Engineering.
…………………………………………..
[Committee member’s academic title and full name]
Committee Member
…………………………………………..
[Committee member’s academic title and full name]
Committee Member
This master’s thesis was submitted in partial fulfillment of the requirements for the degree
of Master of Science in Computer Sciences and Engineering.
…………………………………………..
Assist. Prof. Dr. Amal Mersni
Program Coordinator
…………………………………………..
Assoc. Prof. Dr. Altijana Hromić-Jahjaefendić
Dean
iii
DECLARATION
I hereby declare that all information in this document has been obtained and presented in
accordance with academic rules and ethical conduct. I also declare that, as required by
these rules and conduct, I have fully cited and referenced all material and results that are
not original to this work.
iv
INTERNATIONAL UNIVERSITY OF SARAJEVO
v
iii
ACKNOWLEDGEMENTS
I would like to thank all the people who motivated, supported, and guided me through the
process of finishing this master's thesis. This research is not only an academic achievement
but a personal experience that was possible because of the relentless support of various
people.
First and foremost, I would like to offer my most heartfelt thanks to my family for their
unrelenting support, patience, and understanding through every stage of this academic
endeavor. Their belief in me, even during the darkest moments, has been a constant source
of strength and encouragement. Were it not for their love and encouragement, this triumph
would not have been achieved.
I would also like to extend my heartfelt thanks to my good friends, especially those who
accompanied me every step of the way in this journey with genuine interest, stimulating
feedback, and well-timed moral support. Your inspiring words and thoughtful discussions
kept me on the ground and focused.
In addition, I would like to express thanks and appreciation to all the professors who taught
me throughout the duration of my study. Their instructional dedication, scholarly rigor,
and intellectual provocation laid the requisite foundation for this thesis.
All of you have had a lasting impact on my academic and personal growth, and I am truly
grateful to have been instructed by you.
To all my fellow colleagues, academic staff, and whoever contributed in some form, big
iv
or small by offering insights, being helpful, or simply by being a good listener—thank
you for being part of the journey.
This thesis is a result of teamwork, and I am humbled and grateful for the support I have
had all along.
v
LIST OF ABBREVIATIONS
ACC Accuracy
AI Artificial intelligence
API Application Programming Interface
AUC Area under the curve
BiH Bosnia and Herzegovina
CSV Comma-Separated Values
DI Disciplinary Index
EDA Exploratory Data Analysis
F1 Score Harmonic Mean of Precision and Recall
G/A Goals and Assists (Goal Involvement)
GD Goal Difference
JSON JavaScript Object Notation
KP Key Passes
KPI Key Performance Indicator
LR Logistic Regression
ML Machine Learning
PCA Principal Component Analysis
PLBiH Premier League of Bosnia and Herzegovina
PRE Precision
REC Recall
RF Random Forest
RMSE Root Mean Squared Error
ROC Receiver Operating Characteristic
SVM Support Vector Machine
xG Expected Goals
XGBoost Extreme Gradient Boosting
vi
vii
TABLE OF CONTENTS
DECLARATION ..............................................................................................................iv
ACKNOWLEDGEMENTS ..............................................................................................iv
ABSTRACT ....................................................................................................................... x
1. INTRODUCTION ...................................................................................................... 1
viii
3. RESEARCH METHODOLOGY ............................................................................. 20
5. CONCLUSION ......................................................................................................... 91
REFERENCES................................................................................................................. 95
ix
ABSTRACT
DATA DRIVEN INSIGHTS INTO FOOTBALL PERFORMANCE: MACHINE
LEARNING ANALYSIS OF THE PREMIER LEAGUE OF BOSNIA AND
HERZEGOVINA
This study evaluates and predicts football performance during the 2023–2024 Premier
League of Bosnia and Herzegovina through data-driven processes and machine learning
designs. Grounded on structured match data that was downloaded from SofaScore, the
research integrates team-level statistics such as goals scored and conceded, possession,
disciplinary record, and clean sheet with player-level metrics for defenders, midfielders,
and forwards.
A multi-method strategy was followed with a mix of exploratory data analysis, correlation
analysis, clustering (K-means), and supervised machine learning models like Logistic
Regression, Random Forest, and XGBoost classifiers. Models were applied to predict
league positions, relegation, and key player contributions (like top foulers, assist leaders,
and top goal contributors). Feature engineering was used to enhance raw measures, and
stratified cross-validation was used to enhance generalizability on small datasets.
The findings point to goal conceded, clean sheets, and interceptions as the most accurate
predictors of team success as confirmed by feature importance returns from XGBoost and
Random Forest. Relegation prediction models revealed significant differences in
defensive solidity, efficiency in scoring, and disciplinary behavior between relegated and
non-relegated teams. The player-level models had Logistic Regression consistently
outperforming more complex models with high accuracy, precision, and explainability,
particularly in identifying key players by position.
xi
LIST OF TABLES
xii
LIST OF FIGURES
Figure 1 Data extracted in JSON format for Premier league of BiH ............................... 25
Figure 13 Relationship between penalty success rate and league position ...................... 47
xiv
1. INTRODUCTION
Not only is football the most popular sport in the world, but it's also a cultural and social
phenomenon [1]. Beyond the pitch, football affects millions via national identity,
coherence of community, economic investment, and entertainment. The sport has evolved
to be a complex ecosystem where, apart from players and coaches, there are scientists,
analysts, and data engineers participating in it, given that there are billions of fans
worldwide.
Over the past decade, football has gained significant interest among the scientific
community, particularly as the connection between sport performance and digital
technologies becomes more prominent. As AI, data analysis, and sports science
increasingly overlap, there has been a new era of research and practice in football. Football
choices are more than ever based on data rather than gut feeling alone. Players' and teams'
activities can now be understood in terms much broader than the traditional statistics
because of the increasing amount of detailed match data and sophisticated analytics [2].
Meanwhile, machine learning (ML) has been a key aspect of modern sports analytics [4].
ML algorithms in football are used not just for match analysis after the event but also to
forecast match outcomes, identify opponent strengths and weaknesses, assign player
1
valuations, calculate team formations to optimize, and even suggest real-time tactical
instructions. The ability of ML models to detect hidden patterns in high-dimensional data
sets makes them ideal tools to make sense of the dynamic and complex nature of football.
Powerful European football clubs—i.e., Manchester City, Bayern Munich, and
Liverpool—already use these tools, embedding data science in their scouting, training
load optimization, and opposition analysis processes [5].
While all these developments have been occurring in the world, many smaller national
leagues such as the Premier League of Bosnia and Herzegovina are still to embrace data
analysis to any great extent, either practically or in academic research. This is particularly
intriguing given the league's rich football heritage, enthusiastic fanbase, and socio-cultural
value. While the passion for the game is there, and also willingness of coaches to use
advanced analytical tools there is not much empirical evidence to suggest that clubs or
league administrators are regularly using advanced data analytics to guide their decision-
making. This lack is a significant research gap and opens up a clear window of opportunity
to explore how data science and machine learning can be applied to a league that is still
fairly under-analyzed in existing literature.
The aim of this study is to conduct a systematic analysis of the Bosnia and Herzegovina
Premier League performance statistics using machine learning models. The dataset, in
JSON format acquired from SofaScore, was cleaned and normalized to structured CSV
files that are amenable to statistical manipulation as well as model construction. By
analyzing a wide range of team statistics including goals, ball possession, disciplinary, and
defense statistics, the study hopes to uncover the key drivers of performance that dictate
league success. Along with this, the project hopes to employ the use of predictive
modeling in order to forecast league position, compute the probability of relegation, and
provide insight into other performance-based variables.
2
football analysis by demonstration of a regional case study in which usage of such tools
remains in its infancy. Implications of this study can not only contribute to theoretical
advancement in sport data analysis but also have practical implications for stakeholders
who want to update football training in this league and other environments.
The study also extends its analytical focus from team-level data through the incorporation
of player-level modeling. It investigates the ways in which player performance metrics
can be utilized to inform play on the field and to affect team performance. Three separate
predictive models are developed for each of the player roles: attackers (involvement in
scoring), midfielders (assist percentage), and defenders (foul rates). These models allow
for a more sophisticated determination of individual players' contribution to the overall
team performance. By taking this two-level approach—both at team and player levels—
the research provides an all-encompassing framework for assessing football performance
with machine learning methods.
However, this transformation hasn't occurred everywhere uniformly. There are substantial
disparities among high-resource and low-resource leagues, particularly in the theoretical
and actual application of advanced analytics. The Premier League of Bosnia and
Herzegovina is such a league that has yet to experience fully the impact of data-driven
methods. Despite its strong football tradition and growing online media, the use of
3
predictive analytics and systematic data analysis in this league remains marginal and
underdevelopment.
While platforms like SofaScore and Opta have placed basic match data into the hands of
the public domain, such as the number of goals, percentage possession, and disciplinary
record, the facts are reported without additional analytical overview. While clubs and
analysts do have regard for such basic numbers, few undergo higher-level statistical or
machine learning analysis that can uncover hidden trends, inform evidence-driven
coaching, or facilitate tactical optimization.
In high-resource leagues, there has been an increasing amount of evidence where machine
learning has demonstrated successfully predicting match outcomes, simulating relegation
chances, examining player contribution, and quantifying subtle tactical efficiencies in
games [6]. Highly detailed performance data such as pressing pressure, pass networks,
xG, zone occupation, and defensive actions are the typical workhorse data utilized in these
studies. Conversely, such multi-layer and context-sensitive analysis is yet to be deeply
applied in the Premier League of Bosnia and Herzegovina, thus leaving the huge body of
knowledge and immense amount of untapped potential [7].
This deficiency not only robs clubs in lower divisions of the strategic benefits of data
science but also limits the generalizability of the findings from research into football
analytics. Rein and Memmert contend that the disproportionate representation of elite
leagues in academic scholarship limits generalizability and overlooks contextual
differences in less high-profile leagues [8]. By not extending analysis to these alternative
environments, football scholarship is incomplete—omitting considerations of differential
playing style, resource shortage, and cultural framing of the game.
Furthermore, the absence of predictive modeling in such leagues limits the power to detect
early indicators of success or failure in the course of a season. By leaving predictive
models unutilized to delineate which statistics are most correlated with winning
performance or relegation threat, stakeholders—coaches, analysts, and club officials—are
deprived of beneficial knowledge that may render their competitive plan more effective.
4
This inhibits the deployment of evidence-based decision-making during training, match
preparation, and recruitment. No less overlooked is the space for individual player-level
modeling in lower-league contexts. Though analysis at the team level has had some
traction, little work has been conducted that takes into account the ways in which the
players within particular roles—forward, central midfielder, or defender—contribute on
their own toward overall team performance. Player-level classification models offer a
more detailed view of performance, enabling identification of high-impact players,
recurring tactical issues, or positional strengths/weaknesses that might otherwise be lost
in aggregate metrics.
In consideration of these theoretical and applied deficits, there is a clear and pressing need
for carrying out expert research into the application of machine learning and statistical
modeling in Bosnia and Herzegovina's Premier League. By applying advanced data
science techniques to a regional football setting, this study not only fills an important gap
in research but also offers valuable insight to stakeholders who operate within similar
underrepresented leagues. The overall goal is to demonstrate the viability and worth of
ML-based analytics in performance modeling even where there is moderate data
availability and minimal technology infrastructure.
The purpose of this research is to fully investigate team and player performance
throughout the Bosnia and Herzegovina Premier League using both descriptive statistics
and advanced machine learning (ML) techniques. Although straightforward match
statistics are commonly produced for public release—e.g., number of goals, possession
share, or yellow cards—such measurements really don't attempt much more than
superficial examination. Hiding subtly underneath is a huge possibility to find higher-level
trends, relationships, and decision-critical information.
This study formulates several research questions that are aligned in an effort to bridge the
gap from sheer data presentation towards substantial football intelligence. All the
5
questions are aligned with relevant analytical roles undertaken in the study and are
motivated by a vast dataset that is derived from SofaScore. Additionally, the questions
draw inspiration from the existing body of research in top leagues, where these approaches
have proven to be worthwhile.
Q1: What do descriptive statistics and correlation analysis reveal regarding the most
important predictors of performance that distinguish successful from unsuccessful
teams in the BiH Premier League?
Q2: What is the mean ball possession contribution to team performance in goals
scored, goals conceded, and finishing position within the league?
Q3: How does discipline, for example yellow and red cards affect team performance
and league position?
While sometimes termed a subjective issue, this study tests the hypothesis of negative
correlation between poor discipline (cumulative yellow and red cards) and team success
quantitatively. By comparing these disciplinary statistics to others like goals allowed and
6
league position, the study looks at whether the teams with worse discipline perform worse
on average.
Q4: What are the important features to predict the league standing, and in what
order do machine learning classifiers such as Random Forest and XGBoost rank
them?
Machine learning is not only predictive but also explainable using measures of feature
importance. In this case, Random Forest and XGBoost are trained to predict league
position from various match statistics. The model finds which features such as goals
conceded, successful interceptions, accurate passes, or clean sheets contribute most to
end-of-season team positions and ranks them in increasing order.
Q5: Can machine learning algorithms predict ultimate league position or relegation
destiny using team performance statistics?
Expanding on the above, the research employs both binary classification and regression
models to analyze if season-long performance measurements can be used to accurately
forecast in which position a team will finish and whether it is at risk of relegation. The
performance of the models is measured in terms of accuracies, AUC, and precision-recall
to determine their usability in real-world applications.
Q6: In what ways are relegated team performance profiles different from non-
relegated teams, and how are trends revealed using comparative analysis?
The study contrasts relegated and non-relegated teams along all the relevant parameters
based on group-wise statistical comparisons. It brings to the fore early indicators of
warning or gameplay deficiencies that generally manifest in those teams that fail to remain
up. It also places team poor performance into perspective relative to the league average.
7
Q7: Can unsupervised learning (clustering) be applied to classify teams by their
performance measures, and what is the relationship between the clusters and league
tables?
Unsupervised learning using k-means clustering is used to group teams into performance
types such as dominant, balanced, or underperforming. The study then investigates how
these empirically determined groups correspond to real league tables based on an
independent verification of types of performance.
Q8: Can one utilize player-level stats to predict whether defenders will commit a lot
of fouls, midfielders will provide a lot of assists, or attackers will contribute heavily
in goal involvement?
After team-level modeling, this question shifts the attention toward player-level
contributions. ML classifiers are trained to classify whether a certain player is probable to
have over a certain threshold of fouls (defenders), assists (midfielders), or goals/goal
involvement (attackers). Such prediction problems enable a more detailed understanding
of who contributes the most in their respective roles.
Q9: In what way are such individual player projections echoed at the team level in
performance and outcome, i.e., league standing, threat of goals, or defensive
strength?
Finally, this question probes the ripple effects of individual play. It asks if teams with high-
quality players in specific roles (e.g., great scorers or playmakers) tend to feature higher
in aggregate league statistics. The answers to this question help to bridge micro-level
player analysis to macro-level team success.
8
1.4. Objectives
The primary aim of this study is to apply data-driven and machine learning techniques to
examine and predict football team performance in the Premier League of Bosnia and
Herzegovina. Although these techniques have been applied extensively on major
European leagues, this study will try to fill a knowledge gap in the literature by focusing
on a league that has not received a lot of attention in terms of advanced statistical or
predictive studies.
These goals align with broader trends across sports analytics, where machine learning and
performance modeling have already found value in real-time decision-making, player
evaluation, and tactical planning [11]. Specifically, the study will aim to extract useful
insight from match data, compare the efficacy of predictive models for forecasting league-
related outcomes such as final position or relegation, and establish the relevance of various
performance metrics. Specific goals are outlined below:
9
7. To use feature importance analysis to determine which features have the most
impact on team performance and rank them accordingly.
8. To provide practical implications to clubs, analysts, and football decision-makers
for how data can be effectively used for measurement and improvement of
performance.
9. To train and test classification models on Random Forest and Logistic Regression
for defenders, midfielders, and forwards to predict goal involvement, assist rate,
and foul tendency, respectively.
10. To investigate how major stakeholders affect the overall team performance in the
Premier League of Bosnia and Herzegovina by incorporating individual player
modeling with team performance analysis.
The study starts with the collection of data, where match-level and player-level data are
retrieved from the SofaScore website. SofaScore is a reputable provider of live and
historical football statistics and was used due to how easily available it was and the level
of detail that it could provide. Raw data, initially in JavaScript Object Notation (JSON)
format, were translated into Comma-Separated Values (CSV) format to enable easier
handling within Python-based data analysis environments. The data set contains a wide
variety of numerical and categorical variables that capture various dimensions of team and
player performance. Examples include but are not exhaustive of the following goals
10
scored and allowed, average possession, correct passes, yellow and red cards, clean sheets,
interceptions, fouls, penalties, and other contextual measures.
Once data is gathered, there is a heavy data preprocessing step to provide quality and
integrity. This is a very important step to avoid biased outcomes in model training and
statistical testing. Operations performed during this phase are the elimination of duplicate
records, handling missing values using imputation or exclusion (as required), data type
conversion, and normalization as necessary. Feature engineering is also used to develop
new domain-specific variables that enhance the scope of analytical research. The salient
engineered features are goal difference (goals scored against minus goals scored for),
disciplinary index (weighted summary of yellow and red cards), and penalty success rate.
The data is also cleaned up to exclude outliers, distribution skewness, and
multicollinearity, and then necessary adjustments are made to have a robust input space
for further modeling activities.
The second phase is exploratory data analysis (EDA) and statistical analysis, whose aim
is to provide baseline insight into variable relationships. Descriptive statistics (e.g., means,
medians, standard deviations) are computed, and visualisations (e.g., histograms,
heatmaps, scatter plots) are employed to review patterns between performance measures.
Correlation matrices are generated to inspect linear relationships among variables, in
particular between end of season position and relegation status. These exploratory results
help guide machine learning variable selection and serve as an initial-stage diagnostic
layer.
The core analysis phase applies supervised and unsupervised machine learning methods
to extract knowledge from the data. Along the supervised learning modeling pathway,
Random Forest and XGBoost models are applied to perform regression (predicting
ultimate league position) and classification (predicting relegation) tasks. These models are
chosen due to their small-to-medium dataset suitability, strong predictive power, and
support of feature importance scoring. The models are validated with proper performance
measures for the task: R-squared and RMSE for regression problems, accuracy, precision,
11
recall, and F1-score for classification. Cross-validation and grid search are utilized for
hyperparameter tuning and guarantee of generalization performance.
The approach also applies to the modeling at the player level, in which supervised
classifiers predict individual player tendencies across three positional positions:
defenders, midfielders, and forwards. For defenders, the objective is to predict the
likelihood of being a high-fouling player; for midfielders, whether they are likely to record
high assist rates; and for forwards, whether they are actively involved in goal scoring.
Each positional model is evaluated by AUC, precision, recall, and F1-score to ensure the
ability to discriminate high-impact players from mediocrities. Player-level predictions are
also compared to team-level success to examine the connection between individual and
collective performance.
Lastly, the research includes a feature importance analysis module. This is crucial for
model explainability and for drawing practical insights that can be communicated to
coaches, analysts, and team managers. Ranking features on the basis of their predictive
power in Random Forest and XGBoost models, the study identifies which statistical
metrics contribute most significantly to league performance and achievement. These
findings are not only useful for retrospective examination but also as inputs for the
development of future strategy and club performance monitoring across the region.
The seven main chapters that make up this thesis each focus on a specific stage of the
research process and contribute to the main goal of knowing and predicting football team
performance through data analytics and machine learning. The structure aims to lead the
reader from the introduction and motivation to the method design and analysis, and lastly,
to the interpretation and conclusions from the results.
CHAPTER 1 – Introduction
The context and motivation for the research are presented in this first chapter, which also
declares the absence of data-driven evidence within the Premier League of Bosnia and
Herzegovina and presents the main research questions and goals. It also provides an
overview of the thesis outline and the methodology.
This chapter overviews the literature on football performance analysis, statistical and
machine learning uses in sport, and prior uses of predictive modelling in elite sport and
sub-elite leagues. In line with proposals for taking football analytics into unexplored
competitions, it focuses on important issues and gaps in the literature, which this research
aims to fill.
CHAPTER 3 – Methodology
13
CHAPTER 4 – Data analysis and results
CHAPTER 5 – Conclusion
Key findings, research questions summary, and contribution of the study to football
analytics are all covered in chapter seven. It also highlights areas for future research and
makes a few suggestions for BiH Premier League stakeholders.
Every chapter builds upon the one before it, with a reasoned progression and
comprehensive examination of the research problem. In alignment with best practice in
sport science and data analysis research, this organization generates both theoretical
investigation as well as applied findings.
14
2. LITERATURE REVIEW
2.1. Introduction
The period of the last decade revolutionized professional practice and research in football
performance analysis by introducing the strength of data analysis and machine learning
techniques [26]. While the field relied heavily on human intuition, anecdotal facts, and
elementary statistics in the past, modern analytics has transformed it into a data-science
science characterized by predictive models, tactics simulations, and real-time decision
support systems. This revolution has made it possible for one to break down almost all
elements of the game varying from player position and ball recovery zones to shape
changes of the team and strategy introduction in real match scenarios. Access to publicly
available performance data and purpose-designed analysis software has driven this
revolution, enabling analysts, coaches, and researchers to go far beyond the boundaries of
conventional performance reviews.
This latest research taps into this emerging literature by focusing on the role of
performance measures and machine learning models in football analysis. It attempts to fill
the gap between the big leagues with extensive data infrastructures and small leagues
where data-driven methods are still on the emergence phase. Specifically, the review
presents a critical synthesis of literature regarding the areas of performance analysis,
tactical development, supervised and unsupervised learning, and data limitations to low-
resourced environments. Particular emphasis is placed on studies that investigate low-tier
leagues and aim to generalize machine learning-generated information to be applicable
across competitive arenas of varying maturity.
To be concise and readable, this chapter is composed of five closely related sections:
- Overview of football performance metrics and how they may be utilized within
team performance measurement;
15
- Overview of machine learning methods used in football analytics;
- Identification of literature gap within the context of the Premier League of Bosnia
and Herzegovina.
This systematic presentation sets the stage for discussing study design choices and
highlighting how prior work guides the use of analytical tools to a less-researched football
environment.
At the heart of football analytics is the collection and analysis of playing statistics, which
provide quantifiable measures for the way that games unfold. Traditional measures such
as goals, shots on target, and possession rates have been used for decades to summarize
match results and are still central to performance evaluation. Contemporary research has
significantly expanded the analysis domain by incorporating advanced statistics that allow
for deeper tactical interpretation.
Some of the most important modern metrics involve expected goals (xG), pass network
centrality, pressing zones, and area-based coverage. These metrics are more than just
reporting results and help analysts understand why and how certain patterns developed
within a match [27]. This shift from descriptive to explanatory metrics has helped coaches
and analysts assess performance in process-based indicators, which impacts training
regimens, in-match adjustments, and post-match analysis.
The call for avoiding superficial measurements has been placed in eloquent terms by
authors such as Rein and Memmert, who argue that tactical designs and wits in the game
16
can only be ascertained by combining spatial-temporal modeling with higher levels of
comprehension into performance observation [28]. Similarly, Gudmundsson and Horton
emphasized the need to utilize tracking data so that researchers can uncover details on
team formation, movement strategies, and spatial control across various stages in play
[29]. These developments have enabled more precise tactical models, allowing for insight
into positioning, transitions, and strategic cohesion.
Even with these promising advances, however, university and industry interest continues
to be largely centered on elite competitions such as the UEFA Champions League, La
Liga, and the English Premier League. Because of this, leagues such as the Premier League
of Bosnia and Herzegovina receive essentially no exposure in large datasets or research
papers. This exclusion is a significant limitation of the generalizability of football
research, as it precludes the potential to discover performance attributes honed from
different economic constraints, tactical styles, and player development pathways.
Earlier empirical research has consistently established that statistics such as possession
rate, passes completed accurately, interceptions won, and disciplinary offenses committed
are strongly associated with team performance. These variables are particularly relevant
in seasonal assessments, where longer-term trends better reflect strategic stability. The
influence of these variables, though, can be moderated by local contextual variables such
as refereeing quality, average match intensity, and climatic conditions. Consequently,
contextualized examination in unexplored leagues like Bosnia and Herzegovina is
emphasized in this research to determine whether international results are translatable on
the regional level or if adjustments need to be made per league.
Machine learning (ML) has been the most groundbreaking tool in football research,
enabling researchers and scientists to move away from descriptive statistics and into more
complex activities such as predictive modeling, automatic classification, and behavioral
pattern detection. ML techniques have transformed the analytical toolkit by uncovering
17
intricate, non-linear relationships in data that are often too complicated for conventional
statistical approaches to decipher accurately.
Over the past decade, the use of ML in football has spanned a broad set of applications.
They vary from match outcome prediction, player rating, injury predicting, tactical
profiling, and segmentation of the team by style of play [30]. Enhancing access to
structured data such as rich event logs, GPS data, and seasonally aggregated statistics has
allowed these models to move beyond proof-of-concepts to operational decision-making
systems for clubs and federations.
Among the most widely used families of ML algorithms in football are supervised learning
techniques, which work on learning mappings from input features to known target outputs.
Here, models learn from labeled past information to foretell outcomes like goal tallies,
league positions, or disciplinary incidents. Two of the best-performing supervised
algorithms are Random Forest and XGBoost, which have persistently shown strong
performance in football analytics [31]. Random Forest, an ensemble decision tree
algorithm, is well known for its strength against overfitting, capacity to capture complex
feature interaction, and ranking of feature importance. XGBoost, a gradient boosting
algorithm, is widely praised for its superior computational efficiency, scalability, and
predictive power in high-dimensional data environments.
Some of the other widely used supervised learning algorithms are neural networks, logistic
regression, support vector machines (SVM), and k-nearest neighbors (KNN). Neural
networks and deep learning models, particularly, have gained popularity in spatiotemporal
input tasks such as action recognition from videos or movement prediction from GPS-
based tracking data [34]. Although these models are of high accuracy, interpretability is
low especially in managerial use where explainability is greatly required.
Another very significant aspect of ML use in football is model evaluation, for which the
proper metric for the task at hand must be selected. For classification tasks (e.g., relegation
prediction), common evaluation measures are precision, recall, F1-score, accuracy, and
Area Under the Receiver Operating Characteristic Curve (AUC). For regression tasks
(e.g., league position prediction), Mean squared error (MSE) and R-squared (R²) are used.
Cross-validation and grid search for hyperparameter optimization to avoid overfitting and
ensure generalizability are used in most studies.
Thus, in the Premier League of Bosnia and Herzegovina, in this study, season-level
performance data is employed as a useful input for ML modeling [33]. While such data
may not be capable of describing micro-level behavior or in-game dynamics, it remains
useful when uncovering long-term trends and creating predictive models that can guide
strategic as well as tactical decision-making. The strength of ML in this instance is not
merely in anticipating outcomes but also in facilitating democratization in the application
of analytic tools to clubs with restricted technology capabilities.
19
3. RESEARCH METHODOLOGY
The research methodological design unfolds through seven various steps, each one of
which is critical within the overall investigative framework. These are discussed further
below:
1. Data collection
Programmatically gathered structured team performance information were
sourced from the SofaScore platform, which offers rich football data for a number
of competitions. The information, gathered as JSON initially, include a variety of
season-level measures like goals, passes completed, possession per game average,
fouls, yellow and red cards, and clean sheets. The information were then
programmatically mapped and placed in a structured tabular format (CSV) for
simplicity in statistical analysis as well as machine learning modeling.
This structured data formed the groundwork of the research, both for descriptive
and prediction modeling. Leveraging an open and reproducible dataset like
SofaScore ensures transparency and ease of replication of the research.
20
2. Data cleaning and preprocessing
In order to enhance the accuracy and validity of the data, a process of
preprocessing was exhaustively conducted. This included normalizing column
structures, removing duplicate rows, encoding categorical and numerical variables
as needed, and performing missing value handling.
Feature engineering was also a key aspect of this step. New performance measures
such as goal difference, disciplinary index, and penalty success rate were added to
provide richness to the dataset. Detection and correction of potential outliers and
testing for skewness and symmetry of distribution were conducted to ensure that
the input data were appropriate for application in various modeling algorithms.
These processes, elaborated in Section 3.2, imparted analytical rigour and
enhanced explainability of follow-on results.
21
strategic types dominant, balanced, or struggling sides, the cluster procedure
reduced dimensional complexity and generated useful segmentation data.
This is a valuable step in analysis for profiling Premier League Bosnia and
Herzegovina teams, where tactical style and available resources have great
variability. Unsupervised learning may be applied here to reveal structural patterns
independent of outcome labels.
22
7. Player-level predictive modelling
In addition to team-level modeling, the study also involved individual player
modeling. Individual player data were prepared and preprocessed separately for
attackers, midfielders, and defenders, and predictive models learned to predict
players in terms of binary outcomes whether they were high in foul rate, assist-
happy, or goal-involving.
Logistic Regression and Random Forest classifiers were employed to make these
predictions, which were assessed by AUC, precision, recall, and F1-score. This
level of granularity permitted individual performance to be linked with more
general team success, giving another layer of information and additional support
to the study's double-pronged approach to assessment of performance.
The data utilized in this study was from SofaScore, which is among the most highly rated
and reliable sports analytics websites widely known for giving elaborate, live data and
long-term statistics of many football leagues. In this case, team-level performance data in
a structured format for the 2023–2024 Premier League Bosnia and Herzegovina season
were employed. SofaScore's public API delivered data in JSON (JavaScript Object
23
Notation) format, a widely used web-based data exchange mode due to its flexibility,
lightweight, and hierarchical dataset support [12].
JSON was chosen as the initial data structure because it could effectively support storage
and retrieval of nested data, as well as neatly organize statistics such as goals scored, fouls,
clean sheets, and possession stats. This structure turned out to be extremely helpful in
maintaining the complex interactions between match statistics, player attributes, and
teams.
Upon extraction, the raw JSON files were parsed into CSV (Comma-Separated Values)
and Excel (XLSX) formats using Python to make them compatible with mainstream data
manipulation software such as Microsoft Excel and the pandas library of Python. This
helped to make the data tabular in form and easier to visualize, clean, and analyze, which
is useful in widely used machine learning processes in applied machine learning.
The conversion of JSON to table also paved the way for subsequent steps in the form of
feature engineering, correlation analysis, clustering, and supervised learning. At this stage,
the dataset included team-level datapoints along with some other derived player-level
statistics, which were represented in individual modeling pipelines across different
positional positions.
24
Figure 1 Data extracted in JSON format for Premier league of BiH
25
Figure 3 Data extracted in JSON format for midfield players
26
Figure 5 Data extracted in JSON format for attack players
For purposes of thorough examination, effective preprocessing, and other modeling, the
obtained JSON files were systematically transformed into tabular representations — first
to CSV (Comma-Separated Values) and then to Microsoft Excel spreadsheets. This was
indispensable because such representations are readily supported by most typical tools for
data analysis, from manual examination with Microsoft Excel to Python's pandas library
for programmatic data manipulation. The tabular structure table provided a structured and
well-working format for finding inconsistencies in the data, doing preprocessing, and
constructing machine learning pipelines easily.
The initial dataset consisted of semi-structured, nested columns having three main
categories: team identifiers (name, slug, gender), performance statistics (goalsScored,
interceptions, fouls), and contextual metadata (entity type, national association, number
of matches). These were collected in JSON format through the SofaScore API. But when
flattened from hierarchical structure in the JSON form into Excel-readable format, the
dataset contained structural anomalies. Especially, conversion processing resulted in a
27
sparsely populated table with duplicate alternative rows and missing (null) values,
reducing readability and usability for further processing.
One of the most significant challenges faced was the irregularity in column naming
convention and formatting. Variables, for example, were specified as longer prefixes like
__team__name, __statistics__goalsScored, and __statistics__matches, each making the
direct interpretation tough. Nested attributes were also not automatically exploded and
split into independent dimensions, and further restructuring and realignment of the data
had to be undertaken. The data was duplicated on many entries across rows, and merging
and deduplication had to be undertaken to maintain data integrity.
In addition to the team-level dataset, three player-level datasets were acquired and hand-
structured in Excel form. The datasets covered key positional positions: defenders,
midfielders, and attackers. Each category contained domain-specific features
characteristic of their roles on the pitch. For instance:
- Defenders were defined through variables such as duels won, yellow/red cards,
and interceptions;
- Midfielders were defined through variables such as assists, key passes, and pass
accuracy;
Attackers had metrics like total goals, shots, and overall goal involvement.
28
These player-level data sets were subsequently used in binary classification problems to
determine if a player belonged to the high-performance group in their position. The
cleaned data allowed for more sophisticated, role-specific predictive modeling at later
analysis stages.
To ensure integrity, consistency, and analytical readiness of the dataset, a rigorous and
procedural data cleaning was undertaken. Due to the semi-structured nature of input
JSON-derived data, several preprocessing steps were employed to transform the dataset
into its clean and tabular form for statistical analysis and machine learning.
Duplicate and null row removal: The flattening process created duplicate rows and null
entries. All duplicate observations and structurally empty rows were identified and
eliminated to reduce noise and maintain data consistency.
Column name normalization: Field names inherited verbose and hierarchical identifiers
such as __team__name or __statistics__goalsScored, based on the nested structure of the
JSON format. These were replaced with more readable names such as team_name or
goals_scored for easier code-based manipulation.
Record merging and consolidation: When data for a given team were divided over
multiple rows, the records were brought together in a single, one-row representation per
team. This was done to ensure one observation per team, as it is best practice when
handling panel data formatting.
Data type conversion: All the numeric fields (e.g., goals, red/yellow cards, passes, fouls)
were converted from generic object or string data types to correct numeric types (integer
or float). The reason for the conversion was to enable mathematical operations and
statistical modeling.
29
Categorical label standardization: Idiosyncratic labels in categorical variables, team
names, relegation status (e.g., yes, Yes, 1), and position categories were normalized to a
uniform coding system. For example, the binary outcome variable "relegated" was coded
as 1 for relegated and 0 for non-relegated teams.
Once the cleaning process was finalized, the final structured dataset contained 32 exactly
defined variables across 12 records, identical to the 12 teams that competed for the 2023–
2024 season of the Premier League of Bosnia and Herzegovina. The cleaned dataset
served as input to all the subsequent steps of descriptive analysis, clustering, and
predictive modeling performed in this study.
To further improve the predictive capability of the dataset and allow for more intelligent
machine learning applications, some new features were derived from the original raw
statistical variables. This is a process known as feature engineering, which transforms
existing data into more meaningful representations according to football analytics
researches.
30
The newly engineered features were:
Goal difference: The difference between goals scored and goals conceded
(goal_difference = goals_scored - goals_conceded). This is a standard statistic used to
assess a team's attack-defense balance throughout the season.
Penalty efficiency: The number of penalty goals over penalties taken (penalty_efficiency
= penalty_goals / penalties_taken). This measures how efficient a team is at converting
penalty chances.
Binary relegation label: A binary target variable relegation was utilized, in which
relegation was labeled as 1 for relegated teams and 0 for non-relegated teams. The label
was required by the classification models that attempted to predict relegation chances.
These were selected after prior empirical findings that they are very pertinent to football
team performance and league outcome prediction [13]. By including them in the final
dataset, the analysis had the capacity to cover a greater array of performance dynamics as
well as improve interpretability of the model.
Before feeding the dataset into machine learning models, preprocessing steps were
conducted upon the dataset to ensure it was standardized, improved performance, and
reduced the impact of noise. Preprocessing steps included:
31
Normalization: Numerical features such as number of passes, total shots, and possession
rates were normalized to a common numerical range where necessary. This step helps
algorithms, especially distance-based models, handle variables equally regardless of scale.
Missing value handling: Missing values within the dataset were either removed if found
to be non-critical or imputed using suitable techniques (e.g., median or mode imputation)
such that model training was not compromised or skewed in any manner.
Label encoding: The categorical target variable "relegated" was mapped into binary
format as 1 for "yes" and 0 for "no." The same was also applied for player-level targets
like high_fouler, high_assist_rate, and goal_involvement.
Feature selection: Irrelevant or uninformative features like jerseys, jersey numbers, and
URLs were removed to reduce dimensionality and prevent model overfitting.
The same preprocessing was applied for player-level datasets used for position-based
modeling (defenders, midfielders, forwards). The additional steps for those datasets were:
Class balance maintenance: Because of the limited sample size, threshold levels (e.g.,
what defines a "high-assist" player) were determined by median splits to balance records
evenly across classes.
Stratified k-Fold Cross-Validation: For robust and unbiased model estimation, stratified
k-fold cross-validation was applied. This process preserved class ratios across training and
validation sets, which is especially critical in imbalanced or sparse datasets.
The final result of these preprocessing steps was a sanitized, normalized dataset ready for
exploratory data analysis, clustering, and supervised machine learning. This was the
analytic basis for the empirical modeling framework of the study.
32
4. RESULTS AND ANALYSIS
Machine learning and data analysis experiment results on player and team
performance data from the Bosnia and Herzegovina Premier League are outlined in this
chapter. Statistical and computational procedures will be used to identify significant
patterns, examine hypotheses, and analyze the forecasting capability of chosen features.
The findings presented below form the empirical foundation upon which to answer the
research questions posed in Chapter 1 and frame the broader conclusions and implications
reported in Chapter 5.
To establish a preliminary picture of the dataset and its statistical properties, descriptive
statistical analysis was conducted on team performance statistics in the 2023–2024 season
of the Bosnia and Herzegovina Premier League. This procedure provides an indication of
the nature and distribution of the significant variables to identify trends, outliers, and
anomalies that can influence subsequent analysis.
33
Descriptive statistics such as mean, median, standard deviation, minimum and maximum
values were derived for all the significant numerical variables. Among them, but not all,
were goals scored, goals conceded, average possession, foul number, yellow and red cards,
clean sheets, won duels, interceptions, and penalty goals. Interquartile range (IQR) and
corresponding percentiles (25th and 75th) were also derived to aid in outlier detection and
observing variability in the league.
- FK Velež Mostar racked up the highest average possession rate with 56.5%, which
suggests an enormously controlling and possession-based playing style.
- FK Željezničar received the most red cards (4), and NK GOŠK Gabela received
the most yellow cards with 92, which could suggest disciplinary problems.
- HŠK Zrinjski Mostar was the most attack-heavy team with 76 goals, which was
well off the league average, way higher.
Breakdown continues:
Disciplinary issues were most pronounced in certain clubs. Yellow cards ranged from 45
to 92, red cards from 0 to 4, presenting degrees of tactical mischief or uncontrolled
behavior.
Clean sheets, which indicate defensive solidity, ranged from 4 to 15. Notably, FK Borac
Banja Luka, FK Velež Mostar, and HŠK Zrinjski Mostar all had 15 clean sheets, indicative
of well-organized rearguards. Penalty kick conversions ranged from 1 to 11, with HŠK
Zrinjski Mostar leading again, further indicating the strategic importance of set-piece
conversion.
These summary results not only give initial insight into team tendencies and behaviors but
also constitute a diagnostic overlay justifying the choice of features for further correlation
34
analysis and machine learning modeling. An example are those teams with higher ball
possession like FK Velež Mostar, which exhibit offense aggression, whereas high
disciplinary records in teams like NK GOŠK Gabela may prove to be a hindrance to
continued team performance.
Correlation analysis was used to establish significant correlations between the core team
performance metrics and their correlation with markers of success such as league standing,
goals scored, and clean sheets. Measuring linear relationships between variables, the
analysis provides a statistical foundation for picking predictive features when developing
machine learning models. The Pearson correlation coefficient (r) was used to test for linear
relationships among continuous variables. This coefficient ranges from:
For our study, correlations with an absolute value of over 0.70 were considered strong and
of practical relevance.
35
• Important negative correlations:
Suggesting that those teams with more clean sheets always allow fewer goals—a
reasonable but statistically confirmed trend.
Implies that teams with more defensive interceptions are likely to have fewer shots,
potentially because they are more reactive in nature.
Ball sossession and Passing accuracy are not just highly correlated but are also linked with
greater shooting activity and attacking success. These teams are likely to score more goals
and have better defenses.
Set pieces appear to matter: the close relation between penalty taken and penalty scored
suggests disciplined, direct play in the latter third of the pitch generating set-play
opportunities that determine match outcomes significantly. Defensive structure, quantified
by clean sheets and interceptions, inversely correlates with goals conceded, affirming their
roles as the chief indicators of team solidity and match control.
In the aggregate, these findings verify the strategic relevance of these measures. Strong
positive and negative correlations attest to the complexity of team performance, wherein
creativity on offense and solidity on defense are required in order to be successful. These
findings are strong endorsement for the inclusion of these variables into predictive
modeling exercises investigated in later chapters.
36
4.3. Distribution analysis
To stage data for prediction modeling, the distributional properties of performance metrics
need to be understood. Distribution analysis reveals issues such as skewness, kurtosis,
outliers, and non-normality, which have adverse effects on model assumptions as well as
performance if not addressed.
This study considers absolute skewness measures greater than 1 (|skewness| > 1) as signs
of extreme skewness, thus necessitating potential transformation to meet machine learning
objectives. The computed values of the skewness of performance measures are presented
in the following table:
Metric Skewness
Interceptions -0.207
Red Cards 0.570
Yellow Cards -0.200
Clean Sheets -0.231
Penalty Goals 2.605
Penalties Taken 2.605
Fouls -0.043
Corners 0.033
Successful Dribbles 0.404
Shots 0.771
Accurate Crosses 0.425
37
Accurate Long Balls 0.196
Accurate Passes -0.051
Average Ball Possession 0.076
Goals Conceded 0.476
Goals Scored 1.260
Number of Matches 0.000
Awarded Matches 0.000
Table 2 Skewness values of key performance metrics
These are highly biased by a few clubs, most notably HŠK Zrinjski Mostar, producing a
hugely disproportionate number of penalties and successfully converting them.
Indicates a couple of high-scoring clubs well above the league average, giving a
distribution with a long tail to the right. These imbalanced variables signal league
performance imbalances, with tactical advantage in goal scoring and set-piece play being
enjoyed by some dominant teams.
This section investigates the impact of disciplinary actions yellow and red cards on team
performance in general during the 2023/24 season of the Bosnia and Herzegovina Premier
38
League. The study looks into the relationship between disciplinary behavior and core
measures of performance such as goals scored, goals conceded, and final league position.
The correlation between yellow cards and goals conceded was r = 0.682, indicating good
positive correlation. Teams that received more yellow cards also conceded more goals,
possibly due to reduced defensive control or self-restraint in play once booked.
The correlation between yellow cards and goals scored was r = -0.605, a moderate
negative one. This suggests that teams with higher yellow cards had fewer goals, possibly
because play was interrupted or suspensions were issued to star players.
39
Yellow cards and league standing had a correlation of r = 0.702. Since the larger the league
standing number, the lower the standing, this strong positive correlation is to be
understood as that poorer disciplinary behavior (higher yellow cards) correlates with
worse overall league performance.
• Red cards correlated with goals conceded at r = 0.116, a weak positive correlation.
• The correlation with goals scored was virtually zero (r = -0.007), suggesting no
meaningful relationship.
• The correlation with league position was r = 0.156, again weak and not statistically
significant.
NK GOŠK Gabela, having received the most yellow cards, also finished lower on the table
and conceded a lot of goals. This is further corroborating evidence for the statistical data,
that many yellow cards can lead to rhythm disruption and defense organization in games.
FK Željezničar, who were shown the most red cards, however, didn't do as miserably
comparatively, which is consistent with the overall poor correlation between red card
statistics and team performance.
4.4.3 Conclusion
The results show that yellow cards are more potent than red cards in influencing team
performance. Excessive yellow card counts correlate to higher goals conceded, fewer
goals scored, and worse league positions. This highlights the importance of maintaining
discipline over the course of a season, not just in evading temporary setbacks but also in
contributing to overall team consistency and competitiveness. In comparison, red cards,
while effective for the one game, appear too uncommon to be statistically important at the
season level.
40
4.5. Ball possesion analysis
This section explores the relationship between team ball possession and team performance
throughout the Premier League of Bosnia and Herzegovina during the 2023/24 season.
Ball possession is a tactical aspect of contemporary football associated with control,
territorial acquisition, and attacking quality [14]. The research examines the way
possession correlates to three essential measures of performance: goals scored, goals
conceded, and final league position.
The correlation between possession average and goals for was r = 0.673, a very positive
relationship. The teams that had larger possession percentages had more goals, as the
offense benefited from retaining possession.
The correlation between average possession and goals conceded was r = -0.716, which
showed a very strong negative relationship. The more frequently teams had possession of
the ball, the less they conceded, quite likely because they could limit the opposing team's
possession and reduce their exposure.
41
Average possession and league position correlated at r = -0.770. As higher values of league
position correspond to lower table positions, this strong negative correlation means that
the teams with more possession also had better finishes in the league table. These
correlations reflect the significance of possession as a primary contributor to both
attacking and defending effectiveness.
FK Velež Mostar had the highest possession-based average, obtaining 56.5%. Dominant
possession like this might have contributed to their success in creating attacking
opportunities and defensive structure, culminating in sound team performance as a whole.
The lowest average possession was 43.2% by NK GOŠK Gabela. The lower possession
mirrors poorer match control, which was linked with both lower goal creation and higher
goals allowed, to the disadvantage of poorer league performance.
These figures visually reinforce the strength and direction of the correlations, showing
clear linear trends consistent with the numerical analysis.
43
4.5.4 Conclusion
The result confirms that greater average ball possession is highly correlated with better
overall team performance. Teams that dominate possession have better odds of creating
scoring opportunities, stopping the opposing team from scoring, and achieving higher
league positions. The results concur with contemporary football strategies, which
emphasize the importance of dominance of the ball as the foundation for success. As such,
average possession is a crucial factor to be included in predictive models and performance
measurement systems.
This section investigates the impact of penalty statistics, i.e., penalty goals and penalty
attempts, on overall team performance during the 2023/24 season of the Premier League
of Bosnia and Herzegovina [15]. The aim is to establish whether the ability to score and
convert penalties influences key performance measures, including goals for, goals against,
and final-season league position.
44
League -0.413 -0.517 -0.733 0.946 1.000
Position
Table 5 Correlation matrix – Penalty metrics and team performance
The correlation between goals for and penalty goals was r = 0.688, a high positive
correlation. This suggests that penalties are a large proportion of overall goal output for
certain teams.
The correlation between goals against and penalty goals was r = -0.347, a moderate
negative correlation. Teams scoring more goals in penalties tend to concede fewer goals,
possibly due to superior tactical control and match dominance.
The correlation between penalty goals and league position was r = -0.413. Since a higher
league position number means a lower position, this negative correlation is an indication
that the teams that score most penalties find themselves higher in the league table.
45
Figure 12 Relationship between penalty goals and league position
The correlation between penalties taken and goals scored was r = 0.707, which is
extremely strong positive correlation. This once again testifies to the importance of
penalties as a strategic way of scoring.
The correlation between penalty taken - goals conceded was r = -0.433 and shows that
teams that win more penalties also lose less goals. This might be a consequence of their
capacity to maintain control both in attacking and defensive.
Relationship between penalties received and league ranking was r = -0.517, a strong-to-
moderate inverse relationship. Those clubs receiving more penalties are most likely to
have better final rankings.
HŠK Zrinjski Mostar is a prime example of the influence of penalty effectiveness. The
team had the highest number of taken penalties in the league (12) and also achieved the
most penalty goals (11). The high rate of penalty conversion success helped contribute
significantly to the team's attacking contribution and overall league success. The ability
46
to both gain and successfully convert penalties may therefore be described as an essential
competitive advantage.
The visual evidence does verify that teams with higher successful penalty rates end higher
in the league, again emphasizing the tactical importance of penalties to success at
competitive football.
4.6.3 Conclusion
The evidence is readily apparent to demonstrate that penalties do play an important part
in team success. The number of penalties taken and their successful conversion are both
positively linked with goals scored and inversely linked with goals conceded and ultimate
league positioning. These results indicate that earning and converting penalty chances can
be a key factor in determining league performance. Penalty measures, therefore, need to
be considered as key input variables to any type of predictive modeling or performance
analysis for team results.
47
4.7. Offensive and defensive analysis
This section explores the relationship between offensive and defensive performance and
eventual team success in the 2023/24 Premier League of Bosnia and Herzegovina [16].
Focus is on three major indicators: goals scored (offensive capacity), goals conceded
(defensive solidity), and goal difference (net performance), with interest focusing on how
combined effect impacts final league position.
Pearson correlation coefficient also measured the association between goal difference and
league position.
• The team with the most goals scored was HŠK Zrinjski Mostar, with a total of 76
goals, indicating the most potent offensive output in the league.
• The top goal-scoring team was HŠK Zrinjski Mostar, who managed 76 goals,
reflective of the best attacking performance in the league.
• FK Borac Banja Luka had the most impressive defense, having let in just 26 goals
during the season.
• The top goal difference was once again that of HŠK Zrinjski Mostar, standing at
+49, affirming their prowess in attack and defense.
• The side with the lowest goal difference was FK Zvijezda 09, at -39, which was in
line with their position near the bottom of the league table.
•
48
The correlation matrix below demonstrates the relationship between goal difference and
league position:
4.7.2 Interpretation
The correlation coefficient between goal difference and league final position was r = -
0.976, indicating a very strong negative correlation. That is, the larger the goal difference
(i.e., a team scores more goals than they allow), the improved final league position (i.e.,
the lower the numerical value of the league position).
This correlation highlights the primary importance of establishing parity in attacking and
defending performance. Teams with a positive goal difference are probably to be found in
the top half of the table, while teams with a negative goal difference are likely to be
towards the bottom or in relegation-placed positions.
49
This figure illustrates the general trend that teams that score most tend to occupy higher
league positions, while teams who concede more goals usually finish towards the bottom
of the table.
The figure visually represents how teams with big goal differences naturally finished
higher up in the table. Teams with negative goal differences, however, struggled and
tended to finish near relegation.
This figure confirms the statistical investigation through evidently demonstrating the
negative correlation between goal difference and end league position.
50
Figure 16 Relationship between goal difference and league position
This visualization confirms the statistical analysis by clearly depicting the inverse
relationship between goal difference and final league ranking.
4.7.3 Conclusion
The findings from this analysis strongly attest to the fact that attacking as well as defensive
performance provides a significant contribution to overall team performance. However,
the overall measure goal difference is a robust single measure of consistency of
performance. Teams that specialize in both scoring as well as defense will have better
league performance, while those who are unable to sustain these two will suffer
competition disadvantage. Therefore, goal difference emerges as an excellent predictor in
league performance modeling.
51
4.8. Relagation analysis
This section contrasts relegated and non-relegated teams' performance indicators during
the 2023/24 season of the Premier League in Bosnia and Herzegovina. The aim is to
determine which key performance indicators are most directly linked with relegation risk,
providing information about which characteristics distinguish lower-ranked teams from
those that are able to hold league level [17].
Individual average values of these measures were calculated separately for relegated and
non-relegated clubs. The results are given in the table below:
52
4.8.1 Key findings and interpretation
• Goals scored: Relegated teams scored many fewer goals on average (39.0)
compared to non-relegated clubs (47.3), indicating a lack of attacking efficiency
as a primary factor in poor performance.
• Goals conceded: Relegated teams conceded more goals (70.5) compared to their
non-relegated counterparts (39.7), demonstrating the importance of defensive
vulnerabilities in relegation.
53
Figure 18 Distribution of goals conceded – relegated vs. non-relegated teams
54
Figure 19 Distribution of yellow cards – relegated vs. non-relegated teams
55
• Clean sheets: The average number of clean sheets of relegated teams was merely
4.0, compared to 11.4 of non-relegated teams. This huge difference indicates that
defensive solidity on a consistent basis is one of the key determinants of survival.
• Ball possession: Relegated teams also had lower average ball possession (47.1%)
than non-relegated teams (50.44%). This reflects compromised match control and
may be proof of the inability to maintain territorial or tactical possession.
56
4.8.2 Conclusion
These findings highlight that relegation is most frequently the result of cumulative offence
and defence frailties, typically compounded by discipline-related interruptions. The
metrics discussed here can be utilized as early warning metrics for teams at risk of
relegation and are especially applicable in predictive modeling and planning contexts in
subsequent work.
This section reports on a clustering analysis conducted on team performance data for the
2023/24 season of the Bosnia and Herzegovina Premier League. The aim was to cluster
teams into performance-based groups so that teams with similar playing styles and
statistical profiles could be identified. This unsupervised learning method sheds light on
structural patterns between teams and highlights the performance indicators that
differentiate successful and underperforming teams [18].
57
4.9.1 Methodology
Clustering was carried out using the K-means algorithm, a widely used type of
unsupervised learning that separates observations into k groups by feature similarity. The
following statistics were utilized in the analysis:
• Goals scored
• Goals conceded
• Yellow cards
• Red cards
• Clean sheets
• Accurate passes
• Interceptions
• Fouls
• Corners
Prior to clustering, the dataset was normalized such that all variables contributed equally
to the process of clustering. The optimal number of clusters were determined using the
Elbow method that identifies the point where the returns of increasing more clusters
become progressively smaller while decreasing within-cluster variance.
With this method, the optimal number of clusters was determined to be three that describes
various team performance profiles.
58
Figure 22 Elbow method for determining optimal number of clusters
Based on this approach, the optimal number of clusters was identified as three,
representing distinct team performance profiles.
The mean values of key performance metrics for each cluster are summarized below:
59
4.9.3 Cluster interpretation
This type has low scoring results and poor defensive metrics. These teams scored fewer
goals, conceded more, and exhibited most disciplinary infractions. Low possession and
accurate passes reveal inability to dominate matches, whereas fouls and interceptions
show defensive or reactive tactics. These tend to be lower-table or fallen teams.
Teams in this cluster performed middling in most of the measures. They possessed decent
defense records (low goals against, numerous clean sheets) but a reasonable level of
offense. Ball possession and passing accuracy were more than Cluster 0 but less than
dominant teams. Such teams are usually mid-table teams with balanced playing styles.
60
• Disciplinary metrics: Acceptable
This is a cluster of high-performing clubs. Such clubs performed well both when scoring
goals and conceding, scoring the most goals and conceding the least. They also had high
passing accuracy and ball possession, showing tactical dominance. They also gave few
fouls and received few yellow cards, showing discipline and also control.
61
Figure 23 Clusters based on goals scored and goals conceded
62
4.9.4 Conclusion
Clustering analysis was successful in segmenting the teams into three performance
categories: struggling, balanced, and dominant. This segmentation provides significant
findings relating to league competitive structure and profiles of successful teams. Previous
analyses are reaffirmed: dominant teams balance offensive firepower with good defensive
organization, and struggling teams underperform along many dimensions, such as
possession, discipline, and goal efficiency. Clustering also offers a practical framework
for team benchmarking, allowing clubs to compare their performance with peers and
develop ways of addressing areas of poor performance.
This section presents a feature importance analysis conducted to identify the most
influential performance metrics for predicting team league position in the 2023/24 season
of the Premier League of Bosnia and Herzegovina. Understanding which factors most
strongly influence success can inform strategic decisions and enhance the performance
evaluation framework for clubs [19].
• XGBoost Regressor
Both models are ensemble-based algorithms capable of ranking input features by their
relative importance in predicting a target variable. In this case, the target variable was final
league position, and the input features included a wide range of team performance
indicators, such as:
• Goals scored
• Goals conceded
63
• Clean sheets
• Interceptions
• Ball possession
• Accurate passes
• Shots
The models were trained and evaluated, and feature importance scores were extracted to
assess which variables had the greatest influence on league outcomes.
The top-ranked features from the Random Forest model are presented below:
64
Feature Importance Score
Goals Conceded 0.194
Clean Sheets 0.175
Interceptions 0.171
Average Ball Possession 0.085
Shots 0.056
Accurate Passes 0.055
Corners 0.048
Goals Scored 0.047
Successful Dribbles 0.041
Accurate Crosses 0.039
Fouls 0.028
Yellow Cards 0.026
Penalties Taken 0.016
Accurate Long Balls 0.010
Red Cards 0.005
Penalty Goals 0.004
Table 9 Random Forest feature importance scores
The results indicate that Goals Conceded is the most influential predictor of league
position, followed by Clean Sheets and Interceptions. These findings reinforce the
importance of defensive performance in determining overall team success. Features such
as average ball possession, shots, and accurate passes also contributed moderately,
suggesting that both control and offensive intent play supporting roles. In contrast,
disciplinary metrics (yellow/red cards) and penalty-related features had relatively low
importance in this model.
65
4.10.2 Results – XGBoost Feature Importances
The XGBoost model yielded the following top feature importance scores:
The XGBoost model further confirms the dominant influence of Goals Conceded, with an
exceptionally high importance score of 0.850. This again underscores the value of
defensive solidity. Interestingly, yellow and red cards had slightly more impact in this
model than in Random Forest, suggesting that disciplinary behavior may contribute to
performance variance. However, features such as penalties, shots, and long passes held
zero importance, indicating negligible contribution to the model's predictive accuracy.
66
4.10.3 Conclusion
Features related to discipline and set pieces showed limited influence, which suggests that
while important in isolated contexts, they are not strong general predictors of season-long
team performance. These insights are valuable for strategic planning, talent acquisition,
and performance improvement initiatives at the team management level.
This section aims to forecast the final season league position of teams competing in the
Premier League of Bosnia and Herzegovina (PLBiH) for the season 2023–2024. The key
aim was to measure the power of performance-based measures in forecasting a team's final
finishing position using machine learning regression techniques. League ranking
forecasting insights are important to coaching departments, club management, and
analysts, offering evidence-based opinions on which teams can use to judge their strategy.
The target variable was chosen to be the final position within the league at the end of the
season, a continuous outcome, and regression models were employed to infer position
location from performance measures.
67
4.11.1 Methodology
• Data Preprocessing
A series of preprocessing steps were executed to ensure the quality and consistency of the
input data:
Missing values handling: Missing values were identified and imputed by the median
approach to avoid bias caused by outliers.
Feature selection: Based on prior domain expertise and preliminary checks over feature
importance, the following features were selected: goalsScored, goalsConceded,
interceptions, fouls, yellowCards, redCards, cleanSheets, averageBallPossession,
accuratePasses, shots, corners.
• Feature Engineering:
disciplineIndex: Aggregated score weighted towards yellow and red cards to measure
team discipline.
Feature scaling: Quantitative inputs were all standardized using StandardScaler to put
features on the same scale, a necessity for most machine learning models to avoid
introducing scale-bias.
• Model Development
68
Random Forest: An ensemble model to address complex, non-linear relationships.
• Evaluation Metrics
R-squared (R²): Proportion of the variance in the dependent variable explained by the
model.
• Hyperparameter Tuning
To improve the performance of the Random Forest Regressor, a grid search was conducted
across most vital hyperparameters:
n_estimators = 200
max_depth = None
min_samples_split = 2
69
4.11.2 Results
Although Linear Regression was a good baseline, Random Forest tuned posted the best
overall performance with R² increased to 0.539, the lowest values of MAE (1.150) and
MSE (1.333) of any model tried. This suggests Random Forest best captured hidden
patterns in the data in spite of the infinitesimal dataset size.
It is able to pull out feature importances from the Random Forest model tuned:
Feature Importance
Interceptions 0.0945
Corners 0.0811
70
Shots 0.0782
Fouls 0.0190
Interestingly, Clean sheets and Goal difference emerged as the two most important
predictors of ultimate league position, underpinning the importance of defensive solidity.
Interceptions and goals against also featured high on the list, corroborating the finding
that defensive solidity tends to be accompanied by enhanced league performance.
To determine the model's generalization capability, predictions were made for a sample of
teams that were not employed during training:
FK Željezničar 6 7.25
71
The decimal values for predicted positions (such as 6.19, 7.25, 3.01) reflect the model's
confidence in the predicted league position. Since regression models forecast continuous
numeric variables, the presence of decimal points is both natural and revealing:
HŠK Posušje (6.19): Suggests strong confidence in 6th position with minimal possibility
of dropping to 7th.
HŠK Zrinjski Mostar (3.01): Indicates near absolute certainty in 3rd position with
minimal possibility of 4th.
These analyses can help the analysts understand the confidence intervals around team
positions and guide strategic adjustments. For instance, a team with a projected 6.19th
place might experience slight improvement in crucial statistics (e.g., clean sheets or goal
difference) and find themselves securely in the 6th or even 5th place.
Defensive power as the key indicator: Clean sheets, goal difference, and goals against fill
the indicator category, reinforcing defense's central function in achieving league success.
Secondary offense role: Offense metrics such as goals scored, corners, and shots do have
relevance but were considered lesser to defensive metrics.
Discipline has limited predictive ability: While important within match dynamics,
disciplinary actions (yellow/red cards, disciplinary index) were a poor forecast for end-
of-league position in the models.
Strategic clarity for clubs: Decimal accuracy of predcting allows clubs to understand their
positioning uncertainty and target marginal gains that can provide actual league position
increments.
72
4.11.6 Conclusion
This performance-based regression analysis clearly demonstrated the ability of using team
performance metrics for predicting final league position. The top-performing (R² = 0.539)
Random Forest Regressor was optimized to yield interpretable and accurate league
position predictions. Some of the key findings are as follows:
The defensive solidity, reflected through clean sheets and goal difference, is the most
critical factor in determining league success. Attacking metrics contribute to a moderate
extent, and disciplinary interventions possess comparatively poor predictive ability.
Decimal predictions produce precise predictions that can be utilized to inform planning
and resource allocation. The findings support the application of machine learning in
football analytics at the national level and illustrate the promise of using structured data
from lower-popularity leagues like PLBiH to inform team management, performance
evaluation, and decision-making. Extensions should include temporal dynamics at the
match level and multi-seasonal analysis in order to improve the model's stability and
temporal generalizability.
4.11.1 Objective
In this, the aim was to determine whether individual defense statistics would be suited for
the prediction of a player's likelihood of committing a high number of fouls. In this
instance, the aim was to develop a binary classification model that could classify between
high-fouling and low-fouling players for the 2023/24 season of the Bosnia and
Herzegovina Premier League.
The dummy coding of the response variable, fouls_per_match, was done in binary form
by estimating the sample median fouls per match. The players whose fouls were above
73
the median were given as high foulers (1), and those with fouls at or below the median
were given as low foulers (0) [20].
The hypothesis for testing was whether easily accessible player-level data such as
appearances, won duels, yellow cards, and red cards possess sufficient predictive power
to enable the discrimination of players based on their foul play behavior. Despite the
simplicity of the features utilized, the aim was to test if discrimination was still attainable
within low-dimensional data space.
Stratified 5-fold cross-validation was used to train all models to ensure a balanced
comparison between class labels and avoid overfitting. The following metrics were used
to measure model performance:
ROC AUC: Area under the Receiver Operating Characteristic curve, representing the
model's ability to distinguish between both classes
74
Logistic Regression performed best overall with the highest results in the majority of the
metrics such as accuracy, precision, F1 score, and AUC. This indicates that it yields a good
balanced model that has good discriminative power in separating high and low foulers.
XGBoost also had higher recall (0.66), indicating higher sensitivity to high-fouling
players but at the expense of precision and total AUC, indicating bias towards false
positives. The model's bias towards identifying more high foulers will prove beneficial in
certain uses (e.g., risk detection for discipline) but at the expense of its value as an
impartial predictor.
SVM performed poorly as well as Logistic Regression on AUC but failed behind on F1
score and recall, indicating a lower ability in detecting the minority class compared to the
other models. The linear kernel probably limited it from detecting more complicated
patterns within the dataset.
75
Figure 27 Comparison of model performance (Logistic Regression, XGBoost, SVM)
The figure indicates the improved balance of the logistic regression model, particularly in
precision and F1 score. The XGBoost's performance, while relatively good in recall, is
more variable and less AUC, revealing a less stable boundary decision.
The results are that it is highly likely to predict foul misconduct based only on basic
defensive statistics. Logistic Regression was the strongest model, and it offered
interpretability, simplicity, and acceptable performance even though it had a low-
dimensional space.
76
Small size of dataset – With fewer examples, especially in the high-fouler class,
overfitting may occur.
Imbalanced thresholding – Binary splitting on the median can cause issues of class
imbalance, particularly near the decision boundary.
Lower F1 and recall for XGBoost means that while it captures more true high foulers, it
also classifies a lot of low foulers incorrectly, which may not be operationally desirable if
misclassification is expensive.
The lower F1 and recall rates for SVM are to be expected because it's linear and cannot
handle more subtle distinctions between classes in this data.
4.11.5 Conclusion
Despite moderate performance, the findings verify the viability of player-level foul
prediction using hidden defensive statistics. Logistic Regression, the top performer within
current constraints, generates stable predictions with readily comprehensible conclusions.
As recommendations for future work, the predictive power of more advanced models such
as XGBoost and SVM can be further improved by:
Lastly, this exercise in modeling is a proof of concept showing that even basic features
can provide insight into challenging behavior such as fouling, with further improvements
to be explored in future work.
77
4.12. Midfielder perfromance modelling
4.12.1 Objective
In this section it is discussed whether Premier League of Bosnia and Herzegovina 2023/24
season midfielders can be effectively classified based on their assist productivity through
machine learning. Specifically, it was to investigate whether a set of numerical
performance indicators e.g., pass accuracy, key passes, chances created, and match
appearances can be used to distinguish highly assist-productive from less assist-
productive midfielders.
In order to determine binary classification, the target variable high_assist_rate was formed
by determining the median assists per game within the overall set of midfielders. 1 was
assigned to those whose assist rate exceeded the median, and 0 to those who were at or
below the median point [21].
The unspoken assumption here is that assist-making ability affected by external tactical
and situational influences can be explained at least in part by players' individual technical
information collected consistently throughout games.
Two of the most common classification algorithms were used for model training and
testing:
• Logistic Regression
These models were trained using stratified 5-fold cross-validation to ensure that each fold
had the high and low assist rate class balance. The evaluation metrics were:
78
Precision – number of correctly predicted high-assist players over all predicted high-assist
players
A high linear separability of the features in relation to the target class is indicated by the
perfect scores of the Logistic Regression model on every one of the evaluation metrics. In
contrast, the Random Forest model had missed some true positives, as indicated by its
79
lower recall for class 1 (high assist rate), in spite of its high AUC score (0.94) and overall
accuracy of 83%.
The optimal performance of Logistic Regression on all the evaluation metrics (accuracy,
precision, recall, F1, and AUC) reflects a high degree of linear separability in the selected
feature space. This is to say that midfielders with improved assist rates can be separated
excellently with simple match statistics, and that the assist outcome is strongly correlated
with some inherent properties.
The Random Forest Classifier, while being less accurate, had a strong AUC of 0.94,
suggesting that it remains highly capable of class discrimination. The decreased recall,
though, indicates that it did not correctly label all of the high-scoring midfielders, possibly
due to feature representation lacking depth. Despite its ability to pick up nonlinear
relationships and produce feature importance, Random Forest remains a viable means of
finding which features are most indicative of midfielder creativity, even if it's not the best
tool for classification.
80
4.13. Attacker-level perfromance prediction
4.13.1 Objective
The objective of this section was to investigate whether machine learning models can
predict consistently attacking player performance based on an adequate but limited set of
offensive metrics. The data used for the analysis had been from the 2023/24 season of the
Bosnia and Herzegovina Premier League. The main outcome variable was goal
contribution per match, which had been defined as goals plus assists per appearance, as a
measure of a player's immediate contribution to scoring chance creation [22].
To transform this into a classification problem, a binary target variable was created:
High performers (label 1): Players with goal participation per game greater than the
median of the dataset.
The goal was to determine which model best classifies high-performing attackers from
others based on easily accessed performance attributes.
• Goals
• Assists
81
• Shots
• Shots on target
• Dribbles completed
• Key passes
• Accurate passes
• Minutes played
• Games played
Prior to model training, feature normalization was also done using StandardScaler to
ensure that all the variables will contribute equally towards the model, especially
important for distance-based or coefficient-sensitive algorithms like Logistic Regression.
A 5-fold stratified cross-validation approach was used to maintain class balance across
folds and get reliable estimates of performance.
• Accuracy
• Precision
• Recall
• F1 Score
• ROC AUC
82
4.13.3 Results
The performance measures of the two models are provided in Table 13:
The Logistic Regression model was ideal on all metrics, implying linear separability of
data. The Random Forest model, however, while still good (AUC = 0.94), had minor class
imbalance in its predictions. It overemphasized accurate classification of low achievers
(Class 0) and misclassified some high achievers, as shown by its poorer recall and F1
score for Class 1.
A bar chart comparison (Figure 29) was employed to illustrate the average performance
of the two models for Precision, Recall, and F1 Score. This revealed the superior and
balanced performance of Logistic Regression very clearly.
83
Figure 29 Comparison of model performance metrics
Even though the bar chart indicates superiority for the linear model, caution is to be
exercised when interpreting such extreme values since the sample size is low (n = 12
attackers). Such a setup maximizes the danger of overfitting, especially for the simpler
models matching clean decision boundaries in low-dimensional data.
These findings support the validity of the premise that goal involvement can be predicted
from basic match statistics such as shots, assists, and key passes. That Logistic Regression,
a linear dependency-based method, is successful confirms that the performance of
attackers on offense from within this sample is linearly related to these input features.
But the perfect classification by Logistic Regression could perhaps be a sign of near-future
overfitting rather than genuine generalizability. The model likely memorized patterns in
84
this extremely small and clean data set rather than abstracting rules for unseen data. In
practical application, such accuracy would most likely collapse without extra testing or
larger training data.
The Random Forest model, as less precise but more transferable, was found to be very
effective at detecting poorly performing attackers. Its ability to learn nonlinear feature
interactions and output feature importance ranks as strengths, making it a good substitute,
especially when data complexity is high.
This experiment demonstrates that even a small number of performance attributes are
enough to allow effective attacker evaluation through machine learning. This finding has
implications for real-world use in clubs and analysts:
Analysis of opponents: Highlighting players who are likely to affect match outcome;
85
4.13.7 Limitations and future work
Small sample size – just 12 attackers limits statistical power and increases overfitting risk
No hyperparameter tuning – the default may not be as optimal as it can be for Random
Forest
• Having a larger sample size across multiple seasons or leagues, as well as standard
metrics, having more advanced metrics such as expected goals (xG), progressive
runs, or possession chain involvement
• Applying regularization in Logistic Regression to increase generalizability
• Grid search or Bayesian optimization for hyperparameter tuning
This section seeks to close the gap between player-level classification and team-level
statistical modeling by analyzing how predictive information at both levels synthesizes to
account for general trends of success in the 2023–24 Premier League of Bosnia and
Herzegovina season. The synthesis of descriptive statistics and machine learning
predictions offers a multi-level platform for explaining how individual efforts combine to
affect team results [23].
86
4.14.1 Team-Level analysis and modeling
These descriptive observations were also supported by both models' Random Forest and
XGBoost regressor feature importance. The most predictive features of league position in
both models were goals conceded, clean sheets, and interceptions. For example, the
Random Forest model rated goals conceded with the highest importance value (0.194),
which aligns with the descriptive finding that successful teams defend well.
Furthermore, the robust negative correlation between league standing and ball possession
(r = -0.77) demonstrates the necessity of possession-based strategy in competitive
achievement. This alignment of statistical correlation and machine learning feature
importance scores provides cross-validation to the indicators identified and underscores
their strategic importance to clubs looking to enhance performance.
There were three independent modeling exercises carried out for defenders, midfielders,
and attackers, respectively, for position-specific performance metrics:
87
• Defenders – Fouling behavior prediction:
In logistic regression and XGBoost, the aim was to predict high foulers based on
appearances, duels, and discipline metrics. While model performance was moderate (AUC
up to 0.70), there were significant insights. Defenders from high-performing teams like
Borac and Zrinjski were less likely to be labeled as high foulers, indicating a link between
disciplined defending and team success. These observations corroborate the overall team-
level trend that better discipline is linked with higher league position.
This exercise yielded perfect performance metrics under logistic regression (Precision,
Recall, F1, AUC = 1.00), indicating strong linear relationships between features like key
passes, pass accuracy, and chances created and the likelihood of being a high-assist player.
Typical high assist providers were midfielders on teams like Sarajevo and Zrinjski, which
were consistently two of the most offensively proficient teams. This suggests the value of
creative midfield play in driving team-level offensive proficiency.
Predictive modeling with logistic regression and Random Forest successfully picked out
attackers with high goal involvement. Individuals such as those from Zrinjski and Borac,
which led scoring charts, were predicted with consistency as high performers. The close
relationship between individual offensive production and team goal difference (an
important predictor of league position) underlines the primacy of attacking contributions
in determining team success. These position-wise analyses demonstrate that individual
player performance, when properly modeled, is in line with and explains many team-level
outcomes.
88
4.14.3 Connecting individual performance to team success
- Possession and passing accuracy – associated with dominant team play and
creative midfielders;
- Scoring and assist efficiency – linked to high-scoring attackers and high league
positions;
- Discipline and defensive metrics – shared among low-fouling defenders and low
team goals conceded;
These cross-sections confirm that individual performances are inseparable from team
outcomes. In the majority of cases, players whom machine learning models predicted to
be among the top performers were part of teams that finished at the top of the league table.
This suggests a feedback loop between micro-level (player) and macro-level (team)
performance, with the implication that good player modeling can be utilized as a proxy
for team success indicators and vice versa. This alignment enables the realization of data-
driven scouting, lineup optimization, and individualization of training approaches in real-
world football operations.
While most characteristics had consistent significance for both levels of analysis, there
were some intriguing divergences:
Disciplinary yellow and red cards had high coverage in descriptive analysis but were given
relatively low feature importance in machine learning models. This may suggest that their
effect is more indirect or contextual, i.e., by affecting match outcomes via suspensions
rather than through accumulated performance loss.
89
Possession and passing accuracy were the cross-cutting qualities, appearing regularly in
both correlation analysis as well as feature importance rankings. These features appear
particularly dominant for midfielders and team control, confirming their centrality to
tactical design.
The combination of statistical and algorithmic methods reinforces the interpretability and
validity of findings. The two-tiered framework employed here demonstrates that the
integration of human-interpretable analysis and machine-generated intelligence can
produce actionable intelligence for football, which is ready to be used by management,
analysts, and coaches.
90
5. CONCLUSION
This thesis performed an intensive investigation into the use of machine learning and data-
driven analysis in the evaluation and prediction of football performance in the Premier
League of Bosnia and Herzegovina (PLBiH) for the 2023–2024 season. By combining
structured data collection, statistical exploration, and predictive modeling, the study
sought to produce actionable intelligence on both team and player levels. The ultimate aim
was not simply to assess how modern analytical techniques could be used within an area
football league but also to implement them in a real-world context as tools for operational
analysis and strategic decision-making within lower-resource or less internationally high-
profile leagues.
On the basis of statistics extracted from SofaScore, the research constructed a dataset of
32 variables across 12 teams, an extensive range of performance indicators from goals for
and against to disciplinary record, possession percentage, clean sheets, and passing
accuracy. This foundational dataset underpinned a multi-stage analytical pipeline that
consisted of:
Descriptive and correlation analysis, for the identification of statistically significant trends
and outliers,
Feature importance modeling, to establish the most predictive features of team success,
Supervised machine learning models to predict player-level outcome variables, e.g., foul
frequency, assist frequency, and total goal participation. The study results highlight the
predictive potential of famous football metrics:
At team level, clean sheets, goals conceded, and interceptions were the best predictors of
end-of-season league standing. These metrics were not only highlighted by correlation
coefficients (e.g., possession vs. league rank correlation r = -0.77), but also by feature
importance rankings in Random Forest and XGBoost models.
91
At the player level, individual models were created for defenders, midfielders, and
strikers. Logistic Regression worked best in every position across the board, indicating
extremely high separability of features utilized and demonstrating that even with primitive
performance stats, predictions that are meaningful can be made. Midfielders from teams
that perform offensively (e.g., Sarajevo, Zrinjski) were identified properly as having big
assist numbers, and attackers from highly ranked clubs were identified appropriately using
goal involvement metrics.
These results are strong indicators that machine learning models, especially interpretable
ones like Logistic Regression, can learn basic patterns in football performance from fairly
limited amounts of match-level data. This is particularly significant since it is widely
believed that successful football analytics requires gigantic amounts of player tracking
data or proprietary sensor feeds. The study illustrates that, in a data-scarce environment
like the PLBiH, open-source software and freely available statistics may be used to draw
significant and replicable conclusions.
However, the study points out some vulnerabilities too. The dataset on teams was only 12
observations, which matched the amount of teams playing in the league. That was the
natural restriction on the model's complexity and the ability to generalize findings. At the
player level, the sample in each position-specific dataset was also quite small (n=12 per
group), which again increases the risk of model instability and overfitting, especially with
more advanced models like XGBoost. Although stratified k-fold cross-validation helped
to some extent in prevention, the models' generalizability across seasons, leagues, or
contexts remains limited. Second, binary target labels, through median splits, while useful
in maintaining class balance, may have reduced more nuanced player behavior to simpler
states.
Despite these weaknesses, the thesis provides a valid proof of concept for analytics-driven
football intelligence in lower-researched leagues. It offers an replicatable methodology to
other regional competitions seeking to leverage analytics without expensive infrastructure
or proprietary data feeds.
92
To build on the results and more effectively move beyond current constraints, the
following directions for future research can be done:
Utilize multiseason datasets to support trend analysis between seasons and improve model
stability over time. This would allow for finding enduring patterns in player behavior and
team strategy within multiple competitive cycles. Also interleave match-by-match or
event-level details, e.g., pass chains, sources of shots, or player movement, to provide
temporal and tactical depth to the models. It can be applied explainable AI techniques like
SHAP (SHapley Additive exPlanations) or LIME to make black-box models more
interpretable and trustworthy to stakeholders, especially for management or coaching
decisions. Study ensemble and hybrid models that leverage the predictability of algorithms
like XGBoost but combine it with the explainability of logistic regression for more
nuanced player classification applications. Develop scouting resources or dashboards
from these models that can help clubs in recruitment, training priority, or opponent
preparation and make the analytics actionable for purposes beyond academic research.
This research contributes to the growing intersection of machine learning and sports
analytics, specifically on an under-represented regional European football league in
current academic literature. Through the demonstration of viable predictive models for
team and individual performance, the thesis illustrates how data science can facilitate
clubs, coaches, and analysts to make more objective, informed, and tactical decisions.
In offering a bridge between current data science practices and bottom-line football
performance measurement, this study establishes the groundwork for more widespread
adoption of smart, data-intensive strategies throughout all echelons of the sport across the
93
region. It is to be hoped that this study makes an addition to scholastic advancement as
much as to practical development in the Bosnian football regime and beyond.
94
REFERENCES
[1] Antonini, V., Scriney, M., Mileo, A., & Roantree, M. (2024). A Framework for Spatio-
Temporal Graph Analytics in Field Sports.
[2] Antonini, V., Scriney, M., et al. (2024). Time-windowed graph analytics in sports.
arXiv.
[4] Beetz, M., Gedikli, S., et al. (2009). Automated game analysis via spatio-temporal
data. Int. J. Comput. Sci. Sport.
[5] Beetz, M., von Hoyningen-Huene, N., et al. (2009). ASPOGAMO: Automated sports
games analysis models. International Journal of Computer Science in Sport.
[6] Bialkowski, A., Lucey, P., Carr, P., Yue, Y., & Matthews, I. (2014). Win at home and
draw away: Automatic formation analysis highlighting differences in team
behaviors. Proceedings of the MIT Sloan Sports Analytics Conference.
[8] Breihofer, P., V., Memmert, D. (2006). Tactical training optimization in elite football.
[9] Carling, C., Kannekens, R., Sampaio, J., & Yiannakos, A. (2005). Tactical behavior in
elite football: formation and movement analysis. Journal of Sports Sciences.
[10] Forcher, F., Beckmann, T., et al. (2024). Prediction of defensive success in elite soccer
using machine learning – Tactical analysis of defensive play using tracking data
and explainable AI. Science and Medicine in Football.
[11] Franks, I. M., & Miller, G. (Year). Expert observers’ recollection of match events.
95
[12] Franks, I. M., Miller, G. (Year). Human observational error in sports analytics.
[13] Goes, F. R., Meerhoff, L. A., et al. (2021). Unlocking the potential of big data to
support tactical performance analysis in professional soccer: A systematic review.
European Journal of Sport Science.
[14] Gudmundsson, J., & Horton, M. (2017). Spatio-Temporal Analysis of Team Sports: A
Survey. ACM Computing Surveys.
[15] Gyarmati, L., & Anguera, X. (2015). Automatic Extraction of the Passing Strategies
of Soccer Teams.
[16] Gyarmati, L., & Anguera, X. (2015). DTW-based passing strategy detection. arXiv.
[17] Horton, M., & Gudmundsson, J. (2016). Flow Diagrams for State Sequences in Team
Sports. ACM Journal of Experimental Algorithmics.
[18] Horton, M., Chawla, S., Estephan, J. (2014). Computational geometry approaches to
pass classification. arXiv.
[19] Horton, M., Gudmundsson, J., Chawla, S., & Estephan, J. (2014). Classification of
Passes in Football Matches using Spatiotemporal Data.
[21] Memmert, D. (2017). Match Analysis, Big Data and Tactics: Current Trends in Elite
Soccer. German Journal of Sports Medicine.
[22] Memmert, D. (2017). Spatio-temporal tracking and tactical patterns in elite football.
[23] Memmert, D., & Raabe, D. (2017). Revolution in professional football: data-driven
analysis 4.0.
96
[24] Memmert, D., & Raabe, D. (2019). Data analytics in football: positional data
collection, modelling and analysis. Routledge.
[25] Memmert, D., Lemmink, K., & Sampaio, J. (2017). Current approaches to tactical
performance analyses in soccer using position data. Sports Medicine.
[26] Memmert, D., Lemmink, K., & Sampaio, J. (2020). Collective team behavior and
positional spatio-temporal modeling. Sports Medicine.
[27] Memmert, D., Lemmink, K. A. P. M., & Sampaio, J. (2020). A systematic review of
collective tactical behaviours in football using positional data. Sports Medicine.
[28] Rein, R., & Memmert, D. (2016). Big data and tactical analysis in elite soccer:
Future challenges and opportunities for sports science. SpringerPlus.
[29] Rein, R., & Memmert, D. (2016). Future challenges for sports science in big data.
PubMed Central.
[30] Rein, R., & Memmert, D. (2016). Soccer big data models for tactical decision
making. German Journal of Sports Medicine.
[31] Rein, R., Raabe, D., & Memmert, D. (2017). Which pass is better? Novel approaches
to assess passing effectiveness in elite soccer matches. Human Movement Science.
[32] Rein, R., Raabe, D., et al. (2017). Tactical creativity via spatio-temporal tracking
data. Human Movement Science.
[33] Sampaio, J., & Maças, V. (2012). Measuring tactical behavior in football. Journal of
Sports Sciences.
[34] Teferi, G., & Endalew, D. (2020). Methods of Biomechanical Performance Analyses
in Sport: A Systematic Review. American Journal of Sports Science and Medicine.
97
[35] Teferi, G., & Endalew, D. (2020). Sports biomechanics systematic approaches. Am J
Sports Sci Med.
98