Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views6 pages

Pedido 41 - 3

This document discusses improving multi-agent reinforcement learning performance in StarCraft II by developing an adaptive action selection method called Adaptive Average Exploration. The authors compare their method to the default epsilon-greedy approach used in QMIX, a cooperative multi-agent reinforcement learning algorithm. They test their approach on the StarCraft II Multi-Agent Challenge using transfer learning to improve training times and stability. Results show their adaptive exploration method outperforms epsilon-greedy exploration.

Uploaded by

Mariano Grimoldi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

Pedido 41 - 3

This document discusses improving multi-agent reinforcement learning performance in StarCraft II by developing an adaptive action selection method called Adaptive Average Exploration. The authors compare their method to the default epsilon-greedy approach used in QMIX, a cooperative multi-agent reinforcement learning algorithm. They test their approach on the StarCraft II Multi-Agent Challenge using transfer learning to improve training times and stability. Results show their adaptive exploration method outperforms epsilon-greedy exploration.

Uploaded by

Mariano Grimoldi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Adaptive Average Exploration in Multi-Agent

Reinforcement Learning
Garrett Hall Ken Holladay
Southwest Research Institute Southwest Research Institute
San Antonio, Texas San Antonio, Texas
[email protected] [email protected]

Abstract— The objective of this research project was to With its rich mixture of combat unit capabilities and
improve Multi-Agent Reinforcement Learning performance in the environments, SC2 provides an ideal laboratory environment for
StarCraft II environment with respect to faster training times, RL. In 2019, DeepMind trained an AI capable of playing each
greater stability, and higher win ratios by 1) creating an adaptive of the three StarCraft II races [3]. This AI, AlphaStar, was able
action selector we call Adaptive Average Exploration, 2) using
experiences previously learned by a neural network via Transfer
to win against high level players in online games. DeepMind
Learning, and 3) updating the network simultaneously with its publically released their Python API as PySC2 [4] to encourage
random action selector epsilon. We describe how agents interact RL research.
with the StarCraft II environment and the QMIX algorithm used Oxford’s WhiRL laboratory took full advantage of the SC2
to test our approaches. We compare our AAE action selection
approach with the default epsilon greedy method used by QMIX. API release and developed their own StarCraft II Multi-Agent
These approaches are used to train Transfer Learning (TL) agents Reinforcement Learning Challenge (SMAC) [5] and algorithms
under a variety of test cases. We evaluate our TL agents using a (PyMARL) [6]. The framework and general architecture is
predefined set of metrics. Finally, we demonstrate the effects of illustrated in Fig. 1.
updating the neural networks and epsilon together more
frequently on network performance.

Keywords—multi-agent reinforcement learning, exploration and


exploitation, micromanagement, StarCraft II

I. INTRODUCTION
Reinforcement Learning (RL) is an iterative process in
which an agent learns to complete tasks in an environment based
on experience. At each training timestep, the agent takes an
action, makes an observation of the state of the environment,
collects a reward signal for the action related to the Fig. 1 - General MARL Network Architecture
environmental change, and updates its behavioral policy to
maximize the reward signal. Multi-Agent Reinforcement The challenge focuses on small scale skirmishes between a
Learning (MARL) is a branch of RL where a team of agents MARL algorithm and the handcrafted SC2 AI. In this context,
work together to maximize their rewards as a group. Training each combat unit is an agent. Each team consists of a
together cooperatively allows for the emergence of behaviors homogenous or heterogeneous mixture of combat units from that
otherwise unfound during single agent RL. team’s chosen race. The goal for the MARL algorithm is to learn
appropriate strategies for defeating the opponent within a
Video games provide and excellent MARL research specific time period. For example, the agents are rewarded for
environment. StarCraft II (SC2) [1] is the second title in a series inflicting damage on enemy units during game play, and if all
of real-time strategy games released by Blizzard Entertainment. the enemy units are killed, the team is rewarded with a large
Each player must build an army and defeat their opponent. This bonus at the end of the round. Thus the agents learn from their
is accomplished by acquiring resources, building structures, interactions with the environment.
training combat units, and developing advanced technologies to
improve combat capabilities. The three races in SC2 (Terran, MARL algorithms can be categorized as either learning
Protoss, and Zerg) have a variety of combat units. Each have competitively or cooperatively. In competitive learning, each
their own strengths and weaknesses. Winning gameplay requires agent receives its own reward. In cooperative learning, agents
strategy with respect to force composition, deployment timing, share a common reward. Independent Q-Learning (IQL) [7] is
and unit cooperation. Developing an Artificial Intelligence (AI) an example of a competitive learning algorithm. In IQL, each
algorithm capable of defeating the best human players was agent takes individual actions to improve their Q-value.
considered until recently to be one of the grand challenges in the Learning is fully decentralized. The IQL agents treat other
AI community [2]. agents on their team as part of the environment and do not

978-1-7281-9825-5/20/$31.00
Authorized licensed use limited©2020 IEEE Abierta Interamericana. Downloaded on July 25,2023 at 15:53:47 UTC from IEEE Xplore. Restrictions apply.
to: Universidad
communicate with them. This is computationally efficient, but it
does not provide an avenue for cooperative behavior.
The goal of a cooperative learning algorithm is to encourage
the emergence of team based behavior. This is more applicable
to real world scenarios where agents take actions based on the
actions of other agents on their team, as well as observations
from the environment. Several different cooperative MARL
algorithms have been applied to the SMAC environment using
PyMARL. These include Counterfactual Multi-Agent Policy
Gradients (COMA) [8] and Monotonic Value Function
Factorization for Deep MARL (QMIX) [9]. A key difference
between cooperative MARL algorithms is how they provide
agents with rewards for both their individual actions and for their
team actions. Attributing rewards to specific actions is known as
Fig. 2 - Performance Metrics for Transfer Learning [10]
credit assignment. COMA solves the problem of credit
assignment by comparing each agent’s actions against a II. EXPERIMENTAL SETUP
common baseline. The individual agents are provided with a
reward directly related to the action the agent took. It is A. StarCraft II Environment
considered fully centralized learning. While this approach does The StarCraft II Multi-Agent Challenge focuses on MARL
solve the credit assignment problem, the action space increases in the SC2 environment [5]. Each agent is controlled by its own
exponentially with each agent added. individual neural network. Each agent has a limited, radial field
QMIX is a balance between COMA and IQL. Each agent of view (FoV) and actions are based on local observations.
acts based on its own independent partial observation of the Agents do not track units that are outside of their FoV. Partially
environment. However, during centralized training, the agents observable environmental information is provided to the agents
have full access to all state information of the environment. by making state observations. These local observations include
Since the agents are trained together using the same state distance, relative x position, relative y position, health shield,
information, they learn to act cooperatively. After training, they and unit type. Global observations include coordinates of all
are able to act independently based on their own observations agents relative to the center of the map, all agents' local
while still maximizing the team’s reward. During training, it is observations, and a history of all agents’ last action. The discrete
essential to fully explore the action space for the agents to have action space consists of move (north, south, east, west), attack,
the best possible selection of optimal actions for any situation. stop, and no operation. For the agents to learn, a reward signal is
provided. This reward signal is based on hit-point damage dealt
While training, agents are encouraged to explore their action to opponent units and opponent units killed. There is a large
space by selecting random actions periodically. A commonly overall bonus for eliminating all opponent units. We used
used exploration technique is the epsilon greedy approach, PyMARL [5], which conforms to the challenges outlined in
where agents select either a random exploratory action (with a SMAC, to develop our algorithms. Our method uses QMIX as
probability of ) or the learned optimal greedy action (with a the baseline MARL algorithm.
probability of 1 − ). During the training process, the value of
is decreased using a linear or exponential decay function.
To improve the win rate of our agents against the SC2 AI,
we propose using Adaptive Average Exploration (AAE).
Traditionally, MARL algorithms only look at the rewards to
select their actions. This approach is limited in that it does not
account for the overall success of the team. In order to form this
relationship, we use the win ratio of the agents to derive the
epsilon value.
Fig. 3 - Agents training in SC2 with 3 marines versus 3 marines
In addition to AAE, we also investigate Transfer Learning
(TL) where agents are trained to perform a specific task, then are We build upon exploration and exploitation with an adaptive
retrained to complete a new task while using their previous technique. Each team has 3 marines with equivalent stats and
knowledge. The intent is to increase the rate at which agents attributes. They start each episode in the same locations and
learn and thus decrease training time, as illustrated in Fig. 2. within visible range of each other. Each episode in the 3 marines
Finally, we explore the relationship between updating the vs 3 marines scenario lasts either 60 timesteps or until one team
exploration rate epsilon and the agents simultaneously at specific is completely defeated (Fig. 3). During each timestep, agents
intervals. take an action, receive a reward, and make observations. The
games are played at 8x the normal game speed. Each training
The remainder of this paper is broken down as follows: session is 500,000 timesteps. A single experiment consists of 5
Experimental Setup, Results, and Conclusion. training sessions.

Authorized licensed use limited to: Universidad Abierta Interamericana. Downloaded on July 25,2023 at 15:53:47 UTC from IEEE Xplore. Restrictions apply.
The entire training and testing process proceeds as follows. scenario where the marines are much more powerful than the
Agents train for 200 episodes. During this time the value of Zerglings. The TL Zealot are more powerful melee combat units
epsilon varies depending on whether AAE or Greedy Epsilon is and the TL Hydralisks are more powerful ranged attack units.
selected. Once 200 episodes are reached, the neural network We look for how behaviors and high scores are related during
trains on the 200 episodes using batch sizes of 32. This neural the initial testing and during TL testing. We use a set of
network is frozen and used to select actions for the next 200 performance metrics outlined in [11] to determine the success of
episodes. During the neural network update, the target network our agents.
gives the current network a stationary point of reference for
convergence. After 200 episodes the current network is updated E. Epsilon Update Rates
to be the frozen network. This cycles until training is complete. For the previous experiments, the agent network was updated
every 200 episodes. This Epsilon Update Rate experiment uses
For the testing phase, after 10,000 timesteps, the training is the AAE method in combination with network updates occurring
paused and epsilon is set to zero. Subsequently, agents select at 32, 96, and 192 episodes. These update rates were selected to
optimal actions based on their local observations for 32 episodes. evenly accommodate the networks default batch size of 32
The percentage of episodes won is referred to as the test win episodes. AAE has been adjusted so epsilon is only updated
ratio. when the network updates. We expect updating the network
B. Greedy Epsilon Annealing Rates more frequently will lead to faster increased performances.
QMIX uses (1) to define the linear decay of over each III. RESULTS
timestep.
We provide our results in the order listed in the Experimental
= − (1) Setup Section. Each full training session lasted 500,000
timesteps. We observe average Win Ratios (WR) over time and
=( − )/( ) how epsilon changes the agent WR. Training time was
C. Adaptive Average Exploration approximately 24 hours using an Intel(R) Xeon(R) CPU E5-
In our AAE experiments, we update based on performance 2690 v2 @ 3.00GHz CPU and a NVIDIA Tesla K80 GPU per
during a specific number of episodes. The three experiments five sessions ran in parallel.
update their values of epsilon using a running average every 5, In the graphs, the Test Win Ratio is the total number of
20, or 100 episodes based on the number of episodes the agent games won divided by the number of episodes played. An
has won during training. is directly proportional to the win episode is defined either by 60 timesteps or until an entire team
ratio and expressed within the range of [0,1]. We refer to this is defeated. The number of timesteps in each experiment is a
method as the Adaptive Average Epsilon (AAE) method listed maximum of 500,000. The single dark blue line represents
as (2). epsilon and how it changes during training. During testing,
epsilon is set to zero so that only optimal actions are selected.
= (2) Performance average is denoted by a solid line surrounded
by a shaded region. The shaded region is generated using a
The agents were trained in a homogeneous battle space of 3 standard deviation of = 2. This represents our 95% confidence
marines versus 3 marines. The duration of each episode was 60 interval.
timesteps. Units that survive longer than 60 timesteps self-
destructed. After the episode is complete, each team respawns A. Results for Greedy Epsilon Annealing Rates
their units in the same location, and the new episode begins. Our We discuss the two decay experiments seen in Fig. 4 in terms
agents train for 500,000 timesteps. This number was reduced of their training and testing. In Case 1 (rapid decay), epsilon
from over 2 million timesteps used in the original QMX work in decreases from 1.0 to 0.05 in 10,000 timesteps. In Case 2 (slow
an effort to reduce training time. In total, five networks are decay) epsilon decreases over 100,000 timesteps. During
trained since the network performance is stochastic in nature. training, agents start increasing their WR as epsilon nears its
These results are later used to examine the variance of the lower bound of 5%. In Case 1, this occurs almost instantaneously
networks. while in Case 2 it occurs much later. Even though Case 1 starts
D. Transfer Learning increasing the WR earlier, it is at a lower rate than the agents in
Case 2. Case 1 displays a higher variance than Case 2.
We train our agents in four scenarios for our TL experiment:
3 marines vs 3 Zerglings, 3 marines vs 3 marines, 3 marines vs For testing, Case 1 achieves a steadily increasing WR while
3 Zealots, and 3 marines vs 3 Hydralisks. Once the neural agents perform only the optimal action,. Case 2 experiences a
networks for the 3 marines have trained for 500,000 timesteps higher WR initially but it decreases until 110,000 timesteps
they are saved. These neural networks then continue their before it starts increasing its WR again. During this time
training for 500,000 more timesteps against 3 marines controlled (10,000-190,000 timesteps) there is large variance in both Cases.
by SC2’s native AI to complete the TL training. After 250,000 timesteps Case 2 experiences reduced variance
while Case 1 experiences more variance than Case 2. The
We compare our results against the average performance of average episode length was 23 timesteps.
Marines who have not used transfer learning. The initial TL
Marine agents are used as a baseline to compare against the three
other experiments. The TL Zergling agents train in a simple

Authorized licensed use limited to: Universidad Abierta Interamericana. Downloaded on July 25,2023 at 15:53:47 UTC from IEEE Xplore. Restrictions apply.
C. Results for Transfer Learning
We look at our Transfer Learning results by grouping the
results into the categories of combat types, power ranking, Jump
Start Metric, Time to Threshold Metric, and Asymptotic
Performance Metric. Please note the Hydralisk training was
performed after all experiments were complete. We drew our
conclusions and implemented those in an agent as proof of
concept. The TL Hydralisk agents trained for 32 episodes before
their network and epsilon values were updated. This 32 episode
number is discussed in the Results for Epsilon Update Rates
Sub-Section.
The two different combat types present are melee and
ranged. Melee is close range with either claws for the Zerglings
or psionic swords for the Zealots. Ranged consists of machine
guns for the Marines and spike projectiles for the Hydralisks.
Fig. 4 – Linear decay rates. Rapid decay on left side and slow decay on right. During training in the left hand column of Fig. 6 for the
Zerglings, the Marine agents instantly achieve a high WR and
B. Results for Adaptive Average Exploration maintain it for the course of the training. For training against the
During training, epsilon follows the average WR as described by Zealots, the Marine agents are unable to win a single match.
our AAE algorithm. This causes the agents to never reach above Further review of gameplay footage reveals they are not capable
a WR of 27% during training. However, during testing the agents of even eliminating one enemy Zealot. This is similar to the
successfully apply tactics they learned during training. The two Marines training against the Hydralisks. The preferred strategy
metrics of success for the AAE algorithm are the WR and for fighting Zealots and Hydralisks is to take one step towards
variance. Both the 20 and 100 episode cases outperform the 5 the enemy and continually attack until the episode is complete.
episode in both metrics as seen in Fig. 5. In terms of learning The Marines that train with Marines are able to increase their
rate, the 100 episode case achieves a higher learning rate more WR early.
quickly than the 20 episode case. In terms of variance, the 20 The power rankings can be broken down from weakest to
episode case performs better over time with a significant strongest as follows: Zergling, Marine, Hydralisk, Zealot. The
reduction in variance by the end of training. The average episode Marines have no trouble winning against weaker foes and other
length was 24 timesteps, indicating that for most games, one Marines. It is when they are overpowered by the either the
team destroyed the other. Zealots or the Hydralisks that they are unable to land a single kill
on the enemy team.
In terms of Jump Start, we look at the TL agents’
performance in the second column in Fig. 6. These metrics are
captured in TABLE I. Only agents trained against the similar
combat type showed an initial improved combat performance.
The TL Marines start off at a 100% WR, which sharply drops to
25% before increasing to 90% and holding steady for the rest of
the training. The TL Hydralisk agents take a quick initial dip in
WR performance but bounce back quickly. The TL Marines
outperform the TL Hydralisk agents marginally. Both groups of
ranged agents keep a 95% WR for the majority of the training
after 100,000 timesteps. In doing so, both TL agents outperform
the Regular Marine agents who lack consistently high WR later
in testing.
For the Time to Threshold (ToT) of 80%, we see similar
performance trends in the Jump Start metric. The TL Marines
and TL Hydralisks both outpace the Regular Marines by roughly
150,000 timesteps. The TL Marines reach their ToT 25,000
timesteps before the TL Hydralisks. The TL Zergling agents see
a slight ToT increase of 25,000 timesteps faster than their
Regular Marines counterpart. These TL Zergling agents’ testing
WR mirrors the Regulars Marines testing almost identically. The
TL Zealots, while unable to win during training, showed a
surprisingly good ToT of 60,000 timesteps. The TL Zealots also
kept a consistenly high WR similar to the TL Zealot and TL
Hydralisk agents.
Fig. 5 - Adjusting how many episodes before network updates while using
Adaptive Average Epsilon method

Authorized licensed use limited to: Universidad Abierta Interamericana. Downloaded on July 25,2023 at 15:53:47 UTC from IEEE Xplore. Restrictions apply.
The Asymptotic Performance was very slight in these the best performance. The time to an 80% WR threshold for 32
experiments. The TL Marines, Zealots, and Hydralisks were able UR is 150,000 timesteps, for 96 UR is 195,000 timesteps, and
to consistently keep their WR above the Regular Marine for 196 UR is 210,000 timesteps. For 32 and 96 UR, epsilon
counterpart. While most of them did converge on the WR of converges to 27%. For 192 UR, epsilon does not converge and
95% and greater, there was no WR decrease during testing. The continue to increase while oscillating.
TL Zergling agents did not exhibit any Asymptomatic
Performance. As the update rate increases, the variance increases as well.
The 32 UR had the smallest variance at the end of training. The
TABLE I. TRANSFER LEARNING PERFORMANCE METRICS 96 UR had three performance drops at 320,000 timesteps,
425,000 timesteps, and 490,000 timesteps that resulted in high
Jump Start Time to variances around those timesteps. The 192 UR gets close to
Threshold (80%)
converging to the 90% WR, but as epsilon starts to oscillate and
Regular TL Regular TL diverge, the WR begins to decrease with a large increase in
Marines 0% 25% 190k 26k variance, reducing the performance to 27%.
Zerglings 0% 0% 200k 175k
Zealots 0% 0% 175k 111k
Hydralisks 0% 25% 200k 50k

Fig. 7 - Updating epsilon and agents in unison

IV. CONCLUSION
This paper presented the effects of annealing rates on epsilon
greedy method on QMIX, Adaptive Average Exploration
approach for selecting epsilon in the QMIX algorithm, applied
transfer learning to four scenarios, and used simultaneous
Update Rates to improve model performance. Our results show
AAE performs on par with QMIX’s greedy epsilon method, TL
agents see performance improvements when testing against
similar combat types, and frequent simultaneous updates of
epsilon and network weights helps stabilize model
performance. Future research will extend these ground based
methods to aerial combat scenarios.
A. Conclusion for Greedy Epsilon Annealing Rates
Fig. 6 - Top Row – Agent Win Ratio, Bottom Row – Average Win Ratio of TL
agents against Marine agents
In terms of consistency, Case 1 (rapid decay) outperforms
Case 2 (slow decay). The WR follows a mostly positive trend.
D. Results for Epsilon Update Rates We believe this is due to the action space being rather limited in
the 3v3 scenarios. It is easy for the agents to improve their WR
Fig. 7 shows the impact of updating epsilon at set intervals
when they choose their optimal actions more frequently. The
when training updates occur. Updating every 32 episodes shows

Authorized licensed use limited to: Universidad Abierta Interamericana. Downloaded on July 25,2023 at 15:53:47 UTC from IEEE Xplore. Restrictions apply.
issue with this is it leads to an incomplete search of the action metrics, Marines only deal 7 damage per second while Zealots
space in Case 1 compared to Case 2. The prolonged action space deal 18.6 per second and Hydralisks deal 12 damage per second.
search leads to a higher variance in Case 1. Since both Case 1 In one second, a group of three Zealots can kill a Marine with
and 2 both converge to 95% at 500,000 timestep, we would have the group of Marines only dealing 21 damage. Given the average
to select the variance metric to state that Case 2 outperforms episode lasts 24 timesteps, the group of Marines would only be
Case 1. able to deal a maximum of 144 hit points worth of damage if all
they did was attack. This number is actually less because the
B. Conclusion for Adaptive Average Epsilon number of Marines decreases as the episode progresses. This is
Successful and robust models are determined by WR and not enough damage to kill a single Zealot but might be enough
variance. We look for models that learn quickly over time and to kill a Hydralisk. It is possible that if the TL Hydralisk agents
can consistently hold a high WR during testing. Both the 5 explored the action space more thoroughly during training they
episode and 100 episode case exhibit poor update rates. would be able to defeat the Hydralisks. Further training would
Switching epsilon every 5 episodes does not give the agents be needed for verification. This would require the modification
enough time to learn from the previous behaviors. Doing so of the AAE algorithm to explore more when the WR is zero.
causes the high variance to continue to be carried throughout
testing. For the 100 episode case, agents are basing their action D. Conclusion for Epsilon Updates
selection off episodes actions 100 episodes ago. This does not Updating agents and epsilon more frequently results in more
give the agent accurate feedback for their behaviors. In the three stable long term performance. Larger batches may increase the
experiments, 20 episodes is the number where agents can agent’s ability to win during training but this has a negative
connect their actions to what was learned most recently best. impact on agents during testing.
This allows a significantly lower variance in the 20 episode case
after the 250,000 timestep mark than the 5 and 100 episode ACKNOWLEDGMENT
cases. This work was sponsored by the Southwest Research
C. Conclusion for Transfer Learning Institute Advisory Committee for Research (ACR) under project
R6010.
The primary conclusion from the TL experiments was the
importance of similarity between the combat types learned. REFERENCES
Ranged combat units outperformed both the Zerglings and [1] “StarCraft II Official Game Site.” [Online]. Available:
Zealots. This was due to the fact that for TL, the Marine agents https://starcraft2.com/en-us/. [Accessed: 25-Jun-2020].
were ranged based. When training against the Zerglings, the [2] “AlphaStar: Mastering the Real-Time Strategy Game StarCraft II
agents would evade to retain health for a higher score while | DeepMind.” [Online]. Available:
attacking the Zerglings. This learned behavior did not transfer https://deepmind.com/blog/article/alphastar-mastering-real-time-
strategy-game-starcraft-ii. [Accessed: 25-Jun-2020].
well. We believe this to be the reason that during the TL Zergling
testing the agents did not show much increase in performance [3] O. Vinyals et al., “Grandmaster level in StarCraft II using multi-
agent reinforcement learning,” Nature, vol. 575, no. 7782, pp.
and essentially followed the testing curve of the Regular Marine 350–354, 2019.
agents. [4] O. Vinyals et al., “StarCraft II: A New Challenge for
While most of the results were intuitive, the performance Reinforcement Learning,” 2017.
improvement of the TL Zealots was surprising. Even though the [5] M. Samvelyan et al., “The StarCraft Multi-Agent Challenge,”
2019.
TL agents were unable to win during training, they outperformed
[6] M. Samvelyan et al., “The StarCraft Multi-Agent Challenge
the TL Zerglings in both terms of WR and variance. This shows Extended Abstract,” Proc. ofthe 18th Int. Conf. Auton. Agents
that the combat technique developed during training was more Multiagent Syst. (AAMAS 2019), no. Aamas, pp. 2186–2188,
important than winning during training. We expected that the 2019.
agents would learn how to attack and defeat one unit in order to [7] M. Tan, “Multi-Agent Reinforcement Learning: Independent vs.
get the kill bonus. However, the tactic employed for fighting Cooperative Agents,” Mach. Learn. Proc. 1993, pp. 330–337,
stronger adversaries of locking into place and attacking for the 1993.
duration of the episode learned during the Zealot and Hydralisk [8] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S.
training proved more beneficial than increasing the score in the Whiteson, “Counterfactual multi-agent policy gradients,” 32nd
AAAI Conf. Artif. Intell. AAAI 2018, pp. 2974–2982, 2018.
Zergling training. This shows that in some cases, it is better to
[9] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster,
focus on the behaviors rather than the scores. and S. Whiteson, “QMIX: Monotonic value function factorisation
for deep multi-agent reinforcement Learning,” 35th Int. Conf.
The lack of ability to kill a single Zealot or Hydralisk could Mach. Learn. ICML 2018, vol. 10, pp. 6846–6859, 2018.
occur for a few reasons. One reason would be the Marines are [10] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A Survey and
not engaging in team-based attacks of a single unit to finalize the Critique of Multiagent Deep Reinforcement Learning,” pp. 1–49,
kill. The other possibility is the larger effective hit point ratios 2018.
of the Zealots (150 hit points) and Hydralisks (90 hit points) is [11] S. Albrecht and P. Stone, “Multiagent Learning Foundations and
more than the Marines can deplete before being defeated. Recent Trends,” 2017.
Marines only have 40 hit points. On top of the lower health

Authorized licensed use limited to: Universidad Abierta Interamericana. Downloaded on July 25,2023 at 15:53:47 UTC from IEEE Xplore. Restrictions apply.

You might also like