Generalization of Heterogeneous Multi-Robot Policies
Generalization of Heterogeneous Multi-Robot Policies
1 Introduction
Heterogeneous robot teams have the potential to address complex real-world challenges that arise
in a wide range of domains, such as precision agriculture, defense, warehouse automation, supply
chain optimization, and environmental monitoring. However, a key hurdle in realizing such potential
is the challenge of ensuring effective communication, coordination, and control.
Existing approaches to address the challenges of multi-robot systems can be crudely categorized into
two groups. First, classical approaches use well-understood controllers with simple local interaction
rules, giving rise to complex global emergent behavior [1]. Indeed, such controllers have proven ex-
traordinarily useful in diverse domains. However, designing them requires both significant technical
expertise and considerable domain knowledge. Second, recent learning-based approaches alleviate
the need for expertise and domain knowledge by leveraging advances in learning frameworks and
computational resources. Learning has been successful in many domains, such as video games [2],
autonomous driving [3], disaster response [4], and manufacturing [5].
However, learning approaches are not without their fair share of limitations. First, the majority of ex-
isting methods focus on homogeneous teams and, as such, cannot handle heterogeneous multi-robot
teams. Second, and more importantly, even existing methods designed for heterogeneous teams are
often solely concerned with the challenge of learning to coordinate a given team, entirely ignoring
the challenge of generalizing the learned behavior to new teams. Given the potentially prohibitive
cost of retraining coordination policies after deployment in real-world settings, it is imperative that
multi-robot policies generalize learned behaviors to inevitable changes to the team.
In this work, we focus on the challenge of generalizing multi-robot policies to team changes. In
particular, we focus on generalization of trained policies to teams of new compositions, sizes, and
∗
Equal Contribution. † This work was supported in part by the Army Research Lab under Grants W911NF-
17-2-0181 and W911NF-20-2-0036
7th Conference on Robot Learning (CoRL 2023), Atlanta, USA.
Figure 1: We investigate the role of capability awareness and communication in generalizing decentralized
heterogeneous multi-robot coordination policies to teams of new composition, size, and robots.
robots that are not encountered in training (see Fig. 1). We refer to such generalization as adaptive
teaming, wherein the learning policy can readily handle changes to the team without additional
training. To this end, we need policies that can reason about how a group of diverse robots can
collectively achieve a common goal, without assigning rigid specialized roles to individual robots.
We investigate the role of robot capabilities in generalization to new teams. Our key insight is that
adaptive teaming requires the understanding of how a team’s diverse capabilities combine to dic-
tate the behavior of individual robots. For instance, consider an autonomous heterogeneous team
responding to multiple concurrent wildfires. Effective coordination in such situations requires rea-
soning about the opportunities and constraints introduced by the robots’ individual and relative ca-
pabilities, such as speed, water capacity, and battery range. In general, robots must learn how their
individual capabilities relate to those of others to determine their role in achieving shared objectives.
We develop a policy architecture that can explicitly reason about robot capabilities when select-
ing actions. Our architecture has four key properties: i) capability awareness: our design enables
actions to be conditioned on continuous capabilities in addition to observations, ii) capability com-
munication: we leverage graph networks to learn how robots must communicate their capabilities
iii) robot-agnostic: we utilize parameter sharing and learn policies that are not tied to individual
robots, and iv) decentralized: our trained policies can be deployed in a decentralized manner. To-
gether, these four properties provide the potential to generalize to new teams. One can view this
design as an extension of agent identification techniques [6] to the metric space of capabilities. As
such, capabilities do not merely serve to distinguish between agents during training to enable be-
havioral heterogeneity [7], but also to provide a more general means to encode how individual and
relative capabilities influence collective behavior.
We evaluate the utility of capability awareness and communication in two heterogeneous multi-robot
tasks in sim and real. Our results reveal that both awareness and communication of capabilities can
enable adaptive teaming, outperforming policies that lack either one or both of these features in terms
of average returns and task-specific metrics. Further, capability-based policies achieve superior zero-
shot generalization than existing agent identification-based techniques, while ensuring comparable
performance on the training set.
2 Related Work
Learning for multi-robot teams: Recent advances in deep learning are providing promising ap-
proaches that circumvent the challenges associated with classical control of multi-robot systems.
Multi-agent reinforcement learning (MARL), in particular, has been shown to be capable of solving
a wide variety of tasks, including simple tasks in the multi-agent particle environments (MPE) [8],
complex tasks under partial observability [9], coordinating an arbitrary number of agents in video
games [2], and effective predictive modeling of multi-agent systems [10]. These approaches are
2
driven by popular MARL algorithms like QMIX [11], MADDPG [8], and MAPPO [12] – nontrivial
extensions of their single agent counterparts DQN [13], DDPG [14], and PPO [15], respectively. We
adopt a PPO-based learning framework given its proven benefits despite its simplicity [12]. Cen-
tralized training, decentralized execution (CTDE) is a commonly used framework in which decen-
tralized agents learn to take actions based on local observations while a centralized critic provides
feedback based on global information [16, 17]. We use the CTDE paradigm as it lends itself natu-
rally to multi-robot teams since observation and communication are often restricted. However, its
important to note that our approach is agnostic to the specific learning algorithm.
Learning for heterogeneous teams: Many MARL algorithms were originally designed for use in
homogeneous multi-agent teams. However, truly homogeneous multi-robot teams are rare except
because of manufacturing differences, wear and tear, or task requirements. Most real-world multi-
robot problems such as search & rescue, agriculture, and surveillance require a diverse set of ca-
pabilities aggregated from heterogeneous robots [18–20]. While many MARL approaches consider
heterogeneity, they either tend to focus on differences in behavior exhibited by physically identi-
cal robots [21], or identical behavior exhibited by physically-different robots [22, 23]. A common
strategy used to elicit heterogeneous behavior from shared models is referred to as agent identifi-
cation or behavioral typing, in which the agents’ observations are appended with an agent-specific
index [24, 25]. While these methods have been shown to be highly effective, recent investigations
have revealed issues with scalability [26], and robustness to observation noise [7]. While capability-
awareness is similar in spirit to existing identification-based techniques, it does not require assigning
indices to individual robots and can thus generalize to teams with new robots. Further, most existing
methods do not simultaneously handle teams with physical and behavioral differences. According
to a recent heterogeneous multi-robot taxonomy [7], our work falls under the category consisting
of physically-different robots that differ in behavior, but share the same objective. Two recent ap-
proaches belong to this same category [4, 7]. However, one is limited to a discrete set of robot
types [4] and the other learned decentralized robot-specific policies that cannot handle the addition
of new robots and might not generalize to new compositions [7].
Generalization: For applications to real-world multi-robot systems, it is essential to consider the
generalization capabilities of learned control policies. In our formulation, there are two axes of gen-
eralization in heterogeneous multi-robot teams: combinatorial generalization (new team sizes and
new compositions of the same robots) and individual capability generalization (new robots). Prior
works reliant on feed forward or recurrent networks tend to be limited to teams of static size [8, 27].
Combinatorial generalization for homogeneous teams can be achieved with graph network-based
policies [27, 28]. However, existing methods tend to struggle with generalization in the presence
of heterogeneity [29]. While methods that employ agent identification [24, 25] might be able to
achieve combinatorial generalization by reusing the IDs from training, it is unclear how they can
handle new robots. In stark contrast, capability-based policies are robot-agnostic and can take the
capabilities of the new robot as an input feature to determine its actions.
We model teams of N heterogeneous robots as a graph G = (V, E), where each node vi ∈ V is a
robot, and each edge eij = (vi , vj ) ∈ E is a communication link. We use zi to denote the obser-
vations of the ith robot, which includes its capabilities and its sensor readings of the environment.
We assume that the robots’ heterogeneity can be captured by their capabilities. We represent the
capabilities of the ith robot by a real-valued vector ci ∈ C ⊆ RC + , where C is the C-dimensional
space of all capabilities of the robots. An example of a multi-dimensional capability is a vector with
elements representing payload, speed, and sensing radius. When robot i does not possess the kth
capability, the kth element of ci is set to zero.
3
3.2 Problem Description
We are interested in learning a decentralized control policy that can i) effectively coordinate a team
of heterogeneous robots to achieve the task objectives, and ii) generalize readily to teams of both
novel compositions and novel robots that are not encountered during training. Our problem can
be viewed as a multi-agent reinforcement learning problem that can be formalized as a decentral-
ized partially-observable Markov Decision Process (Dec-POMDP) [30]. We expand on the Dec-
POMDP formulation to incorporate the capabilities of heterogeneous robots and arrive at the tuple
⟨D, S, {Ai }, {Zi }, C, T, R, O⟩ where D is the set of N robots, S is the set of global states, {Ai }
is a set of action spaces across all robots, {Zi } is a set of joint observations across all robots, C is
the multi-dimensional space of capabilities, R is the global reward function, and T and O are the
joint state transition and observation models, respectively. Our objective isP to learn decentralized
Th
action policies that control each robot to maximize the expected return E[ t=0 rt ] over the task
horizon Th . The decentralized policy of the ith robot π(ai |e oi ) defines the probability that Robot i
takes Action ai given its effective observation oei . The effective observation oei of the ith robot is a
function of both its individual observation and that of others in its neighborhood.
We utilize parameter sharing to train a single action policy that is shared by all of the robots. Pa-
rameter sharing is known to improve learning efficiency by limiting the number of parameters that
must be learned. More importantly, parameter sharing is required for our problem so policies can
transfer to new robots without the need for training new policies or assigning robots to already
4
trained policies. Additionally, we believe that parameter sharing serves a secondary role in learning
generalizable strategies for efficient generalization. Sharing parameters enables the policy to learn
generalized coordination strategies that, when conditioned on robot capabilities and local observa-
tions, can be adapted to specific robots and contexts.
We employ a centralized training, decentralized execution (CTDE) paradigm to train the action
policy. We apply an actor-critic model, and train using proximal policy optimization (PPO). The
actor-critic model is composed of a decentralized actor network (i.e., shared action policy) that
maps robots observations to control actions, and a centralized critic network [12], which estimates
the value of the team’s current state based on centralized information about the environment and
robots aggregated from individual observations. Finally, we trained our policies on multiple teams
until they converged, with the teams changing every 10 episodes to stabilize training.
4 Experimental Design
We conducted detailed experiments to evaluate how capability awareness and communication impact
generalization to: i) new team sizes and compositions, and ii) new robots with unseen capabilities.
Environments: We designed two heterogeneous multi-robot tasks for experimentation:
• Heterogeneous Material Transport (HMT): A team of robots with different material car-
rying capacities for lumber and concrete (denoted by ci ∈ R2 for the ith robot) must transport
materials from lumber and concrete depots to a construction site to fulfill a pre-specified quota
while minimizing over-provision. We implemented this environment as a Multi-Particle Envi-
ronment (MPE) [32] and leverage the infrastructure of EPyMARL [33].
• Heterogeneous Sensor Network (HSN): A robot team must form a single fully-connected
sensor network while maximizing the collective coverage area. The ith robot’s capability ci ∈ R
corresponds to its sensing radius. We implemented this environment using the MARBLER [34]
framework which enables hardware experimentation in the Robotarium [35], a well-established
multi-robot test bed.
Policy architectures: In order to systematically examine the impact of capability awareness and
communication, we consider the following policy architectures:
• ID(MLP): Robot ID-based MLP
• ID(GNN): Robot ID-based GNN
• CA(MLP): Capability-aware MLP
• CA(GNN): Capability-aware GNN without communication of capabilities,
• CA+CC(GNN): Both capability awareness and communication.
The ID-based baselines stand in for SOTA approaches that employ behavioral typing to handle het-
erogeneous teams [24, 25], and, as such, question the need for capabilities. The MLP based base-
lines help us investigate the need for communication. Finally, the CA(GNN) enables communication
of observations but does not does not communicate capability information.
Metrics: For both environments, we compare the above policies using Average Return: the average
joint reward received over the task horizon (higher is better). Additionally, we use environment-
specific metrics. In HMT, we terminate the episodes when the quotas for both materials are met.
Therefore, we consider Average Steps taken to meet the quota (lower is better). For HSN, we
consider Pairwise Overlap: sum of pairwise overlapping area of robots’ coverage areas (lower is
better).
Training: For each environment, we used five teams with four robots each during training. We
selected the training teams to ensure diverse compositions and degree of heterogeneity. For HSN,
we sampled robots’ sensing radius from the uniform distribution U (0.2, 0.6). For HMT, we sampled
robots’ lumber and concrete carrying capacities from the uniform distribution U (0, 1.0). We also
assigned each robot a one-hot ID to train ID-based policies. We trained each policy with 3 random
seeds. We resampled robot teams every 10 episodes to stabilize training.
5
(a) HMT: Average return (b) HMT: Average steps (c) HSN: Average return (d) HSN: Overlap area
Figure 2: When evaluated on teams seen during training, capability-aware policies performed comparably to
ID-based policies in terms of both average return (higher is better) and task-specific metrics (lower is better).
5 Results
Below, we report i) performance on the training team, ii) zero-shot generalization to new teams, and
iii) zero-shot generalization to new robots with unseen values of capabilities for each environment.
6
(a) 3 Robots (b) 4 Robots (c) 5 Robots
Figure 3: When generalizing to new team compositions and sizes in HMT, capability-based policies consistently
outperformed ID-based policies in terms of average steps taken to meet the quota (lower is better).
Additional results: We provide additional results for HMT by reporting more task-specific metrics
and evaluations on significantly larger teams in Appendix A. The results on task-specific metrics
further support the claim that capability-aware methods generalize better than ID-based methods.
We also find that these benefits extend to teams consisting of 8, 10, and 15 robots. Taken together, the
above results suggest that reasoning about capabilities (rather than assigned IDs) improves adaptive
teaming, likely due to the ability to map capabilities to implicit roles.
7
but tend to generalize poorly to new team compositions and sizes. This is likely because ID-based
policies fail to reason about robot heterogeneity, and instead overfit the relationships between robot
IDs and behavior in the training set. Further, CA+CC(GNN) in particular consistently outperformed
all other policies across metrics and variations, suggesting that both capability awareness and com-
munication are necessary to enable generalization in HSN.
Zero-shot generalization to new robots: We evaluated the trained policies’ ability to generalize to
teams of different sizes which are composed of entirely new robots whose sensing radii are different
from those encountered in training. Similar to HMT, we cannot evaluate ID-based policies on teams
with new robots since there is no obvious way to assign IDs to the new robots. In Fig. 6, we report
the performance of all three capability-based policies in terms of average return. Both GNN-based
policies (CA(GNN) and CA+CC(GNN)) considerably outperform the CA(MLP) policy, underscoring
the importance of communication in generalization to teams with new robots. However, we also see
that communication of observations alone is insufficient, as evidenced by the fact that CA+CC(GNN)
(which communicates both observations and capabilities) consistently outperforms CA(GNN) (which
only communicates observations).
Real-robot demonstrations: We also deployed the trained policies of CA+CC(GNN), CA(MLP), and
CA(GNN) on the physical Robotarium (see Section B.1 for further details and snapshots). Overall,
we find that the benefits reported above extend to physical robot teams. We find that CA+CC(GNN)
and CA(GNN) policies generalize to physical robots and successfully build a sensor network while
minimizing sensing overlap for teams of 3 and 4 robots. The CA(MLP) policy resulted in signif-
icantly worse performance, where robots’ executed paths provoked significant engagement of the
Robotarium’s barrier certificates due to potential collisions.
Additional results: We provide additional results for HSN by reporting more task-specific metrics
in Appendix B. Much like the results for HMT, the results on task-specific metrics further support
the claim that capability-aware methods show superior adaptive teaming ability compared with ID-
based methods. The additional results also support our claim in this section that communication of
capabilities is essential for success on this task.
6 Limitations
While our framework could reason about many different capabilities simultaneously, our experi-
ments only involved variations in 1-D and 2-D capabilities. We also only consider generalization
to new values for capabilities; we do not consider generalization to new types of capabilities. Ad-
ditionally, our work only considers the representation of robot’s capabilities that we can quantify.
Handling implicit capabilities and communication thereof may benefit from additional meta-learning
mechanisms, uncovering a more general relationship between robots’ learned behaviors and capa-
bilities. Further, we do not consider high-level planning and task-allocation and rely solely on the
learning framework to perform implicit assignments to sub-tasks within the macro task. Future
work can investigate appropriate abstractions and interfaces for considering both learning-based
low-level policies and efficient algorithms for higher-level coordination. Lastly, we only consid-
8
(a) 3 Robots (b) 4 Robots (c) 5 Robots
Figure 6: When generalizing to teams comprised of new robots in HSN, combining awareness and communi-
cation of capabilities (CA+CC(GNN)) achieves higher average returns than baselines that are merely aware of
capabilities, irrespective of whether they communicate observations (CA(GNN)) or not (CA(MLP)).
ered fully-connected communication graphs in our evaluations for simplicity. While graph networks
are known to effectively share local observations for global state estimations in partially-connected
teams [27, 36], it is unclear if such ability will translate to the communication of capabilities.
7 Conclusion
We investigated the utility of awareness and communication of robot capabilities in the general-
ization of heterogeneous multi-robot policies to new teams. We developed a graph network-based
decentralized policy architecture based on parameter sharing that enables robots to reason about and
communicate their observations and capabilities to achieve adaptive teaming. Our detailed experi-
ments involving two heterogeneous multi-robot tasks unambiguously illustrate the importance and
the need for reasoning about capabilities as opposed to agent IDs.
9
References
[1] J. Cortés and M. Egerstedt. Coordinated control of multi-robot systems: A survey. SICE
Journal of Control, Measurement, and System Integration, 2017.
[4] E. Seraj, Z. Wang, R. Paleja, D. Martin, M. Sklar, A. Patel, and M. Gombolay. Learning
efficient diverse communication for cooperative heterogeneous teaming. In Proceedings of the
21st International Conference on Autonomous Agents and Multiagent Systems, 2022.
[5] Z. Wang, C. Liu, and M. Gombolay. Heterogeneous graph attention networks for scalable
multi-robot scheduling with temporospatial constraints. Autonomous Robots, 2022.
[6] C. Li, T. Wang, C. Wu, Q. Zhao, J. Yang, and C. Zhang. Celebrating Diversity in Shared
Multi-Agent Reinforcement Learning, 2021.
[8] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for
mixed cooperative-competitive environments. Neural Information Processing Systems (NIPS),
2017.
[9] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task
multi-agent reinforcement learning under partial observability. In International Conference on
Machine Learning, 2017.
[10] Y. Hoshen. Vain: Attentional multi-agent predictive modeling. In Advances in Neural Infor-
mation Processing Systems, 2017.
[12] C. Yu, A. Velu, E. Vinitsky, Y. Wang, A. M. Bayen, and Y. Wu. The surprising effectiveness
of MAPPO in cooperative, multi-agent games. CoRR, 2021.
[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning, 2019.
[15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization
algorithms, 2017.
[16] V. Egorov and A. Shpilman. Scalable multi-agent model-based reinforcement learning, 2022.
[18] T. Dang, M. Tranzatto, S. Khattak, F. Mascarich, K. Alexis, and M. Hutter. Graph-based sub-
terranean exploration path planning using aerial and legged robots. Journal of Field Robotics,
2020.
10
[19] T. Dang, M. Tranzatto, S. Khattak, F. Mascarich, K. Alexis, and M. Hutter. Graph-based sub-
terranean exploration path planning using aerial and legged robots. Journal of Field Robotics,
2020.
[20] G. Gil, D. E. Casagrande, L. P. Cortés, and R. Verschae. Why the low adoption of robotics in the
farms? challenges for the establishment of commercial agricultural robots. Smart Agricultural
Technology, 2023.
[21] C. Li, T. Wang, C. Wu, Q. Zhao, J. Yang, and C. Zhang. Celebrating diversity in shared multi-
agent reinforcement learning. Advances in Neural Information Processing Systems, 2021.
[22] J. K. Terry, N. Grammel, S. Son, and B. Black. Parameter sharing for heterogeneous agents in
multi-agent reinforcement learning, 2022.
[23] C. Wakilpoor, P. J. Martin, C. Rebhuhn, and A. Vu. Heterogeneous multi-agent reinforcement
learning for unknown environment mapping. arXiv preprint arXiv:2010.02663, 2020.
[24] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson. Learning to communicate with deep
multi-agent reinforcement learning. Advances in neural information processing systems, 2016.
[25] J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multi-agent control using deep
reinforcement learning. In Autonomous Agents and Multiagent Systems: AAMAS 2017 Work-
shops, Best Papers, São Paulo, Brazil, May 8-12, 2017, Revised Selected Papers 16. Springer,
2017.
[26] F. Christianos, G. Papoudakis, M. A. Rahman, and S. V. Albrecht. Scaling multi-agent rein-
forcement learning with selective parameter sharing. In International Conference on Machine
Learning. PMLR, 2021.
[27] A. Agarwal, S. Kumar, and K. Sycara. Learning transferable cooperative behavior in multi-
agent teams, 2019.
[28] Q. Li, W. Lin, Z. Liu, and A. Prorok. Message-aware graph attention networks for large-scale
multi-robot path planning, 2021.
[29] A. Mahajan, M. Samvelyan, T. Gupta, B. Ellis, M. Sun, T. Rocktäschel, and S. Whiteson.
Generalization in cooperative multi-agent systems, 2022.
[30] F. A. Oliehoek and C. Amato. A Concise Introduction to Decentralized POMDPs. Springer-
Briefs in Intelligent Systems, 2016.
[31] A. Khan, E. Tolstaya, A. Ribeiro, and V. Kumar. Graph policy gradients for large scale robot
control, 2019.
[32] I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent
populations. CoRR, 2017. URL http://arxiv.org/abs/1703.04908.
[33] G. Papoudakis, F. Christianos, L. Schäfer, and S. V. Albrecht. Benchmarking multi-agent
deep reinforcement learning algorithms in cooperative tasks. In Proceedings of the Neural
Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2021.
[34] R. Torbati, S. Lohiya, S. Singh, M. S. Nigam, and H. Ravichandar. Marbler: An open platform
for standardized evaluation of multi-robot reinforcement learning algorithms, 2023.
[35] S. Wilson, P. Glotfelter, L. Wang, S. Mayya, G. Notomista, M. Mote, and M. Egerstedt.
The robotarium: Globally impactful opportunities, challenges, and lessons learned in remote-
access, distributed control of multirobot systems. IEEE Control Systems Magazine, 2020.
[36] Q. Li, F. Gama, A. Ribeiro, and A. Prorok. Graph neural networks for decentralized multi-
robot path planning. CoRR, 2019.
[37] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization
Algorithms. 2017.
11
A Heterogeneous Material Transport (HMT) Additional Results
This section provides additional results for the HMT environment. Specifically, we provide additional
task-specific metrics for the generalization experiments, and new generalization results for robot
teams of significantly larger sizes (i.e. team sizes of 8, 10, and 15 robots).
The task-specific metrics defined below evaluate the rate at which each policy contributes to fulfilling
the total quota, and the individual quotas for lumber and concrete:
• % of episodes by which the total quota was filled.
• % of lumber quota remaining.
• % of concrete quota remaining.
For both generalization to new teams (see Fig. 7) and new robots (see Fig. 8), capability-aware
methods filled the total quota in fewer episode steps compared to the ID-based methods, while gen-
erally better preventing over-provisioning of both lumber and concrete. This result further supports
our claim that capability awareness improves generalization performance. Observing Fig. 9, we find
that these benefits of capability-based policies extend to considerably larger team sizes.
Figure 7: Policies with capability awareness outperform agent ID methods at meeting the material quota with a
minimal number of steps when generalizing to new team compositions. Capability-awareness methods without
communication of capabilities (i.e. CA(MLP) & CA(GNN)) outperform methods with capability communication
for this task.
12
(a) 3 robots (b) 4 robots (c) 5 robots
Figure 8: Policies without communication of capability-awareness (i.e. CA(MLP) and CA(GNN)) outperformed
the policy with communication of capabilities (CA+CC(GNN) on task-specific metrics when generalizing to new
robots with capabilities not seen during training.
• Pairwise overlap: The sum of pairwise overlap in coverage area among robots (lower is better).
• % of fully connected teams (by episode step): Percentage of teams that managed to form a sensor
network that connects all of the robots (higher is better).
In Fig. 12, we report the training performance (i.e. training teams only) for each policy. The train-
ing curve suggest that capability-aware and ID-based methods perform comparably during learning.
Notably, the communication models ID(GNN) and CA+CC(GNN) converge faster and achieve higher
overall returns than other methods. This result suggests that communication between robots signifi-
cantly assists in learning collaborative behavior.
Capability-aware policies again demonstrate superior performance when generalizing to new teams
(see Fig. 10) and new robots (see Fig. 11) on task-specific metrics, highlighting the importance of
capability awareness for generalization to robot teams with new robots, team sizes, and team com-
positions. Notably, CA+CC(GNN) results in significantly lower pairwise overlap for robot teams of
size 3 and 4 robots, and marginally lower pairwise overlap for robot teams of size 5, compared to
ID-based methods. This suggests the communication-enabled policy effectively learns to commu-
nicate capabilities and that such communication of capabilities is essential for generalization in this
task.
In this section, we present visual representations derived from actual robot demonstrations of the
trained capability-aware communication policy. Videos of the robot demonstrations can be found
at: https://sites.google.com/view/cap-comm.
13
S
Figure 9: Experiments evaluating the generalization of policies to significantly larger teams (size 8, 10, and 15)
compared to the training team size (size 4). Policies with capability awareness outperform agent ID methods
at meeting the total material quota with a minimal number of steps when generalizing to new, large team com-
positions. Capability-awareness methods without communication of capabilities (i.e. CA(MLP) & CA(GNN))
outperform methods with capability communication for this task.
14
(a) 3 Robots (b) 4 Robots (c) 5 Robots
Figure 10: Capability-based policy architectures consistently outperform ID-based baselines both in terms of
average return and task performance metrics when generalizing to new team compositions and sizes. Further,
combining awareness and communication of capabilities results in the best generalization performance.
Figure 11: Policy architecture that combines awareness and communication (CA+CC(GNN)) of capabilities out-
performs both other baselines (CA(GNN) in terms of % fully connected and CA(MLP) in terms of average return
and pairwise overlap) when generalizing to teams comprised of new robots.
15
Figure 12: For the HSN environment, capability-aware policies perform comparably to ID-based policies in
terms of training efficiency (first) and in terms of task-specific metrics when evaluating the trained policy on
the training set.
Figure 13: Demonstrations of CA+CC(GNN) policy deployed to real robot teams in the Robotarium testbed for
the HSN task. See https://sites.google.com/view/cap-comm for videos of deployment to the Robotar-
ium.
16
C Environment Specifications
C.1 Heterogeneous Material Transport
This section describes additional details about the heterogeneous material transport (HMT) environ-
ment.
Figure 14: In the Heterogeneous Material Transport (HMT) environment, each agent’s color is a mixture of blue
and red, which represents its bias towards its carrying capacity for either lumber (red) or concrete (blue). The
objective of the team is to fill the lumber and concrete quota at the construction site without delivering excess.
The lumber and concrete quota limit for the HMT environment are randomly initialized to an integer
value between (0.5 × n agents) and (2.0 × n agents).
Robots have five available actions: they can move left, right, up, down, or stop. At the beginning
of each episode, all of the robots begin at a random position in the construction site zone (see Fig.
14). The observation space for a robot is the combination of the robot’s state and the environment’s
state: specifically it is composed of the robot’s position, velocity, amount of lumber and concrete
it’s carrying, and its distance to each depot, the total lumber quota, the total concrete quota, the total
amount of lumber delivered, and the total amount of concrete delivered. Robots’ observations do
not contain state information about other robots. Finally, we append the robot’s unique ID for the
ID baseline methods and the robot’s maximum lumber and concrete carrying capacity for the CA
methods to the robots’ observations.
The total reward for each robot in HMT is computed by summing the individual rewards of each robot.
Robots are rewarded when they make progress in meeting the lumber and concrete quotas and are
penalized when they exceed the quota. If a robot enters the lumber depot or concrete depot, and
the robot is empty (i.e. not loaded with any lumber or concrete), and the quota has yet to be filled,
then the robot is rewarded with pickup reward of 0.25. If the robot is loaded with material, then the
robot is rewarded or penalized when it drops off the material at the construction site. Specifically,
when a robot delivers a material and the quota for that material has yet to be filled, then the robot is
rewarded with a positive dropoff reward of 0.75. However, if the robot delivers a material and goes
over the quota, then the robot is penalized with a negative surplus penalty reward proportional to
the amount of surplus: −0.10 × surplus material. Finally, robots received a small time penalty of
−0.005 for each episode step in which the total quota is not filled; this promotes the robots to finish
the task as quickly as possible.
17
C.2 Heterogeneous Sensor Network
This section describes additional details about the heterogeneous sensor network (HSN) environment.
Robots have five available actions: they can move left, right, up, down, or stop. After selecting an
action, the robots move in their selected direction for slightly less than a second before selecting
a new action. The robots start at random locations least 30cm apart from each other, move at
∼21cm/second, and utilize barrier certificates [35] that takes effect at 17cm away to ensure they do
not collide when running in the physical Robotarium.
The reward from the heterogeneous sensor network environment is a shared reward. We describe
the reward below:
D(i, j) = ||p(i) − p(j)|| − (ci + cj )
−0.9 ∗ |D(i, j)| + 0.05, if D(i, j) < 0
r(i, j) =
−1.1 ∗ |D(i, j)| − 0.05, otherwise
N
X
R= r(i, j)
i<j
where i and j are robots, p(i) is the position of robot i, ci is the (capability) sensing radius of robot
i, and R is the cumulative team reward shared by all the robots. The above reward is designed
to reward the team when robots connect their sensing regions while minimizing overlap so as to
maximize the total sensing area.
18
D.2 Heterogeneous Sensor Network
To design these training teams, we first binned robot capabilities into small, medium, and large
sensing radii with bin ranges [0.2m, 0.33m], [0.33m, 0.46m], and [0.46m, 0.60m] respectively. We
then generated all possible combinations with replacements for teams composed of four robots of
small, medium, and large robots for a total of 15 teams. Each robot assigned to one of the bins
small, medium, and large had its capability (i.e. sensing radius) uniformly sampled within the bin
range. This resulted in 15 total teams, for which we hand-selected 5 sufficiently diverse teams to be
the training teams. The resulting training teams are given in Table 2.
The evaluation robot teams were sampled differently for the different experimental evaluations per-
formed. In the training evaluation experiment, the teams were the same as the training teams in
Table 2. Teams for the generalization experiment to new team compositions, but not new robots,
were sampled randomly from the 20 robots from the training teams (with replacement). Each robot
from the pool of 20 robots was sampled with equal probability. In contrast, teams for the gener-
alization experiment to new robots were generated by randomly sampling new robots, where each
robot’s sensing radius was sampled from a uniform distribution independently U (0.2m, 0.6m). For
the two generalization experiments, 100 total teams were sampled. Each algorithm was evaluated
on the same set of sampled teams by fixing the pseudo random number generator’s seed.
We first focus on the training curves and subsequent evaluations conducted on the training set,
without considering generalization. The goal of this experiment is to ensure introducing capabilities
does not negatively impact training. The learning curves in terms of average return are shown in
Fig. 12. All models achieved convergence within 20 million environment steps, with ID(GNN) and
CA+CC(GNN) exhibited both the fastest convergence and the highest returns. These results suggest
that communication of individual robot features, whether based on IDs or capabilities, improves
learning efficiency and performance for heterogeneous coordination.
E Policy Details
E.1 Graph Neural Networks
We employ a graph convolutional network (GCN) architecture for the decentralized policy πi , which
enables robots to communicate for coordination according to the robot communication graph G.
A GCN is composed of L layers of graph convolutions, followed by non-linearity. In this work, we
consider a single graph convolution layer applied to node i is given by
(l) (l−1)
X
hi = σ ϕθ (hj )
j∈N (i)∪i
(l−1) F
where hj ∈ R is the node feature of node j, N (i) = {j|(vi , vj ) ∈ E} are all nodes j connected
to i, ϕθ is node feature transformation function with parameters θ, σ is a non-linearity (e.g. Relu),
and hli ∈ RG is the output node feature.
Each of the graph neural networks in the GNN-based policy architectures evaluated are composed of
an input encoder network, a message passing network, and an action output network. The encoder
19
network is a 2-layer MLP with hidden dimensions of size 64. For the message passing network, a
single graph convolution layer composed of 2-layer MLPs with ReLU non-linear activations. The
action output network is additionally a 2-layer MLP with hidden dimensions of size 64. The learning
rate is 0.005.
MLP(ID)/MLP(CA): The MLP architectures compose of a 4-layer multi-layer perceptron with 64
hidden units at each layer and ReLU non-linearities.
CA(GNN)/CA+CC(GNN)/ID(GNN): Each of the graph neural networks compose of an input “encoder”
network, a message passing network, and an action output network. The encoder network and the
action output network are multi-layer perceptrons with hidden layers of size 64, ReLU non-linear
activations, and with one and two hidden layers respectively. The message passing network is a
graph convolution layer wherein the linear transformation of node features (i.e. observations) is done
by a 2-layer MLP with ReLU non-linear activations and 64 dimensional hidden units, followed by a
summation of the transformed neighboring node features. The ouptut node features a concatenated
with the output feature from the encoder network. This concatenated features is the input to the two
out action network. The CA(GNN) network doesn’t communicate the robot’s capabilities with the
graph convolution layers. Rather, the capabilities are appended to the output of the encoder network
and output of node features of the graph convolution layer just before the the action network. Thus,
the the action network is the only part of this model that is conditioned on robot capabilities.
We detail the hyperparameters used to train each of the policies using proximal policy optimization
(PPO) [37] in Table 3.
Hyperparameter Value
Action Selection (Training) soft action selection
Action Selection (Testing) hard action selection
Critic Network Update Interval 200 steps
Learning Rate 0.0005
Entropy Coefficient 0.01
Epochs 4
Clip 0.2
Q Function Steps 5
Buffer Length 64
Number of training steps 40 × 106 (HMT), 20 × 106 (HSN)
Table 3: Hyperparameters used to train each of the policies with PPO.
20