feat: RemoteAgents #2055

Lancetnik · 2025-08-20T15:20:44Z

AG2 Remote Agents

Introduction

The current AG2 multi-agent model is based on the assumption that all agents run in the same process.

Basic multi-agent communication looks like this:

from autogen import GroupChat, GroupChatManager, LLMConfig, ConversableAgent

llm_config = LLMConfig({"model": "gpt-4o-mini"})

planner = ConversableAgent(
  name="planner_agent",
  system_message=planner_message,
  description="Creates lesson plans",
  llm_config=llm_config,
)

reviewer = ConversableAgent(
    name="reviewer_agent",
    system_message="Review lesson plans against 4th grade curriculum. Provide max 3 changes.",
    description="Reviews lesson plans",
    llm_config=llm_config,
)

teacher = ConversableAgent(
    name="teacher_agent",
    system_message="Choose topics and work with planner and reviewer. Say DONE! when finished.",
    llm_config=llm_config,
)

groupchat = GroupChat(
  agents=[teacher, planner, reviewer],
  speaker_selection_method="auto",
)

# Create manager
# At each turn, the manager will check if the message contains DONE! and end the chat if so
# Otherwise, it will select the next appropriate agent using its LLM
manager = GroupChatManager(
    name="group_manager",
    groupchat=groupchat,
    llm_config=llm_config,
    is_termination_msg=lambda x: "DONE!" in (x.get("content", "") or "").upper(),
)

# Start the conversation
response = teacher.run(
    recipient=manager,
    message="Let's teach the kids about the solar system.",
    summary_method="reflection_with_llm",
)

response.process()

So, "agents" in our case are just Python objects with methods. This works well enough for simple cases, but it does not scale well. Therefore, the main problems of the current model are:

agents are referred to by direct Python object references
communication between agents is done using direct method calls

Remote Agents' motivation

The suggested Remote Agents concept is pretty close to microservices. Here are some features we can get:

Use different local setups for different agents (OS, Python version, libraries, local tools, etc.)
Separate release cycles for each agent. A remote agent can be updated without updating the entire system
Reuse existing agents in different multi-agent systems

Introducing the Remote Agent concept would allow us to split agent development among different teams. This would increase the quality of individual agents, the number of features, and make multi-agent systems less coupled. Like microservices, remote agents allow scaling AG2 multi-agent system to an enterprise level.

What we would like to achieve (Goals)

An inter-agent communication model that

does not rely on direct Python references and method calls
is able to be split across process (and machine) barriers
lends itself to the elegant implementation of current AG2 flow patterns
can incorporate existing AG2 agents
allows users to connect non-AG2 agents (even non-LLM agents) to the system

Remote Agents API

RemoteAgent

First of all, I suggest introducing the RemoteAgent class that would be compatible with the regular ConversableAgent. This way, we can migrate existing AG2 agents to the Remote Agent without any changes. For example, the above application could look like this:

from autogen import GroupChat, GroupChatManager, LLMConfig, ConversableAgent
from autogen.remote import RemoteAgen

llm_config = LLMConfig({"model": "gpt-4o-mini"})

# planner is a local agent for current process (as an example)
planner = ConversableAgent(
  name="planner_agent",
  system_message=planner_message,
  description="Creates lesson plans",
  llm_config=llm_config,
)

# reviewer is a remote agent, running in another process
reviewer = RemoteAgent("reviewer_agent", ...)

# teacher is a remote agent, running in another process
teacher = RemoteAgent("teacher_agent", ...)

# regular group chat
groupchat = GroupChat(
  agents=[teacher, planner, reviewer],  # mix of local and remote agents
  speaker_selection_method="auto",
)

# group chat manager is a local agent for current process
manager = GroupChatManager(
    name="group_manager",
    groupchat=groupchat,
    llm_config=llm_config,
    is_termination_msg=lambda x: "DONE!" in (x.get("content", "") or "").upper(),
)

response = teacher.run(
    recipient=manager,
    message="Let's teach the kids about the solar system.",
    summary_method="reflection_with_llm",
)

response.process()

This way, reviewer and teachers were moved to separate processes. They are able to:

be developed by different teams
be updated independently
be reused in other multi-agent systems.

AgentBus runtime

Then, we should make our ConversableAgent able to process remote calls. So, we need to introduce the AgentBus, which would be responsible for:

receiving events from other agents
selecting agents to be activated by an event
performing actions
emitting events to other agents

Suggested API:

from autogen import LLMConfig, ConversableAgent
from autogen.remote import AgentBus

llm_config = LLMConfig({"model": "gpt-4o-mini"})

# regular agent we already have
reviewer = ConversableAgent(
    name="reviewer_agent",
    system_message="Review lesson plans against 4th grade curriculum. Provide max 3 changes.",
    description="Reviews lesson plans",
    llm_config=llm_config,
)

# runtime should consume all calls from `RemoteAgent("reviewer_agent", ...)`
# process them and return results back to the conversation
runtime = AgentBus(
    agents=[reviewer],
)

if __name__ == "__main__":
    runtime.run()

The question is about which methods should be implemented by runtime.

Conversation State

Currently, ConversableAgent knows about conversation state - it has a message history and knows about each agent's messages and actions. Distributed conversation state should also be available for each participant (remote or local). Here we have two options:

each agent has a local copy of the conversation state, which is simple, but requires a way to synchronize the state between agents
real distributed conversation state, which is more complex, but allows for more accurate conversation states

I think we should start with the first option. Local context synchronization is not a problem, as each agent sends a message to all chat participants so they can update their local states in real-time. The question is about message order guarantees, but most of the time there is a single speaker, and in other cases, agents can talk concurrently without any order guarantee.

These reasons are enough to choose the first option. A real distributed conversation state is a more complex solution that requires a lot of refactoring, so we will provide it as another interface later. Finally, users will be able to choose between using a local copy or a distributed implementation.

Sync or async communication

By design, multi-agent communication is a sequence of messages with no guarantee of order. One agent can send a message to another, the second can send a reply to the third, and the third can say something to all of them again, then another random agent can send another message, and so on.

Therefore, RemoteAgent().send_message doesn't require a response, so our distributed runtime must be asynchronous by design.

Asynchronous HTTP Communication

Asynchronous communication can be implemented using HTTP requests. To do this, we just need to use RemoteAgent().send_message as an HTTP request with an empty Response <200> response. To send a real response, we need to use another HTTP request as a webhook for another agent. However, the problem is that we cannot update the local state of other chat participants this way. Additionally, RemoteAgent needs to know about all chat participants and their addresses in order to send a message to a random agent instead of simply responding to the original request. Therefore, our solution could be:

Notify all remote agents of all participants: When a chat starts, the conversation initiator sends an event named chat_start to all remote agents. This event includes:
- The chat ID
- A list of participants
- The initial context of the conversation
Each agent should then send a message HTTP request to all participants in order to invalidate their local state.

Pros:

Easy to implement
Easy to understand
Easily integrates with other systems

Cons:

Each agent should know all participants and their addresses
Each agent needs to send a message to all participants to update local state
Each agent listens for message events from all participants

In the future, when we have a truly distributed conversation state, our implementation will become a message broker with a self-written queue as the state. For this reason, I suggest using a message broker such as RabbitMQ, Kafka, Nats, or Redis.

Asynchronous Broker-based Communication

We can use message brokers like RabbitMQ, Kafka, Nats, or Redis to implement asynchronous communication. This way, RemoteAgent().send_message() just publishes a message to the broker in the conversation context. The agent's response will be the similar message that was published by another agent. To do this, we need no to know all the participants' addresses. We can simply send a message to a chat_id topic, and all agents will listen for it. However, we still need to notify all participants of the chat start so they can subscribe to the topic. So, the implementation might look like this:

At the start of a chat, the conversation initiator sends an chat_start event to all remote agents. This event includes:
- chat id
- initial conversation context
Each agent should then publish a message on the chat_id topic to share it with all participants

I suggest using NATS at the start. It is simple, lightweight, and easy to integrate with other systems. Benefits of using it include:

not persistent messages (by default), which is good for our use case, because missed messages aren't a problem
native and zero-cost requests are made when needed
pattern-based subjects make it easy to implement flexible routing, such as:
- [chat_id].*. Listen for all messages in a chat.
- *.[agent_id]. Listen to direct messages from a specific agent.
- [chat_id].[agent_id]. Listen only for messages from the specified agent in the chat.

As a client, I suggest using FastStream. We have a lot of experience with this framework, it's simple and lightweight, and easy to integrate with other systems. It has native observability features like OTEL, metrics and healthchecks, and it supports multiple message brokers so we can switch if needed (or implement alternative runtimes users can choose).

Stateful or stateless

By design, conversations are stateful. However, a specific agent can choose to be stateless. REST-API-like agents can be stateless, as they don't need to be aware of the conversation state in order to respond to a request. They simply consume a single request and provide a response to the conversation. Stateful agents, on the other hand, need to know the entire history of the conversation in order to make a decision.

I suggest allowing users to choose which agents should be stateful and which should be stateless.

The suggested API:

from autogen import LLMConfig, ConversableAgent
from autogen.remote import RemoteAgent, AgentWrapper

llm_config = LLMConfig({"model": "gpt-4o-mini"})

# regular agents we already have
agent1 = ConversableAgent()
agent2 = ConversableAgent()
agent3 = ConversableAgent()

runtime = AgentBus(
    agents=[
        agent3,  # stateful by default
        AgentWrapper(agent1, stateful=True),
        AgentWrapper(agent2, stateful=False),
    ],
)

AgentWrapper implementation details:

class AgentBus:
    def __init__(self, agents: Iterable[AgentWrapper | ConversableAgent]):
        self.agents = [AgentWrapper.ensure_wrapper(agent) for agent in agents]

class AgentWrapper:
    def __init__(self, agent: ConversableAgent, stateful: bool = True):
        self.agent = agent
        self.stateful = stateful

    @classmethod
    def ensure_wrapper(cls, agent: ConversableAgent | AgentWrapper) -> AgentWrapper:
        if isinstance(agent, AgentWrapper):
            return agent
        return AgentWrapper(agent)

Stateful agent emulate a local conversation with incoming message requests. For this reason, it needs to have a copy of each chat participant. The chat_start event must contain information about all participants, so that they can create a local copy of the state of the conversation. Stateless remote agents are more difficult to implement, so it is recommended to start with stateless ones.

Migration from stateless defaults to stateful can be done by adding AgentWrapper and making stateless default deprecated. Then we can change the default behavior to stateful

Conversation Distributed Context

https://docs.ag2.ai/latest/docs/user-guide/advanced-concepts/orchestration/group-chat/context-variables/

AG2 supports context variables that can be passed to conversations. Such variables should definitely be passed to remote agents. I suggest passing them as part of the start_chat event. The final information about the chat would look like this:

{
  "chat_id": "123",
  "participants": ["agent1", "agent2", "agent3"],
  "context": {
    "variable1": "value1",
    "variable2": "value2"
  }
}

This message should be sent to all participants, and they should update their local contexts accordingly. Also, any updates to context variables (if it's a valid case) should be sent out to all participants too. For this reason, we need to add an update_context event to the protocol, which should be listened for by all participants in the chat.

{
  "context": {
    "variable1": "value1",
    "variable2": "value2"
  }
}

Conversation Manager

Conversation manager is a special agent responsible for managing conversations. It sends a specific event to determine the next speaker. We should add this new event to the protocol.

choose_next_speaker event:

{
  "next_speaker": "agent1"
}

The conversation manager should function at the chat initiator's side. It listens to all incoming messages and sends the choose_next speaker event to all participants. The appropriate agent processes this event and sends a message to the conversation. Other participants ignore this event.

Such synchronous communication is required because RemoteAgent is a REST-like service that knows nothing about conversations except their state. Therefore, conversation management rules must be defined by the initiator at conversation startup. So, the conversation manager must be part of the chat initiators.

In the future, start_chat will have information about conversation management rules. This will allow agents to choose the next speaker themselves, but the first implementation could be synchronous.

Tool calling

MCP Tools and functions can be declared on either side of a remote conversation.

Chat initiator side:

from autogen import GroupChat, GroupChatManager, LLMConfig, ConversableAgent
from autogen.remote import RemoteAgen

llm_config = LLMConfig({"model": "gpt-4o-mini"})

planner = ConversableAgent(
  name="planner_agent",
  system_message=planner_message,
  description="Creates lesson plans",
  llm_config=llm_config,
  tools=[...]  # planner tools
)

reviewer = RemoteAgent("reviewer_agent", ...)

groupchat = GroupChat(
  agents=[teacher, reviewer],
  speaker_selection_method="auto",
)

manager = GroupChatManager(
    name="group_manager",
    groupchat=groupchat,
    llm_config=llm_config,
    is_termination_msg=lambda x: "DONE!" in (x.get("content", "") or "").upper(),
)

response = teacher.run(
    recipient=manager,
    message="Let's teach the kids about the solar system.",
    summary_method="reflection_with_llm",
)

response.process()

Remote agent side:

from autogen import LLMConfig, ConversableAgent
from autogen.remote import AgentBus

llm_config = LLMConfig({"model": "gpt-4o-mini"})

reviewer = ConversableAgent(
    name="reviewer_agent",
    system_message="Review lesson plans against 4th grade curriculum. Provide max 3 changes.",
    description="Reviews lesson plans",
    llm_config=llm_config,
    tools=[...]  # reviewer tools
)

runtime = AgentBus(agents=[reviewer])

I think the agent should only be able to use their own tools. So, we don't need to add a new event to the protocol.

Human in the Loop

Human on the edge

Multi-agent conversation should be able to handle a human in the loop. I think this logic is closely related to the conversation manager and the chat initiator. So, I suggest adding a new event to the protocol called ask_human_input, which the agent would send to the chat, and wait for a human response from the conversation initiator's side.

Human in the middle

In some cases, agents may require human input for their actions, not just by conversation design. For example, a client initiates a distributed conversation, and a remote agent needs administrator approval for an action. The user should not be able to approve or reject the action, so this is not a case of an ask_humans_input event.

So, we need a mechanism to handle these cases. I suggest adding an additional hook to AgentBus, like:

...
runtime = AgentBus(
  agents=[reviewer],
  ask_user_input=lambda: ...,
)

This hook should be able to process user input and return a result to the agent. Also, this method allows you to call the ask_human_input event if necessary. Hooks allow you to implement different types of user input, such as messenger, email, and SMS, etc. You just need to call some code and wait for the user's response and return it as the function's result.

RemoteAgent inaccessibility

Conversation correctness strongly depends on each agent's accessibility. So, I suggest adding a specific event to the protocol called ping. The conversation manager should send the ping event to all remote participants with a timeout to check their availability. If any agent becomes unavailable, the conversation manager should notify the other participants by sending a mark_dead event. If the agent becomes available again for any participant, they should be notified other chat members using the mark_alive event.

Additionally, we could add special markers to the RemoteAgent API to allow users to take action when an agent goes offline:

agent = RemoteAgent(
  "agent1",
  # on_dead=InaccessibilityPolicy.STOP_CONVERSATION(),
  # on_dead=InaccessibilityPolicy.IGNORE(),
  on_dead=InaccessibilityPolicy.RESTART(
    wait_for_restart=True,
    restart_timeout=10,
    restart_interval=1,
    restart_max_attempts=3,
    if_failed_to_restart=InaccessibilityPolicy.STOP_CONVERSATION,
  ),
)

If a dead agent is ignored, the conversation manager should respect this information in the choose_next_speaker decision.

Also, we should process Agent new version release while Conversation on going correctly.

Observability

All remote agents should support all observability features, including:

OpenTelemetry metrics
health checks (at protocol and service levels)
logs (bound to the conversation)

Most tools are already implemented at runtime (HTTP or FastStream), so we can reuse them.

Messages serialization

Some remote agents may require a specific incoming message format or use a strict message structure for their responses. We can respect these requirements at the protocol level and automatically retry incorrectly formatted remote requests. To notify chat participants about these requirements, I suggest adding a new event to the protocol called i_am. This event should be sent to the chat immediately after the start_chat event by each participant, providing all participants with information about an agent's requirements and features so they can adjust their behavior accordingly.

Authentication

Authentication is a very important requirement for distributed systems. Remote agents should support authentication at runtime. We can adopt Basic / Digest authentication for HTTP and integrate it with Keycloak or implement tokens ourselves. In broker cases, we can delegate authentication to the message broker. This part doesn't differ from regular microservice authentication.

Interoperability with non-AG2 agents

We should connect non-AG2 agents as long as they implement AG2 RemoteAgent interface. The AG2 framework should provide suitable interfaces and pre-written functions to allow users to write their own protocol implementations with non-ag2 agents. This feature is not needed at the start, so it has Priority 1.

Implementation plan

P0: First implementation

The simplest implementation could be a regular HTTP runtime with stateless remote agents that can only respond to questions. They do not have the ability to maintain a conversation state.

Therefore, the conversation would look like just an HTTP request for a response from a stateless agent, instead of the local calls that we have now. The chat manager at the initiator's side makes decisions about who speaks next, then calls the next agent, and so on.

This is a simple but effective implementation. It also allows you to check whether RemoteAgents are available and implement these mechanisms.

Implement HTTPAgentBus
stateless remote agents (do not have a conversation state, just answer on conversation)
? Remote Agents availability

P1: Conversation State support

In this phase, we should make RemoteAgents stateful.

The chat initiator (with the Conversation Manager) sends a start_chat event to all remote participants.
RemoteAgent creates a local mock copy of the conversation state for all participants
Each message sent to the Conversation Manager is broadcast to all participants so they can update their local states.

Also, at this stage, I suggest implementing human-in-the-loop and context variable support.

stateful agents
human in the loop
context variables

Therefore, all Remote Agents become full-featured with conversation state support. At this stage, we can implement an alternative AgentBus implementation based on a message broker to avoid broadcasting messages from the Conversation Manager to all participants.ext variables support.

P2: Remote Agents features

Here we should support distributed features like:

authentication
Remote Agents availability
human in the middle

It makes our distributed systems safe and reliable.

P3: Real Distribution

Here we should go beyond local copies of the real state and make it truly distributed.

So, any remote agent should be able to decide on the next speaker on its own. This allows us to make the protocol more flexible and reliable, and we don't need a chat initiator anymore.

Also, we should make the conversation state truly distributed, which avoids any inconsistency between participants and makes communication much more reliable.

distributed Conversation Manager
distributed Conversation State

P4: additional features

messages serialization
observability
interoperability with non AG2 agents
? distributed tools

Current codebase problems

Chat startup doesn't notify chat participants. So we can't set a hook to send notification to all remote agents.
ConversableAgent.send doesn't have an information about current chat identifier. So we can't bind a published message to specific conversation.
ConversableAgent strongly bound to specific conversation and couldn't be reused in different conversations.

Protocol methods

start_chat (to all participants)

{
  "chat_id": "123",
  "participants": ["agent1", "agent2", "agent3"],
  "context": {
    "variable1": "value1",
    "variable2": "value2"
  }
}

stop_chat (to chat)

update_context (to chat)

{
  "context": {
    "variable1": "value1",
    "variable2": "value2"
  }
}

ask_human_input (to chat)
ping (to specific agent)
mark_dead (to chat)
```
{
  "agent_id": "agent1"
}
```
mark_alive (to chat)
```
{
  "agent_id": "agent1"
}
```
send_message (to chat)
- message
- sender id
choose_next_speaker (to chat)
- next speaker id
i_am (to chat or as answer)
- agent description
- request expectations
- response format

marklysze · 2025-08-24T21:03:56Z

Thanks @Lancetnik! I like the progressive approach.

I have a few questions:

Is agent_id the name of the agent or a unique identifier (if name, we need to be careful of nested chats that have agents with the same name)
What is the lifecycle of a remote agent?
What are the limitations of an agent that is stateless?
What is the benefit of having a stateless agent?
Conversation Manager with run_group_chat:

There are context conditions, llm conditions, and after work conditions that run after an agent creates a reply to determine the next agent or terminate (and upcoming option to run a function) - where will these be run - on the Conversation Manager or on the remote agent? I assume the Conversation Manager, so we need to make sure that the context variables are updated first as some of these rely on the context variables to be evaluated.
Tool calls will run a function and that function can return a string as it’s “message” or it can return a ReplyResult which contains a message, updated context variables, and the next target (an agent or terminate) - if this is handled by the Conversation Manager we need to be able to send this back.
Going to the next agent: Transition Targets determine the next agent or termination - apart from the TerminateTarget, these should resolve to an agent so I think that’s okay - but we will allow developers to build their own targets so may need to ensure these resolve to agents. One of the members on our team is considering a FunctionTarget that will run a Python function and then return a ReplyResult, this function may need to run on the agent itself. Target info and FunctionTarget PR

Dependency injection for a tool’s function’s parameters - as the agent is executing the tool I assume this is okay, but the agent will need to be able to populate the value of these injected parameters so it will need access to the values. Dependency Injection
RemoteAgent inaccessibility: We need to be mindful that some transitions (oncontextcondition) may depend on an agent being available (e.g.tech_agent.handoffs.set_after_work(AgentTarget(that_agent)))

…emote-agents

qingyun-wu · 2025-08-25T19:18:35Z

Thanks Nikita! This reads like a very neat proposal as a start! A couple questions:

How the AgentBus runtime work with different conversation patterns in AG2, including
a. Different variants of GroupChat
b. Patterns beyond GroupChat, e.g., sequential chat

Lancetnik · 2025-08-25T19:45:42Z

Thanks Nikita! This reads like a very neat proposal as a start! A couple questions:

How the AgentBus runtime work with different conversation patterns in AG2, including
a. Different variants of GroupChat
b. Patterns beyond GroupChat, e.g., sequential chat

Well, the current implementation is designed around the idea of staying with all ConversationManager strategies locally and providing it with the ability to call remote agents to support distributed conversations. Therefore, all logic for next speaker selection (group chat, sequential chat, etc.) should be implemented on the client side. The AgentBus runtime is currently just a stateful HTTP server that stores chat history (to pass it as a conversation context to LLM) and responds to remote calls.

Lancetnik · 2025-10-17T20:08:09Z

@claude

claude · 2025-10-17T20:08:26Z

Claude finished @Lancetnik's task —— View job

PR Review: RemoteAgents Implementation (PR #2055)

Executive Summary

This is an excellent and comprehensive PR that introduces distributed agent communication to AG2 through both HTTP and A2A protocol support. The implementation demonstrates strong architectural design, extensive test coverage, and thorough documentation. With 4,457 additions across 64 files, this represents a significant feature addition that enables true distributed multi-agent systems.

Recommendation: Approve with Minor Suggestions ✅

The code is production-ready, but I've identified some areas for improvement in error handling, resource management, and edge case coverage.

🎯 Strengths

1. Excellent Architecture

Clean separation of concerns: autogen.remote (HTTP) and autogen.a2a (A2A protocol) are well-isolated
Protocol-agnostic design: RemoteService interface enables future transport implementations
Backward compatibility: Seamless integration with existing ConversableAgent via reply function replacement
Proper abstraction: Factory patterns (ClientFactory, HttpxClientFactory) for testability

2. Comprehensive Implementation

Full A2A protocol support: Streaming and polling modes with fallback
Stateful conversation management: Context variables and message history synchronization
Retry mechanisms: Configurable retry policies with reconnection support
Tool execution: Both local and client-side tools properly handled

3. Strong Test Coverage

392 test cases for message parsing/serialization
Integration tests for group chat, sequential chat, and two-agent conversations
Parametrized tests for edge cases
Mock-based testing with TestAgent helper

4. Quality Documentation

Comprehensive user guides in website/docs/user-guide/a2a/
Clear API documentation with docstrings
Usage examples for both server and client
Interoperability documentation (Pydantic AI)

🔍 Issues & Recommendations

Critical Issues

1. Infinite Polling Loop Risk - `autogen/a2a/client.py:184-203`

Issue: The polling loop has no timeout mechanism and could run indefinitely:

while connection_attemps < self._max_reconnects:
    while True:  # ⚠️ No timeout!
        try:
            task = await client.get_task(TaskQueryParams(id=started_task.id))
        except httpx.ConnectError as e:
            # reconnection logic
        else:
            if _is_task_completed(task):
                return response_message_from_a2a_artifacts(task.artifacts)
            await asyncio.sleep(self._polling_interval)

Impact: Tasks that never complete will cause the client to poll forever, consuming resources.

Recommendation: Add a configurable timeout:

max_polling_time = 300  # 5 minutes default
start_time = time.time()

while connection_attemps < self._max_reconnects:
    while True:
        if time.time() - start_time > max_polling_time:
            raise A2aClientError(f"Task {started_task.id} timed out after {max_polling_time}s")
        # ... rest of logic

2. Task ID Loss Recovery Loop - `autogen/remote/agent.py:76-103`

Issue: If the server repeatedly loses the task (404 response), the client will recreate it indefinitely:

task_id: Any = None
with self._httpx_client_factory.make_sync() as client:
    while True:  # ⚠️ No limit on task recreation attempts
        with retry_policy:
            if task_id is None:
                task_id = self._process_create_remote_task_response(...)

Impact: Could lead to infinite loop if server has persistent issues.

Recommendation: Add max task recreation attempts:

max_task_recreations = 3
task_recreations = 0

while True:
    with retry_policy:
        if task_id is None:
            if task_recreations >= max_task_recreations:
                raise RemoteAgentError(f"Failed to create task after {max_task_recreations} attempts")
            task_id = self._process_create_remote_task_response(...)
            task_recreations += 1

High Priority

3. Exponential Backoff Missing - `autogen/a2a/client.py:203`

Issue: Fixed polling interval doesn't adapt to server load:

await asyncio.sleep(self._polling_interval)  # Always 0.5s

Recommendation: Implement exponential backoff:

backoff_factor = 1.5
max_interval = 10.0
current_interval = self._polling_interval

while True:
    # ... polling logic ...
    await asyncio.sleep(min(current_interval, max_interval))
    current_interval *= backoff_factor

4. Error Message Information Disclosure - `autogen/a2a/client.py:147,165,180,272`

Issue: Full model dumps in error messages could expose sensitive information:

raise A2aClientError(f"Failed to connect to the agent: {pformat(self.__agent_card.model_dump())}")

Recommendation: Sanitize error messages:

raise A2aClientError(f"Failed to connect to agent '{self.name}' at {self.__agent_card.url}")

5. Resource Cleanup - `autogen/a2a/client.py:109-119`

Issue: Client creation pattern may not guarantee cleanup on exceptions:

self._client_config.httpx_client = self._httpx_client_factory()
async with self._client_config.httpx_client:
    agent_client = A2AClientFactory(self._client_config).create(self.__agent_card)

Recommendation: Use try-finally for cleanup guarantee:

self._client_config.httpx_client = self._httpx_client_factory()
try:
    async with self._client_config.httpx_client:
        # ... logic ...
finally:
    self._client_config.httpx_client = None

Medium Priority

6. Missing Input Validation - `autogen/a2a/client.py:88-107`

Issue: No validation that messages list is non-empty before API call.

Recommendation: Add validation:

if not messages:
    raise ValueError("Cannot generate reply with empty message list")

7. KeyError Risk - `autogen/remote/agent.py:72,123`

Issue: Direct dictionary access without checking key existence:

messages = self._oai_messages[sender]  # Could raise KeyError

Recommendation: Use .get() with default:

messages = self._oai_messages.get(sender, [])
if not messages:
    raise ValueError(f"No message history found for sender {sender}")

8. Retry Module Untested - `autogen/remote/retry.py`

Issue: The entire retry module (102 lines) has zero test coverage.

Recommendation: Add tests for:

Successful retry after N attempts
Exhausting retry count
Async vs sync behavior differences
Sleep timing verification

Minor Issues

9. TODO Comments Without Tracking - `autogen/remote/agent_service.py:37,104,123`

Issue: Multiple TODOs without GitHub issues:

Line 37: "catch ask user input event"
Line 104: "inject ChatContext to tool"
Line 123: "catch handoffs"

Recommendation: Create tracking issues and reference them in comments.

10. Inconsistent Naming - `autogen/a2a/httpx_client_factory.py:130`

Issue: Function named like a class (PascalCase):

def EmptyClientFactory() -> MockClient:  # Should be snake_case

Recommendation: Rename to create_empty_client() or make it an actual class.

🔒 Security Analysis

✅ Good Practices

No hardcoded credentials
SSL verification enabled by default in httpx clients
Proper exception handling prevents information leakage in most cases
Type validation with Pydantic models

⚠️ Recommendations

Agent Card Validation: The _get_agent_card method fetches remote configs. Add validation to prevent malicious configurations
Rate Limiting: Consider adding configurable rate limits to prevent abuse
Timeout Configuration: Expose timeout settings in user-facing APIs
Authentication: While mentioned in PR description as P2, consider adding basic auth support before production use

📊 Test Coverage Assessment

Well Tested ✅

Message serialization/deserialization (392 test cases)
Round-trip conversions
Mock client interactions
Agent service functionality
Group chat integration

Needs More Tests ⚠️

Retry module - 0% coverage
Timeout scenarios - No tests for polling timeouts
Connection failure recovery - Limited edge case coverage
Concurrent requests - No tests for multiple simultaneous remote calls
Context variable updates - Limited coverage of distributed context sync
A2A streaming edge cases - Connection loss during streaming

Recommendation: Add integration test suite:

# test/a2a/test_integration.py
@pytest.mark.asyncio
async def test_polling_timeout():
    """Test that polling stops after timeout"""
    
@pytest.mark.asyncio  
async def test_task_recreation_limit():
    """Test that task recreation has a limit"""
    
@pytest.mark.asyncio
async def test_concurrent_remote_agents():
    """Test multiple agents calling same remote agent"""

🚀 Performance Considerations

Current Implementation

Fixed polling interval: Could be inefficient for both fast and slow tasks
Client recreation: New httpx client on each request (line 109) - consider connection pooling
Message history size: No limit on conversation history - could cause memory issues

Suggestions

Connection pooling: Reuse httpx clients across requests
Adaptive polling: Implement exponential backoff as noted above
History limits: Add configurable max message count for remote agents
Webhook support: Consider webhook-based notifications to avoid polling overhead

📝 Documentation Quality

Strengths ✅

Clear architecture explanation in PR description
User-facing docs for both client and server
Good examples with code snippets
Interoperability documentation

Suggestions

Error handling guide: Add examples of handling remote agent failures
Performance tuning: Document polling intervals, timeouts, retry policies
Migration guide: Help users migrate existing local agents to remote
Security best practices: Document authentication, SSL, rate limiting

🎯 Comparison with Previous Reviews

The previous Claude reviews (Oct 15) identified similar issues:

✅ Polling without backoff - Still present (Issue rename #3)
✅ Missing timeout handling - Still present (Issue update all links #1)
✅ Task ID regeneration - Still present (Issue renaming #2)
⚠️ Some concerns addressed: Error messages improved, but still verbose

This indicates these are known issues that should be prioritized for resolution.

🔧 Action Items

Must Fix Before Merge

⚠️ Add timeout to polling loop (Issue update all links #1)
⚠️ Add max task recreation limit (Issue renaming #2)
⚠️ Sanitize error messages (Issue Readme #4)

Should Fix

Implement exponential backoff (Issue rename #3)
Add retry module tests (Issue chat example #8)
Fix resource cleanup pattern (Issue renaming #5)
Add input validation (Issue Readme #6)

Nice to Have

Create GitHub issues for TODOs (Issue model update #9)
Add integration test suite
Document performance tuning
Add security best practices guide

✅ Conclusion

This PR represents exceptional work that brings distributed agent capabilities to AG2. The implementation is well-architected, thoroughly tested (1,058 total test lines), and comprehensively documented. The main concerns are around robustness (timeouts, infinite loops) and operational safety (error messages, resource cleanup).

Final Score: 9/10

The critical issues are straightforward fixes that don't require architectural changes. Once addressed, this will be a production-ready feature that significantly expands AG2's capabilities.

Outstanding work, @Lancetnik! 🎉 This feature will enable exciting new use cases for distributed multi-agent systems.

Branch: feat/remote-agents | Job Run

Lancetnik · 2025-10-20T16:29:15Z

@marklysze seems like everything fine now. I am just waiting for your review

marklysze

Amazing work @Lancetnik!

Carried forward.

codecov · 2025-10-21T20:46:37Z

Codecov Report

❌ Patch coverage is 75.84541% with 200 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
autogen/a2a/client.py	58.90%	51 Missing and 9 partials ⚠️
autogen/remote/agent.py	51.19%	36 Missing and 5 partials ⚠️
autogen/remote/agent_service.py	68.67%	12 Missing and 14 partials ⚠️
autogen/remote/retry.py	50.00%	26 Missing ⚠️
autogen/agentchat/conversable_agent.py	86.76%	4 Missing and 5 partials ⚠️
autogen/agentchat/groupchat.py	14.28%	4 Missing and 2 partials ⚠️
autogen/remote/runtime.py	83.78%	3 Missing and 3 partials ⚠️
autogen/a2a/agent_executor.py	84.84%	4 Missing and 1 partial ⚠️
autogen/a2a/httpx_client_factory.py	82.60%	2 Missing and 2 partials ⚠️
autogen/a2a/utils.py	94.11%	2 Missing and 2 partials ⚠️
... and 8 more

Files with missing lines	Coverage Δ
autogen/a2a/errors.py	`100.00% <100.00%> (ø)`
autogen/agentchat/__init__.py	`100.00% <100.00%> (ø)`
autogen/agentchat/agent.py	`100.00% <ø> (ø)`
autogen/interop/pydantic_ai/pydantic_ai.py	`86.56% <100.00%> (+44.77%)`	⬆️
autogen/llm_config/client.py	`100.00% <100.00%> (ø)`
autogen/oai/bedrock.py	`27.34% <ø> (+0.62%)`	⬆️
autogen/oai/client.py	`55.35% <100.00%> (-3.98%)`	⬇️
autogen/oai/cohere.py	`34.80% <ø> (+0.01%)`	⬆️
autogen/remote/errors.py	`100.00% <100.00%> (ø)`
autogen/remote/protocol.py	`100.00% <100.00%> (ø)`
... and 20 more

... and 41 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

refactor: explicit ChatResult id generation

0ea9858

Lancetnik marked this pull request as draft August 20, 2025 15:20

refactor: add Chat class

765e12e

Lancetnik changed the title ~~feat: RemoteAgent~~ refactor: polish ConversableAgent logic Aug 20, 2025

qingyun-wu requested review from VasiliyRad, marklysze and qingyun-wu August 20, 2025 17:07

Lancetnik removed request for VasiliyRad, marklysze and qingyun-wu August 20, 2025 19:05

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into feat/remote-agents

cfaf654

This comment was marked as outdated.

Sign in to view

Lancetnik and others added 3 commits August 21, 2025 20:22

refactor: polish Conversible Agent

fe0b8d2

Merge branch 'main' into feat/remote-agents

d1a1f8f

chore: revert branch changes

422f592

Lancetnik changed the title ~~refactor: polish ConversableAgent logic~~ feat: RemoteAgents Aug 24, 2025

Lancetnik and others added 4 commits August 25, 2025 19:15

feat: init RemoteAgent

dede993

Merge branch 'main' into feat/remote-agents

b70dc36

Merge branch 'feat/remote-agents' of github.com:ag2ai/ag2 into feat/r…

6d2888a

…emote-agents

refactor: pass chat_id to _prepare_chat

ded5dea

qingyun-wu requested review from VasiliyRad, bassilkhilo, marklysze, priyansh4320 and qingyun-wu August 25, 2025 19:06

Lancetnik added 6 commits October 16, 2025 23:03

test: add tests for RemoteService

02e2a08

test: add A2A chat tests

9cce978

chore: merge main

ac3317f

ci: trigger claude at ready to review PR state

78c98ac

chore: add fasta2a to pydantic-ai interop dep

584d09e

test: bump pydantic-ai interop version to let tests pass

a2db8fc

Lancetnik marked this pull request as ready for review October 17, 2025 20:06

Lancetnik added 4 commits October 20, 2025 18:53

test: fix red tests

12a2c7d

docs: add Remote Tool call diagram

ac207d9

test: fix red tests

558455a

docs: rephrase

bab39f9

Lancetnik and others added 5 commits October 20, 2025 19:31

chore: remove 0.10 deprecations

120d66b

test: fix red tests

ed61cf1

docs: polish LLMConfig usage

5822c12

Remote Agent documentation tweaks

3019c31

Merge branch 'main' into feat/remote-agents

3758589

marklysze approved these changes Oct 21, 2025

View reviewed changes

marklysze enabled auto-merge October 21, 2025 20:14

marklysze disabled auto-merge October 21, 2025 20:34

marklysze enabled auto-merge October 21, 2025 20:34

marklysze added this pull request to the merge queue Oct 21, 2025

Merged via the queue into main with commit fa2be6f Oct 21, 2025
20 of 21 checks passed

marklysze deleted the feat/remote-agents branch October 21, 2025 20:45

claude bot mentioned this pull request Oct 22, 2025

Bump version to 0.10 #2162

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: RemoteAgents #2055

feat: RemoteAgents #2055

Lancetnik commented Aug 20, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

marklysze commented Aug 24, 2025

Uh oh!

qingyun-wu commented Aug 25, 2025 •

edited

Loading

Uh oh!

Lancetnik commented Aug 25, 2025

Uh oh!

Lancetnik commented Oct 17, 2025

Uh oh!

claude bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

Lancetnik commented Oct 20, 2025

Uh oh!

marklysze left a comment

Uh oh!

Uh oh!

codecov bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat: RemoteAgents #2055

feat: RemoteAgents #2055

Conversation

Lancetnik commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AG2 Remote Agents

Introduction

Remote Agents' motivation

What we would like to achieve (Goals)

Remote Agents API

RemoteAgent

AgentBus runtime

Conversation State

Sync or async communication

Asynchronous HTTP Communication

Asynchronous Broker-based Communication

Stateful or stateless

Conversation Distributed Context

Conversation Manager

Tool calling

Human in the Loop

Human on the edge

Human in the middle

RemoteAgent inaccessibility

Observability

Messages serialization

Authentication

Interoperability with non-AG2 agents

Implementation plan

P0: First implementation

P1: Conversation State support

P2: Remote Agents features

P3: Real Distribution

P4: additional features

Current codebase problems

Protocol methods

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

marklysze commented Aug 24, 2025

Uh oh!

qingyun-wu commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lancetnik commented Aug 25, 2025

Uh oh!

Lancetnik commented Oct 17, 2025

Uh oh!

claude bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: RemoteAgents Implementation (PR #2055)

Executive Summary

🎯 Strengths

1. Excellent Architecture

2. Comprehensive Implementation

3. Strong Test Coverage

4. Quality Documentation

🔍 Issues & Recommendations

Critical Issues

1. Infinite Polling Loop Risk - autogen/a2a/client.py:184-203

2. Task ID Loss Recovery Loop - autogen/remote/agent.py:76-103

High Priority

3. Exponential Backoff Missing - autogen/a2a/client.py:203

4. Error Message Information Disclosure - autogen/a2a/client.py:147,165,180,272

5. Resource Cleanup - autogen/a2a/client.py:109-119

Medium Priority

6. Missing Input Validation - autogen/a2a/client.py:88-107

7. KeyError Risk - autogen/remote/agent.py:72,123

8. Retry Module Untested - autogen/remote/retry.py

Minor Issues

9. TODO Comments Without Tracking - autogen/remote/agent_service.py:37,104,123

10. Inconsistent Naming - autogen/a2a/httpx_client_factory.py:130

🔒 Security Analysis

✅ Good Practices

⚠️ Recommendations

📊 Test Coverage Assessment

Well Tested ✅

Needs More Tests ⚠️

Lancetnik commented Aug 20, 2025 •

edited

Loading

qingyun-wu commented Aug 25, 2025 •

edited

Loading

claude bot commented Oct 17, 2025 •

edited

Loading

1. Infinite Polling Loop Risk - `autogen/a2a/client.py:184-203`

2. Task ID Loss Recovery Loop - `autogen/remote/agent.py:76-103`

3. Exponential Backoff Missing - `autogen/a2a/client.py:203`

4. Error Message Information Disclosure - `autogen/a2a/client.py:147,165,180,272`

5. Resource Cleanup - `autogen/a2a/client.py:109-119`

6. Missing Input Validation - `autogen/a2a/client.py:88-107`

7. KeyError Risk - `autogen/remote/agent.py:72,123`

8. Retry Module Untested - `autogen/remote/retry.py`

9. TODO Comments Without Tracking - `autogen/remote/agent_service.py:37,104,123`

10. Inconsistent Naming - `autogen/a2a/httpx_client_factory.py:130`