Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Lancetnik
Copy link
Member

@Lancetnik Lancetnik commented Aug 20, 2025

AG2 Remote Agents

Introduction

The current AG2 multi-agent model is based on the assumption that all agents run in the same process.

Basic multi-agent communication looks like this:

from autogen import GroupChat, GroupChatManager, LLMConfig, ConversableAgent

llm_config = LLMConfig({"model": "gpt-4o-mini"})

planner = ConversableAgent(
  name="planner_agent",
  system_message=planner_message,
  description="Creates lesson plans",
  llm_config=llm_config,
)

reviewer = ConversableAgent(
    name="reviewer_agent",
    system_message="Review lesson plans against 4th grade curriculum. Provide max 3 changes.",
    description="Reviews lesson plans",
    llm_config=llm_config,
)

teacher = ConversableAgent(
    name="teacher_agent",
    system_message="Choose topics and work with planner and reviewer. Say DONE! when finished.",
    llm_config=llm_config,
)

groupchat = GroupChat(
  agents=[teacher, planner, reviewer],
  speaker_selection_method="auto",
)

# Create manager
# At each turn, the manager will check if the message contains DONE! and end the chat if so
# Otherwise, it will select the next appropriate agent using its LLM
manager = GroupChatManager(
    name="group_manager",
    groupchat=groupchat,
    llm_config=llm_config,
    is_termination_msg=lambda x: "DONE!" in (x.get("content", "") or "").upper(),
)

# Start the conversation
response = teacher.run(
    recipient=manager,
    message="Let's teach the kids about the solar system.",
    summary_method="reflection_with_llm",
)

response.process()

So, "agents" in our case are just Python objects with methods. This works well enough for simple cases, but it does not scale well. Therefore, the main problems of the current model are:

  • agents are referred to by direct Python object references
  • communication between agents is done using direct method calls

Remote Agents' motivation

The suggested Remote Agents concept is pretty close to microservices. Here are some features we can get:

  • Use different local setups for different agents (OS, Python version, libraries, local tools, etc.)
  • Separate release cycles for each agent. A remote agent can be updated without updating the entire system
  • Reuse existing agents in different multi-agent systems

Introducing the Remote Agent concept would allow us to split agent development among different teams. This would increase the quality of individual agents, the number of features, and make multi-agent systems less coupled. Like microservices, remote agents allow scaling AG2 multi-agent system to an enterprise level.

What we would like to achieve (Goals)

An inter-agent communication model that

  • does not rely on direct Python references and method calls
  • is able to be split across process (and machine) barriers
  • lends itself to the elegant implementation of current AG2 flow patterns
  • can incorporate existing AG2 agents
  • allows users to connect non-AG2 agents (even non-LLM agents) to the system

Remote Agents API

RemoteAgent

First of all, I suggest introducing the RemoteAgent class that would be compatible with the regular ConversableAgent. This way, we can migrate existing AG2 agents to the Remote Agent without any changes. For example, the above application could look like this:

from autogen import GroupChat, GroupChatManager, LLMConfig, ConversableAgent
from autogen.remote import RemoteAgen

llm_config = LLMConfig({"model": "gpt-4o-mini"})

# planner is a local agent for current process (as an example)
planner = ConversableAgent(
  name="planner_agent",
  system_message=planner_message,
  description="Creates lesson plans",
  llm_config=llm_config,
)

# reviewer is a remote agent, running in another process
reviewer = RemoteAgent("reviewer_agent", ...)

# teacher is a remote agent, running in another process
teacher = RemoteAgent("teacher_agent", ...)

# regular group chat
groupchat = GroupChat(
  agents=[teacher, planner, reviewer],  # mix of local and remote agents
  speaker_selection_method="auto",
)

# group chat manager is a local agent for current process
manager = GroupChatManager(
    name="group_manager",
    groupchat=groupchat,
    llm_config=llm_config,
    is_termination_msg=lambda x: "DONE!" in (x.get("content", "") or "").upper(),
)

response = teacher.run(
    recipient=manager,
    message="Let's teach the kids about the solar system.",
    summary_method="reflection_with_llm",
)

response.process()

This way, reviewer and teachers were moved to separate processes. They are able to:

  • be developed by different teams
  • be updated independently
  • be reused in other multi-agent systems.

AgentBus runtime

Then, we should make our ConversableAgent able to process remote calls. So, we need to introduce the AgentBus, which would be responsible for:

  • receiving events from other agents
  • selecting agents to be activated by an event
  • performing actions
  • emitting events to other agents

Suggested API:

from autogen import LLMConfig, ConversableAgent
from autogen.remote import AgentBus

llm_config = LLMConfig({"model": "gpt-4o-mini"})

# regular agent we already have
reviewer = ConversableAgent(
    name="reviewer_agent",
    system_message="Review lesson plans against 4th grade curriculum. Provide max 3 changes.",
    description="Reviews lesson plans",
    llm_config=llm_config,
)

# runtime should consume all calls from `RemoteAgent("reviewer_agent", ...)`
# process them and return results back to the conversation
runtime = AgentBus(
    agents=[reviewer],
)

if __name__ == "__main__":
    runtime.run()

The question is about which methods should be implemented by runtime.

Conversation State

Currently, ConversableAgent knows about conversation state - it has a message history and knows about each agent's messages and actions. Distributed conversation state should also be available for each participant (remote or local). Here we have two options:

  • each agent has a local copy of the conversation state, which is simple, but requires a way to synchronize the state between agents
  • real distributed conversation state, which is more complex, but allows for more accurate conversation states

I think we should start with the first option. Local context synchronization is not a problem, as each agent sends a message to all chat participants so they can update their local states in real-time. The question is about message order guarantees, but most of the time there is a single speaker, and in other cases, agents can talk concurrently without any order guarantee.

These reasons are enough to choose the first option. A real distributed conversation state is a more complex solution that requires a lot of refactoring, so we will provide it as another interface later. Finally, users will be able to choose between using a local copy or a distributed implementation.

Sync or async communication

By design, multi-agent communication is a sequence of messages with no guarantee of order. One agent can send a message to another, the second can send a reply to the third, and the third can say something to all of them again, then another random agent can send another message, and so on.

Therefore, RemoteAgent().send_message doesn't require a response, so our distributed runtime must be asynchronous by design.

Asynchronous HTTP Communication

Asynchronous communication can be implemented using HTTP requests. To do this, we just need to use RemoteAgent().send_message as an HTTP request with an empty Response <200> response. To send a real response, we need to use another HTTP request as a webhook for another agent. However, the problem is that we cannot update the local state of other chat participants this way. Additionally, RemoteAgent needs to know about all chat participants and their addresses in order to send a message to a random agent instead of simply responding to the original request. Therefore, our solution could be:

  • Notify all remote agents of all participants: When a chat starts, the conversation initiator sends an event named chat_start to all remote agents. This event includes:
    • The chat ID
    • A list of participants
    • The initial context of the conversation
  • Each agent should then send a message HTTP request to all participants in order to invalidate their local state.

Pros:

  • Easy to implement
  • Easy to understand
  • Easily integrates with other systems

Cons:

  • Each agent should know all participants and their addresses
  • Each agent needs to send a message to all participants to update local state
  • Each agent listens for message events from all participants

In the future, when we have a truly distributed conversation state, our implementation will become a message broker with a self-written queue as the state. For this reason, I suggest using a message broker such as RabbitMQ, Kafka, Nats, or Redis.

Asynchronous Broker-based Communication

We can use message brokers like RabbitMQ, Kafka, Nats, or Redis to implement asynchronous communication. This way, RemoteAgent().send_message() just publishes a message to the broker in the conversation context. The agent's response will be the similar message that was published by another agent. To do this, we need no to know all the participants' addresses. We can simply send a message to a chat_id topic, and all agents will listen for it. However, we still need to notify all participants of the chat start so they can subscribe to the topic. So, the implementation might look like this:

  • At the start of a chat, the conversation initiator sends an chat_start event to all remote agents. This event includes:
    • chat id
    • initial conversation context
  • Each agent should then publish a message on the chat_id topic to share it with all participants

I suggest using NATS at the start. It is simple, lightweight, and easy to integrate with other systems. Benefits of using it include:

  • not persistent messages (by default), which is good for our use case, because missed messages aren't a problem
  • native and zero-cost requests are made when needed
  • pattern-based subjects make it easy to implement flexible routing, such as:
    • [chat_id].*. Listen for all messages in a chat.
    • *.[agent_id]. Listen to direct messages from a specific agent.
    • [chat_id].[agent_id]. Listen only for messages from the specified agent in the chat.

As a client, I suggest using FastStream. We have a lot of experience with this framework, it's simple and lightweight, and easy to integrate with other systems. It has native observability features like OTEL, metrics and healthchecks, and it supports multiple message brokers so we can switch if needed (or implement alternative runtimes users can choose).

Stateful or stateless

By design, conversations are stateful. However, a specific agent can choose to be stateless. REST-API-like agents can be stateless, as they don't need to be aware of the conversation state in order to respond to a request. They simply consume a single request and provide a response to the conversation. Stateful agents, on the other hand, need to know the entire history of the conversation in order to make a decision.

I suggest allowing users to choose which agents should be stateful and which should be stateless.

The suggested API:

from autogen import LLMConfig, ConversableAgent
from autogen.remote import RemoteAgent, AgentWrapper

llm_config = LLMConfig({"model": "gpt-4o-mini"})

# regular agents we already have
agent1 = ConversableAgent()
agent2 = ConversableAgent()
agent3 = ConversableAgent()

runtime = AgentBus(
    agents=[
        agent3,  # stateful by default
        AgentWrapper(agent1, stateful=True),
        AgentWrapper(agent2, stateful=False),
    ],
)

AgentWrapper implementation details:

class AgentBus:
    def __init__(self, agents: Iterable[AgentWrapper | ConversableAgent]):
        self.agents = [AgentWrapper.ensure_wrapper(agent) for agent in agents]

class AgentWrapper:
    def __init__(self, agent: ConversableAgent, stateful: bool = True):
        self.agent = agent
        self.stateful = stateful

    @classmethod
    def ensure_wrapper(cls, agent: ConversableAgent | AgentWrapper) -> AgentWrapper:
        if isinstance(agent, AgentWrapper):
            return agent
        return AgentWrapper(agent)

Stateful agent emulate a local conversation with incoming message requests. For this reason, it needs to have a copy of each chat participant. The chat_start event must contain information about all participants, so that they can create a local copy of the state of the conversation. Stateless remote agents are more difficult to implement, so it is recommended to start with stateless ones.

Migration from stateless defaults to stateful can be done by adding AgentWrapper and making stateless default deprecated. Then we can change the default behavior to stateful

Conversation Distributed Context

https://docs.ag2.ai/latest/docs/user-guide/advanced-concepts/orchestration/group-chat/context-variables/

AG2 supports context variables that can be passed to conversations. Such variables should definitely be passed to remote agents. I suggest passing them as part of the start_chat event. The final information about the chat would look like this:

{
  "chat_id": "123",
  "participants": ["agent1", "agent2", "agent3"],
  "context": {
    "variable1": "value1",
    "variable2": "value2"
  }
}

This message should be sent to all participants, and they should update their local contexts accordingly. Also, any updates to context variables (if it's a valid case) should be sent out to all participants too. For this reason, we need to add an update_context event to the protocol, which should be listened for by all participants in the chat.

{
  "context": {
    "variable1": "value1",
    "variable2": "value2"
  }
}

Conversation Manager

Conversation manager is a special agent responsible for managing conversations. It sends a specific event to determine the next speaker. We should add this new event to the protocol.

choose_next_speaker event:

{
  "next_speaker": "agent1"
}

The conversation manager should function at the chat initiator's side. It listens to all incoming messages and sends the choose_next speaker event to all participants. The appropriate agent processes this event and sends a message to the conversation. Other participants ignore this event.

Such synchronous communication is required because RemoteAgent is a REST-like service that knows nothing about conversations except their state. Therefore, conversation management rules must be defined by the initiator at conversation startup. So, the conversation manager must be part of the chat initiators.

In the future, start_chat will have information about conversation management rules. This will allow agents to choose the next speaker themselves, but the first implementation could be synchronous.

Tool calling

MCP Tools and functions can be declared on either side of a remote conversation.

Chat initiator side:

from autogen import GroupChat, GroupChatManager, LLMConfig, ConversableAgent
from autogen.remote import RemoteAgen

llm_config = LLMConfig({"model": "gpt-4o-mini"})

planner = ConversableAgent(
  name="planner_agent",
  system_message=planner_message,
  description="Creates lesson plans",
  llm_config=llm_config,
  tools=[...]  # planner tools
)

reviewer = RemoteAgent("reviewer_agent", ...)

groupchat = GroupChat(
  agents=[teacher, reviewer],
  speaker_selection_method="auto",
)

manager = GroupChatManager(
    name="group_manager",
    groupchat=groupchat,
    llm_config=llm_config,
    is_termination_msg=lambda x: "DONE!" in (x.get("content", "") or "").upper(),
)

response = teacher.run(
    recipient=manager,
    message="Let's teach the kids about the solar system.",
    summary_method="reflection_with_llm",
)

response.process()

Remote agent side:

from autogen import LLMConfig, ConversableAgent
from autogen.remote import AgentBus

llm_config = LLMConfig({"model": "gpt-4o-mini"})

reviewer = ConversableAgent(
    name="reviewer_agent",
    system_message="Review lesson plans against 4th grade curriculum. Provide max 3 changes.",
    description="Reviews lesson plans",
    llm_config=llm_config,
    tools=[...]  # reviewer tools
)

runtime = AgentBus(agents=[reviewer])

I think the agent should only be able to use their own tools. So, we don't need to add a new event to the protocol.

Human in the Loop

Human on the edge

Multi-agent conversation should be able to handle a human in the loop. I think this logic is closely related to the conversation manager and the chat initiator. So, I suggest adding a new event to the protocol called ask_human_input, which the agent would send to the chat, and wait for a human response from the conversation initiator's side.

Human in the middle

In some cases, agents may require human input for their actions, not just by conversation design. For example, a client initiates a distributed conversation, and a remote agent needs administrator approval for an action. The user should not be able to approve or reject the action, so this is not a case of an ask_humans_input event.

So, we need a mechanism to handle these cases. I suggest adding an additional hook to AgentBus, like:

...
runtime = AgentBus(
  agents=[reviewer],
  ask_user_input=lambda: ...,
)

This hook should be able to process user input and return a result to the agent. Also, this method allows you to call the ask_human_input event if necessary. Hooks allow you to implement different types of user input, such as messenger, email, and SMS, etc. You just need to call some code and wait for the user's response and return it as the function's result.

RemoteAgent inaccessibility

Conversation correctness strongly depends on each agent's accessibility. So, I suggest adding a specific event to the protocol called ping. The conversation manager should send the ping event to all remote participants with a timeout to check their availability. If any agent becomes unavailable, the conversation manager should notify the other participants by sending a mark_dead event. If the agent becomes available again for any participant, they should be notified other chat members using the mark_alive event.

Additionally, we could add special markers to the RemoteAgent API to allow users to take action when an agent goes offline:

agent = RemoteAgent(
  "agent1",
  # on_dead=InaccessibilityPolicy.STOP_CONVERSATION(),
  # on_dead=InaccessibilityPolicy.IGNORE(),
  on_dead=InaccessibilityPolicy.RESTART(
    wait_for_restart=True,
    restart_timeout=10,
    restart_interval=1,
    restart_max_attempts=3,
    if_failed_to_restart=InaccessibilityPolicy.STOP_CONVERSATION,
  ),
)

If a dead agent is ignored, the conversation manager should respect this information in the choose_next_speaker decision.

Also, we should process Agent new version release while Conversation on going correctly.

Observability

All remote agents should support all observability features, including:

  • OpenTelemetry metrics
  • health checks (at protocol and service levels)
  • logs (bound to the conversation)

Most tools are already implemented at runtime (HTTP or FastStream), so we can reuse them.

Messages serialization

Some remote agents may require a specific incoming message format or use a strict message structure for their responses. We can respect these requirements at the protocol level and automatically retry incorrectly formatted remote requests. To notify chat participants about these requirements, I suggest adding a new event to the protocol called i_am. This event should be sent to the chat immediately after the start_chat event by each participant, providing all participants with information about an agent's requirements and features so they can adjust their behavior accordingly.

Authentication

Authentication is a very important requirement for distributed systems. Remote agents should support authentication at runtime. We can adopt Basic / Digest authentication for HTTP and integrate it with Keycloak or implement tokens ourselves. In broker cases, we can delegate authentication to the message broker. This part doesn't differ from regular microservice authentication.

Interoperability with non-AG2 agents

We should connect non-AG2 agents as long as they implement AG2 RemoteAgent interface. The AG2 framework should provide suitable interfaces and pre-written functions to allow users to write their own protocol implementations with non-ag2 agents. This feature is not needed at the start, so it has Priority 1.

Implementation plan

P0: First implementation

The simplest implementation could be a regular HTTP runtime with stateless remote agents that can only respond to questions. They do not have the ability to maintain a conversation state.

Therefore, the conversation would look like just an HTTP request for a response from a stateless agent, instead of the local calls that we have now. The chat manager at the initiator's side makes decisions about who speaks next, then calls the next agent, and so on.

This is a simple but effective implementation. It also allows you to check whether RemoteAgents are available and implement these mechanisms.

  • Implement HTTPAgentBus
  • stateless remote agents (do not have a conversation state, just answer on conversation)
  • ? Remote Agents availability

P1: Conversation State support

In this phase, we should make RemoteAgents stateful.

  1. The chat initiator (with the Conversation Manager) sends a start_chat event to all remote participants.
  2. RemoteAgent creates a local mock copy of the conversation state for all participants
  3. Each message sent to the Conversation Manager is broadcast to all participants so they can update their local states.

Also, at this stage, I suggest implementing human-in-the-loop and context variable support.

  • stateful agents
  • human in the loop
  • context variables

Therefore, all Remote Agents become full-featured with conversation state support. At this stage, we can implement an alternative AgentBus implementation based on a message broker to avoid broadcasting messages from the Conversation Manager to all participants.ext variables support.

P2: Remote Agents features

Here we should support distributed features like:

  • authentication
  • Remote Agents availability
  • human in the middle

It makes our distributed systems safe and reliable.

P3: Real Distribution

Here we should go beyond local copies of the real state and make it truly distributed.

So, any remote agent should be able to decide on the next speaker on its own. This allows us to make the protocol more flexible and reliable, and we don't need a chat initiator anymore.

Also, we should make the conversation state truly distributed, which avoids any inconsistency between participants and makes communication much more reliable.

  • distributed Conversation Manager
  • distributed Conversation State

P4: additional features

  • messages serialization
  • observability
  • interoperability with non AG2 agents
  • ? distributed tools

Current codebase problems

  1. Chat startup doesn't notify chat participants. So we can't set a hook to send notification to all remote agents.
  2. ConversableAgent.send doesn't have an information about current chat identifier. So we can't bind a published message to specific conversation.
  3. ConversableAgent strongly bound to specific conversation and couldn't be reused in different conversations.

Protocol methods

  • start_chat (to all participants)
    {
      "chat_id": "123",
      "participants": ["agent1", "agent2", "agent3"],
      "context": {
        "variable1": "value1",
        "variable2": "value2"
      }
    }
  • stop_chat (to chat)
  • update_context (to chat)
    {
      "context": {
        "variable1": "value1",
        "variable2": "value2"
      }
    }
  • ask_human_input (to chat)
  • ping (to specific agent)
  • mark_dead (to chat)
    {
      "agent_id": "agent1"
    }
  • mark_alive (to chat)
    {
      "agent_id": "agent1"
    }
  • send_message (to chat)
    • message
    • sender id
  • choose_next_speaker (to chat)
    • next speaker id
  • i_am (to chat or as answer)
    • agent description
    • request expectations
    • response format

@Lancetnik Lancetnik marked this pull request as draft August 20, 2025 15:20
@Lancetnik Lancetnik changed the title feat: RemoteAgent refactor: polish ConversableAgent logic Aug 20, 2025
@Lancetnik

This comment was marked as outdated.

priyansh4320

This comment was marked as outdated.

@Lancetnik

This comment was marked as outdated.

@Lancetnik Lancetnik changed the title refactor: polish ConversableAgent logic feat: RemoteAgents Aug 24, 2025
@marklysze
Copy link
Collaborator

Thanks @Lancetnik! I like the progressive approach.

I have a few questions:

  1. Is agent_id the name of the agent or a unique identifier (if name, we need to be careful of nested chats that have agents with the same name)

  2. What is the lifecycle of a remote agent?

  3. What are the limitations of an agent that is stateless?

  4. What is the benefit of having a stateless agent?

  5. Conversation Manager with run_group_chat:

  • There are context conditions, llm conditions, and after work conditions that run after an agent creates a reply to determine the next agent or terminate (and upcoming option to run a function) - where will these be run - on the Conversation Manager or on the remote agent? I assume the Conversation Manager, so we need to make sure that the context variables are updated first as some of these rely on the context variables to be evaluated.
  • Tool calls will run a function and that function can return a string as it’s “message” or it can return a ReplyResult which contains a message, updated context variables, and the next target (an agent or terminate) - if this is handled by the Conversation Manager we need to be able to send this back.
  • Going to the next agent: Transition Targets determine the next agent or termination - apart from the TerminateTarget, these should resolve to an agent so I think that’s okay - but we will allow developers to build their own targets so may need to ensure these resolve to agents. One of the members on our team is considering a FunctionTarget that will run a Python function and then return a ReplyResult, this function may need to run on the agent itself. Target info and FunctionTarget PR
  1. Dependency injection for a tool’s function’s parameters - as the agent is executing the tool I assume this is okay, but the agent will need to be able to populate the value of these injected parameters so it will need access to the values. Dependency Injection

  2. RemoteAgent inaccessibility: We need to be mindful that some transitions (oncontextcondition) may depend on an agent being available (e.g.tech_agent.handoffs.set_after_work(AgentTarget(that_agent)))

@qingyun-wu
Copy link
Collaborator

qingyun-wu commented Aug 25, 2025

Thanks Nikita! This reads like a very neat proposal as a start! A couple questions:

How the AgentBus runtime work with different conversation patterns in AG2, including
a. Different variants of GroupChat
b. Patterns beyond GroupChat, e.g., sequential chat

@Lancetnik
Copy link
Member Author

Thanks Nikita! This reads like a very neat proposal as a start! A couple questions:

  1. How the AgentBus runtime work with different conversation patterns in AG2, including
    a. Different variants of GroupChat
    b. Patterns beyond GroupChat, e.g., sequential chat

Well, the current implementation is designed around the idea of staying with all ConversationManager strategies locally and providing it with the ability to call remote agents to support distributed conversations. Therefore, all logic for next speaker selection (group chat, sequential chat, etc.) should be implemented on the client side. The AgentBus runtime is currently just a stateful HTTP server that stores chat history (to pass it as a conversation context to LLM) and responds to remote calls.

@Lancetnik Lancetnik marked this pull request as ready for review October 17, 2025 20:06
@Lancetnik
Copy link
Member Author

@claude

@claude
Copy link

claude bot commented Oct 17, 2025

Claude finished @Lancetnik's task —— View job


PR Review: RemoteAgents Implementation (PR #2055)

Executive Summary

This is an excellent and comprehensive PR that introduces distributed agent communication to AG2 through both HTTP and A2A protocol support. The implementation demonstrates strong architectural design, extensive test coverage, and thorough documentation. With 4,457 additions across 64 files, this represents a significant feature addition that enables true distributed multi-agent systems.

Recommendation: Approve with Minor Suggestions

The code is production-ready, but I've identified some areas for improvement in error handling, resource management, and edge case coverage.


🎯 Strengths

1. Excellent Architecture

  • Clean separation of concerns: autogen.remote (HTTP) and autogen.a2a (A2A protocol) are well-isolated
  • Protocol-agnostic design: RemoteService interface enables future transport implementations
  • Backward compatibility: Seamless integration with existing ConversableAgent via reply function replacement
  • Proper abstraction: Factory patterns (ClientFactory, HttpxClientFactory) for testability

2. Comprehensive Implementation

  • Full A2A protocol support: Streaming and polling modes with fallback
  • Stateful conversation management: Context variables and message history synchronization
  • Retry mechanisms: Configurable retry policies with reconnection support
  • Tool execution: Both local and client-side tools properly handled

3. Strong Test Coverage

  • 392 test cases for message parsing/serialization
  • Integration tests for group chat, sequential chat, and two-agent conversations
  • Parametrized tests for edge cases
  • Mock-based testing with TestAgent helper

4. Quality Documentation

  • Comprehensive user guides in website/docs/user-guide/a2a/
  • Clear API documentation with docstrings
  • Usage examples for both server and client
  • Interoperability documentation (Pydantic AI)

🔍 Issues & Recommendations

Critical Issues

1. Infinite Polling Loop Risk - autogen/a2a/client.py:184-203

Issue: The polling loop has no timeout mechanism and could run indefinitely:

while connection_attemps < self._max_reconnects:
    while True:  # ⚠️ No timeout!
        try:
            task = await client.get_task(TaskQueryParams(id=started_task.id))
        except httpx.ConnectError as e:
            # reconnection logic
        else:
            if _is_task_completed(task):
                return response_message_from_a2a_artifacts(task.artifacts)
            await asyncio.sleep(self._polling_interval)

Impact: Tasks that never complete will cause the client to poll forever, consuming resources.

Recommendation: Add a configurable timeout:

max_polling_time = 300  # 5 minutes default
start_time = time.time()

while connection_attemps < self._max_reconnects:
    while True:
        if time.time() - start_time > max_polling_time:
            raise A2aClientError(f"Task {started_task.id} timed out after {max_polling_time}s")
        # ... rest of logic

2. Task ID Loss Recovery Loop - autogen/remote/agent.py:76-103

Issue: If the server repeatedly loses the task (404 response), the client will recreate it indefinitely:

task_id: Any = None
with self._httpx_client_factory.make_sync() as client:
    while True:  # ⚠️ No limit on task recreation attempts
        with retry_policy:
            if task_id is None:
                task_id = self._process_create_remote_task_response(...)

Impact: Could lead to infinite loop if server has persistent issues.

Recommendation: Add max task recreation attempts:

max_task_recreations = 3
task_recreations = 0

while True:
    with retry_policy:
        if task_id is None:
            if task_recreations >= max_task_recreations:
                raise RemoteAgentError(f"Failed to create task after {max_task_recreations} attempts")
            task_id = self._process_create_remote_task_response(...)
            task_recreations += 1

High Priority

3. Exponential Backoff Missing - autogen/a2a/client.py:203

Issue: Fixed polling interval doesn't adapt to server load:

await asyncio.sleep(self._polling_interval)  # Always 0.5s

Recommendation: Implement exponential backoff:

backoff_factor = 1.5
max_interval = 10.0
current_interval = self._polling_interval

while True:
    # ... polling logic ...
    await asyncio.sleep(min(current_interval, max_interval))
    current_interval *= backoff_factor

4. Error Message Information Disclosure - autogen/a2a/client.py:147,165,180,272

Issue: Full model dumps in error messages could expose sensitive information:

raise A2aClientError(f"Failed to connect to the agent: {pformat(self.__agent_card.model_dump())}")

Recommendation: Sanitize error messages:

raise A2aClientError(f"Failed to connect to agent '{self.name}' at {self.__agent_card.url}")

5. Resource Cleanup - autogen/a2a/client.py:109-119

Issue: Client creation pattern may not guarantee cleanup on exceptions:

self._client_config.httpx_client = self._httpx_client_factory()
async with self._client_config.httpx_client:
    agent_client = A2AClientFactory(self._client_config).create(self.__agent_card)

Recommendation: Use try-finally for cleanup guarantee:

self._client_config.httpx_client = self._httpx_client_factory()
try:
    async with self._client_config.httpx_client:
        # ... logic ...
finally:
    self._client_config.httpx_client = None

Medium Priority

6. Missing Input Validation - autogen/a2a/client.py:88-107

Issue: No validation that messages list is non-empty before API call.

Recommendation: Add validation:

if not messages:
    raise ValueError("Cannot generate reply with empty message list")

7. KeyError Risk - autogen/remote/agent.py:72,123

Issue: Direct dictionary access without checking key existence:

messages = self._oai_messages[sender]  # Could raise KeyError

Recommendation: Use .get() with default:

messages = self._oai_messages.get(sender, [])
if not messages:
    raise ValueError(f"No message history found for sender {sender}")

8. Retry Module Untested - autogen/remote/retry.py

Issue: The entire retry module (102 lines) has zero test coverage.

Recommendation: Add tests for:

  • Successful retry after N attempts
  • Exhausting retry count
  • Async vs sync behavior differences
  • Sleep timing verification

Minor Issues

9. TODO Comments Without Tracking - autogen/remote/agent_service.py:37,104,123

Issue: Multiple TODOs without GitHub issues:

  • Line 37: "catch ask user input event"
  • Line 104: "inject ChatContext to tool"
  • Line 123: "catch handoffs"

Recommendation: Create tracking issues and reference them in comments.

10. Inconsistent Naming - autogen/a2a/httpx_client_factory.py:130

Issue: Function named like a class (PascalCase):

def EmptyClientFactory() -> MockClient:  # Should be snake_case

Recommendation: Rename to create_empty_client() or make it an actual class.


🔒 Security Analysis

✅ Good Practices

  1. No hardcoded credentials
  2. SSL verification enabled by default in httpx clients
  3. Proper exception handling prevents information leakage in most cases
  4. Type validation with Pydantic models

⚠️ Recommendations

  1. Agent Card Validation: The _get_agent_card method fetches remote configs. Add validation to prevent malicious configurations
  2. Rate Limiting: Consider adding configurable rate limits to prevent abuse
  3. Timeout Configuration: Expose timeout settings in user-facing APIs
  4. Authentication: While mentioned in PR description as P2, consider adding basic auth support before production use

📊 Test Coverage Assessment

Well Tested ✅

  • Message serialization/deserialization (392 test cases)
  • Round-trip conversions
  • Mock client interactions
  • Agent service functionality
  • Group chat integration

Needs More Tests ⚠️

  1. Retry module - 0% coverage
  2. Timeout scenarios - No tests for polling timeouts
  3. Connection failure recovery - Limited edge case coverage
  4. Concurrent requests - No tests for multiple simultaneous remote calls
  5. Context variable updates - Limited coverage of distributed context sync
  6. A2A streaming edge cases - Connection loss during streaming

Recommendation: Add integration test suite:

# test/a2a/test_integration.py
@pytest.mark.asyncio
async def test_polling_timeout():
    """Test that polling stops after timeout"""
    
@pytest.mark.asyncio  
async def test_task_recreation_limit():
    """Test that task recreation has a limit"""
    
@pytest.mark.asyncio
async def test_concurrent_remote_agents():
    """Test multiple agents calling same remote agent"""

🚀 Performance Considerations

Current Implementation

  1. Fixed polling interval: Could be inefficient for both fast and slow tasks
  2. Client recreation: New httpx client on each request (line 109) - consider connection pooling
  3. Message history size: No limit on conversation history - could cause memory issues

Suggestions

  1. Connection pooling: Reuse httpx clients across requests
  2. Adaptive polling: Implement exponential backoff as noted above
  3. History limits: Add configurable max message count for remote agents
  4. Webhook support: Consider webhook-based notifications to avoid polling overhead

📝 Documentation Quality

Strengths ✅

  • Clear architecture explanation in PR description
  • User-facing docs for both client and server
  • Good examples with code snippets
  • Interoperability documentation

Suggestions

  1. Error handling guide: Add examples of handling remote agent failures
  2. Performance tuning: Document polling intervals, timeouts, retry policies
  3. Migration guide: Help users migrate existing local agents to remote
  4. Security best practices: Document authentication, SSL, rate limiting

🎯 Comparison with Previous Reviews

The previous Claude reviews (Oct 15) identified similar issues:

  • Polling without backoff - Still present (Issue rename #3)
  • Missing timeout handling - Still present (Issue update all links #1)
  • Task ID regeneration - Still present (Issue renaming #2)
  • ⚠️ Some concerns addressed: Error messages improved, but still verbose

This indicates these are known issues that should be prioritized for resolution.


🔧 Action Items

Must Fix Before Merge

  1. ⚠️ Add timeout to polling loop (Issue update all links #1)
  2. ⚠️ Add max task recreation limit (Issue renaming #2)
  3. ⚠️ Sanitize error messages (Issue Readme #4)

Should Fix

  1. Implement exponential backoff (Issue rename #3)
  2. Add retry module tests (Issue chat example #8)
  3. Fix resource cleanup pattern (Issue renaming #5)
  4. Add input validation (Issue Readme #6)

Nice to Have

  1. Create GitHub issues for TODOs (Issue model update #9)
  2. Add integration test suite
  3. Document performance tuning
  4. Add security best practices guide

✅ Conclusion

This PR represents exceptional work that brings distributed agent capabilities to AG2. The implementation is well-architected, thoroughly tested (1,058 total test lines), and comprehensively documented. The main concerns are around robustness (timeouts, infinite loops) and operational safety (error messages, resource cleanup).

Final Score: 9/10

The critical issues are straightforward fixes that don't require architectural changes. Once addressed, this will be a production-ready feature that significantly expands AG2's capabilities.

Outstanding work, @Lancetnik! 🎉 This feature will enable exciting new use cases for distributed multi-agent systems.


Branch: feat/remote-agents | Job Run

@Lancetnik
Copy link
Member Author

@marklysze seems like everything fine now. I am just waiting for your review

Copy link
Collaborator

@marklysze marklysze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @Lancetnik!

@marklysze marklysze enabled auto-merge October 21, 2025 20:14
@marklysze marklysze disabled auto-merge October 21, 2025 20:34
@marklysze marklysze enabled auto-merge October 21, 2025 20:34
@marklysze marklysze added this pull request to the merge queue Oct 21, 2025
Merged via the queue into main with commit fa2be6f Oct 21, 2025
20 of 21 checks passed
@marklysze marklysze deleted the feat/remote-agents branch October 21, 2025 20:45
@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

❌ Patch coverage is 75.84541% with 200 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autogen/a2a/client.py 58.90% 51 Missing and 9 partials ⚠️
autogen/remote/agent.py 51.19% 36 Missing and 5 partials ⚠️
autogen/remote/agent_service.py 68.67% 12 Missing and 14 partials ⚠️
autogen/remote/retry.py 50.00% 26 Missing ⚠️
autogen/agentchat/conversable_agent.py 86.76% 4 Missing and 5 partials ⚠️
autogen/agentchat/groupchat.py 14.28% 4 Missing and 2 partials ⚠️
autogen/remote/runtime.py 83.78% 3 Missing and 3 partials ⚠️
autogen/a2a/agent_executor.py 84.84% 4 Missing and 1 partial ⚠️
autogen/a2a/httpx_client_factory.py 82.60% 2 Missing and 2 partials ⚠️
autogen/a2a/utils.py 94.11% 2 Missing and 2 partials ⚠️
... and 8 more
Files with missing lines Coverage Δ
autogen/a2a/errors.py 100.00% <100.00%> (ø)
autogen/agentchat/__init__.py 100.00% <100.00%> (ø)
autogen/agentchat/agent.py 100.00% <ø> (ø)
autogen/interop/pydantic_ai/pydantic_ai.py 86.56% <100.00%> (+44.77%) ⬆️
autogen/llm_config/client.py 100.00% <100.00%> (ø)
autogen/oai/bedrock.py 27.34% <ø> (+0.62%) ⬆️
autogen/oai/client.py 55.35% <100.00%> (-3.98%) ⬇️
autogen/oai/cohere.py 34.80% <ø> (+0.01%) ⬆️
autogen/remote/errors.py 100.00% <100.00%> (ø)
autogen/remote/protocol.py 100.00% <100.00%> (ø)
... and 20 more

... and 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude claude bot mentioned this pull request Oct 22, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants