Thanks to visit codestin.com
Credit goes to github.com

Skip to content

raj-gupta1/ComputerUse-WebAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Raj's Web Agent

1. This project contains a modular web agent designed to operate within the AGI SDK REAL benchmark.
2. The agent is architected to be using a dynamic prompt-routing system to select the best strategy for a given task.

System Architecture Diagram:

Current repo contains

  • Designed high level project architecture
  • Orchestrator routing to prompts, agent, config, prompt_selector, main.
  • Designed Memory and integration with MongoDB
  • Prompt Routing
  • LLM as a Judge
  • Implemented prompt papers like Chain of Thought prompting.
  • Prompt Creation for OMNIZON and NetworkIn.
  • Model testing
  • Setup and eval on realeval for OMNIZON and some tasks of NetworkIn.

Possible Improvements

  • Integrating DSpy for better prompt creation and handling.
  • Using RL for post training maybe GRPO, PPO, continueous learning.
  • Testing with better LLMs and choose different LLMs for each role and cost optimisation.
  • Building on better browser-use, Nova-act frameworks and fine tune some parts of multi-modal LLM.

Future improvements

  • The algo for capturing screenshots and BrowserGym’s HighLevelActionSet feature don't sync properly.
  • We can create a better map for button tasks, bid, action space by fine-tuning prompts or using more dedicated prompt with website workflow explanation.
  • Integrating with more agentic frameworks for cost and speed optimisation.

Cost & Model Limitations

  • I am using cheap gpt-40-mini for everything but models can be changed through config.py and using multimodal reasoning models will significatly improve the performance.
  • Post training or using GRPO with DSpy can improve the performance significantly.

🌟 Key Features

  • Modular Architecture: The agent's logic is separated into distinct components: a high-level Orchestrator (the project manager) and a focused Agent (the LLM specialist).
  • Dynamic Prompt Routing: Uses a small, fast LLM to analyze the task goal and dynamically load the correct "instruction manual" (prompt file) for the specific website (e.g., Omnizon, DashDish).
  • Chain-of-Thought Planning: The agent performs a "self-verification" step after creating a plan, critically reviewing it for logical flaws (like missing navigation steps) before execution begins.
  • Advanced Self-Correction & Recovery: The agent can detect when it's stuck in a repetitive failure loop (e.g., endless scrolling, trying to click a blocked element) and will change its strategy to recover.
  • Planning: Breaks down complex user goals into the smallest possible, single-action steps, which dramatically improves reliability, especially for multi-part UI interactions like selecting options in a dropdown.

📂 Project Structure

The agent's source code is located entirely within the agiwebagent/ directory.

agiwebagent/
├── main.py                                                      # The main entry point to launch the agent.
├── requirements.txt                                     # Python dependencies.
└── agent_src/                                                      # The core source code for the agent.
├── init.py
├── agent.py                                              # The "Specialist": Communicates with the LLM.
├── config.py                                                      # Simple configuration data class.
├── memory.py                                     # Stores the history of actions for each step.
├── orchestrator.py                                     # The "Manager": Oversees the entire task lifecycle.
├── prompt_selector.py                                     # The "Strategy Advisor": Chooses the correct prompt file.
├── utils.py                                                             # Helper functions (e.g., image conversion).
└── prompts/                                                                     # Directory containing all specialized "brains".
├── init.py
├── dashdish_prompts.py                         # The brain for the DashDish food delivery site.
└── omnizon_prompts.py # The brain for the Omnizon e-commerce site.

🛠️ Setup Instructions

Follow these steps from the root directory of the project (agiinc/).
  1. Create and Activate a Virtual Environment
    It's highly recommended to use a virtual environment to manage dependencies.
python -m venv agienv

source agienv/bin/activate

(On Windows, use agienv\Scripts\activate)

  1. Install Dependencies
    Install all the required Python packages from both the root requirements.txt and the agent's specific requirements.txt.
pip install -r requirements.txt

pip install -r agiwebagent/requirements.txt
  1. Set Up Your API Key
    The agent requires an OpenAI API key to function. Create a file named .env in the root agiinc/ directory.

Add your API key to this file:

OPENAI_API_KEY="sk-YourSecretAPIKeyHere"

🚀 Running the Agent

All commands should be run from the root agiinc/ directory. The main script is located at agiwebagent/main.py.

Running a Single Task
To run a specific, named task, use the --task_name argument. This is perfect for debugging.

Example (networkin):

python agiwebagent/main.py --task_name webclones.networkin-3 --no-cache --headless true

Running a Full Task Suite
To run all tasks for a specific website (like all 10 omnizon tasks), use the --task_type argument. This is ideal for benchmarking.

Example (Run all networkin tasks):

python agiwebagent/main.py --task_type networkin --no-cache --headless true

Example (Run all Omnizon tasks):

python agiwebagent/main.py --task_type omnizon --no-cache --headless true
Argument Description Example
--task_name Runs a single, specific task by its full ID. webclones.dashdish-2
--task_type Runs all tasks belonging to a specific benchmark suite. dashdish, omnizon
--headless true or false. Runs the browser in the background (true) or shows the UI (false). Default is false. --headless true
--no-cache Disables caching and forces the agent to re-run the task from scratch. Highly recommended for testing changes. --no-cache
--model Specifies the OpenAI model to use for the main agent. --model gpt-4o

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages