Raj's Web Agent

1. This project contains a modular web agent designed to operate within the AGI SDK REAL benchmark.
2. The agent is architected to be using a dynamic prompt-routing system to select the best strategy for a given task.

System Architecture Diagram:

https://miro.com/app/board/uXjVI86Rj0o=/?share_link_id=809744597370

Current repo contains

Designed high level project architecture

Orchestrator routing to prompts, agent, config, prompt_selector, main.

Designed Memory and integration with MongoDB

Prompt Routing

LLM as a Judge

Implemented prompt papers like Chain of Thought prompting.

Prompt Creation for OMNIZON and NetworkIn.

Model testing

Setup and eval on realeval for OMNIZON and some tasks of NetworkIn.

Possible Improvements

Integrating DSpy for better prompt creation and handling.

Using RL for post training maybe GRPO, PPO, continueous learning.

Testing with better LLMs and choose different LLMs for each role and cost optimisation.

Building on better browser-use, Nova-act frameworks and fine tune some parts of multi-modal LLM.

Future improvements

The algo for capturing screenshots and BrowserGym’s HighLevelActionSet feature don't sync properly.

We can create a better map for button tasks, bid, action space by fine-tuning prompts or using more dedicated prompt with website workflow explanation.

Integrating with more agentic frameworks for cost and speed optimisation.

Cost & Model Limitations

I am using cheap gpt-40-mini for everything but models can be changed through config.py and using multimodal reasoning models will significatly improve the performance.

Post training or using GRPO with DSpy can improve the performance significantly.

🌟 Key Features

Modular Architecture: The agent's logic is separated into distinct components: a high-level Orchestrator (the project manager) and a focused Agent (the LLM specialist).
Dynamic Prompt Routing: Uses a small, fast LLM to analyze the task goal and dynamically load the correct "instruction manual" (prompt file) for the specific website (e.g., Omnizon, DashDish).
Chain-of-Thought Planning: The agent performs a "self-verification" step after creating a plan, critically reviewing it for logical flaws (like missing navigation steps) before execution begins.
Advanced Self-Correction & Recovery: The agent can detect when it's stuck in a repetitive failure loop (e.g., endless scrolling, trying to click a blocked element) and will change its strategy to recover.
Planning: Breaks down complex user goals into the smallest possible, single-action steps, which dramatically improves reliability, especially for multi-part UI interactions like selecting options in a dropdown.

📂 Project Structure

The agent's source code is located entirely within the agiwebagent/ directory.

agiwebagent/
├── main.py                                                      # The main entry point to launch the agent.
├── requirements.txt                                     # Python dependencies.
└── agent_src/                                                      # The core source code for the agent.
├── init.py
├── agent.py                                              # The "Specialist": Communicates with the LLM.
├── config.py                                                      # Simple configuration data class.
├── memory.py                                     # Stores the history of actions for each step.
├── orchestrator.py                                     # The "Manager": Oversees the entire task lifecycle.
├── prompt_selector.py                                     # The "Strategy Advisor": Chooses the correct prompt file.
├── utils.py                                                             # Helper functions (e.g., image conversion).
└── prompts/                                                                     # Directory containing all specialized "brains".
├── init.py
├── dashdish_prompts.py                         # The brain for the DashDish food delivery site.
└── omnizon_prompts.py # The brain for the Omnizon e-commerce site.

🛠️ Setup Instructions

Follow these steps from the root directory of the project (agiinc/).

Create and Activate a Virtual Environment
It's highly recommended to use a virtual environment to manage dependencies.

python -m venv agienv

source agienv/bin/activate

(On Windows, use agienv\Scripts\activate)

Install Dependencies
Install all the required Python packages from both the root requirements.txt and the agent's specific requirements.txt.

pip install -r requirements.txt

pip install -r agiwebagent/requirements.txt

Set Up Your API Key
The agent requires an OpenAI API key to function. Create a file named .env in the root agiinc/ directory.

Add your API key to this file:

OPENAI_API_KEY="sk-YourSecretAPIKeyHere"

🚀 Running the Agent

All commands should be run from the root agiinc/ directory. The main script is located at agiwebagent/main.py.

Running a Single Task
To run a specific, named task, use the --task_name argument. This is perfect for debugging.

Example (networkin):

python agiwebagent/main.py --task_name webclones.networkin-3 --no-cache --headless true

Running a Full Task Suite
To run all tasks for a specific website (like all 10 omnizon tasks), use the --task_type argument. This is ideal for benchmarking.

Example (Run all networkin tasks):

python agiwebagent/main.py --task_type networkin --no-cache --headless true

Example (Run all Omnizon tasks):

python agiwebagent/main.py --task_type omnizon --no-cache --headless true

Argument	Description	Example
`--task_name`	Runs a single, specific task by its full ID.	`webclones.dashdish-2`
`--task_type`	Runs all tasks belonging to a specific benchmark suite.	`dashdish`, `omnizon`
`--headless`	`true` or `false`. Runs the browser in the background (`true`) or shows the UI (`false`). Default is `false`.	`--headless true`
`--no-cache`	Disables caching and forces the agent to re-run the task from scratch. Highly recommended for testing changes.	`--no-cache`
`--model`	Specifies the OpenAI model to use for the main agent.	`--model gpt-4o`

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
agisdk		agisdk
agiwebagent		agiwebagent
result_images		result_images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Raj's Web Agent

System Architecture Diagram:

https://miro.com/app/board/uXjVI86Rj0o=/?share_link_id=809744597370

Current repo contains

Possible Improvements

Future improvements

Cost & Model Limitations

I am using cheap gpt-40-mini for everything but models can be changed through config.py and using multimodal reasoning models will significatly improve the performance.

Post training or using GRPO with DSpy can improve the performance significantly.

🌟 Key Features

📂 Project Structure

🛠️ Setup Instructions

🚀 Running the Agent

About

Uh oh!

Releases

Packages

Languages

License

raj-gupta1/ComputerUse-WebAgent

Folders and files

Latest commit

History

Repository files navigation

Raj's Web Agent

System Architecture Diagram:

https://miro.com/app/board/uXjVI86Rj0o=/?share_link_id=809744597370

Current repo contains

Possible Improvements

Future improvements

Cost & Model Limitations

I am using cheap gpt-40-mini for everything but models can be changed through config.py and using multimodal reasoning models will significatly improve the performance. Post training or using GRPO with DSpy can improve the performance significantly.

🌟 Key Features

📂 Project Structure

🛠️ Setup Instructions

🚀 Running the Agent

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

I am using cheap gpt-40-mini for everything but models can be changed through config.py and using multimodal reasoning models will significatly improve the performance.

Post training or using GRPO with DSpy can improve the performance significantly.

Packages