2. The agent is architected to be using a dynamic prompt-routing system to select the best strategy for a given task.
- Designed high level project architecture
- Orchestrator routing to prompts, agent, config, prompt_selector, main.
- Designed Memory and integration with MongoDB
- Prompt Routing
- LLM as a Judge
- Implemented prompt papers like Chain of Thought prompting.
- Prompt Creation for OMNIZON and NetworkIn.
- Model testing
- Setup and eval on realeval for OMNIZON and some tasks of NetworkIn.
- Integrating DSpy for better prompt creation and handling.
- Using RL for post training maybe GRPO, PPO, continueous learning.
- Testing with better LLMs and choose different LLMs for each role and cost optimisation.
- Building on better browser-use, Nova-act frameworks and fine tune some parts of multi-modal LLM.
- The algo for capturing screenshots and BrowserGym’s HighLevelActionSet feature don't sync properly.
- We can create a better map for button tasks, bid, action space by fine-tuning prompts or using more dedicated prompt with website workflow explanation.
- Integrating with more agentic frameworks for cost and speed optimisation.
- I am using cheap gpt-40-mini for everything but models can be changed through config.py and using multimodal reasoning models will significatly improve the performance.
- Post training or using GRPO with DSpy can improve the performance significantly.
- Modular Architecture: The agent's logic is separated into distinct components: a high-level Orchestrator (the project manager) and a focused Agent (the LLM specialist).
- Dynamic Prompt Routing: Uses a small, fast LLM to analyze the task goal and dynamically load the correct "instruction manual" (prompt file) for the specific website (e.g., Omnizon, DashDish).
- Chain-of-Thought Planning: The agent performs a "self-verification" step after creating a plan, critically reviewing it for logical flaws (like missing navigation steps) before execution begins.
- Advanced Self-Correction & Recovery: The agent can detect when it's stuck in a repetitive failure loop (e.g., endless scrolling, trying to click a blocked element) and will change its strategy to recover.
- Planning: Breaks down complex user goals into the smallest possible, single-action steps, which dramatically improves reliability, especially for multi-part UI interactions like selecting options in a dropdown.
agiwebagent/
├── main.py # The main entry point to launch the agent.
├── requirements.txt # Python dependencies.
└── agent_src/ # The core source code for the agent.
├── init.py
├── agent.py # The "Specialist": Communicates with the LLM.
├── config.py # Simple configuration data class.
├── memory.py # Stores the history of actions for each step.
├── orchestrator.py # The "Manager": Oversees the entire task lifecycle.
├── prompt_selector.py # The "Strategy Advisor": Chooses the correct prompt file.
├── utils.py # Helper functions (e.g., image conversion).
└── prompts/ # Directory containing all specialized "brains".
├── init.py
├── dashdish_prompts.py # The brain for the DashDish food delivery site.
└── omnizon_prompts.py # The brain for the Omnizon e-commerce site.
- Create and Activate a Virtual Environment
It's highly recommended to use a virtual environment to manage dependencies.
python -m venv agienv
source agienv/bin/activate(On Windows, use agienv\Scripts\activate)
- Install Dependencies
Install all the required Python packages from both the root requirements.txt and the agent's specific requirements.txt.
pip install -r requirements.txt
pip install -r agiwebagent/requirements.txt- Set Up Your API Key
The agent requires an OpenAI API key to function. Create a file named .env in the root agiinc/ directory.
Add your API key to this file:
OPENAI_API_KEY="sk-YourSecretAPIKeyHere"Running a Single Task
To run a specific, named task, use the --task_name argument. This is perfect for debugging.
Example (networkin):
python agiwebagent/main.py --task_name webclones.networkin-3 --no-cache --headless trueRunning a Full Task Suite
To run all tasks for a specific website (like all 10 omnizon tasks), use the --task_type argument. This is ideal for benchmarking.
Example (Run all networkin tasks):
python agiwebagent/main.py --task_type networkin --no-cache --headless trueExample (Run all Omnizon tasks):
python agiwebagent/main.py --task_type omnizon --no-cache --headless true| Argument | Description | Example |
|---|---|---|
--task_name |
Runs a single, specific task by its full ID. | webclones.dashdish-2 |
--task_type |
Runs all tasks belonging to a specific benchmark suite. | dashdish, omnizon |
--headless |
true or false. Runs the browser in the background (true) or shows the UI (false). Default is false. |
--headless true |
--no-cache |
Disables caching and forces the agent to re-run the task from scratch. Highly recommended for testing changes. | --no-cache |
--model |
Specifies the OpenAI model to use for the main agent. | --model gpt-4o |