Enable AI agents to control your desktop (Windows, MacOS, Linux), mobile (Android, iOS) and HMI devices
Join the AskUI Discord.
- π Introduction
- π¦ Installation
- π Quickstart
- π Further Documentation
- π€ Contributing
- π License
AskUI Vision Agent is a powerful automation framework that enables you and AI agents to control your desktop, mobile, and HMI devices and automate tasks. With support for multiple AI models, multi-platform compatibility, and enterprise-ready features,
AskUI_VisionAgentsforEnterprise.1.mp4
π― Key Features
- Support for Windows, Linux, MacOS, Android and iOS device automation (Citrix supported)
- Support for single-step UI automation commands (RPA like) as well as agentic intent-based instructions
- In-background automation on Windows machines (agent can create a second session; you do not have to watch it take over mouse and keyboard)
- Flexible model use (hot swap of models) and infrastructure for reteaching of models (available on-premise)
- Secure deployment of agents in enterprise environments
pip install askui[all]Requires Python >=3.10
Agent OS is a device controller that allows agents to take screenshots, move the mouse, click, and type on the keyboard across any operating system. It is installed on a Desktop OS but can control also mobile devices and HMI devices connected.
It offers powerful features like
- multi-screen support,
- support for all major operating systems (incl. Windows, MacOS and Linux),
- process visualizations,
- real Unicode character typing
- and more exciting features like application selection, in background automation and video streaming are to be released soon.
Linux
curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.run
bash /tmp/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.runcurl -L -o /tmp/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.run
bash /tmp/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.runMacOS
curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.run
bash /tmp/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.runDouble click where-ever the cursor is currently at:
from askui import VisionAgent
with VisionAgent() as agent:
agent.click(button="left", repeat=2)By default, the agent works within the context of a display that is selected which defaults to the primary display.
Run the script with python <file path>, e.g python test.py to see if it works.
In order to let AI agents control your devices, you need to be able to connect to an AI model (provider). We host some models ourselves and support several other ones, e.g. Anthropic, OpenRouter, Hugging Face, etc. out of the box. If you want to use a model provider or model that is not supported, you can easily plugin your own (see Custom Models).
For this example, we will us AskUI as the model provider to easily get started.
Sign up at hub.askui.com to:
- Activate your free trial by signing up (no credit card required)
- Get your workspace ID and access token
Linux & MacOS
export ASKUI_WORKSPACE_ID=<your-workspace-id-here>
export ASKUI_TOKEN=<your-token-here>Windows PowerShell
$env:ASKUI_WORKSPACE_ID="<your-workspace-id-here>"
$env:ASKUI_TOKEN="<your-token-here>"from askui import VisionAgent
with VisionAgent() as agent:
# Give complex instructions to the agent (may have problems with virtual displays out of the box, so make sure there is no browser opened on a virtual display that the agent may not see)
agent.act(
"Look for a browser on the current device (checking all available displays, "
"making sure window has focus),"
" open a new window or tab and navigate to https://docs.askui.com"
" and click on 'Search...' to open search panel. If the search panel is already "
"opened, empty the search field so I can start a fresh search."
)
agent.type("Introduction")
# Locates elements by text (you can also use images, natural language descriptions, coordinates, etc. to
# describe what to click on)
agent.click(
"Documentation > Tutorial > Introduction",
)
first_paragraph = agent.get(
"What does the first paragraph of the introduction say?"
)
print("\n--------------------------------")
print("FIRST PARAGRAPH:\n")
print(first_paragraph)
print("--------------------------------\n\n")Run the script with python <file path>, e.g python test.py.
If you see a lot of logs and the first paragraph of the introduction in the console, congratulations! You've successfully let AI agents control your device to automate a task! If you have any issues, please check the documentation or join our Discord for support.
Aside from our official documentation, we also have some additional guides and examples under the docs folder that you may find useful, for example:
- Chat - How to interact with agents through a chat
- Direct Tool Use - How to use the tools, e.g., clipboard, the Agent OS etc.
- Extracting Data - How to extract data from the screen and documents
- MCP - How to use MCP servers to extend the capabilities of an agent
- Observability - Logging and reporting
- Telemetry - Which data we gather and how to disable it
- Using Models - How to use different models including how to register your own custom models
We'd love your help! Contributions, ideas, and feedback are always welcome. A proper contribution guide is coming soonβstay tuned!
This project is licensed under the MIT License - see the LICENSE file for details.