Conversational Computer Use and Task Execution
Bridging natural language dialogue with system action
Abstract
The gap between conversational AI and computer use is large: one answers questions, the other performs actions. We are researching how to bridge this gap—enabling AI to understand natural language task descriptions, formulate multi-step plans, execute actions on live systems, and handle failures through dialogue.
Problem Statement
Current systems separate dialogue and action. A user might describe a task in natural language, but then must switch to GUI automation or explicit API calls. The AI cannot clarify ambiguous instructions mid-task, handle failures by asking questions, or adapt plans based on observed system state. The interaction model is brittle and non-conversational.
Approach
We use a cognitive architecture with three components: a dialogue manager for natural language interaction, a planner that generates and revises task plans, and an executor that performs actions through tool use (GUI automation, API calls, file operations). All three share a common working memory that represents the current task state.
Task understanding
Natural language task descriptions are often ambiguous or incomplete. The dialogue manager identifies missing information through clarification questions before planning begins. For example, 'move these files' triggers 'which files?' and 'to where?'. The clarified intent is compiled into a formal task representation.
Plan generation and repair
The planner generates action sequences using a combination of learned policies and explicit search. Plans are hierarchical: high-level goals decompose into subgoals into primitive actions. When execution fails (element not found, API error), the planner can repair by substituting alternative actions or backtracking to a previous subgoal.
Execution and observation
Primitive actions use a tool abstraction layer: GUI actions (click, type, scroll), API calls, file operations, and shell commands. After each action, the executor captures observations: screenshots, API responses, file contents. These observations feed back into working memory and may trigger plan revisions.
Safety and sandboxing
Unconstrained system access is dangerous. We implement a capability model where each task has an explicit permission set (which directories, which APIs, destructive vs read-only). High-risk actions require user confirmation. All actions are logged and reversible where possible.
Failure recovery dialogue
When autonomous recovery fails, the system falls back to dialogue. It describes the failure, suggests alternatives, and asks for guidance. This hybrid approach—autonomous when possible, collaborative when needed—balances efficiency with safety and user control.