30% off every model — launch pricing
← Back to research
R06Pilot system

Conversational Computer Use and Task Execution

Bridging natural language dialogue with system action

agentscomputer-useplanningtool-use

Abstract

The gap between conversational AI and computer use is large: one answers questions, the other performs actions. We are researching how to bridge this gap—enabling AI to understand natural language task descriptions, formulate multi-step plans, execute actions on live systems, and handle failures through dialogue.

Problem Statement

Current systems separate dialogue and action. A user might describe a task in natural language, but then must switch to GUI automation or explicit API calls. The AI cannot clarify ambiguous instructions mid-task, handle failures by asking questions, or adapt plans based on observed system state. The interaction model is brittle and non-conversational.

Approach

We use a cognitive architecture with three components: a dialogue manager for natural language interaction, a planner that generates and revises task plans, and an executor that performs actions through tool use (GUI automation, API calls, file operations). All three share a common working memory that represents the current task state.

Task understanding

Natural language task descriptions are often ambiguous or incomplete. The dialogue manager identifies missing information through clarification questions before planning begins. For example, 'move these files' triggers 'which files?' and 'to where?'. The clarified intent is compiled into a formal task representation.

Plan generation and repair

The planner generates action sequences using a combination of learned policies and explicit search. Plans are hierarchical: high-level goals decompose into subgoals into primitive actions. When execution fails (element not found, API error), the planner can repair by substituting alternative actions or backtracking to a previous subgoal.

Execution and observation

Primitive actions use a tool abstraction layer: GUI actions (click, type, scroll), API calls, file operations, and shell commands. After each action, the executor captures observations: screenshots, API responses, file contents. These observations feed back into working memory and may trigger plan revisions.

Safety and sandboxing

Unconstrained system access is dangerous. We implement a capability model where each task has an explicit permission set (which directories, which APIs, destructive vs read-only). High-risk actions require user confirmation. All actions are logged and reversible where possible.

Failure recovery dialogue

When autonomous recovery fails, the system falls back to dialogue. It describes the failure, suggests alternatives, and asks for guidance. This hybrid approach—autonomous when possible, collaborative when needed—balances efficiency with safety and user control.

Related Research