Multimodal Interaction

Human-AI interaction that combines multiple input channels: visual context (what the user sees/points at), natural language (voice or text), and gestural input (cursor position, selection).

Modes

ModeDescription
VisualScreenshot, screen region, pointer target captured automatically
SemanticStructured data extracted from visual elements (entities)
Speech/textNatural language commands, often shorthand
GesturalCursor position, selection, hover as implicit context signals

Show and Tell Pattern

From ai-pointer-deepmind-2026: pointing at something + shorthand phrase (“Fix this”) provides richer context than text-only prompts. The visual and gestural channels replace parts of the verbal prompt.