Multimodal Interaction
Human-AI interaction that combines multiple input channels: visual context (what the user sees/points at), natural language (voice or text), and gestural input (cursor position, selection).
Modes
| Mode | Description |
|---|---|
| Visual | Screenshot, screen region, pointer target captured automatically |
| Semantic | Structured data extracted from visual elements (entities) |
| Speech/text | Natural language commands, often shorthand |
| Gestural | Cursor position, selection, hover as implicit context signals |
Show and Tell Pattern
From ai-pointer-deepmind-2026: pointing at something + shorthand phrase (“Fix this”) provides richer context than text-only prompts. The visual and gestural channels replace parts of the verbal prompt.
Related Concepts
- ambient-ai-interfaces — multimodal interaction is a common mechanism for ambient AI