Multimodal Interaction

Human-AI interaction that combines multiple input channels: visual context (what the user sees/points at), natural language (voice or text), and gestural input (cursor position, selection).

Modes

Mode	Description
Visual	Screenshot, screen region, pointer target captured automatically
Semantic	Structured data extracted from visual elements (entities)
Speech/text	Natural language commands, often shorthand
Gestural	Cursor position, selection, hover as implicit context signals

Show and Tell Pattern

From ai-pointer-deepmind-2026: pointing at something + shorthand phrase (“Fix this”) provides richer context than text-only prompts. The visual and gestural channels replace parts of the verbal prompt.

ambient-ai-interfaces — multimodal interaction is a common mechanism for ambient AI

AI Wiki — martplus

Explorer

Multimodal Interaction

Multimodal Interaction

Modes

Show and Tell Pattern

Graph View

Table of Contents

Backlinks

AI Wiki — martplus

Explorer

Multimodal Interaction

Multimodal Interaction

Modes

Show and Tell Pattern

Related Concepts

Graph View

Table of Contents

Backlinks