LLM Model Routing

The practice of dynamically selecting which LLM (or LLM provider) handles a given request at inference time, rather than hard-coding a single model. A router sits as a proxy between the application and the model providers.

Why It Matters

A single capable model (e.g. GPT-4o, Claude Opus) is overkill for simple queries. Routing cheaper models to low-complexity requests can cut inference costs by 50–70% with no quality degradation on those tasks. The theoretical ideal is the Oracle router: for each query, pick the cheapest model that still meets the quality threshold. Real routers approximate this oracle.

Two Fundamental Strategies

Research distinguishes two distinct approaches that are often conflated:

Routing — select a single model per query upfront, based on predicted suitability. One decision, one model call.

Cascading — try models sequentially starting from the cheapest. Evaluate the response quality after each call; escalate to a stronger model only if the threshold isn’t met. Multiple calls possible, but usually cheap ones.

A 2025 ICML paper (arxiv 2410.10347) derives a unified optimal strategy that integrates both: it proves the optimality of existing routing strategies and derives a novel optimal cascading strategy, framing them as special cases of the same cost-quality trade-off problem.

Theoretical Framing: Cost-Quality Space

RouterBench (arxiv 2403.12031) formalises routing as a problem in a 2D cost–quality space. Each model and each routing strategy is a point or curve in this space. Key construct: the non-decreasing convex hull — the Pareto-optimal frontier of cost-quality pairs. A good router should push the operating point onto this hull. RouterBench evaluates against 405,000 inference outcomes across 64 tasks and 11+ models.

Routing Signals

Routers score requests on dimensions including:

  • Task complexity — does this need deep reasoning or is retrieval/classification sufficient?
  • Specificity — narrow factual vs. broad generative
  • Task type — coding, math, summarisation, creative writing each favour different models
  • Reasoning patterns — detected from prompt structure
  • Custom metadata — caller-supplied HTTP headers to force or hint model selection
  • Cost budget — per-agent or per-period spend limits

manifest-llm-router evaluates 23 dimensions per prompt. LLMRank (arxiv 2510.01234) uses human-readable prompt features — task type, reasoning patterns, complexity indicators — trained on RouterBench. Research (arxiv 2505.12601) shows that simple kNN over these features can outperform complex learned routers.

Seminal Work: FrugalGPT

FrugalGPT is the foundational routing paper. It introduces a generation judger: a model that assesses response quality from cheaper LLMs and decides whether to escalate. Models are invoked sequentially until the quality threshold is met. This is the prototype of modern cascading.

Not routing, but related: a small fast “drafter” model predicts a sequence of future tokens; the large target model verifies them in parallel. Multiple tokens generated per large-model step. This reduces latency rather than routing queries to different models.

Speculative Cascades (Google Research) combines both: speculative decoding within a cascade, sometimes deferring to the smaller model for efficiency. Better quality-per-cost than either technique alone.

Proxy Architecture

The standard deployment pattern: the router exposes an OpenAI- (or Anthropic-) compatible API endpoint. Existing applications connect to it unchanged; the router forwards requests to the selected backend. Zero-friction adoption for apps already using those SDKs.

Provider Tiers

Modern routers support a spectrum:

TierExamples
Frontier APIOpenAI, Anthropic, Google, xAI
Mid-tier APIMistral, DeepSeek, Qwen
Local inferenceOllama, LM Studio, llama.cpp (GGUF)
Subscription reuseChatGPT Plus, Claude Max

Fallback Chains

Routers maintain ordered fallback lists. If the primary model is unavailable or rate-limited, the next model in the chain takes the request — providing resilience without application-level retry logic. Distinct from cascading: fallbacks trigger on failure, cascading triggers on insufficient quality.

Cost Controls

Enforcement mechanisms: spend limits (token or dollar) per agent and per time window, with either alerting (email) or hard blocking (HTTP 429). Prevents runaway spend in agentic loops.

Key Tools and Frameworks

  • manifest-llm-router — open-source router, 23-dimension scoring, OpenAI + Anthropic compatible, cloud + self-hosted
  • RouteLLM (github.com/lm-sys/RouteLLM) — LM-sys framework for serving and evaluating routers
  • RouterBench — benchmark dataset and evaluation framework for comparing routing strategies

Relationship to Agentic Workflows

In multi-step agentic pipelines, different steps have very different complexity profiles — a web search summarisation step vs. a multi-hop reasoning step. Routing lets each step use the right model without the developer manually branching on model selection. The cost savings compound across long agent runs.

See also: agentic-workflows-mcp