latent routing in small web agents

most turns a web agent handles are dull. a click, a lookup, a tidy little answer. you do not need the biggest model in the world to do that, and paying for it on every turn is how budgets quietly bleed out.

the seam

we put a small fast pass in front. it answers what it can and, when a turn looks hard, it hands off to the larger model. the trick is the handoff being invisible. the user sees one assistant, not two.

what counts as hard:

the turn needs a tool the small pass cannot drive
the question spans more context than the small pass holds well
the small pass is not confident, and says so

what it bought us

median latency fell, because most turns never wait on the big model
cost per session fell with it, since escalation is the exception not the rule
quality held, because the hard turns still go to the model that can do them

no magic. just routing the easy turns away from the expensive path, and being honest about which turns are easy.