most turns a web agent handles are dull. a click, a lookup, a tidy little answer. you do not need the biggest model in the world to do that, and paying for it on every turn is how budgets quietly bleed out.
the seam
we put a small fast pass in front. it answers what it can and, when a turn looks hard, it hands off to the larger model. the trick is the handoff being invisible. the user sees one assistant, not two.
what counts as hard:
- the turn needs a tool the small pass cannot drive
- the question spans more context than the small pass holds well
- the small pass is not confident, and says so
what it bought us
- median latency fell, because most turns never wait on the big model
- cost per session fell with it, since escalation is the exception not the rule
- quality held, because the hard turns still go to the model that can do them
no magic. just routing the easy turns away from the expensive path, and being honest about which turns are easy.