Latency isn’t just model time. It’s tool calls, retries, serialization, and the “thinking” loop you accidentally created with your prompting. Agents that feel fast are almost never the ones with the fastest model, they’re the ones with the fewest unnecessary steps.
Where the time actually goes
A typical agent interaction involves several stages: parsing the user input, retrieving relevant context, calling the LLM for reasoning, executing tool calls, and formatting the response. Each stage has its own latency profile.
Model inference is the stage most teams focus on, but it’s often not the bottleneck. A well-optimized model call takes 1-3 seconds. But if the agent makes three sequential tool calls at 500ms each, plus a context retrieval at 300ms, you’ve already added 1.8 seconds on top of the model time. And if any of those calls fail and retry, you double it.
Set a time budget
Start by budgeting time per step. If your user expects a response in under 5 seconds, work backward: 2 seconds for model inference, 1 second for tool calls, 500ms for context retrieval, 500ms for everything else. That’s your budget.
If your agent can do three tool calls, decide what happens when it hits the limit. Your system should degrade gracefully instead of timing out unpredictably. Helix handles auto-scaling and health checks so latency stays within budget even under load, but you still need to manage the per-request budget within your agent logic.
Reduce steps, not just speed
The best optimization is often fewer steps. A slightly smarter plan up front beats five extra tool calls later.
Consider these strategies: precompute context so the agent doesn’t need to look it up at runtime. Cortex auto-injects relevant memory before your agent runs, eliminating a retrieval step. Batch tool calls when possible. If the agent needs data from three sources, call them in parallel instead of sequentially. And reduce the number of LLM round-trips by providing better instructions that help the model get it right on the first try.
Streaming changes the perception
Raw latency and perceived latency are different things. A 4-second response that appears all at once feels slower than a 4-second response that streams token by token. Streaming gives users a signal that the agent is working, which significantly improves the experience.
For tool calls that take time, provide status updates. “Searching the database…” or “Processing your request…” keeps the user informed and patient.
Cold starts vs. warm requests
Auto-scaling introduces cold start latency. When a new replica spins up, it needs to load the model, initialize connections, and warm caches. Helix minimizes cold starts to under 50ms, but it’s still a factor in your budget.
For latency-sensitive applications, keep at least one replica warm at all times. For cost-sensitive applications, accept occasional cold starts and optimize the cold start path instead.
Measure latency at every stage
You can’t optimize what you don’t measure. Instrument every stage of your agent’s execution: input parsing, context retrieval, model inference, tool execution, and response formatting. Track p50, p95, and p99 latencies for each stage. The aggregate number hides the bottleneck, the per-stage breakdown reveals it.