The Latency Illusion: Building Responsive AI Agents with Async & Streaming
There’s a metric that kills AI products faster than hallucinations or bad UI: the Spinning Wheel of Death.
When a user asks a chatbot a question, they start a mental timer:
- 0.1 seconds: instant
- 1.0 second: it’s thinking
- 10.0 seconds: it’s broken (and they leave)
Those numbers aren’t random. They’re classic UX principles. But in enterprise AI—where agents query databases, search documents, and check policies—10-second responses are common.
If you present that as a 10-second loading spinner, your product will lose users.
The good news: people are surprisingly patient when they can see progress.
That’s the Latency Illusion. You don’t always need the agent to finish faster. You need it to start responding sooner.
Here’s how Product Managers and Engineers can work together to build responsive AI agents on Databricks.
The Metric That Matters: TTFT
If you want the system to feel fast, stop measuring only total latency (time to finish). Start measuring Time-to-First-Token (TTFT).
TTFT is the time between the user pressing Enter and the first visible text appearing.
Scenario A: The AI thinks for 10 seconds, then dumps a 500-word paragraph.
User reaction: “It’s broken… oh, there it is. That was annoying.”
Scenario B: The AI starts typing after 0.5 seconds, and finishes 10 seconds later.
User reaction: “It’s answering me immediately.”
Same total time. Completely different experience.
Scenario B feels like a conversation. Scenario A feels like a batch job.
The Technical Bottleneck: The Traffic Jam
Why do agents feel slow? Because many are written sequentially.
Imagine an agent answering a customer’s order question. To respond, it must:
- Retrieve policy: vector search for return policy (2s)
- Check status: SQL query for order status (3s)
- Reason: the LLM thinks (1s)
- Generate: the LLM writes the answer (4s)
Total: 2 + 3 + 1 + 4 = 10 seconds.
In a standard architecture, the user watches a spinner for the entire 10 seconds. That’s the traffic jam: the SQL query can’t start until retrieval finishes, even though they aren’t related.
The Solution: Non-Blocking Architectures
To get a “Scenario B” experience on Databricks, you typically combine two patterns:
- Asynchronous tools to reduce real wait time
- Token streaming to reduce perceived wait time
1) Async Tools: Doing Two Things at Once
Async I/O lets the agent do useful work while external systems respond. Instead of waiting for vector search to finish before starting SQL, you fire both requests at the same time.
Conceptual pattern:
# Fire both requests at the same time
docs_task = asyncio.to_thread(retrieve_docs_sync, query)
sql_task = asyncio.to_thread(query_sql_sync, f"SELECT * FROM orders WHERE q='{query}'")
# Wait for the slowest one, not the sum of both
docs, sql_result = await asyncio.gather(docs_task, sql_task)
return docs, sql_result
Now the “tool phase” becomes:
Max(2s, 3s) instead of 2s + 3s.
New total (example): Max(2s, 3s) + 1s + 4s = 8 seconds.
That’s a 20% improvement without buying more compute—just better orchestration.
2) Streaming: The “Typewriter Effect”
Now we address the remaining 8 seconds.
Even if the agent still needs time to finish, you don’t want users staring at a blank screen. You want the response to start flowing immediately.
On Databricks Model Serving, you do this with token streaming: keep the connection open and send chunks to the UI as the model produces them.
Client-side streaming pattern:
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()
stream = client.responses.create(
model="YOUR_AGENT_ENDPOINT",
input="Explain TTFT.",
stream=True, # <--- The critical flag
)
for event in stream:
# Immediately show the user new text as it arrives
print(event)
Now TTFT drops below 1 second, and the user sees progress right away. The remaining “thinking time” is hidden behind the typewriter effect.
Business Impact: Speed Is a Feature
When you prioritize responsiveness, the product metrics follow:
- Lower abandonment: fewer users assume the bot froze
- Higher completion: people stay long enough to get value
- More trust: streaming feels conversational and human
The Everstone Approach
Everstone AI approach is to treat user experience as an architectural constraint. Speed isn’t something you “tune” at the end. It’s something you design in from the start.
I build agents that are non-blocking by default, using async patterns for tool execution and streaming for delivery—so enterprise AI feels as responsive as a consumer app.
Managerial takeaway: if your AI feels slow, don’t just blame the model. Ask whether the system is using async tooling and streaming responses. Very often, the bottleneck isn’t the AI—it’s the architecture around it.
Comments
Post a Comment