The “Self-Healing” Agent: Closing the Loop Between Operations and Development
There’s a fundamental difference between traditional software and AI software.
When traditional software breaks, it’s loud. You get a “500 Internal Server Error,” an alert fires at 2 a.m., and an engineer deploys a fix.
When AI software breaks, it’s often silent. The chatbot returns a fluent, confident answer that happens to be wrong. The user doesn’t open a ticket. They just lose trust—and quietly stop using the product.
That kind of “silent failure” compounds over time. New data enters the system. User questions evolve. The world changes. And your AI drifts. I call this AI Rot.
Most organizations try to solve AI Rot with anecdotes. They hold weekly meetings and trade stories like:
“Someone in Marketing said the bot was wrong about Q3 sales.”
That’s not engineering. It’s hearsay.
If you want an AI product that lasts, you need a system where production failures automatically become regression tests. You need an agent that can heal—not by magic, but through disciplined feedback loops.
Here’s how to architect a Trace-to-Test pipeline on Databricks that closes the loop between what users experience and what developers ship.
The Broken Feedback Loop
Every modern chatbot has a “Thumbs Up / Thumbs Down” button. But in most organizations, that button is basically a placebo.
The typical workflow looks like this:
- A user clicks “Thumbs Down.”
- The event lands in a database table or a spreadsheet.
- Once a quarter, someone glances at it, feels overwhelmed, and moves on.
The issue isn’t the lack of data. It’s the lack of action. Operations collects signals, but Engineering has no reliable mechanism to turn those signals into improvements.
The Solution: The “Trace-to-Test” Pipeline
We fix this by wiring observability directly into the software lifecycle.
Instead of treating feedback as a passive log, we build an active pipeline where a negative interaction in production triggers a chain of events that leads to a real engineering improvement.
Phase 1: Capture (Contextualizing the Failure)
A “Thumbs Down” is useless without context.
To fix the problem, you need to know what actually happened:
- What documents did the AI retrieve?
- What tools did it call?
- How long did each step take?
- Which version of the app produced the output?
On Databricks, we use MLflow Tracing to capture the full execution path.
Just as importantly, we enrich traces with business context—user and session IDs—so the failure is searchable later.
import mlflow
@mlflow.trace
def answer(user_id: str, session_id: str, message: str) -> str:
# Tag the trace with business metadata so we can find it later
mlflow.update_current_trace(
client_request_id=os.getenv("REQUEST_ID"),
metadata={
"user_id": user_id,
"session_id": session_id,
"environment": "production"
},
)
# ... The agent logic (retrieval + tools + LLM) runs here ...
return generate_response(message)
Linking Feedback to Reality
When a user gives feedback, we don’t just record a rating. We attach that rating directly to the Trace ID. That ties the user’s sentiment to the exact execution path the system took.
from mlflow.entities import AssessmentSource
def submit_feedback(trace_id: str, is_correct: bool, comment: str, user_id: str):
mlflow.log_feedback(
trace_id=trace_id,
name="user_feedback.is_correct",
value=is_correct,
rationale=comment,
source=AssessmentSource(source_type="HUMAN", source_id=user_id),
)
Now feedback isn’t just a row in a spreadsheet. It becomes queryable, debuggable data attached to real system behavior.
Phase 2: Curate (The “Hard Examples” Dataset)
A “Thumbs Down” is a signal—not a test case.
To make it useful, we convert it into a structured record developers can actually use. We run a nightly scheduled job that:
- Queries traces with negative feedback
- Extracts inputs (question) and outputs (bad answer)
- Promotes them into a “Hard Examples” dataset in Unity Catalog
This dataset becomes your “gold standard” of failure: the edge cases your system currently cannot handle.
Technical note: programmatic search
client = MlflowClient()
# Find all traces where the user said "This is wrong"
bad_traces = client.search_traces(
experiment_ids=[experiment_id],
filter_string="assessments.`user_feedback.is_correct`.value = 'false'",
max_results=1000,
)
In a mature setup, these examples flow into a Review App where a domain expert (for example, a senior support agent) labels the correct answer. That’s the moment feedback becomes a true regression test.
Phase 3: Test (Regression Protection)
This is where the loop closes.
We update the CI/CD pipeline (GitHub Actions) to pull from the “Hard Examples” dataset during testing.
Now, when a developer opens a pull request, the system runs the new version against historical failures. If the change doesn’t fix old errors—or introduces new ones—the deployment gets blocked.
The gate itself is simple. The effect is enormous:
from mlflow.genai.scorers import Correctness, Safety
# Run the new model against the "Hard Examples" dataset
results = mlflow.evaluate(
model=predict_fn,
data=eval_dataset.to_pandas(),
extra_metrics=[Correctness(), Safety()],
)
# Block deployment if accuracy is below 90%
assert results.metrics["correctness/v1/score"] >= 0.90
The outcome is exactly what engineering teams want: production failures become permanent regression protection. Every release is forced to prove it’s better.
Turning Failures into Features (Optimization)
You can push this even further.
If you’re using optimization tools like DSPy, these “Hard Examples” become training data. Instead of manually rewriting prompts for every edge case, you feed the dataset into an optimizer like MIPROv2.
The optimizer analyzes the failures and proposes improvements:
- rewritten prompt instructions, and/or
- better few-shot examples targeted at the real problems
The system doesn’t just detect failures. It helps generate fixes.
(For a deeper look, see my article on Smart Downsizing with DSPy.)
Managerial Takeaway: Logs Are Assets, Not Exhaust
A common leadership mistake is treating production logs as exhaust fumes—stored for compliance, then ignored.
In GenAI, production logs are different. They’re valuable intellectual property. They’re the best map you have of what users actually want—and where the system is failing them.
If you only monitor uptime (“Is the API online?”), you’re flying blind. You also need to monitor learning.
One metric to watch: Fix Rate.
If your Fix Rate is low, you don’t have an AI development process. You have a support queue.
With a Trace-to-Test pipeline on Databricks, you stop guessing what to build next. Your users show you exactly what’s broken—and your infrastructure ensures you fix it permanently.
Comments
Post a Comment