The 'Cold Start' Fix: Generating Synthetic Golden Sets with Unity Catalog

There is a moment in nearly every RAG project that I call the Evaluation Deadlock.

The engineering team has built a chatbot. It works. They’re ready to test it.

Then someone asks the obvious question:

“What should we test it against?”

The room goes quiet.

To measure quality, you need a Golden Dataset—at least 100 realistic user questions paired with accurate, ground-truth answers. But early on, you don’t have users yet. Which means you don’t have questions.

And your Subject Matter Experts—the senior lawyers, engineers, or doctors who could write those answers—are far too busy billing $500 an hour to spend days in Excel creating test cases.

This is the deadlock:
You can’t deploy without testing.
You can’t test without data.

The solution isn’t to hire more humans.
It’s to build an Evaluation Factory.

Here’s how to use Databricks Unity Catalog and synthetic data generation to bootstrap a high-quality test suite overnight—without consuming a single hour of SME time.


The Architecture: Seed Synthetic, Evolve with Reality

A mature evaluation strategy follows a simple progression:

  1. Seed an initial dataset synthetically from your knowledge base.
  2. Evaluate your agent systematically to establish a baseline.
  3. Enrich the dataset over time using real production logs and expert labels.

Databricks provides built-in tools that automate the hardest part: Step 1—generating realistic test cases directly from your documents.


Step 1: Ingesting the Knowledge Base

Your raw knowledge—PDFs, HTML, Markdown files, manuals—should live in Unity Catalog Volumes. This is the governed home for unstructured data.

We read these files and convert them into clean text using ai_parse_document. This function handles messy, real-world formats like PDFs and Word documents and prepares them for downstream analysis.

Technical implementation pattern:

# Read raw files from Unity Catalog Volume
df_files = spark.read.format("binaryFile").load("/Volumes/main/rag/raw_docs/*")

# Parse PDFs into clean text
df_parsed = df_files.select(
  expr("path").alias("doc_uri"),
  expr("ai_parse_document(content, map('version','2.0'))").alias("parsed")
)


At this point, your knowledge base is ready to be turned into evaluation data.


Step 2: The Generation Factory

Instead of hand-writing prompts that ask the model to “come up with questions,” we use the Databricks Mosaic AI Agent Evaluation toolkit.

The function generate_evals_df reads your documents and reverse-engineers the kinds of questions real users would ask. For each test case, it generates:

  • The question — realistic, user-style queries
  • Expected facts — what a correct answer must include
  • Expected context — which specific document the answer should come from

from databricks.agents.evals import generate_evals_df

# Generate 500 test cases automatically
evals = generate_evals_df(
  docs_df,
  num_evals=500,
  agent_description="A helpful RAG assistant for internal policy questions.",
  question_guidelines="Include realistic edge cases."
)


This becomes your cold-start Golden Set—built entirely from the material you already own, with coverage across your full document corpus.


Step 3: Governance and Persistence

A Golden Dataset should never be a random CSV sitting on someone’s laptop.

It needs to be durable, versioned, and governed.

That’s why we store it as an MLflow Evaluation Dataset inside Unity Catalog. This gives you lineage, access controls, and the ability to evolve the dataset over time without losing history.

import mlflow.genai.datasets as ds

# Persist the Golden Set to Unity Catalog
dataset = ds.create_dataset(name="main.eval.rag_golden_v1")
dataset = dataset.merge_records(evals)


From this point on, every agent change can be measured against the same immutable standard.


Step 4: The Quality Gate

Once the data exists, evaluation becomes repeatable.

You plug the dataset into mlflow.evaluate() and measure your agent across the metrics that actually matter:

  • Retrieval accuracy — did it pull the right document?
  • Answer quality — did it respond correctly and completely?

results = mlflow.evaluate(
  model="endpoints:/YOUR_AGENT_ENDPOINT",
  data=dataset.to_df(),
  model_type="databricks-agent"
)


This turns evaluation into a true quality gate, not a subjective demo-day exercise.


Managerial Takeaway: AI Testing AI

In practice, the biggest bottleneck in GenAI adoption isn’t compute.
It’s data readiness.

An automated evaluation factory delivers three immediate wins:

  • Speed: from zero tests to 500 tests in a single night
  • Cost savings: hundreds of hours of expensive SME time avoided
  • Coverage: every document is read, including obscure edge cases humans often miss

Don’t let missing test data block deployment.
Use AI to test AI.


Comments

Popular posts from this blog

The GDPR Timebomb in Your Vector Database (And How to Defuse It)

The "CFO-Approved" Deployment: Embedding FinOps into Your CI/CD Pipeline

10 Rules for Professional GenAI Engineering on Databricks