Your Data Lake is a Swamp: Building the Semantic Layer for Agents

Most companies believe their data is AI-ready.

They have a data lake. They have dashboards. They have tables with millions of rows.

Then they point an AI agent at the warehouse and ask a simple question:

“What was our churn rate last month?”

The agent replies:

“I cannot find a column named churn.”

Or worse, it finds a column called CUST_STAT_CD, guesses that 0 means “churned,” and confidently reports a number that’s off by 50%.

The problem isn’t that the AI is stupid.

The problem is that your data is cryptic.

For the last 20 years, we built data warehouses for human analysts—experts who rely on tribal knowledge to know that T_SALES_FINAL_V2 really means revenue.

AI agents don’t have tribal knowledge. They only know what’s explicitly written in the schema.

When metadata is missing, your data lake isn’t a lake.

It’s a swamp.

To fix this, you need a semantic layer. Here’s how to use Databricks Unity Catalog and Genie to teach your data to speak human.


The Metadata Gap: Why Agents Get Lost

When an AI writes SQL, it reads your schema like a map.

Good map: monthly_revenue_usd
Bad map: m_rev_01

Most enterprise schemas are bad maps. They’re full of acronyms, legacy codes, and undocumented logic.

So the goal isn’t just “add AI.”
The goal is to make meaning explicit.


Step 1: Reduce the Surface Area

Before you document anything, shrink the problem.

Agents perform best when the data they see is small and purposeful. Don’t expose 200 raw tables and hope for the best. Instead, curate a business-ready layer:

  • Hide irrelevant columns
  • Pre-join common relationships into clean views
  • Keep only what business users actually ask for

This work isn’t glamorous. But it’s the difference between a demo and a production system.


Step 2: Annotate What You Already Have

Start with the simplest, highest-leverage move: document your tables and columns in Unity Catalog.

Databricks lets you attach rich comments directly to the data:

COMMENT ON TABLE main.sales.customers IS
'Master customer list for churn and active user analysis.';

COMMENT ON COLUMN main.sales.customers.status_code IS
'Account status code.
0 = Active
1 = Paused
99 = Churned (Cancelled subscription).
Exclude Free Trial users when calculating churn.';


Pro tip:

Don’t write comments for engineers (“integer, primary key”).
Write them for the AI (“use this column to identify churned customers”).

Databricks can now generate draft comments automatically, which is a great way to get started. Just don’t skip human review—especially for critical tables.


Step 3: Define Standard Metrics (Metric Views)

Some concepts don’t fit in a comment.

Metrics like Net Revenue Retention, Active Daily Users, or Churn Rate involve non-trivial logic. If you let the AI guess the formula, it will eventually guess wrong.

This is where Metric Views come in. They let you define a KPI once—governed, reusable, and consistent—and allow the AI to consume it safely.

measures:

  - name: churn_rate
    expr: churned_customers / NULLIF(active_customers + churned_customers, 0)
    display_name: "Churn Rate"
    format: "percent"
    synonyms: ["cancellation rate", "attrition"]


Those synonyms matter. When a user asks about attrition, the AI now knows they mean churn.

You’re no longer asking the agent to infer business logic. You’re giving it the answer key.


Step 4: Curate Knowledge in Databricks Genie

The final step is bringing everything together in a Genie Space.

A Genie Space isn’t just a chatbot. It’s a curated environment with a Knowledge Store that guides how the AI reasons about your data.

Instead of long instruction prompts, you give Genie trusted assets:

  • Example SQL queries: “This is the approved logic for MRR.”
  • Verified functions: “Always use get_fiscal_calendar() for date filtering.”

These assets can be marked as Trusted, which tells the AI to prefer official business logic over improvisation.

At that point, you’re no longer prompting an AI.
You’re onboarding a new analyst—with guardrails.


Managerial Takeaway: Documentation Is Training Data

For decades, documentation was a “nice to have.”

In the GenAI era, documentation is training data.

If you want agents that behave like reliable analysts, you need:

  • Curated, agent-friendly datasets
  • Explicit business meaning in metadata
  • Standardized metrics through metric views
  • Guided reasoning via Genie’s knowledge store

That’s the difference between a swamp and a semantic layer.

The strategic shift:
Stop treating metadata as admin work. Treat it as product development for your internal AI capabilities.

 

Comments

Popular posts from this blog

10 Rules for Professional GenAI Engineering on Databricks

The "CFO-Approved" Deployment: Embedding FinOps into Your CI/CD Pipeline

Zero-Trust RAG: The C-Suite Guide to Secure Multi-Tenant AI | Everstone AI