Databricks Builds a Smarter Enterprise Search Agent

Most enterprise RAG (Retrieval Augmented Generation) systems are designed for a specific type of search task. That works fine until a different type of query appears. A model that is good at summarizing reports may struggle to find specific entities. A system built for simple lookups may fail when it needs to reason across multiple documents.

This is a common problem inside companies where information is scattered across meeting notes, customer records, internal documents and product discussions. The model often fails silently, giving incomplete or incorrect answers.

Databricks believes it has a solution.

The company has built a new agent called KARL (Knowledge Agents via Reinforcement Learning) that is designed to handle multiple enterprise search behaviors at the same time.

According to Databricks, KARL can match the performance of Claude Opus 4.6 on a custom enterprise benchmark while delivering:
• 33% lower cost per query
• 47% lower latency

The model was trained entirely on synthetic data generated by the agent itself, without human labeling.

Why enterprise search is harder than it looks

Most real business questions do not have a single clear answer.

Examples include:
• Combining insights from multiple product manager meeting notes
• Reconstructing outcomes of past competitive deals
• Understanding a customer’s history when details are spread across many systems
• Building internal sales battle cards from unstructured data

In these situations, the answer is not stored in one document. The model has to retrieve information, reason through it, and synthesize a conclusion.

Jonathan Frankle, Chief AI Scientist at Databricks, describes this challenge as working with tasks that are not strictly verifiable. Unlike math problems or coding tasks, there is rarely a clear right or wrong answer.


The generalization problem in RAG systems

Traditional RAG pipelines are often optimized for a single behavior. Databricks tested this and found that models trained on one type of search task performed poorly when tested on other types.

To measure this problem, the team created a benchmark called KARLBench, which evaluates six enterprise search behaviors:

• Constraint driven entity search
• Cross document report synthesis
• Long document navigation with numerical reasoning
• Exhaustive entity retrieval
• Procedural reasoning over technical documentation
• Fact aggregation from internal company notes

Training on just one of these tasks did not generalize well.

However, multi task reinforcement learning allowed the model to adapt across all tasks, even ones it had never seen before.


Grounded reasoning and large scale retrieval

KARL performs what Databricks calls grounded reasoning.

Instead of relying only on the language model’s internal knowledge, the system continuously retrieves facts and anchors every reasoning step to real data.

In some tasks, the agent performs up to 200 sequential vector database searches, refining queries, verifying details and cross checking documents before producing an answer.

This is far more complex than the typical RAG pipeline used in many enterprise tools today.


The reinforcement learning engine behind KARL

Training the system required a new reinforcement learning approach called OAPL (Optimal Advantage based Policy Optimization with Lagged Inference policy).

Traditional RL methods assume the model generating training data and the model being updated are synchronized. In large distributed training systems, that assumption breaks down.

OAPL is designed to handle this reality by allowing training to remain stable even when models are far out of sync. In experiments, it achieved similar results to standard RL methods while using about three times fewer training samples.

This efficiency kept the total training run within a few thousand GPU hours, making the project feasible for enterprise scale development.


How KARL handles large context and memory

Enterprise queries often require pulling information from huge internal databases. The context window of an LLM cannot hold all of that information.

Databricks approached this problem with a layered architecture:

• Vector database with millions of records at the base
• Compression and caching layers in the middle
• The LLM context window at the top

Instead of building a separate summarization system, the team allowed KARL to learn how to compress its own context during reinforcement learning.

When context becomes too large, the agent compresses earlier information and continues reasoning. Removing this learned compression reduced benchmark accuracy from 57% to 39%, showing how critical this capability is.


Where the system still struggles

Despite the progress, KARL is not perfect.

The model struggles most with highly ambiguous questions where multiple valid answers exist. In these cases, the system may not know whether the question is open ended or simply difficult.

Another behavior is stopping early on some queries before producing a final answer. Databricks argues that this is sometimes the correct choice, since the most complex queries are often the ones that produce wrong answers anyway.

The current system is also limited to vector search. It does not yet handle:

• SQL queries
• File system search
• Python based calculations

Those capabilities are planned for future versions.


What this means for enterprise AI teams

The work behind KARL highlights three important takeaways for companies building internal AI search tools.

  1. Pipeline design matters
    Systems optimized for a single search behavior may fail on other types of queries. Training models across multiple retrieval patterns leads to better generalization.

  2. Reinforcement learning changes the model’s behavior
    Supervised fine tuning improved performance on tasks the model had already seen. But only reinforcement learning helped the system adapt to new tasks.

  3. Efficient search reduces cost and errors
    A model trained to search effectively performs fewer retrieval steps, stops early on impossible queries, explores alternative search paths and compresses its own context when needed.


The bigger takeaway is that enterprise search is not just about better language models. It is about training systems that know how to retrieve, reason and navigate complex internal data environments. Databricks believes purpose built agents like KARL could become a key part of that future.