Unstructured retrieval AI agent tools

Preview

This article shows how to create AI agent tools for unstructured data retrieval using the Mosaic AI Agent Framework. Unstructured retrievers enable agents to query unstructured data sources, such as a document corpus, using vector search indexes.

To learn more about agent tools, see AI agent tools.

Locally develop Vector Search retriever tools with AI Bridge

The easiest way to start developing a Databricks Vector Search retriever tool is locally. Use Databricks AI Bridge packages like databricks-langchain and databricks-openai to quickly add retrieval capabilities to an agent and experiment with query parameters. This approach enables fast iteration during initial development.

Once your local tool is ready, you can directly productionize it as part of your agent code, or migrate it to a Unity Catalog function, which provides better discoverability and governance but has certain limitations. See Vector Search retriever tool with Unity Catalog functions.

note

To use an external vector index hosted outside of Databricks, see Vector Search retriever using a vector index hosted outside of Databricks.

LangChain/LangGraph
OpenAI

The following code prototypes a retriever tool and binds it to an LLM locally so you can chat with the agent to test its tool-calling behavior.

Install the latest version of databricks-langchain which includes Databricks AI Bridge.

Bash
%pip install --upgrade databricks-langchain

The following example queries a hypothetical vector search index that fetches content from Databricks product documentation.

Provide a clear and descriptive tool_description. The agent LLM uses the tool_description to understand the tool and determine when to invoke the tool.

Python
from databricks_langchain import VectorSearchRetrieverTool, ChatDatabricks

# Initialize the retriever tool.
vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation."
)

# Run a query against the vector search index locally for testing
vs_tool.invoke("Databricks Agent Framework?")

# Bind the retriever tool to your Langchain LLM of choice
llm = ChatDatabricks(endpoint="databricks-meta-llama-3-1-70b-instruct")
llm_with_tools = llm.bind_tools([vs_tool])

# Chat with your LLM to test the tool calling functionality
llm_with_tools.invoke("Based on the Databricks documentation, what is Databricks Agent Framework?")

note

When initializing the VectorSearchRetrieverTool, the text_column and embedding arguments are required for Delta Sync Indexes with self-managed embeddings and Direct Vector Access Indexes. See options for providing embeddings.

For additional details, see the API docs for VectorSearchRetrieverTool

Python
from databricks_langchain import VectorSearchRetrieverTool
from databricks_langchain import DatabricksEmbeddings

embedding_model = DatabricksEmbeddings(
    endpoint="databricks-bge-large-en",
)

vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
  num_results=5, # Max number of documents to return
  columns=["primary_key", "text_column"], # List of columns to include in the search
  filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
  query_type="ANN", # Query type ("ANN" or "HYBRID").
  tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
  tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
  text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
  embedding=embedding_model # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

The following code prototypes a vector search retriever tool and integrates it with OpenAI’s GPT models.

For more information on OpenAI recommendations for tools, see OpenAI Function Calling documentation.

Install the latest version of databricks-openai which includes Databricks AI Bridge.

Bash
%pip install --upgrade databricks-openai

The following example queries a hypothetical vector search index that fetches content from Databricks product documentation.

Provide a clear and descriptive tool_description. The agent LLM uses the tool_description to understand the tool and determine when to invoke the tool.

Python
from databricks_openai import VectorSearchRetrieverTool
from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key=<your_API_key>)

# Initialize the retriever tool
dbvs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation"
)

messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": "Using the Databricks documentation, answer what is Spark?"
  }
]
first_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

# Execute function code and parse the model's response and handle function calls.
tool_call = first_response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = dbvs_tool.execute(query=args["query"])  # For self-managed embeddings, optionally pass in openai_client=client

# Supply model with results – so it can incorporate them into its final response.
messages.append(first_response.choices[0].message)
messages.append({
  "role": "tool",
  "tool_call_id": tool_call.id,
  "content": json.dumps(result)
})
second_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

note

For additional details, see the API docs for VectorSearchRetrieverTool

Python
from databricks_openai import VectorSearchRetrieverTool

vs_tool = VectorSearchRetrieverTool(
    index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
    num_results=5, # Max number of documents to return
    columns=["primary_key", "text_column"], # List of columns to include in the search
    filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
    query_type="ANN", # Query type ("ANN" or "HYBRID").
    tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
    tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
    text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
    embedding_model_name="databricks-bge-large-en" # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

Vector Search retriever tool with Unity Catalog functions

The following example creates retriever tool using a Unity Catalog function to query data from a Mosaic AI Vector Search index.

The Unity Catalog function databricks_docs_vector_search queries a hypothetical Vector Search index containing Databricks documentation. This function wraps the Databricks SQL function vector_search() and aligns its output with the MLflow retriever schema. by using the page_content and metadata aliases.

note

To conform to the MLflow retriever schema, any additional metadata columns must be added to the metadata column using the SQL map function, rather than as top-level output keys.

Run the following code in a notebook or SQL editor to create the function:

SQL
CREATE OR REPLACE FUNCTION main.default.databricks_docs_vector_search (
  -- The agent uses this comment to determine how to generate the query string parameter.
  query STRING
  COMMENT 'The query string for searching Databricks documentation.'
) RETURNS TABLE
-- The agent uses this comment to determine when to call this tool. It describes the types of documents and information contained within the index.
COMMENT 'Executes a search on Databricks documentation to retrieve text documents most relevant to the input query.' RETURN
SELECT
  chunked_text as page_content,
  map('doc_uri', url, 'chunk_id', chunk_id) as metadata
FROM
  vector_search(
    -- Specify your Vector Search index name here
    index => 'catalog.schema.databricks_docs_index',
    query => query,
    num_results => 5
  )

To use this retriever tool in your AI agent, wrap it with UCFunctionToolkit. This enables automatic tracing through MLflow.

MLflow Tracing captures detailed execution information for gen AI applications. It logs inputs, outputs, and metadata for each step, helping you debug issues and analyze performance.

When using UCFunctionToolkit, retrievers automatically generate RETRIEVER span types in MLflow logs if their output conforms to the MLflow retriever schema. See MLflow Tracing Schema.

For more information about UCFunctionToolkit see Unity Catalog documentation.

Python
from unitycatalog.ai.langchain.toolkit import UCFunctionToolkit

toolkit = UCFunctionToolkit(
    function_names=[
        "main.default.databricks_docs_vector_search"
    ]
)
tools = toolkit.tools

This retriever tool has the following caveats:

SQL clients may limit the maximum number of rows or bytes returned. To prevent data truncation, you should truncate column values returned by the UDF. For example, you could use substring(chunked_text, 0, 8192) to reduce the size of large content columns and avoid row truncation during execution.
Since this tool is a wrapper for the vector_search() function, it is subject to the same limitations as the vector_search() function. See Limitations.

Vector Search retriever using a vector index hosted outside of Databricks

If your vector index is hosted outside of Databricks, you can create a Unity Catalog Connection to to connect to the external service and use the connection agent code. For more information, see Connect AI agent tools to external services.

The following example creates a Vector Search retriever that calls a vector index hosted outside of Databricks for a PyFunc-flavored agent.

Create a Unity Catalog Connection to the external service, in this case, Azure.

SQL
CREATE CONNECTION ${connection_name}
TYPE HTTP
OPTIONS (
  host 'https://example.search.windows.net',
  base_path '/',
  bearer_token secret ('<secret-scope>','<secret-key>')
);

Define the retriever tool in agent code using the Unity Catalog Connection you created. This example uses MLflow decorators to enable agent tracing.

note

To conform to the MLflow retriever schema, the retriever function should return a Document type and use the metadata field in the Document class to add additional attributes to the returned document, like like doc_uri and similarity_score.

Python
import mlflow
import json

from mlflow.entities import Document
from typing import List, Dict, Any
from dataclasses import asdict

class VectorSearchRetriever:
  """
  Class using Databricks Vector Search to retrieve relevant documents.
  """

  def __init__(self):
    self.azure_search_index = "hotels_vector_index"

  @mlflow.trace(span_type="RETRIEVER", name="vector_search")
  def __call__(self, query_vector: List[Any], score_threshold=None) -> List[Document]:
    """
    Performs vector search to retrieve relevant chunks.
    Args:
      query: Search query.
      score_threshold: Score threshold to use for the query.

    Returns:
      List of retrieved Documents.
    """
    from databricks.sdk import WorkspaceClient
    from databricks.sdk.service.serving import ExternalFunctionRequestHttpMethod

    json = {
      "count": true,
      "select": "HotelId, HotelName, Description, Category",
      "vectorQueries": [
        {
          "vector": query_vector,
          "k": 7,
          "fields": "DescriptionVector",
          "kind": "vector",
          "exhaustive": true,
        }
      ],
    }

    response = (
      WorkspaceClient()
      .serving_endpoints.http_request(
        conn=connection_name,
        method=ExternalFunctionRequestHttpMethod.POST,
        path=f"indexes/{self.azure_search_index}/docs/search?api-version=2023-07-01-Preview",
        json=json,
      )
      .text
    )

    documents = self.convert_vector_search_to_documents(response, score_threshold)
    return [asdict(doc) for doc in documents]

  @mlflow.trace(span_type="PARSER")
  def convert_vector_search_to_documents(
    self, vs_results, score_threshold
  ) -> List[Document]:
    docs = []

    for item in vs_results.get("value", []):
      score = item.get("@search.score", 0)

      if score >= score_threshold:
        metadata = {
          "score": score,
          "HotelName": item.get("HotelName"),
          "Category": item.get("Category"),
        }

        doc = Document(
          page_content=item.get("Description", ""),
          metadata=metadata,
          id=item.get("HotelId"),
        )
        docs.append(doc)

    return docs

To run the retriever, run the following Python code. You can optionally include Vector Search filters in the request to filter results.

Python
retriever = VectorSearchRetriever()
query = [0.01944167, 0.0040178085 . . .  TRIMMED FOR BREVITY 010858015, -0.017496133]
results = retriever(query, score_threshold=0.1)

Set retriever schema

If the trace returned from the retriever or span_type="RETRIEVER" does not conform to MLflow’s standard retriever schema, you must manually map the returned schema to MLflow’s expected fields. This ensures that MLflow can properly trace your retriever and render the traces correctly in downstream applications.

To set the retriever schema manually, call mlflow.models.set_retriever_schema when you define your agent. Use set_retriever_schema to map the column names in the returned table to MLflow’s expected fields such as primary_key, text_column, and doc_uri.

Python
# Define the retriever's schema by providing your column names
mlflow.models.set_retriever_schema(
    name="vector_search",
    primary_key="chunk_id",
    text_column="text_column",
    doc_uri="doc_uri"
    # other_columns=["column1", "column2"],
)

You can also specify additional columns in your retriever’s schema by providing a list of column names with the other_columns field.

If you have multiple retrievers, you can define multiple schemas by using unique names for each retriever schema.

The retriever schema set during agent creation affects downstream applications and workflows, such as the review app and evaluation sets. Specifically, the doc_uri column serves as the primary identifier for documents returned by the retriever.

The review app displays the doc_uri to help reviewers assess responses and trace document origins. See Review App UI.
Evaluation sets use doc_uri to compare retriever results against predefined evaluation datasets to determine the retriever’s recall and precision. See Evaluation sets.

Trace the retriever

MLflow tracing adds observability by capturing detailed information about your agent’s execution. It provides a way to record the inputs, outputs, and metadata associated with each intermediate step of a request, enabling you to pinpoint the source of bugs and unexpected behaviors quickly.

This example uses the @mlflow.trace decorator to create a trace for the retriever and parser. For other options for setting up trace methods, see MLflow Tracing for agents.

The decorator creates a span that starts when the function is invoked and ends when it returns. MLflow automatically records the function’s input and output and any exceptions raised.

note

LangChain, LlamaIndex, and OpenAI library users can use MLflow auto logging instead of manually defining traces with the decorator. See Use autologging to add traces to your agents.

Python
...
@mlflow.trace(span_type="RETRIEVER", name="vector_search")
def __call__(self, query: str) -> List[Document]:
  ...

To ensure downstream applications such as Agent Evaluation and the AI Playground render the retriever trace correctly, make sure the decorator meets the following requirements:

Use span_type="RETRIEVER" and ensure the function returns List[Document] object. See Retriever spans.
The trace name and the retriever_schema name must match to configure the trace correctly.

Next steps

After you create a Unity Catalog function agent tool, add the tool to an AI agent. See Add Unity Catalog tools to agents.

Locally develop Vector Search retriever tools with AI Bridge​

Vector Search retriever tool with Unity Catalog functions​

Vector Search retriever using a vector index hosted outside of Databricks​

Set retriever schema​

Trace the retriever​

Next steps​

Locally develop Vector Search retriever tools with AI Bridge

Vector Search retriever tool with Unity Catalog functions

Vector Search retriever using a vector index hosted outside of Databricks

Set retriever schema

Trace the retriever

Next steps