Skip to main content

Foundation model REST API reference

This article provides general API information for Databricks Foundation Model APIs and the models they support. The Foundation Model APIs are designed to be similar to OpenAI’s REST API to make migrating existing projects easier. Both the pay-per-token and provisioned throughput endpoints accept the same REST API request format.

Endpoints

Foundation Model APIs supports pay-per-token endpoints and provisioned throughput endpoints.

A preconfigured endpoint is available in your workspace for each pay-per-token supported model, and users can interact with these endpoints using HTTP POST requests. See Pay-per-token for supported models.

Provisioned throughput endpoints can be created using the API or the Serving UI. These endpoints support multiple models per endpoint for A/B testing, as long as both served models expose the same API format. For example, both models are chat models. See POST /api/2.0/serving-endpoints for endpoint configuration parameters.

Requests and responses use JSON, the exact JSON structure depends on an endpoint’s task type. Chat and completion endpoints support streaming responses.

Usage

Responses include a usage sub-message which reports the number of tokens in the request and response. The format of this sub-message is the same across all task types.

FieldTypeDescription
completion_tokensIntegerNumber of generated tokens. Not included in embedding responses.
prompt_tokensIntegerNumber of tokens from the input prompt(s).
total_tokensIntegerNumber of total tokens.

For models like Meta-Llama-3.3-70B-Instruct a user prompt is transformed using a prompt template before being passed into the model. For pay-per-token endpoints, a system prompt might also be added. prompt_tokens includes all text added by our server.

Chat task

Chat tasks are optimized for multi-turn conversations with a model. The model response provides the next assistant message in the conversation. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.

Chat request

FieldDefaultTypeDescription
messagesChatMessage listRequired. A list of messages representing the current conversation.
max_tokensnullnull, which means no limit, or an integer greater than zeroThe maximum number of tokens to generate.
streamtrueBooleanStream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the Server-sent events standard.
temperature1.0Float in [0,2]The sampling temperature. 0 is deterministic and higher values introduce more randomness.
top_p1.0Float in (0,1]The probability threshold used for nucleus sampling.
top_knullnull, which means no limit, or an integer greater than zeroDefines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic.
stop[]String or List[String]Model stops generating further tokens when any one of the sequences in stop is encountered.
n1Integer greater than zeroThe API returns n independent chat completions when n is specified. Recommended for workloads that generate multiple completions on the same input for additional inference efficiency and cost savings. Only available for provisioned throughput endpoints.
tool_choicenoneString or ToolChoiceObjectUsed only in conjunction with the tools field. tool_choice supports a variety of keyword strings such as auto, required, and none. auto means that you are letting the model decide which (if any) tool is relevant to use. With auto if the model doesn’t believe any of the tools in tools are relevant, the model generates a standard assistant message instead of a tool call. required means that the model picks the most relevant tool in tools and must generate a tool call. none means that the model does not generate any tool calls and instead must generate a standard assistant message. To force a tool call with a specific tool defined in tools, use a ToolChoiceObject. By default, if the tools field is populated tool_choice = "auto". Else, the tools field defaults to tool_choice = "none"
toolsnullToolObjectA list of tools that the model can call. Currently, function is the only supported tool type and a max of 32 functions are supported.
response_formatnullResponseFormatObjectAn object specifying the format that the model must output. Accepted types are text, json_schema or json_object

Setting to { "type": "json_schema", "json_schema": {...} } enables structured outputs which ensures the model follows your supplied JSON schema.

Setting to { "type": "json_object" } ensures the responses the model generates is valid JSON, but does not ensure that responses follow a specific schema.
logprobsfalseBooleanThis parameter indicates whether to provide the log probability of a token being sampled.
top_logprobsnullIntegerThis parameter controls the number of most likely token candidates to return log probabilities for at each sampling step. Can be 0-20. logprobs must be true if using this field.

ChatMessage

FieldTypeDescription
roleStringRequired. The role of the author of the message. Can be "system", "user", "assistant" or "tool".
contentStringThe content of the message. Required for chat tasks that do not involve tool calls.
tool_callsToolCall listThe list of tool_calls that the model generated. Must have role as "assistant" and no specification for the content field.
tool_call_idStringWhen role is "tool", the ID associated with the ToolCall that the message is responding to. Must be empty for other role options.

The system role can only be used once, as the first message in a conversation. It overrides the model’s default system prompt.

ToolCall

A tool call action suggestion by the model. See Function calling on Databricks.

FieldTypeDescription
idStringRequired. A unique identifier for this tool call suggestion.
typeStringRequired. Only "function" is supported.
functionFunctionCallCompletionRequired. A function call suggested by the model.

FunctionCallCompletion

FieldTypeDescription
nameStringRequired. The name of the function the model recommended.
argumentsObjectRequired. Arguments to the function as a serialized JSON dictionary.

ToolChoiceObject

See Function calling on Databricks.

FieldTypeDescription
typeStringRequired. The type of the tool. Currently, only "function" is supported.
functionObjectRequired. An object defining which tool to call of the form {"type": "function", "function": {"name": "my_function"}} where "my_function is the name of a FunctionObject in the tools field.

ToolObject

See Function calling on Databricks.

FieldTypeDescription
typeStringRequired. The type of the tool. Currently, only function is supported.
functionFunctionObjectRequired. The function definition associated with the tool.

FunctionObject

FieldTypeDescription
nameStringRequired. The name of the function to be called.
descriptionObjectRequired. The detailed description of the function. The model uses this description to understand the relevance of the function to the prompt and generate the tool calls with higher accuracy.
parametersObjectThe parameters the function accepts, described as a valid JSON schema object. If the tool is called, then the tool call is fit to the JSON schema provided. Omitting parameters defines a function without any parameters. The number of properties is limited to 15 keys.
strictBooleanWhether to enable strict schema adherence when generating the function call. If set to true, the model follows the exact schema defined in the schema field. Only a subset of JSON schema is supported when strict is true

ResponseFormatObject

See Structured outputs on Databricks.

FieldTypeDescription
typeStringRequired. The type of response format being defined. Either text for unstructured text, json_object for unstructured JSON objects, or json_schema for JSON objects adhering to a specific schema.
json_schemaJsonSchemaObjectRequired. The JSON schema to adhere to if type is set to json_schema

JsonSchemaObject

See Structured outputs on Databricks.

FieldTypeDescription
nameStringRequired. The name of the response format.
descriptionStringA description of what the response format is for, used by the model to determine how to respond in the format.
schemaObjectRequired. The schema for the response format, described as a JSON schema object.
strictBooleanWhether to enable strict schema adherence when generating the output. If set to true, the model follows the exact schema defined in the schema field. Only a subset of JSON schema is supported when strict is true

Chat response

For non-streaming requests, the response is a single chat completion object. For streaming requests, the response is a text/event-stream where each event is a completion chunk object. The top-level structure of completion and chunk objects is almost identical: only choices has a different type.

FieldTypeDescription
idStringUnique identifier for the chat completion.
choicesList[ChatCompletionChoice] or List[ChatCompletionChunk] (streaming)List of chat completion texts. n choices are returned if the n parameter is specified.
objectStringThe object type. Equal to either "chat.completions" for non-streaming or "chat.completion.chunk" for streaming.
createdIntegerThe time the chat completion was generated in seconds.
modelStringThe model version used to generate the response.
usageUsageToken usage metadata. Might not be present on streaming responses.

ChatCompletionChoice

FieldTypeDescription
indexIntegerThe index of the choice in the list of generated choices.
messageChatMessageA chat completion message returned by the model. The role will be assistant.
finish_reasonStringThe reason the model stopped generating tokens.

ChatCompletionChunk

FieldTypeDescription
indexIntegerThe index of the choice in the list of generated choices.
deltaChatMessageA chat completion message part of generated streamed responses from the model. Only the first chunk is guaranteed to have role populated.
finish_reasonStringThe reason the model stopped generating tokens. Only the last chunk will have this populated.

Completion task

Text completion tasks are for generating responses to a single prompt. Unlike Chat, this task supports batched inputs: multiple independent prompts can be sent in one request. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.

Completion request

FieldDefaultTypeDescription
promptString or List[String]Required. The prompts for the model.
max_tokensnullnull, which means no limit, or an integer greater than zeroThe maximum number of tokens to generate.
streamtrueBooleanStream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the Server-sent events standard.
temperature1.0Float in [0,2]The sampling temperature. 0 is deterministic and higher values introduce more randomness.
top_p1.0Float in (0,1]The probability threshold used for nucleus sampling.
top_knullnull, which means no limit, or an integer greater than zeroDefines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic.
error_behavior"error""truncate" or "error"For timeouts and context-length-exceeded errors. One of: "truncate" (return as many tokens as possible) and "error" (return an error). This parameter is only accepted by pay per token endpoints.
n1Integer greater than zeroThe API returns n independent chat completions when n is specified. Recommended for workloads that generate multiple completions on the same input for additional inference efficiency and cost savings. Only available for provisioned throughput endpoints.
stop[]String or List[String]Model stops generating further tokens when any one of the sequences in stop is encountered.
suffix""StringA string that is appended to the end of every completion.
echofalseBooleanReturns the prompt along with the completion.
use_raw_promptfalseBooleanIf true, pass the prompt directly into the model without any transformation.

Completion response

FieldTypeDescription
idStringUnique identifier for the text completion.
choicesCompletionChoiceA list of text completions. For every prompt passed in, n choices are generated if n is specified. Default n is 1.
objectStringThe object type. Equal to "text_completion"
createdIntegerThe time the completion was generated in seconds.
usageUsageToken usage metadata.

CompletionChoice

FieldTypeDescription
indexIntegerThe index of the prompt in request.
textStringThe generated completion.
finish_reasonStringThe reason the model stopped generating tokens.

Embedding task

Embedding tasks map input strings into embedding vectors. Many inputs can be batched together in each request. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.

Embedding request

FieldTypeDescription
inputString or List[String]Required. The input text to embed. Can be a string or a list of strings.
instructionStringAn optional instruction to pass to the embedding model.

Instructions are optional and highly model specific. For instance the The BGE authors recommend no instruction when indexing chunks and recommend using the instruction "Represent this sentence for searching relevant passages:" for retrieval queries. Other models like Instructor-XL support a wide range of instruction strings.

Embeddings response

FieldTypeDescription
idStringUnique identifier for the embedding.
objectStringThe object type. Equal to "list".
modelStringThe name of the embedding model used to create the embedding.
dataEmbeddingObjectThe embedding object.
usageUsageToken usage metadata.

EmbeddingObject

FieldTypeDescription
objectStringThe object type. Equal to "embedding".
indexIntegerThe index of the embedding in the list of embeddings generated by the model.
embeddingList[Float]The embedding vector. Each model will return a fixed size vector (1024 for BGE-Large)

Additional resources