API Examples

This topic contains a range of examples that demonstrate how to use the Databricks API.

Requirements

Before you review or try these examples, you should review the Authentication topic. When you use cURL, we assume that you store Databricks API credentials under .netrc or use BEARER authentication. In the following examples, replace <your-token> with your Databricks personal access token.

In the following examples, replace <databricks-instance> with the <ACCOUNT>.cloud.databricks.com domain name of your Databricks deployment.

Use jq to parse API output

Sometimes it can be useful to parse out parts of the JSON output. In these cases, we recommend that you to use the utility jq. For more information, see the jq Manual. You can install jq on MacOS using Homebrew with brew install jq.

Invoke a GET

While most API calls require that you specify a JSON body, for GET calls you can specify a query string. For example, to get the details for a cluster, run:

curl -n https://<databricks-instance>/api/2.0/clusters/get?cluster_id=<cluster-id>

To list the contents of the DBFS root, run:

curl -n https://<databricks-instance>/api/2.0/dbfs/list?path=/ | jq

Get a gzipped list of clusters

curl -n -H "Accept-Encoding: gzip" https://<databricks-instance>/api/2.0/clusters/list > clusters.gz

Upload a big file into DBFS

The amount of data uploaded by single API call cannot exceed 1MB. To upload a file that is larger than 1MB to DBFS, use the streaming API, which is a combination of create, addBlock, and close.

Here is an example of how to perform this action using Python.

import json
import base64
import requests

DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'
BASE_URL = 'https://%s/api/2.0/dbfs/' % (DOMAIN)

def dbfs_rpc(action, body):
    """ A helper function to make the DBFS API request, request/response is encoded/decoded as JSON """
    response = requests.post(
        BASE_URL + action,
        headers={"Authorization": "Basic " + base64.standard_b64encode("token:" + TOKEN)},
        json=body
    )
    return response.json()

# Create a handle that will be used to add blocks
handle = dbfs_rpc("create", {"path": "/temp/upload_large_file", "overwrite": "true"})['handle']
with open('/a/local/file') as f:
    while True:
        # A block can be at most 1MB
        block = f.read(1 << 20)
        if not block:
            break
        data = base64.standard_b64encode(block)
        dbfs_rpc("add-block", {"handle": handle, "data": data})
# close the handle to finish uploading
dbfs_rpc("close", {"handle": handle})

Create a Python 3 cluster

The following example shows how to launch a Python 3 cluster using the Databricks REST API and the popular requests Python HTTP library:

import requests

DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'

response = requests.post(
  'https://%s/api/2.0/clusters/create' % (DOMAIN),
  headers={'Authorization': "Basic " + base64.standard_b64encode("token:" + TOKEN)},
  json={
  "new_cluster": {
    "spark_version": "4.0.x-scala2.11",
    "node_type_id": "r3.xlarge",
    'spark_env_vars': {
      'PYSPARK_PYTHON': '/databricks/python3/bin/python3',
    }
  }
)

if response.status_code == 200:
  print(response.json()['cluster_id'])
else:
  print("Error launching cluster: %s: %s" % (response.json()["error_code"], response.json()["message"]))

Jobs API examples

This section shows how to create Python, spark submit, and JAR jobs and run the JAR job and view its output.

Create a Python job

This example shows how to create a Python job. It uses the Apache Spark Python Spark Pi estimation.

  1. Download the Python file containing the example and upload to your Databricks instance using Databricks File System - DBFS.

    dbfs cp pi.py dbfs:/docs/pi.py
    
  2. Create the job.

curl -n -H "Content-Type: application/json" -X POST -d @- https://<databricks-instance>/api/2.0/jobs/create <<JSON
{
  "name": "SparkPi Python job",
  "new_cluster": {
    "spark_version": "4.0.x-scala2.11",
    "node_type_id": "i3.xlarge",
    "num_workers": 2
  },
  "spark_python_task": {
    "python_file": "dbfs:/docs/pi.py",
    "parameters": [
      "10"
    ]
  }
}
JSON

Create a spark-submit job

This example shows how to create a spark-submit job. It uses the Apache Spark SparkPi example.

  1. Download the JAR containing the example and upload the JAR to your Databricks instance using Databricks File System - DBFS.

    dbfs cp SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar
    
  2. Create the job.

curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{
      "name": "SparkPi spark-submit job",
      "new_cluster": {
        "spark_version": "4.0.x-scala2.11",
        "node_type_id": "r3.xlarge",
        "aws_attributes": {"availability": "ON_DEMAND"},
        "num_workers": 2
        },
     "spark_submit_task": {
        "parameters": [
          "--class",
          "org.apache.spark.examples.SparkPi",
          "dbfs:/docs/sparkpi.jar",
          "10"
          ]
        }
}' https://<databricks-instance>/api/2.0/jobs/create

Create and run a JAR job

This example shows how to create and run a JAR job. It uses the Apache Spark SparkPi example.

  1. Download the JAR containing the example.

  2. Upload the JAR to your Databricks instance using the API:

    curl -n \
    -F filedata=@"SparkPi-assembly-0.1.jar" \
    -F path="/docs/sparkpi.jar" \
    -F overwrite=true \
    https://<databricks-instance>/api/2.0/dbfs/put
    

    A successful call returns {}. Otherwise you will see an error message.

  3. Get a list of all Spark versions prior to creating your job.

    curl -n https://<databricks-instance>/api/2.0/clusters/spark-versions
    

    I’m going to use version 4.0.x-scala2.11. See Databricks Runtime Versions for more information about Spark cluster versions.

  4. Create the job. The JAR is specified as a library and the main class name is referenced in the Spark JAR task.

    curl -n \
    -X POST -H 'Content-Type: application/json' \
    -d '{
          "name": "SparkPi JAR job",
          "new_cluster": {
            "spark_version": "4.0.x-scala2.11",
            "node_type_id": "r3.xlarge",
            "aws_attributes": {"availability": "ON_DEMAND"},
            "num_workers": 2
            },
         "libraries": [{"jar": "dbfs:/docs/sparkpi.jar"}],
         "spark_jar_task": {
            "main_class_name":"org.apache.spark.examples.SparkPi",
            "parameters": "10"
            }
    }' https://<databricks-instance>/api/2.0/jobs/create
    

    This returns a job-id that you can then use to run the job.

  5. Run the job using run now:

    curl -n \
    -X POST -H 'Content-Type: application/json' \
    -d '{ "job_id": <job-id> }' https://<databricks-instance>/api/2.0/jobs/run-now
    
  6. Navigate to https://<databricks-instance>/#job/<job-id> and you’ll be able to see your job running.

  7. You can also check on it from the API using the information returned from the previous request.

    curl -n https://<databricks-instance>/api/2.0/jobs/runs/get?run_id=<run-id> | jq
    

    Which should return something like:

    {
      "job_id": 35,
      "run_id": 30,
      "number_in_job": 1,
      "original_attempt_run_id": 30,
      "state": {
        "life_cycle_state": "TERMINATED",
        "result_state": "SUCCESS",
        "state_message": ""
      },
      "task": {
        "spark_jar_task": {
          "jar_uri": "",
          "main_class_name": "org.apache.spark.examples.SparkPi",
          "parameters": [
            "10"
          ],
          "run_as_repl": true
        }
      },
      "cluster_spec": {
        "new_cluster": {
          "spark_version": "4.0.x-scala2.11",
          "node_type_id": "<node-type>",
          "enable_elastic_disk": false,
          "num_workers": 1
        },
        "libraries": [
          {
            "jar": "dbfs:/docs/sparkpi.jar"
          }
        ]
      },
      "cluster_instance": {
        "cluster_id": "0412-165350-type465",
        "spark_context_id": "5998195893958609953"
      },
      "start_time": 1523552029282,
      "setup_duration": 211000,
      "execution_duration": 33000,
      "cleanup_duration": 2000,
      "trigger": "ONE_TIME",
      "creator_user_name": "...",
      "run_name": "SparkPi JAR job",
      "run_page_url": "<databricks-instance>/?o=3901135158661429#job/35/run/1",
      "run_type": "JOB_RUN"
    }
    
  8. To view the job output, visit the job run details page.

    Executing command, time = 1523552263909.
    Pi is roughly 3.13973913973914
    

Enable table access control example

To create a cluster enabled for table access control, specify the following spark_conf property in your request body:

curl -X POST https://<databricks-instance>/api/2.0/clusters/create -d'
{
  "cluster_name": "my-cluster",
  "spark_version": "4.0.x-scala2.11",
  "node_type_id": "i3.xlarge",
  "spark_conf": {
    "spark.databricks.acl.dfAclsEnabled":true,
    "spark.databricks.repl.allowedLanguages": "python,sql"
  },
  "aws_attributes": {
    "availability": "SPOT",
    "zone_id": "us-west-2a"
  },
  "num_workers": 1,
  "custom_tags":{
     "costcenter":"Tags",
     "applicationname":"Tags1"
  }
}'

Cluster log delivery examples

While you can view the Spark driver and executor logs in the Spark UI, Databricks can also deliver the logs to a DBFS or S3 destination. We provide several examples below.

Create a cluster with logs delivered to a DBFS location

The following cURL command creates a cluster named “cluster_log_dbfs” and requests Databricks to sends its logs to dbfs:/logs with the cluster ID as the prefix.

curl -n -H "Content-Type: application/json" -X POST -d @- https://<databricks-instance>/api/2.0/clusters/create <<JSON
{
  "cluster_name": "cluster_log_dbfs",
  "spark_version": "4.0.x-scala2.11",
  "node_type_id": "i3.xlarge",
  "num_workers": 1,
  "cluster_log_conf": {
    "dbfs": {
      "destination": "dbfs:/logs"
    }
  }
}
JSON

The response should contain the cluster ID:

{"cluster_id":"1111-223344-abc55"}

After cluster creation, Databricks syncs log files to the destination every 5 minutes. It uploads driver logs to dbfs:/logs/1111-223344-abc55/driver and executor logs to dbfs:/logs/1111-223344-abc55/executor.

Create a cluster with logs delivered to an S3 location

Databricks also supports delivering logs to an S3 location using cluster IAM roles. The following command creates a cluster named “cluster_log_s3” and requests Databricks to send its logs to s3://my-bucket/logs using the IAM role associated with the specified instance profile.

curl -n -H "Content-Type: application/json" -X POST -d @- https://<databricks-instance>/api/2.0/clusters/create <<JSON
{
  "cluster_name": "cluster_log_s3",
  "spark_version": "4.0.x-scala2.11",
  "aws_attributes": {
    "availability": "SPOT",
    "zone_id": "us-west-2c",
    "instance_profile_arn": "arn:aws:iam::12345678901234:instance-profile/YOURIAM"
  },
  "num_workers": 1,
  "cluster_log_conf": {
    "s3": {
      "destination": "s3://my-bucket/logs",
  "region": "us-west-2"
    }
  }
}
JSON

Databricks delivers the logs to the S3 destination using the corresponding IAM role. We support encryption with both Amazon S3-Managed Keys (SSE-S3) and AWS KMS-Managed Keys (SSE-KMS). See Clusters API doc for more details.

Note

You should make sure the IAM role has permission to upload logs to the S3 destination and read them after. Otherwise, by default only the AWS account owner of the S3 bucket can access the logs. Use canned_acl in the API request to change the default permission.

Check log delivery status

Users can retrieve cluster information with log delivery status via API:

curl -n -H "Content-Type: application/json" -d @- https://<databricks-instance>/api/2.0/clusters/get <<JSON
{
  "cluster_id": "1111-223344-abc55"
}
JSON

If the latest batch of log upload was successful, the response should contain only the timestamp of the last attempt:

{
  "cluster_log_status": {
    "last_attempted": 1479338561
  }
}

In case of errors, the error message would appear in the response:

{
  "cluster_log_status": {
    "last_attempted": 1479338561,
    "last_exception": "Exception: Access Denied ..."
  }
}

Workspace API examples

Here are some examples for using Workspace API to list/import/export/delete notebooks.

List a notebook or a folder

The following cURL command lists a path in the workspace.

curl -n -H "Content-Type: application/json" -X Get -d @- https://<databricks-instance>/api/2.0/workspace/list <<JSON
{
  "path": "/Users/user@example.com/"
}
JSON

The response should contain a list of statuses:

{
  "objects": [
    {
      "object_type": "DIRECTORY",
      "path": "/Users/user@example.com/folder"
    },
    {
      "object_type": "NOTEBOOK",
      "language": "PYTHON",
      "path": "/Users/user@example.com/notebook1"
    },
    {
      "object_type": "NOTEBOOK",
      "language": "SCALA",
      "path": "/Users/user@example.com/notebook2"
    }
  ]
}

If the path is a notebook, the response contains an array containing the status of the input notebook.

Get information about a notebook or a folder

The following cURL command gets the status of a path in the workspace.

curl -n -H "Content-Type: application/json" -X Get -d @- https://<databricks-instance>/api/2.0/workspace/get-status <<JSON
{
  "path": "/Users/user@example.com/"
}
JSON

The response should contain the status of the input path:

{
  "object_type": "DIRECTORY",
  "path": "/Users/user@example.com"
}

Create a folder

The following cURL command creates a folder in the workspace. It creates the folder recursively like mkdir -p. If the folder already exists, it will do nothing and succeed.

curl -n -H "Content-Type: application/json" -X POST -d @- https://<databricks-instance>/api/2.0/workspace/mkdirs <<JSON
{
  "path": "/Users/user@example.com/new/folder"
}
JSON

If the request succeeds, an empty JSON string will be returned.

Delete a notebook or folder

The following cURL command deletes a notebook/folder in the workspace. It deletes notebook or folder. recursive can be enabled to recursively delete a non-empty folder.

curl -n -H "Content-Type: application/json" -X POST -d @- https://<databricks-instance>/api/2.0/workspace/delete <<JSON
{
  "path": "/Users/user@example.com/new/folder",
  "recursive": "false"
}
JSON

If the request succeeds, an empty json string will be returned.

Export a notebook or folder

The following cURL command exports a notebook in the workspace. Notebooks can be exported in the following formats: SOURCE, HTML, JUPYTER, DBC. Note that a folder can only be exported as DBC.

curl -n -H "Content-Type: application/json" -X GET -d @- https://<databricks-instance>/api/2.0/workspace/export <<JSON
{
  "path": "/Users/user@example.com/notebook",
  "format": "SOURCE"
}
JSON

The response contains base64 encoded notebook content.

{
  "content": "Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKcHJpbnQoImhlbGxvLCB3b3JsZCIpCgovLyBDT01NQU5EIC0tLS0tLS0tLS0KCg=="
}

Alternatively, you can download the exported notebook directly.

curl -n -X GET "https://<databricks-instance>/api/2.0/workspace/export?format=SOURCE&direct_download=true&path=/Users/user@example.com/notebook"

The response will be the exported notebook content.

Import a notebook or directory

The following cURL command imports a notebook in the workspace. Multiple formats (SOURCE, HTML, JUPYTER, DBC) are supported. If the format is SOURCE, you must specify language. The content parameter contains base64 encoded notebook content. You can enable overwrite to overwrite the existing notebook.

curl -n -H "Content-Type: application/json" -X POST -d @- https://<databricks-instance>/api/2.0/workspace/import <<JSON
{
  "path": "/Users/user@example.com/new-notebook",
  "format": "SOURCE",
  "language": "SCALA",
  "content": "Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKcHJpbnQoImhlbGxvLCB3b3JsZCIpCgovLyBDT01NQU5EIC0tLS0tLS0tLS0KCg==",
  "overwrite": "false"
}
JSON

If the request succeeds, an empty JSON string is returned.

Alternatively, you can import a notebook via multipart form post.

curl -n -X POST https://<databricks-instance>/api/2.0/workspace/import \
     -F path="/Users/user@example.com/new-notebook" -F format=SOURCE -F language=SCALA -F overwrite=true -F content=@notebook.scala