Libraries

To make third-party or locally-built code available to execution environments running on your clusters, you create a library. Libraries can be written in Python, Java, Scala, and R.

To allow a library to be shared by all users in a Workspace, create the library in the Shared folder. To make it available to a single user, create the library in the user folder.

You can create and manage libraries using the UI, the CLI, and by invoking the Libraries API. This topic focuses on performing library tasks using the UI. For the other methods, see Databricks CLI and Libraries API.

Library lifecycle

Libraries can be created, attached to a cluster, detached from a cluster, and deleted.

When you create a library, you either upload or install the library package. Packages that you upload or install using Maven are stored in the FileStore in FileStore/jars. Databricks installs Python packages in the Spark container using pip install.

To use a library, you first attach it to a cluster. To use the library in a notebook that was attached to the cluster before the library was attached, you must reattach the cluster to the notebook.

There are two steps to permanently delete a library:

  1. Move the library to the Trash folder.
  2. Either permanently delete the library in the Trash folder or empty the Trash folder.

When you move a library to the Trash folder, the library is not marked for deletion, which means that it remains available on any clusters that it is attached to. When you permanently delete a library, the cluster to which the library is attached identifies the library as marked for deletion. The following screenshot illustrates the delete and detach indications:

../_images/library-delete-detach.png

As indicated in the screenshot, when you detach a library from a cluster or permanently delete a library previously attached to the cluster, you must restart the cluster.

Create a library

You can create Java, Scala, and Python libraries to run on Spark clusters, or point to external packages in PyPI, Maven, and CRAN repositories. To create a library:

  1. Right-click the folder where you want to store the library.

  2. Select Create > Library.

    New Library

Upload a Java JAR or Scala JAR

  1. In the Source drop-down list, select Upload Java/Scala JAR.

  2. Enter a library name.

  3. Click and drag your JAR to the JAR File text box.

    Upload Jar

  4. Click Create Library. The library detail screen displays.

  5. In the Attach column, select clusters to attach the library to.

  6. Optionally select the Attach automatically to all clusters. checkbox and click Confirm.

Upload a Python PyPI package or Python Egg

  1. In the Source drop-down list, select Upload Python Egg or PyPI.

    • PyPI package - Enter a PyPI package name and click Install Library. The library detail screen displays.

      Tip

      PyPI has a specific format for installing specific versions of libraries. For example, to install a specific version of pandas use this format for the library: pandas==0.17.1.

    • Python egg:

      1. Enter a library name.
      2. Click and drag the egg and optionally the documentation egg to the Egg File text box.
      3. Click Create Library. The library detail screen displays.
  2. In the Attach column, select clusters to attach the library to.

  3. Optionally select the Attach automatically to all clusters. checkbox and click Confirm.

Upload a Maven package or Spark package

  1. In the Source field, select Maven Coordinate.

    ../_images/maven-library.png
    • In the Coordinate field, enter the Maven coordinate of the library to install. Maven coordinates are in the form groupId:artifactId:version; for example, com.databricks:spark-avro_2.10:1.0.0.
    • If you don’t know the exact coordinate, enter the library name and click Search Spark Packages and Maven Central. A list of matching packages displays. To display details about a package, click its name. You can sort packages by name, organization, and rating. You can also filter the results by writing a query in the search bar. The results refresh automatically.
    1. Select Spark Packages or Maven Central in the drop-down list at the top right.

      ../_images/spark-packages.png
      ../_images/maven-central.png
    2. Optionally select the package version in the Releases column.

    3. Click + Select next to a package. The Coordinate field is filled in with the selected package and version.

  2. Optionally click Advanced Options to set up a custom Maven URL and to exclude certain dependencies.

    • Enter the Repository URL if your coordinate is in a different Maven repository; for example, https://oss.sonatype.org/content/repositories.

      Note

      Internal Maven repositories are not supported.

    • In the Excludes box, provide the groupId and the artifactId of the dependencies that you want to exclude; for example, log4j:log4j.

  3. Click Create Library. The library detail screen displays.

  4. In the Attach column, select clusters to attach the library to. The dependencies resolve and the library installs in a couple of minutes.

  5. Optionally select the Attach automatically to all clusters. checkbox and click Confirm.

Upload a CRAN library

Note

You can use CRAN libraries on clusters running Databricks Runtime 3.2 and above.

  1. In the Source drop-down list, select R Library.

    ../_images/cran-library.png
  2. In the Install from drop-down list, CRAN-like Repository is the only option and is selected by default. This option covers CRAN and bioconductor repositories.

  3. In the Repository field, enter the CRAN repository URL.

  4. In the Package field, enter the name of the package.

  5. Click Create Library. The library detail screen displays.

  6. In the Attach column, select clusters to attach the library to. When the library is attached to a cluster, the dependencies resolve and the library installs.

  7. Optionally select the Attach automatically to all clusters. checkbox and click Confirm.

View library details

  1. Go to the folder containing the library.
  2. Click the library name.

The library details page shows the running clusters and whether the library is attached to the clusters. If the library is installed, the page contains a link to the package host. If the library is uploaded, the page displays a link to the uploaded package file.

Attach a library to a cluster

  1. Go to the folder containing the library.

  2. Click the library name.

  3. In the Attach column, select the cluster to attach the library to.

    ../_images/library-attach.png
  4. To configure the library to be attached to all clusters, optionally select the Attach automatically to all clusters checkbox and click Confirm.

Detach a library from a cluster

  1. Go to the folder containing the library.

  2. Click the library name.

  3. In the Attach column, deselect the cluster the library is attached to.

    ../_images/library-attach.png
  4. Restart the cluster.

View the libraries attached to a cluster

  1. Click the clusters icon Clusters Icon in the sidebar.
  2. Click the cluster name.
  3. Click the Libraries tab. For each library, the tab displays the library name and version, whether the library has been deleted, and the library location.

Move a library

  1. Go to the library location in the Workspace.
  2. Click the drop-down arrow Menu Dropdown to the right of the library name and select Move. A folder browser displays.
  3. Click the destination folder.
  4. Click Select.
  5. Click Confirm and Move.

Delete a library

You can move a library to the Trash folder and permanently delete the library. For details, see Delete an object.

You can also move a library to the Trash folder by clicking Move to Trash on the library details page.

Note

When you move a library to the Trash folder, the library is not marked for deletion, which means that it remains available on any clusters that it is attached to. You must permanently delete the library from the Trash folder or empty the Trash folder to make it unavailable.

Update a library

To update a library, delete the old version of the library and create a new version. The requirements to use the new version of the library are the union of the requirements for deleting a library and uploading a library: you must restart the cluster and reattach any notebooks that use the library to the cluster.

Notebook-scoped Python libraries

Databricks Utilities support installing Python libraries into a notebook. Libraries installed through this API are scoped to the notebook session. They do not appear in the cluster Libraries tab. For more info, see Library utilities.

Install libraries using init scripts

Some libraries require custom configuration. To install these libraries, you can configure a cluster with a UNIX script that runs at cluster creation time.

Custom R Package

This example creates an init script that installs a custom R package archive custom_r_package_v0.1.gz on a cluster.

  1. Upload the custom package archive to a DBFS location like dbfs:/FileStore/packages/ using the REST API or CLI.

  2. Create the base directory you want to store the init script in if it does not exist. This example uses dbfs:/databricks/<directory>.

    dbutils.fs.mkdirs("dbfs:/databricks/<directory>/")
    
  3. Create the init script.

    dbutils.fs.put("/databricks/<directory>/install-packages.sh","""
    #!/bin/bash
    R -e "install.packages('/dbfs/FileStore/packages/custom_r_package_v0.1.gz', repos=NULL, type='source')"
    """, True)
    
  4. Check that the init script exists.

    display(dbutils.fs.ls("dbfs:/databricks/<directory>/install-packages.sh"))
    
  5. Create a cluster configuring the script as a cluster-scoped init script.

Package with Linux dependency

This example creates an init script that installs a R package topicmodels with a required Linux dependency on a cluster. If package you plan to use has one or more Linux dependencies, install those using sudo apt-get -y install.

Important

To install Python packages, use the Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into the Databricks Python virtual environment rather than the system Python environment. For example, /databricks/python/bin/pip install <packagename>.

  1. Create the base directory you want to store the init script in if it does not exist. This example uses dbfs:/databricks/<directory>.

    dbutils.fs.mkdirs("dbfs:/databricks/<directory>/")
    
  2. Create the init script.

    dbutils.fs.put("/databricks/<directory>/install-packages.sh","""
    #!/bin/bash
    sudo apt-get -y install libgsl-dev
    R -e "install.packages('topicmodels', repos='http://cran.us.r-project.org')"
    """, True)
    
  3. Check that the init script exists.

    display(dbutils.fs.ls("dbfs:/databricks/<directory>/install-packages.sh"))
    
  4. Create a cluster configuring the script as a cluster-scoped init script.