This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Databricks. Databricks provides these examples on a best-effort basis. Because they are external libraries, they may change in ways that are not easy to predict. If you need additional support for third-party tools, consult the documentation, mailing lists, forums, or other support options provided by the library vendor or maintainer.
The H2O Flow UI provides user-friendly clickable interface to machine learning and also provides some useful visualizations to view your jobs. To enable H2O Flow on a Databricks cluster, you must set up an ssh tunnel to the Spark driver. After H2O starts, run the following on your Spark driver:
ssh ubuntu@<hostname> -p 2200 -i <private-key> -L 54321:localhost:54321
You should be able to access H2O Flow on localhost:54321.
Databricks Runtime for Machine Learning installs XGBoost, which conflicts with the XGBoost packaged in PySparkling. To use PySparkling on Databricks Runtime ML, you must first remove XGBoost using this command:
scikit-learn, a well-known Python machine learning library, is included in Databricks Runtime. See Databricks Runtime Release Notes for the scikit-learn library version included with your cluster’s runtime.
The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments. Read more at DataRobot.
XGBoost is a popular machine learning library designed specifically for training decision trees and random forests. You can train XGBoost models on individual machines or in a distributed fashion. Read more in the XGBoost documentation.
There are two versions of XGBoost: a Python version, which is not distributed, and a Scala-based Spark version, which supports distributed training.
To install the non-distributed Python version, run:
/databricks/python/bin/pip install xgboost --pre
This Python version allows you to train only single node workloads.
XGBoost is included in Databricks Runtime ML, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of installing XGBoost using the instructions below, you can simply create a cluster using Databricks Runtime ML. See Overview of Databricks Runtime for Machine Learning.
You install XGBoost as a Databricks library, using
xgboost-linux64 as the Spark Package name.