Distributed Deep Learning

Distributed deep learning (DL) involves training a deep neural network in parallel across multiple machines.

When possible, Databricks recommends training neural networks on a single machine; distributed training code is more complex than single-machine training and slower due to communication overhead. However, you should consider distributed training if your model or your data are too large to fit in memory on a single machine.

A typical workflow for distributed deep learning has the following components:

Prepare the data space

Data loading and model checkpointing are crucial to distributed DL workloads. In order to run distributed DL, you need to prepare a shared space with a FUSE mount for data loading, model checkpoint, and logging.

Databricks recommends using an init script to mount an S3 bucket as a file system with Goofys. Goofys is a high-performance, POSIX-ish Amazon S3 file system written in Go. For information about Goofys, see the Goofys GitHub website.

The example notebook below demonstrates how to mount an S3 bucket on Databricks.

Perform distributed training

For more information about distributed training, see the following guides, which explain how to run TensorFlow and Keras-backed distributed deep learning workflows on Databricks using the following frameworks:

  • HorovodEstimator: Supports TensorFlow workflows
  • TensorFlowOnSpark: Supports multi-machine TensorFlow workloads
  • dist-keras: Supports multi-machine Keras workloads

We recommend HorovodEstimator for TensorFlow workloads due to its ease of use in multi-GPU contexts.

Distributed deep learning guides