Distributed deep learning involves training a deep neural network in parallel across multiple machines. A typical workflow has three components that run concurrently: model training, model evaluation (on a held-out validation set), and monitoring.
When possible, we recommend training neural networks on a single machine; distributed training code is more complex than single-machine training and slower due to communication overhead. However, you should consider distributed training if your model or your data are too large to fit in memory on a single machine.
For more information about distributed training, see the guides below, which explain how to run TensorFlow and Keras-backed distributed deep learning workflows on Databricks using the following frameworks:
- Horovod: Supports single and multi-machine TensorFlow and Keras workflows
- TensorFlowOnSpark: Supports multi-machine TensorFlow workloads
- dist-keras: Supports multi-machine Keras workloads
We recommend Horovod for both TensorFlow and Keras-backed workloads due to its ease of use in single-machine-multi-GPU and multi-machine-multi-GPU contexts.
Distributed deep learning guides: