Horovod

Horovod

Horovod is a distributed deep learning training framework developed by Uber for TensorFlow, Keras, PyTorch, and Apache MXNet that makes distributed training fast and easy.

Use it when

  • Scaling single-GPU scripts: When you have working single-GPU training code that needs to scale to multiple GPUs.
  • Multi-framework support needed: When working with TensorFlow, Keras, PyTorch, or MXNet and need consistent distributed training approach.
  • High scaling efficiency required: When you need to achieve near-linear scaling across hundreds of GPUs.
  • Minimal code changes preferred: When you want to add distributed training with minimal modifications to existing code.
  • Established ecosystem integration: When using platforms like AWS, Azure, Databricks, or NVIDIA GPU Cloud that have built-in Horovod support.

Watch out

  • MPI complexity: Requires careful MPI configuration - Open MPI 3.1.3 has hanging issues, and specific flags are required.
  • Network interface issues: Non-routed interfaces (like docker0) can cause Open MPI to hang and require explicit exclusion.
  • Installation dependencies: Requires CMake and specific compiler versions (g++-5 or above for TensorFlow/PyTorch).
  • Version compatibility: Must reinstall Horovod when upgrading/downgrading TensorFlow, Keras, or PyTorch.
  • Synchronization overhead: Distribution adds synchronization costs that vary by cluster configuration.
  • TCP vs RDMA tradeoffs: Default RDMA doesn't work well with multithreading, requiring TCP communication in some cases.
  • Environment-specific issues: Common problems in Databricks/Spark environments with TensorFlow object pickling.

Available in stages

Runtime Engine

Installation

pip install horovod

Example stacks

Example stacks coming soon...