Horovod is a distributed deep learning training framework developed by Uber for TensorFlow, Keras, PyTorch, and Apache MXNet that makes distributed training fast and easy.
Use it when
•Scaling single-GPU scripts: When you have working single-GPU training code that needs to scale to multiple GPUs.
•Multi-framework support needed: When working with TensorFlow, Keras, PyTorch, or MXNet and need consistent distributed training approach.
•High scaling efficiency required: When you need to achieve near-linear scaling across hundreds of GPUs.
•Minimal code changes preferred: When you want to add distributed training with minimal modifications to existing code.
•Established ecosystem integration: When using platforms like AWS, Azure, Databricks, or NVIDIA GPU Cloud that have built-in Horovod support.
Watch out
⚠MPI complexity: Requires careful MPI configuration - Open MPI 3.1.3 has hanging issues, and specific flags are required.
⚠Network interface issues: Non-routed interfaces (like docker0) can cause Open MPI to hang and require explicit exclusion.
⚠Installation dependencies: Requires CMake and specific compiler versions (g++-5 or above for TensorFlow/PyTorch).
⚠Version compatibility: Must reinstall Horovod when upgrading/downgrading TensorFlow, Keras, or PyTorch.
⚠Synchronization overhead: Distribution adds synchronization costs that vary by cluster configuration.
⚠TCP vs RDMA tradeoffs: Default RDMA doesn't work well with multithreading, requiring TCP communication in some cases.
⚠Environment-specific issues: Common problems in Databricks/Spark environments with TensorFlow object pickling.