Exploring Distributed Data Parallel (DDP) Training on Consumer GPUs

Keith Cagney
September 13, 2022

Our team is currently developing utility applications for Salad Container Engine (SCE), a fully managed container orchestration platform designed to leverage the heterogeneous hardware that makes up our uniquely decentralized network.

One of the most exciting use cases we're exploring is distributed data parallelism (DDP), a deep-learning paradigm in which networked GPUs train models by performing discrete operations on distributed subsets of larger data sets.

What Is Distributed Data Parallel (DDP) Training?

Unlike single-process data parallelism, where training operations occur sequentially on a single accelerator, distributed data parallelism allows developers to instantiate and coordinate processes across a cohort of networked machines. Each of those devices is then responsible for performing operations on some volume of data as a discrete worker process.

The main advantage of a DDP training configuration is reduction in the length of a typical epoch. By distributing discrete segments of inputted data to multiple hardware accelerators (such as GPUs or TPUs), developers can effectively increase the batch size for a given operational phase (which in fact encompass numerous, identical processes hosted on different workers).

DDP training modules can also be configured for use with model parallelized systems in which distributed workers perform different operations on the same data subsets, or to execute more complex computational graphs that incorporate aspects of both approaches. Though the latter design pattern is less common, distributing both data and operations across distinct worker layers can be ideal for advanced applications requiring sophisticated parallel computation.

Constraints of DDP Training

No matter whether the training modules are constructed in PyTorch, TensorFlow, or some other machine learning framework, DDP configurations generally improve the efficiency of iterations, unburden process overhead, and reduce epoch length—but both data parallelism and model parallelism introduce unique performance constraints that cannot be overcome paradigmatically.

In data-parallel applications, learning must be slowed to a uniform rate in order to ensure smooth reconciliation between all workers on the network. By contrast, computational overhead can be significantly greater in model parallelism and complex graphs because they require persistent communication between nodes.

We're investigating ways to leverage Salad's intelligent workload-distribution plane to optimize DDP workflows executed in Salad Container Engine.

SaladCloud Architecture

For four years, our team has perfected a platform to turn latent consumer compute resources into available cloud infrastructure. As the world's largest decentralized computing cluster, the Salad network has achieved twice the peak processing rate of the recently unveiled Polaris supercomputer. We owe this success to our voluntary computesharing community1.

Salad's control plane was designed to access, index, and distribute compute tasks2 to target node cohorts. Salad Container Engine works in tandem with these systems to instantiate container replicas, secure and manage simultaneous processes, and dynamically access machine-level GPU resources when required by a given workload.

With our current systems (and an ever-growing corpus heterogeneous hardware), operations can be distributed to optimal hardware workers based on the anticipated degree of computational expense, while Salad's node selection matrix can minimize latency and timeouts between worker processes by isolating geographically adjacent hosts. Salad Container Engine will soon be able to distribute tensors to the most trustworthy nodes and maintain healthy training populations at runtime.

Our team is now investigating how to synchronize buffers and data gradients to facilitate DDP modeling across a worldwide cloud network. The next step is designing deep-learning containers to efficiently manage backward computations between Salad nodes and our customers' APIs. In time, we hope to teach Salad to asynchronously reconcile fast and slow learning processes for a variety of multi-GPU applications using PyTorch and TensorFlow.

We look forward to sharing more with our developer community in the months ahead. To stay up to date with the latest SaladCloud features and product updates, please consider subscribing to our newsletter.

  1. The Salad desktop application rewards voluntary participants for their contributions to a set of baseline processing workloads. Internal trust rating systems further reward "virtuous" nodes (i.e., those with performant hardware, reliable connections, longer availability durations, or ideally all three) by elevating their priority ranking in queues for more profitable workloads.
  2. Our baseline workloads include cryptomining libraries and other open-source executables that rely on trustless blockchain protocols or cryptographic equations. By providing built-in participation incentives, these seminal workloads allowed us to grow and persist our network while designing our state-of-the-art orchestration systems.