This post is the first of three part series on distributed training of neural networks.
In Part 1, we’ll look at how the training of deep learning models can be significantly accelerated with distributed computing on GPUs, as well as discuss some of the challenges and examine current research on the topic. We’ll also consider when distributed training of neural networks is - and isn’t - appropriate for particular use cases.
In Part 2, we’ll take hands-on look into Deeplearning4j’s implementation of network training on Apache Spark, and provide an end-to-end example of how to perform training in practice.
Finally, in Part 3 we’ll peak under the hood of Deeplearning4j’s Spark implementation, and discuss some of the performance and design challenges involved with maximizing training performance with Apache Spark. We’ll also look at how Spark interacts with the native high-performance computing libraries and off-heap memory management that Deeplearning4j utilizes.
Introduction
Modern neural network architectures trained on large data sets can obtain impressive performance across a wide variety of domains, from speech and image recognition, to natural language processing to industry-focused applications such as fraud detection and recommendation systems. But training these neural network models is computationally demanding. Although in recent years significant advances have been made in GPU hardware, network architectures and training methods, the fact remains that network training can take an impractically long time on a single machine. Fortunately, we are not restricted to a single machine: a significant amount of work and research has been conducted on enabling the efficient distributed training of neural networks.
We’ll start by considering two approaches to parallelizing/distributing our training computation.
In model parallelism, different machines in the distributed system are responsible for the computations in different parts of a single network - for example, each layer in the neural network may be assigned to a different machine.
In data parallelism, different machines have a complete copy of the model; each machine simply gets a different portion of the data, and results from each are somehow combined.
Of course, these approaches are not mutually exclusive. Consider a cluster of multi-GPU systems. We could use model parallelism (model split across GPUs) for each machine, and data parallelism between machines.
While model parallelism can work well in practice, data parallelism is arguably the preferred approach for distributed systems and has been the focus of more research. For one thing, implementation, fault tolerance and good cluster utilization is easier for data parallelism than for model parallelism. Model parallelism in the context of distributed systems is interesting and does have some benefits (such as scalability to large models), but here we will be focusing on data parallelism.
Data Parallelism
Data parallel approaches to distributed training keep a copy of the entire model on each worker machine, processing different subsets of the training data set on each. Data parallel training approaches all require some method of combining results and synchronizing the model parameters between each worker. A number of different approaches have been discussed in the literature, and the primary differences between approaches are
• Synchronous vs. asynchronous methods
• Centralized vs. distributed synchronization
Deeplearning4j’s current Spark implementation is a synchronous parameter averaging where the Spark driver and reduction operations take the place of a parameter server.
Parameter Averaging
Parameter averaging is the conceptually simplest approach to data parallelism. With parameter averaging, training proceeds as follows:
1. Initialize the network parameters randomly based on the model configuration
2. Distribute a copy of the current parameters to each worker
3. Train each worker on a subset of the data
4. Set the global parameters to the average the parameters from each worker
5. While there is more data to process, go to step 2
Steps 2 through 4 are demonstrated in the image below. In this diagram, W represents the parameters (weights, biases) in the neural network. Subscripts are used to index the version of the parameters over time, and where necessary for each worker machine.
In fact, it’s straightforward to prove that a restricted version of parameter averaging is mathematically identical to training on a single machine; these restructions are parameter averaging after each minibatch, no updater (i.e., no momentum etc - just multiplication by learning rate), and an identical number of examples processed by each worker. For the mathematically inclined, the proof is as follows.
Consider the case of a cluster with n workers, where each worker processes m examples, for a total ofnm examples processed between averagings. If we process all nm examples on a single machine with learning rate α, our weight update rule is given by:
Wi+1=Wi−αnmnm∑j=1∂Lj∂WiWi+1=Wi−αnm∑j=1nm∂Lj∂Wi
Now, if we instead perform learning on m examples in each of the n workers (where worker 1 gets examples 1, ..., m, worker 2 gets examples m + 1, ..., 2m and so on), we have:
Wi+1=1nn∑w=1Wi+1,w=1nn∑w=1??Wi−αmwm∑j=(w−1)m+1∂Lj∂Wi??