2. Distributed Computing ======================== This `code `_ has been tested on `NimbleBox.ai `_ instance with two Nvidia-T4 cards. In this section we will see how you can train 1Bn+ parameter models directly from ``gperc`` using ``GPipe`` that allows model parallelism while avoiding the bottleneck that comes with using large batches by spliting the compute for each batch in mini-batches so devices have higher utilisation. .. image:: assets/gpipe.png :alt: Alternative text Some reference papers: 1. `torch GPipe `_: This paper extends the original `Gpipe `_ for torch. 2. The code from above paper was added into the ``pytorch`` and `this blog `_ has tutorial for it. Using ``GPipe`` --------------- I won't go into details of why ``GPipe`` was built and how ``pytorch`` handles it internally but rather go over the coding decisions. 1. All the functions take in dedicated keywords in the ``forward()`` method because sending ina tuple that can then be split requires serious modification to the ``pytorch`` source code that can be a tiresome and tedious process. Example: .. code-block:: python # forward method for Embeddings Module def forward(self, input_array, attention_mask=None, output_array=None) 2. Now the ``attention_mask`` and ``output_array`` can be ``None`` but when using ``Pipe`` pytorch consumes only tensors and so I have added this weird quirk where you can send in tensors with same first dimension with values ``-69`` for ignoring ``attention_mask`` and ``-420`` for ignoring ``output_array``. Yes very childish, I know. So in the script you will see code like this: .. code-block:: python # output_array needs to be set None so be pass tensor with values -420 model_input = ( inputs.cuda(0), attn_mask.cuda(0), torch.tensor([-420. for _ in range(inputs.shape[0])]).cuda(0) ) 3. You will need to experiment with values ``chunks`` (total number of chunks to break the input into) and ``partition_len`` (number of modules on each chip). Returned attentions list also is chunked ie. .. code-block:: python chunks = 16; batch_size = 32 output, attentions = model(*model_input) # number of attentions == number of chunks len(attentions) # 16 # batch size of attention layer is batch_size / chunks attentions[0][0].shape[0] == 2 More 🍰 on the way ------------------