gperc Architecture¶
Perceiver Model¶
This file has code on the neural network of the pervceiver architecture. gperc.models.Perceiver sits at
the heart of this project. Use Perceiver for everyday use of the model, when you want
to train really large models with model parallellism read here.
Distributed¶
gperc out-of-box can handle distributed model parallel training with
get_distributed_model() During distributed training
and inference with torch.distributed.pipeline.sync.Pipe
(read tutorial) the input has to be a
nn.Sequential object.
Documentation¶
- gperc.models.build_position_encoding(position_encoding_type, config, num_index_items, emb_dim)[source]¶
Get the positional encoding matrix. If
position_encoding_type == "trainable"then a random normal matrix is returned, if it is “sinusoid” then- Parameters
position_encoding_type (str) – type of embedding, should be one of “trainable”, “sinusoid”
config –
gperc.PerceiverConfignum_index_items (int) – number of items in the embedding, eg.
vocab_sizeemb_dim (int) – embedding dimension
- Returns
Item that can be used as a parameter in a
torch.nn.Embedding- Return type
torch.nn.Parameter
- class gperc.models.Block(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.ModuleGeneric block with Attention and MLP layers
- Parameters
kv_dim (int) – dimension of the key-value embeddings
q_dim (int) – dimension of the query embeddings
num_heads (int) – number of heads in the multihead attention
ffw_dim (int) – dimension of the feed-forward layer
dropout (float, optional) – dropout rate
add_residual (bool, optional) – whether to add residual to the query
- forward(kv, q, attn_mask=None)[source]¶
Forward pass of the block that taken in a a key-value tensor and a query tensor and performs the attention and mlp layers. Since it consumes
kvandqseperately, the blocks are responisble for cross attention like features. Returns a- Parameters
kv (torch.Tensor) – tensor to extract information from
q (torch.Tensor) – tensor for querying the information
- Returns
tuple of output Tensor and Attention matrix
- Return type
Tuple[torch.Tensor, torch.Tensor]
- class gperc.models.Embeddings(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.Module- forward(input_array, attention_mask=None, output_array=None)[source]¶
Takes in either the
input_arrayor tuple with 3 items(input_array, attention_mask, output)and returns a tuple with 4 values(input_array, attention_mask, latent_array, output_array). If configuredinput_arraycan have tokens and will be automatically embedded.Note
When using GPipe you need to send in tensors because it will try to send items as microbatches for each GPU. Now that requires all the inputs to be tensors, so here I have written some basic dumb heuristic that can set attention_mask and output_array to None if average of the values in those tensors is -69 and -420 resp.
Image classification task does not require any
attention_maskyou can pass that as a tensor with valuesattention_mask = torch.tensor([-69. for _ in range(batch_size)])and similarly you can sendoutput_arrayas a tensor with valuesoutput_array = torch.tensor([-420. for _ in range(batch_size)])
- class gperc.models.EncoderBlock(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.ModuleEncoder Block with postional embeddings
- class gperc.models.ProcessorBlock(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.ModuleProcessor Block without positional embeddings
- class gperc.models.Perceiver(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.ModuleUnassuming Perceiver Architecture that sits at the heart of this project. In practive this is a nice wrapper around model returned by
get_sequential_from_configthat automatically handles different types of input in a simple fashion. This is a great approach when using on a single GPU or performing Data Parallel training on multiple GPUs. When using this for Model Parallel training, you will need to write your own list etc. read story on distributed for more details.- Parameters
config –
gperc.PerceiverConfigobject
- num_parameters(include_non_trainable: bool = True)[source]¶
function that returns the number of parameters in the modle
- Parameters
include_non_trainable (bool, optional) – If true includes tensors that have
requires_grad=Falseas well- Returns
number of parameters in the model
- Return type
int
- forward(input_array, attention_mask=None, output_array=None, return_attentions=False)[source]¶
Performs the forward pass of the Perceiver.
- Parameters
input_array (torch.Tensor) – Input array to the Perceiver, read paper for reference
attention_mask (torch.Tensor, optional) – Mask for the decoder, attends at location with value 1
output_array (torch.Tensor, optional) – Output array to the Perceiver, read paper for reference
return_attentions (bool, optional) – If true returns the attentions as a list
- Returns
The output of the Perceiver and the attention matrices
- Return type
Tuple[torch.Tensor, List[torch.Tensor]] if
return_attentionsis True else torch.Tensor
- gperc.models.get_distributed_model(config)[source]¶
This function returns the model that is used for distributed training. This is not a wrapper around
Perceiverbut instead returns aPipeobject.Note
When using GPipe you need to send in tensors because it will try to send items as microbatches for each GPU. Now that requires all the inputs to be tensors, so here I have written some basic dumb heuristic that can set attention_mask and output_array to None if average of the values in those tensors is -69 and -420 resp.
Image classification task does not require any
attention_maskyou can pass that as a tensor with valuesattention_mask = torch.tensor([-69. for _ in range(batch_size)])and similarly you can sendoutput_arrayas a tensor with valuesoutput_array = torch.tensor([-420. for _ in range(batch_size)])- Parameters
config (PerceiverConfig) – Configuration object for the Perceiver
- Returns
Model that can be used inplace of
Perceiverbut note that it can only take intorch.Tensorobjects and notNone.- Return type
torch.distributed.pipeline.sync.Pipe