gperc Architecture

Perceiver Model

This file has code on the neural network of the pervceiver architecture. gperc.models.Perceiver sits at the heart of this project. Use Perceiver for everyday use of the model, when you want to train really large models with model parallellism read here.

Distributed

gperc out-of-box can handle distributed model parallel training with get_distributed_model() During distributed training and inference with torch.distributed.pipeline.sync.Pipe (read tutorial) the input has to be a nn.Sequential object.

Documentation

gperc.models.build_position_encoding(position_encoding_type, config, num_index_items, emb_dim)[source]

Get the positional encoding matrix. If position_encoding_type == "trainable" then a random normal matrix is returned, if it is “sinusoid” then

Parameters
  • position_encoding_type (str) – type of embedding, should be one of “trainable”, “sinusoid”

  • configgperc.PerceiverConfig

  • num_index_items (int) – number of items in the embedding, eg. vocab_size

  • emb_dim (int) – embedding dimension

Returns

Item that can be used as a parameter in a torch.nn.Embedding

Return type

torch.nn.Parameter

class gperc.models.Block(*args: Any, **kwargs: Any)[source]

Bases: torch.nn.Module

Generic block with Attention and MLP layers

Parameters
  • kv_dim (int) – dimension of the key-value embeddings

  • q_dim (int) – dimension of the query embeddings

  • num_heads (int) – number of heads in the multihead attention

  • ffw_dim (int) – dimension of the feed-forward layer

  • dropout (float, optional) – dropout rate

  • add_residual (bool, optional) – whether to add residual to the query

forward(kv, q, attn_mask=None)[source]

Forward pass of the block that taken in a a key-value tensor and a query tensor and performs the attention and mlp layers. Since it consumes kv and q seperately, the blocks are responisble for cross attention like features. Returns a

Parameters
  • kv (torch.Tensor) – tensor to extract information from

  • q (torch.Tensor) – tensor for querying the information

Returns

tuple of output Tensor and Attention matrix

Return type

Tuple[torch.Tensor, torch.Tensor]

class gperc.models.Embeddings(*args: Any, **kwargs: Any)[source]

Bases: torch.nn.Module

forward(input_array, attention_mask=None, output_array=None)[source]

Takes in either the input_array or tuple with 3 items (input_array, attention_mask, output) and returns a tuple with 4 values (input_array, attention_mask, latent_array, output_array). If configured input_array can have tokens and will be automatically embedded.

Note

When using GPipe you need to send in tensors because it will try to send items as microbatches for each GPU. Now that requires all the inputs to be tensors, so here I have written some basic dumb heuristic that can set attention_mask and output_array to None if average of the values in those tensors is -69 and -420 resp.

Image classification task does not require any attention_mask you can pass that as a tensor with values attention_mask = torch.tensor([-69. for _ in range(batch_size)]) and similarly you can send output_array as a tensor with values output_array = torch.tensor([-420. for _ in range(batch_size)])

class gperc.models.EncoderBlock(*args: Any, **kwargs: Any)[source]

Bases: torch.nn.Module

Encoder Block with postional embeddings

forward(input_array, attention_mask, latent_array, output_array)[source]

takes in a tuple with 4 values (input_array, attention_mask, latent_array, output_array) and returns a tuple with 3 items (latent_array, output_array, attentions)

class gperc.models.ProcessorBlock(*args: Any, **kwargs: Any)[source]

Bases: torch.nn.Module

Processor Block without positional embeddings

forward(latent_array, output_array, attentions)[source]

takes in a tuple with 3 values (latent_array, output_array, attentions) and returns a tuple with 3 items (latent_array, output_array, attentions)

class gperc.models.DecoderBlock(*args: Any, **kwargs: Any)[source]

Bases: torch.nn.Module

forward(input_array, latent_array, output_array, attentions)[source]

takes in a tuple with 3 values (latent_array, output_array, attentions) and returns a tuple with 2 items (output_logits, attentions)

class gperc.models.Perceiver(*args: Any, **kwargs: Any)[source]

Bases: torch.nn.Module

Unassuming Perceiver Architecture that sits at the heart of this project. In practive this is a nice wrapper around model returned by get_sequential_from_config that automatically handles different types of input in a simple fashion. This is a great approach when using on a single GPU or performing Data Parallel training on multiple GPUs. When using this for Model Parallel training, you will need to write your own list etc. read story on distributed for more details.

Parameters

configgperc.PerceiverConfig object

num_parameters(include_non_trainable: bool = True)[source]

function that returns the number of parameters in the modle

Parameters

include_non_trainable (bool, optional) – If true includes tensors that have requires_grad=False as well

Returns

number of parameters in the model

Return type

int

save(path: str)[source]

saves the model to a file

Parameters

path – path to save the model to

forward(input_array, attention_mask=None, output_array=None, return_attentions=False)[source]

Performs the forward pass of the Perceiver.

Parameters
  • input_array (torch.Tensor) – Input array to the Perceiver, read paper for reference

  • attention_mask (torch.Tensor, optional) – Mask for the decoder, attends at location with value 1

  • output_array (torch.Tensor, optional) – Output array to the Perceiver, read paper for reference

  • return_attentions (bool, optional) – If true returns the attentions as a list

Returns

The output of the Perceiver and the attention matrices

Return type

Tuple[torch.Tensor, List[torch.Tensor]] if return_attentions is True else torch.Tensor

gperc.models.get_distributed_model(config)[source]

This function returns the model that is used for distributed training. This is not a wrapper around Perceiver but instead returns a Pipe object.

Note

When using GPipe you need to send in tensors because it will try to send items as microbatches for each GPU. Now that requires all the inputs to be tensors, so here I have written some basic dumb heuristic that can set attention_mask and output_array to None if average of the values in those tensors is -69 and -420 resp.

Image classification task does not require any attention_mask you can pass that as a tensor with values attention_mask = torch.tensor([-69. for _ in range(batch_size)]) and similarly you can send output_array as a tensor with values output_array = torch.tensor([-420. for _ in range(batch_size)])

Parameters

config (PerceiverConfig) – Configuration object for the Perceiver

Returns

Model that can be used inplace of Perceiver but note that it can only take in torch.Tensor objects and not None.

Return type

torch.distributed.pipeline.sync.Pipe

class gperc.models.PerceiverMLM(*args: Any, **kwargs: Any)[source]

Bases: torch.nn.Module

num_parameters()[source]
forward(x)[source]
class gperc.models.PerceiverImage(*args: Any, **kwargs: Any)[source]

Bases: torch.nn.Module

num_parameters()[source]
forward(x)[source]