gperc Architecture¶

Perceiver Model¶

This file has code on the neural network of the pervceiver architecture. gperc.models.Perceiver sits at the heart of this project. Use Perceiver for everyday use of the model, when you want to train really large models with model parallellism read here.

Distributed¶

gperc out-of-box can handle distributed model parallel training with get_distributed_model() During distributed training and inference with torch.distributed.pipeline.sync.Pipe (read tutorial) the input has to be a nn.Sequential object.

Documentation¶

gperc.models.build_position_encoding(position_encoding_type, config, num_index_items, emb_dim)[source]¶

Get the positional encoding matrix. If position_encoding_type == "trainable" then a random normal matrix is returned, if it is “sinusoid” then

Parameters

position_encoding_type (str) – type of embedding, should be one of “trainable”, “sinusoid”
config – gperc.PerceiverConfig
num_index_items (int) – number of items in the embedding, eg. vocab_size
emb_dim (int) – embedding dimension

Returns

Item that can be used as a parameter in a torch.nn.Embedding

Return type

torch.nn.Parameter

class gperc.models.Block(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.Module

Generic block with Attention and MLP layers

Parameters

kv_dim (int) – dimension of the key-value embeddings
q_dim (int) – dimension of the query embeddings
num_heads (int) – number of heads in the multihead attention
ffw_dim (int) – dimension of the feed-forward layer
dropout (float, optional) – dropout rate
add_residual (bool, optional) – whether to add residual to the query

forward(kv, q, attn_mask=None)[source]¶

Forward pass of the block that taken in a a key-value tensor and a query tensor and performs the attention and mlp layers. Since it consumes kv and q seperately, the blocks are responisble for cross attention like features. Returns a

Parameters

kv (torch.Tensor) – tensor to extract information from
q (torch.Tensor) – tensor for querying the information

Returns

tuple of output Tensor and Attention matrix

Return type

Tuple[torch.Tensor, torch.Tensor]

class gperc.models.Embeddings(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.Module

forward(input_array, attention_mask=None, output_array=None)[source]¶: Takes in either the input_array or tuple with 3 items (input_array, attention_mask, output) and returns a tuple with 4 values (input_array, attention_mask, latent_array, output_array). If configured input_array can have tokens and will be automatically embedded.

Note

When using GPipe you need to send in tensors because it will try to send items as microbatches for each GPU. Now that requires all the inputs to be tensors, so here I have written some basic dumb heuristic that can set attention_mask and output_array to None if average of the values in those tensors is -69 and -420 resp.

Image classification task does not require any attention_mask you can pass that as a tensor with values attention_mask = torch.tensor([-69. for _ in range(batch_size)]) and similarly you can send output_array as a tensor with values output_array = torch.tensor([-420. for _ in range(batch_size)])

class gperc.models.EncoderBlock(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.Module

Encoder Block with postional embeddings

forward(input_array, attention_mask, latent_array, output_array)[source]¶: takes in a tuple with 4 values (input_array, attention_mask, latent_array, output_array) and returns a tuple with 3 items (latent_array, output_array, attentions)

class gperc.models.ProcessorBlock(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.Module

Processor Block without positional embeddings

forward(latent_array, output_array, attentions)[source]¶: takes in a tuple with 3 values (latent_array, output_array, attentions) and returns a tuple with 3 items (latent_array, output_array, attentions)

class gperc.models.DecoderBlock(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.Module

forward(input_array, latent_array, output_array, attentions)[source]¶: takes in a tuple with 3 values (latent_array, output_array, attentions) and returns a tuple with 2 items (output_logits, attentions)

class gperc.models.Perceiver(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.Module

Unassuming Perceiver Architecture that sits at the heart of this project. In practive this is a nice wrapper around model returned by get_sequential_from_config that automatically handles different types of input in a simple fashion. This is a great approach when using on a single GPU or performing Data Parallel training on multiple GPUs. When using this for Model Parallel training, you will need to write your own list etc. read story on distributed for more details.

Parameters: config – gperc.PerceiverConfig object

num_parameters(include_non_trainable: bool = True)[source]¶

function that returns the number of parameters in the modle

Parameters: include_non_trainable (bool, optional) – If true includes tensors that have requires_grad=False as well
Returns: number of parameters in the model
Return type: int

save(path: str)[source]¶

saves the model to a file

Parameters: path – path to save the model to

forward(input_array, attention_mask=None, output_array=None, return_attentions=False)[source]¶

Performs the forward pass of the Perceiver.

Parameters

input_array (torch.Tensor) – Input array to the Perceiver, read paper for reference
attention_mask (torch.Tensor, optional) – Mask for the decoder, attends at location with value 1
output_array (torch.Tensor, optional) – Output array to the Perceiver, read paper for reference
return_attentions (bool, optional) – If true returns the attentions as a list

Returns

: The output of the Perceiver and the attention matrices

Return type

Tuple[torch.Tensor, List[torch.Tensor]] if return_attentions is True else torch.Tensor

gperc.models.get_distributed_model(config)[source]¶

This function returns the model that is used for distributed training. This is not a wrapper around Perceiver but instead returns a Pipe object.

Note

When using GPipe you need to send in tensors because it will try to send items as microbatches for each GPU. Now that requires all the inputs to be tensors, so here I have written some basic dumb heuristic that can set attention_mask and output_array to None if average of the values in those tensors is -69 and -420 resp.

Image classification task does not require any attention_mask you can pass that as a tensor with values attention_mask = torch.tensor([-69. for _ in range(batch_size)]) and similarly you can send output_array as a tensor with values output_array = torch.tensor([-420. for _ in range(batch_size)])

Parameters: config (PerceiverConfig) – Configuration object for the Perceiver
Returns: Model that can be used inplace of Perceiver but note that it can only take in torch.Tensor objects and not None.
Return type: torch.distributed.pipeline.sync.Pipe

class gperc.models.PerceiverMLM(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.Module

num_parameters()[source]¶

forward(x)[source]¶

class gperc.models.PerceiverImage(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.Module

num_parameters()[source]¶

forward(x)[source]¶