gperc Architecture¶
Perceiver Model¶
This file has code on the neural network of the pervceiver architecture. gperc.models.Perceiver
sits at
the heart of this project. Use Perceiver
for everyday use of the model, when you want
to train really large models with model parallellism read here.
Distributed¶
gperc
out-of-box can handle distributed model parallel training with
get_distributed_model() During distributed training
and inference with torch.distributed.pipeline.sync.Pipe
(read tutorial) the input has to be a
nn.Sequential
object.
Documentation¶
- gperc.models.build_position_encoding(position_encoding_type, config, num_index_items, emb_dim)[source]¶
Get the positional encoding matrix. If
position_encoding_type == "trainable"
then a random normal matrix is returned, if it is “sinusoid” then- Parameters
position_encoding_type (str) – type of embedding, should be one of “trainable”, “sinusoid”
config –
gperc.PerceiverConfig
num_index_items (int) – number of items in the embedding, eg.
vocab_size
emb_dim (int) – embedding dimension
- Returns
Item that can be used as a parameter in a
torch.nn.Embedding
- Return type
torch.nn.Parameter
- class gperc.models.Block(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.Module
Generic block with Attention and MLP layers
- Parameters
kv_dim (int) – dimension of the key-value embeddings
q_dim (int) – dimension of the query embeddings
num_heads (int) – number of heads in the multihead attention
ffw_dim (int) – dimension of the feed-forward layer
dropout (float, optional) – dropout rate
add_residual (bool, optional) – whether to add residual to the query
- forward(kv, q, attn_mask=None)[source]¶
Forward pass of the block that taken in a a key-value tensor and a query tensor and performs the attention and mlp layers. Since it consumes
kv
andq
seperately, the blocks are responisble for cross attention like features. Returns a- Parameters
kv (torch.Tensor) – tensor to extract information from
q (torch.Tensor) – tensor for querying the information
- Returns
tuple of output Tensor and Attention matrix
- Return type
Tuple[torch.Tensor, torch.Tensor]
- class gperc.models.Embeddings(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.Module
- forward(input_array, attention_mask=None, output_array=None)[source]¶
Takes in either the
input_array
or tuple with 3 items(input_array, attention_mask, output)
and returns a tuple with 4 values(input_array, attention_mask, latent_array, output_array)
. If configuredinput_array
can have tokens and will be automatically embedded.Note
When using GPipe you need to send in tensors because it will try to send items as microbatches for each GPU. Now that requires all the inputs to be tensors, so here I have written some basic dumb heuristic that can set attention_mask and output_array to None if average of the values in those tensors is -69 and -420 resp.
Image classification task does not require any
attention_mask
you can pass that as a tensor with valuesattention_mask = torch.tensor([-69. for _ in range(batch_size)])
and similarly you can sendoutput_array
as a tensor with valuesoutput_array = torch.tensor([-420. for _ in range(batch_size)])
- class gperc.models.EncoderBlock(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.Module
Encoder Block with postional embeddings
- class gperc.models.ProcessorBlock(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.Module
Processor Block without positional embeddings
- class gperc.models.Perceiver(*args: Any, **kwargs: Any)[source]¶
Bases:
torch.nn.Module
Unassuming Perceiver Architecture that sits at the heart of this project. In practive this is a nice wrapper around model returned by
get_sequential_from_config
that automatically handles different types of input in a simple fashion. This is a great approach when using on a single GPU or performing Data Parallel training on multiple GPUs. When using this for Model Parallel training, you will need to write your own list etc. read story on distributed for more details.- Parameters
config –
gperc.PerceiverConfig
object
- num_parameters(include_non_trainable: bool = True)[source]¶
function that returns the number of parameters in the modle
- Parameters
include_non_trainable (bool, optional) – If true includes tensors that have
requires_grad=False
as well- Returns
number of parameters in the model
- Return type
int
- forward(input_array, attention_mask=None, output_array=None, return_attentions=False)[source]¶
Performs the forward pass of the Perceiver.
- Parameters
input_array (torch.Tensor) – Input array to the Perceiver, read paper for reference
attention_mask (torch.Tensor, optional) – Mask for the decoder, attends at location with value 1
output_array (torch.Tensor, optional) – Output array to the Perceiver, read paper for reference
return_attentions (bool, optional) – If true returns the attentions as a list
- Returns
The output of the Perceiver and the attention matrices
- Return type
Tuple[torch.Tensor, List[torch.Tensor]] if
return_attentions
is True else torch.Tensor
- gperc.models.get_distributed_model(config)[source]¶
This function returns the model that is used for distributed training. This is not a wrapper around
Perceiver
but instead returns aPipe
object.Note
When using GPipe you need to send in tensors because it will try to send items as microbatches for each GPU. Now that requires all the inputs to be tensors, so here I have written some basic dumb heuristic that can set attention_mask and output_array to None if average of the values in those tensors is -69 and -420 resp.
Image classification task does not require any
attention_mask
you can pass that as a tensor with valuesattention_mask = torch.tensor([-69. for _ in range(batch_size)])
and similarly you can sendoutput_array
as a tensor with valuesoutput_array = torch.tensor([-420. for _ in range(batch_size)])
- Parameters
config (PerceiverConfig) – Configuration object for the Perceiver
- Returns
Model that can be used inplace of
Perceiver
but note that it can only take intorch.Tensor
objects and notNone
.- Return type
torch.distributed.pipeline.sync.Pipe