Input Embeddings in Transformers



The two main components of a Transformer, i.e., the encoder and the decoder, contain various mechanisms and sub-layers. In the Transformer architecture, the first sublayer is Input Embedding.

Input embeddings are a crucial component that serve as the initial representation of the input data. These embeddings, before feeding text into the model for processing, first convert raw text data like words or sub words into a format that the model can process.

Read this chapter to understand what input embeddings are, why they are important, and how they are implemented in Transformers, with Python examples to illustrate these concepts.

What are Input Embeddings?

Input embeddings are basically vector representations of discrete tokens like words, sub words, or characters. These vectors capture the semantic meaning of the tokens that enables the model to understand and manipulate the text data effectively.

The role of the input embedding sublayer in Transformer is to convert the input tokens in a high-dimensional space $\mathrm{d_{model} \: = \: 512}$ where similar tokens have similar vector representations.

Importance of Input Embeddings in Transformers

Let's now understand why input embeddings are important in Transformers −

Semantic Representation

Input embeddings capture semantic similarities between words of input text. For example, words like "king" and "queen", or "cat" and "dog" will have vectors that are close to each other in the embedding space.

Dimensionality Reduction

The traditional one-hot encoding represents each token as a binary vector with a single high value which requires large space. On the other hand, input embeddings reduce this computational complexity by providing a compact and dense representation.

Enhanced Learning

Embeddings also capture the contextual relationships hence they improve the model's ability to learn patterns and relationships in the data. This ability leads to better performance of a ML model on NLP tasks.

Working of the Input Embedding Sublayer

The working of input embedding sublayer is like other standard transduction models, includes the following steps −

Step 1: Tokenization

Before the input tokens can be embedded, the raw text data should be tokenized. Tokenization is the process in which the text splits into smaller units like words or sub words.

Let’s see both kinds of tokenization −

  • Word-level tokenization − As the name implies, it splits the text into individual words.
  • Subword-level tokenization − As name implies, it splits the text into smaller units, which can be the parts of word. This kind of tokenization is used in models like BERT and GPT to handle misspellings. For example, a subword-level tokenizer applied to the text =" Transformers revolutionized the field of NLP" will produce the tokens = "["Transformers", " revolutionized ", " the ", "field", "of", "NLP"]

Step 2: Embedding Layer

The second step is embedding layer which is basically a lookup table that maps each token to a dense vector of fixed dimensions. This process involves the following two steps −

  • Vocabulary − It is a set of unique tokens recognized by the model.
  • Embedding Dimension − It represents the size of the vector space in which tokens are represented, for example a size of 512.

When a token is passed to the embedding layer, it returns the corresponding dense vector from the embedding matrix.

How Input Embeddings are Implemented in a Transformer?

Given below is a Python example to illustrate how input embeddings are implemented in a Transformer −

Example

import numpy as np

# Example text and tokenization
text = "Transformers revolutionized the field of NLP"
tokens = text.split()

# Creating a vocabulary
Vocab = {word: idx for idx, word in enumerate(tokens)}

# Example input (sequence of token indices)
input_indices = np.array([vocab[word] for word in tokens])

print("Vocabulary:", vocab)
print("Input Indices:", input_indices)

# Parameters
vocab_size = len(vocab)
embed_dim = 512  # Dimension of the embeddings

# Initialize the embedding matrix with random values
embedding_matrix = np.random.rand(vocab_size, embed_dim)

# Get the embeddings for the input indices
input_embeddings = embedding_matrix[input_indices]

print("Embedding Matrix:\n", embedding_matrix)
print("Input Embeddings:\n", input_embeddings)

Output

The above Python script first splits the text into tokens and creates a vocabulary that maps each word to a unique index. After that, it initializes an embedding matrix with random values where each row corresponds to the embedding of a word. We are using the dimension of embedding = 512.

Vocabulary: {'Transformers': 0, 'revolutionized': 1, 'the': 2, 'field': 3, 'of': 4, 'NLP': 5}
Input Indices: [0 1 2 3 4 5]
Embedding Matrix:
 [[0.29083668 0.70830247 0.22773598 ... 0.62831348 0.90910366 0.46552784]
 [0.01269533 0.47420163 0.96242738 ... 0.38781376 0.33485277 0.53721033]
 [0.62287977 0.09313765 0.54043664 ... 0.7766359  0.83653342 0.75300144]
 [0.32937143 0.51701913 0.39535506 ... 0.60957358 0.22620172 0.60341522]
 [0.65193484 0.25431826 0.55643452 ... 0.76561879 0.24922971 0.96247851]
 [0.78385765 0.58940282 0.71930539 ... 0.61332926 0.24710099 0.5445185 ]]
Input Embeddings:
 [[0.29083668 0.70830247 0.22773598 ... 0.62831348 0.90910366 0.46552784]
 [0.01269533 0.47420163 0.96242738 ... 0.38781376 0.33485277 0.53721033]
 [0.62287977 0.09313765 0.54043664 ... 0.7766359  0.83653342 0.75300144]
 [0.32937143 0.51701913 0.39535506 ... 0.60957358 0.22620172 0.60341522]
 [0.65193484 0.25431826 0.55643452 ... 0.76561879 0.24922971 0.96247851]
 [0.78385765 0.58940282 0.71930539 ... 0.61332926 0.24710099 0.5445185 ]]

Conclusion

The input embedding sublayer converts raw text data like words or sub words into a format that the model can process. We also explained why input embeddings are important for the successful working of Transformer.

By capturing semantic similarities between words and reducing the computational complexity by providing their compact and dense representation, this sublayer ensures that the model can effectively learn patterns and relationships in the data.

We also provided a Python implementation example to achieve the fundamental steps required to transform raw text data into a format suitable for further processing in a Transformer model.

Understanding and implementing the input embedding sublayer is crucial for effectively using the Transformer models for NLP tasks.

Advertisements