RNN-based encoder – decoder architecture explained

RNN-based Encoder – Decoder

Encoder
An encoder transforms the input data into a different representation, usually a fixed-size context vector. The input data x can be a sequence or a set of features. The encoder maps this input to a context vector c, which is a condensed representation of the input data. Mathematically, this can be represented as:

\textbf{c = f(x)}
 c \text{ - context vector, } x \text{ - input data}

In the case of a sequence, such as a sentence in a language translation task, the encoder might process each element of the sequence (e.g., each word) sequentially. If the encoder is a recurrent neural network (RNN), the transformation f can involve updating the hidden state h at each step:

\boldsymbol{h_t = f(h_{t-1}, x_t)}
 h_t \text{ - hidden state at time } t
 x_t \text{ - input at time } t
 h_{t-1} \text{ - hidden state at time } t-1

The final hidden state h_T​ can be used as the context vector c for the entire input sequence.

Decoder
The decoder takes the context vector c and generates the output data y. In many applications, the output is also a sequence, and the decoder generates it one element at a time. Mathematically, the decoder’s operation can be represented as:

\boldsymbol{y_t = g(y_{t-1}, h_t, c)}
 y_t \text{ - output at time } t
 h_t \text{ - hidden state at time } t

In many sequence-to-sequence models, the decoder is also an RNN, and its hidden state is updated at each step:

\boldsymbol{h_t = g(h_{t-1}, y_{t-1}, c)}


The encoder-decoder framework, particularly in the context of sequence-to-sequence models, is designed to handle sequences of variable lengths both on the input and the output sides.

Output Generation (Decoder)

Initial State: The decoder is initialized with the context vector c as its initial state:

h'_0 = C

Start Token: The decoder receives a start-of-sequence token SOS as its first input y0​.
Decoding Loop: At each step t, the decoder generated an output token y_t and updates its hidden state h’_t.
Variable Length Output: The decoder continues to generate tokens one at a time until it produces an end-of-sequence token EOS. The length of the output sequence Y = (y_1, y_2, …, y_m) is not fixed and can be different from the input length n. The process is as follows:

y_t = Decode(h'_{t-1}, y_{t-1})
h'_t = UpdateState(h'_{t-1}, y_{t-1})
\text{for t = 1 to m, where m can be different from n}

Stopping Criterion: The loop stops when the EOS token is generated, or after producing the maximum allowed length for the output sequence.

The decoder can be also represented using the probability distribution of the next token given the previous tokens and the context vector c from the encoder:

p(y_t|y_{<t}, C)

The full sequence probability is the product of individual token probabilities: the decoder generates a sequence token by token, and the probability of the sequence Y given the context vector C can be described by the chain rule of probability:

p(Y|C) = p(y_1|C) * p(y_2|y_1, C) * ... * p(y_m|y_{<m}, C)

How do we obtain these conditional probabilities?
– For each time step t from 1 to m (m is to be determined):

\bullet \text{The decoder takes the previous hidden state } h'_{t-1} \text{ and the previously generated token } y_{t-1}
\text{ as inputs.}
\bullet \text{The function } f_{theta_{dec}} \text{, parametrized by decoder's weights } \theta_{dec} \text{, computes the current}
\text{ hidden state } h'_t \text{ and the logit vector } l_t \text{, which precedes the probability distribution}
\text{for the next token.}
(h'_{t-1}, y_{t-1}) \xrightarrow{f_{\theta_{dec}}} (l_t, h'_t)
\bullet \text{The logit vector is computed by multiplying the embedded representation of the decoder's}
\text{output by the transposed word embedding matrix}
l_t = W_ey' + b
\bullet \text{The logit vector } l_t \text{ is passed through a softmax layer to obtain the probability}
\text{distribution for the next token } y_t:
p(y_t | y_{<t}, C) = Softmax(l_t)

Token Generation:

\bullet \text{A token is sampled from the probability distribution } p(y_t|y_{<t}, C) \text{, which becomes}
\text{the next token in the sequence } y_t
\bullet \text{This token is then used as the input for the next time step}

Sequence Continuation:
– This process repeats, with the decoder generating one token at a time, updating its hidden state, and adjusting the probability distributions for subsequent tokens based on the current sequence.

Stopping Criterion:
– The loop continues until the decoder generates an EOS token, indicating the end of the sequence, or until it reaches a predefined maximum sequence length.

Published by

Unknown's avatar

Neuropsych Amateur

Misdiagnosed with schizophrenia for a year. Later on received the correct diagnosis of autoimmune encephalitis (Hashimoto's Encephalitis) in April 2017. This is me trying to understand this autoimmune disease, what led to it, and why it took so long to diagnose.

Leave a comment