The difference between RNN's output and h_n


I was so confused when doing a homework on implementing the Luong Attention, because it tells that the decoder is a RNN, which takes \(y_{t-1}\) and \(s_{t-1}\) as input, and outputs \(s_t\), i.e., \(s_t = RNN(y_{t-1}, s_{t-1})\).

But the pytorch implementation of RNN is: \(outputs, hidden\_last = RNN(inputs, hidden\_init)\), which takes in a sequence of elements, computes in serials, and outputs a sequence also.

I was confused about what is the \(s_t\). Is it the \(outputs\), or the \(hidden\_states\)?

This is the very helpful picture:

The \(output\) here is the \(hidden\_states\) of the last layer among all elements in the sequence (time steps), while the \(h_n,c_n = hidden\_last\) is the \(hidden\_states\) of the last time step among all layers.

The former is the \(H\), hidden state collection, which can be used in subsequent calculations, like attentions or scores; and the latter is the hidden state that can be directly used in the next iteration.