I was so confused when doing a homework on implementing the Luong Attention, because it tells that the decoder is a RNN, which takes $$y_{t-1}$$ and $$s_{t-1}$$ as input, and outputs $$s_t$$, i.e., $$s_t = RNN(y_{t-1}, s_{t-1})$$.

But the pytorch implementation of RNN is: $$outputs, hidden\_last = RNN(inputs, hidden\_init)$$, which takes in a sequence of elements, computes in serials, and outputs a sequence also.

I was confused about what is the $$s_t$$. Is it the $$outputs$$, or the $$hidden\_states$$?

This is the very helpful picture:

The $$output$$ here is the $$hidden\_states$$ of the last layer among all elements in the sequence (time steps), while the $$h_n,c_n = hidden\_last$$ is the $$hidden\_states$$ of the last time step among all layers.

The former is the $$H$$, hidden state collection, which can be used in subsequent calculations, like attentions or scores; and the latter is the hidden state that can be directly used in the next iteration.