Obviously, this approach fails to consider the temporal order of frames, and gives the same input to the decoder in every time step (as shown in the figure above). We can easily improve this model by replacing the arithmatic mean with weighted average. That’s exactly the method in Describing Videos by Exploiting Temporal Structure. Here is the picture of their model.
So how do we choose those weights? We can calculate the weights automatically! Let’s check the structure above. When you want to calculate the weight for a single CNN output, what should be the input factors? We know that the feature itself holds the information about that frame, and the hidden states holds the information of a RNN. As LSTM has a memory over time steps, the information will be gathered into the latest hidden state (i.e. the hidden state of previous time step). Thus, it comes naturally to define the weight as a function of a feature and previous hidden state.
\[e_{i}^{(t)} = f(v_i, h^{(t-1)})\]In this equation, \(v_i\) represents a feature vector, and \(h_{t-1}\) is the previous hidden state of which unit the weighted feature is going to input to. And assume \(N\) is the number of frames and \(T\) is the total number of time steps in RNN, we will have \(i\in [1,N]\) and \(t\in [1,T]\). (We can define \(h^{(0)}\) as \(\mathbf{0}\) vector so \(t=1\) makes sense)
But this doesn’t make sure \(e_{i}^{(t)}\) will have a sum to 1 over all features (i.e. \(\sum_{i=1}^Ne_i^{(t)}\neq 1\), remember we are doing weighted average!). So we apply softmax over them as follows.
\[\alpha_{i}^{(t)} = \frac{exp\{e_{i}^{(t)}\}}{\sum_{i=1}^Nexp\{e_{i}^{(t)}\}}\]This will give you \(\sum_{i=1}^N\alpha_{i}^{(t)}=1\), so we can calculate a single feature for each time step as
\[\phi^{(t)}=\sum_{i=1}^N\alpha_{i}^{(t)}v_i\]So how we define \(f(v_i,h^{(t)})\)? That’s much the similar of what we do in a node of neural network. (if you know nothing about neural network and machine learning, I would highly recommand this tutorial and Andrew Ng’s machine learning course which takes week long study but really a worth of time.) Just a linear function over \(v_i\) and \(h^{(t-1)}\) and add a non-linear activation function. Here we use \(\tanh\) as the activation function.
\[e_{i}^{(t)} = w^T\tanh(W_ah^{(t-1)}+U_av_i+b_a)\]In which \(w,W_a,U_a,b_a\) are all learnable parameters. Here we assume \(r\) to be the size of an RNN hidden state, \(m\) to be the encoding size of CNN, \(a\) to be the attention matrix size (you can play with this parameter), we can list the size of those variables as follows (I always find it useful to list the size of those variables!)
\[v_i\in \mathbb{R}^{m} \\ h^{(t)}\in \mathbb{R}^{r}\\ W_a\in \mathbb{R}^{a\times r}\\ U_a\in \mathbb{R}^{a\times m}\\ b_a, w \in \mathbb{R}^{a}\]So this gives \(e_i^{(t)}\) as a scalar value.
Another detail is that we share those learnable parameters both for different \(i\) (features) and different \(t\) (time steps), just like what RNN does. (If you know little about RNN, I would highly recommand this series of tutorial.) It can be implemented into the RNN unit so that the unit like LSTM can take a sequence as input. And as the equations are all differential, we can learn those parameters using back propagation with any optimization method!
This mechanism can definitely be used in other problems as long as you can transform the problem to a sequence representation. Theoretically, it will perform better than input a single item of a sequence because that can be described as a one-hot weight (i.e. weight over the sequence with a vector that only a single position takes 1 and others are all 0). However, if the attention matrix size \(a\) is too small, intuitively, it won’t cover all the possible weights to minimize the cost function, so the choice of \(a\) is important. But I haven’t found a best practice on how to choose that. Currently I just set it the same as \(r\) (the size of RNN hidden state). Tell me if you have some experience!
Here is a easy implementation of this temporal attention mechanism in torch. Also shown as follows:
Notice that here is also a seq_per_video
parameter, in which case there are multiple ground truth sentences for one video clip. This cause the hidden state to be the size of seq_per_video * rnn_size
, so we need to replicate \(U_av_i\) to match the size of \(W_ah^{(t-1)}\).
I’m doing experiment on some models using attention mechanism. I may update my blog to make some comparison with different models in the coming post (as long as I cure my procrastination :-).
The author would thank Ryan Szeto and Daiqi Gao for their careful review of this post.