In large-scale video classification with convolutional neural networks by
Karpathy et al., Can any one explain the architecture, the inputs and the training process of
Slow fusion more in detail.
Here is a little more reference, but that simply explains what is explained in the document.
I can not understand how the information is merged in the different steps and how the images below should be interpreted.