I am currently developing an LSTM RNN for time series forecasting. I understand that it is common to tie the gradients of the RNN when it exceeds a certain threshold. However, I am not entirely clear whether or not this includes the output layer.
If we call the hidden layer of an RNN h, then the output is sigmoid (connected_weight * h + polarization). I know that the gradients for the weights used to determine the masked layer are truncated, but is it the same for the output layer? In other words, are the gradients for weight_connections also cut into gradient breaks?