I’m currently reading the book “Grokking Deep learning” by Andrew W. Trask. And I’m having trouble understanding the gradient calculation portion of the code, where he is calculating the deltas for the weights.
The code is here
import sys, numpy as np from keras.datasets import mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() images = x_train(0:1000).reshape(1000, 28*28) / 255 labels = y_train(0:1000) one_hot_labels = np.zeros((len(labels), 10)) for i, l in enumerate(labels): one_hot_labels(i)(l) = 1 labels = one_hot_labels test_images = x_test.reshape(len(x_test), 28*28) / 255 test_labels = np.zeros((len(y_test),10)) for i,l in enumerate(y_test): test_labels(i)(l) = 1 np.random.seed(1) relu = lambda x: (x>0) * x relu2derive = lambda x: x >= 0 alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40, 784, 10) weights_0_1 = 0.2*np.random.random((pixels_per_image, hidden_size)) - 0.1 weights_1_2 = 0.2*np.random.random((hidden_size, num_labels)) - 0.1 for j in range(iterations): error, correct_cnt = (0.0, 0) for i in range(len(images)): layer_0 = images(i:i+1) layer_1 = relu(np.dot(layer_0, weights_0_1)) layer_2 = np.dot(layer_1, weights_1_2) error += np.sum((labels(i:i+1) - layer_2) ** 2) correct_cnt += int(np.argmax(layer_2) == np.argmax(labels(i:i+1))) layer_2_delta = (labels(i:i+1) - layer_2) layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2derive(layer_1) <--how did he come up with this formula? weights_1_2 += alpha * layer_1.T.dot(layer_2_delta) weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
Can someone show me the math on how he got these formulas? I’ve been look at this for week’s and I can’t make sense of it. Because on paper, the gradient for the weights just the partial derivative of the cost function with respect to the weights in question. I can work that out on paper, but it doesn’t show me the same formula he’s using up top.