I’m currently reading the book “Grokking Deep learning” by Andrew W. Trask. And I’m having trouble understanding the gradient calculation portion of the code, where he is calculating the deltas for the weights.

The code is here

```
import sys, numpy as np
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images = x_train(0:1000).reshape(1000, 28*28) / 255
labels = y_train(0:1000)
one_hot_labels = np.zeros((len(labels), 10))
for i, l in enumerate(labels):
one_hot_labels(i)(l) = 1
labels = one_hot_labels
test_images = x_test.reshape(len(x_test), 28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
test_labels(i)(l) = 1
np.random.seed(1)
relu = lambda x: (x>0) * x
relu2derive = lambda x: x >= 0
alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40, 784, 10)
weights_0_1 = 0.2*np.random.random((pixels_per_image, hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size, num_labels)) - 0.1
for j in range(iterations):
error, correct_cnt = (0.0, 0)
for i in range(len(images)):
layer_0 = images(i:i+1)
layer_1 = relu(np.dot(layer_0, weights_0_1))
layer_2 = np.dot(layer_1, weights_1_2)
error += np.sum((labels(i:i+1) - layer_2) ** 2)
correct_cnt += int(np.argmax(layer_2) == np.argmax(labels(i:i+1)))
layer_2_delta = (labels(i:i+1) - layer_2)
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2derive(layer_1) <--how did he come up with this formula?
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
```

Can someone show me the math on how he got these formulas? I’ve been look at this for week’s and I can’t make sense of it. Because on paper, the gradient for the weights just the partial derivative of the cost function with respect to the weights in question. I can work that out on paper, but it doesn’t show me the same formula he’s using up top.