output classes = (1, 10)
number of hidden nodes = 50, 20, 30
Activation Function : Sigmoid ($\sigma $)
Loss Function : Cross Entropy
Forward Pass
Layer 1
$$ Z^1 = X \bullet W^{[1]} + b^{[1]}, \\ A^{[1]} = \sigma{(Z^{[1]})} $$
On layer 1, input feature size = (1,400), 따라서 $$ X \in \mathbb{R}^{1 \times 400}, \; W^{[1]} \in \mathbb{R}^{400 \times 50}, \; b^{[1]} \in \mathbb{R}^{1 \times 50} $$
$$ Z^{[1]} \in \mathbb{R}^{1 \times 50}, \; A^{[1]} \in \mathbb{R}^{1x50} $$
Layer 2
$$ Z^{[2]} = A^{[1]} \bullet W^{[2]} + b^{[2]} \\ A^{[2]} = \sigma{(Z^{[2]})} $$
On layer 2, input feature = $ A^{[1]} $ $$ A^{[1]} \in \mathbb{R}^{1 \times 50}, \; W^{[2]} \in \mathbb{R}^{50 \times 20}, \; b^{[2]} \in \mathbb{R}^{1 \times 20} $$
$$ Z^{[2]} \in \mathbb{R}^{1 \times 20}, \; A^{[1]} \in \mathbb{R}^{1x20} $$
Layer 3
$$ Z^{[3]} = A^{[2]} \bullet W^{[3]} + b^{[3]} \\ A^{[3]} = \sigma{(Z^{[3]})} $$
On layer 3, input feature = $ A^{[2]} $ $$ A^{[2]} \in \mathbb{R}^{1 \times 20}, \; W^{[3]} \in \mathbb{R}^{20 \times 30}, \; b^{[3]} \in \mathbb{R}^{1 \times 30} $$
$$ Z^{[3]} \in \mathbb{R}^{1 \times 30}, \; A^{[2]} \in \mathbb{R}^{1x30} $$
Layer 4
$$ Z^{[4]} = A^{[3]} \bullet W^{[4]} + b^{[4]} \\ O = softmax{(Z^{[4]})} $$
On layer 4, input feature = $ A^{[3]} $ $$ A^{[3]} \in \mathbb{R}^{1 \times 30}, \; W^{[4]} \in \mathbb{R}^{30 \times 10}, \; b^{[4]} \in \mathbb{R}^{1 \times 10} $$
$$ Z^{[4]} \in \mathbb{R}^{1 \times 20}, \; O \in \mathbb{R}^{1x10} $$
Softmax 이후
$$ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \sum_{y} P_{data}(y|x^{(i)}) \ log(P_{model}(y|x^{(i)}; \theta)) $$
$$ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} one-hot(y^{(i)})^T \bullet \ log(h_\theta(x^{(i)}) $$
부연설명 : 라벨 값이 주어지는 y의 경우 확률 0 또는 1 이므로 one-hot으로 나타낼 수 있음, theta값에 대해 입력 값 x(i)를 넣은 모델의 예측 확률 y은 h()로 나타냄.
- Backpropagation Binary Cross Entropy를 사용할 경우와 그냥 Cross Entropy를 사용할 경우의 차이$$ J(y, \hat{y}) = -y \bullet log(\hat{y}) - (1-y) \bullet log(1 - \hat{y}) $$
CE 미분 y가 1, 0일 때 구분해서 하면 됌 $$ CE(y, \hat{y}) = \sum_{i}j_i \bullet \log(\hat{y}) \rightarrow \frac{\partial CE(y, \hat{y})}{\partial \theta_i} = y - \hat{y} $$
Backpropagation
Layer 4
Goal : $ \frac{\partial J(\theta)}{\partial W^{[4]}} $
Chain Rule에 의해 $$ \frac{\partial J(\theta)}{\partial W^{[4]}} = \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial W^{[4]}} $$
$ \frac{\partial J(\theta)}{\partial O} = y - o $
softmax의 미분은 시그모이드의 미분과 비슷하다는데 잘 모르겠다.! 그래서 시그모이드 미분을 활용했다.
참고 링크 : https://www.derivative-calculator.net
$ \sigma(z) = \frac{1}{1 + e^{-z}} , \; \frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1 - \sigma(z)) \\ \frac{\partial O}{\partial Z^{[4]}} = \frac{\partial \sigma}{\partial Z^{[4]}} = O \bullet (1 - O) $
$ \frac{\partial Z^{[4]}}{\partial W^{[4]}} = \frac{ \partial (A^{[3]}W^{[4]} + b^{[4]})}{\partial W^{[4]}} = A^{[3]} $
$ \therefore \frac{\partial J(\theta)}{\partial W^{[4]}} = \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial W^{[4]}} = (y - o) \times O \bullet (1-O) \times A^{[3]} \\ $
여기서 전치가 발행하므로 $ A^{[3]} \rightarrow (A^{[3]})^T, \; (A^{[3]})^T \times (O - O^2) \times (y - O) \rightarrow A^{[3]} \delta^{[4]} $
- 전치 발생 원인 분석링크Layer 3
Goal : $ \frac{\partial J(\theta)}{\partial W^{[3]}} $
Chain Rule에 의해 $$ \frac{\partial J(\theta)}{\partial W^{[3]}} = \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial W^{[3]}} $$
$ \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} = \delta^{[4]} $ 이므로
$ \frac{\partial J(\theta)}{\partial W^{[3]}} = \delta^{[4]} \; \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial W^{[3]}} $
$ \frac{\partial Z^{[4]}}{\partial A^{[3]}} = \frac{\partial (A^{[3]}W^{[4]} + b^{[4]})}{\partial A^{[3]}} = W^{[4]} $
$ \frac{\partial A^{[3]}}{\partial Z^{[3]}} = \frac{\partial (\frac{1}{1 + e^{-Z^{[3]}}})}{\partial Z^{[3]}} = A^{[3]} \bullet (1-A^{[3]}) $
$ \frac{\partial Z^{[3]}}{\partial W^{[3]}} = \frac{\partial(A^{[2]}W^{[3]} + b^{[3]})}{\partial W^{[3]}} = A^{[2]} $
$ \frac{\partial J(\theta)}{\partial W^{[3]}} = \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial W^{[3]}} $
$ \therefore \frac{\partial J(\theta)}{\partial W^{[3]}} = \delta^{[4]} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial W^{[3]}} = \delta^{[4]} \times W^{[4]} \times A^{[3]} \bullet (1 - A^{[3]}) \times A^{[2]} $
여기서 전치가 발행하므로 $ A^{[2]} \rightarrow (A^{[2]})^T, \; (A^{[2]})^T \times (A^{[3]} - (A^{[3]})^2) \times W^{[4]} \times \delta^{[4]} \rightarrow A^{[2]} \delta^{[3]} $
Layer 2
Goal : $ \frac{\partial J(\theta)}{\partial W^{[2]}} $
Chain Rule에 의해 $$ \frac{\partial J(\theta)}{\partial W^{[2]}} = \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial A^{[2]}} \frac{\partial A^{[2]}}{\partial Z^{[2]}} \frac{\partial Z^{[2]}}{\partial W^{[2]}} $$
$ \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} = \delta^{[3]} $ 이므로
$ \frac{\partial J(\theta)}{\partial W^{[2]}} = \delta^{[3]} \; \frac{\partial Z^{[3]}}{\partial A^{[2]}} \frac{\partial A^{[2]}}{\partial Z^{[2]}} \frac{\partial Z^{[2]}}{\partial W^{[2]}} $
$ \frac{\partial Z^{[3]}}{\partial A^{[2]}} = \frac{\partial (A^{[2]}W^{[3]} + b^{[3]})}{\partial A^{[2]}} = W^{[3]} $
$ \frac{\partial A^{[2]}}{\partial Z^{[2]}} = \frac{\partial (\frac{1}{1 + e^{-Z^{[2]}}})}{\partial Z^{[2]}} = A^{[2]} \bullet (1-A^{[2]}) $
$ \frac{\partial Z^{[2]}}{\partial W^{[2]}} = \frac{\partial(A^{[1]}W^{[2]} + b^{[2]})}{\partial W^{[2]}} = A^{[1]} $
$ \frac{\partial J(\theta)}{\partial W^{[2]}} = \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial A^{[2]}} \frac{\partial A^{[2]}}{\partial Z^{[2]}} \frac{\partial Z^{[2]}}{\partial W^{[2]}} $
$ \therefore \frac{\partial J(\theta)}{\partial W^{[2]}} = \delta^{[3]} \frac{\partial Z^{[3]}}{\partial A^{[2]}} \frac{\partial A^{[2]}}{\partial Z^{[2]}} \frac{\partial Z^{[2]}}{\partial W^{[2]}} = \delta^{[3]} \times W^{[3]} \times A^{[2]} \bullet (1 - A^{[2]}) \times A^{[1]} $
여기서 전치가 발행하므로 $ A^{[1]} \rightarrow (A^{[1]})^T, \; (A^{[1]})^T \times (A^{[2]} - (A^{[2]})^2) \times W^{[3]} \times \delta^{[3]} \rightarrow (A^{[1]})^T \delta^{[2]} $
Layer 1
Goal : $ \frac{\partial J(\theta)}{\partial W^{[1]}} $
Chain Rule에 의해 $$ \frac{\partial J(\theta)}{\partial W^{[1]}} = \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial A^{[2]}} \frac{\partial A^{[2]}}{\partial Z^{[2]}} \frac{\partial Z^{[2]}}{\partial A^{[1]}} \frac{\partial A^{[1]}}{\partial Z^{[1]}} \frac{\partial Z^{[1]}}{\partial W^{[1]}} $$
$ \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial A^{[2]}} \frac{\partial A^{[2]}}{\partial Z^{[2]}} = \delta^{[2]} $ 이므로
$ \frac{\partial J(\theta)}{\partial W^{[1]}} = \delta^{[2]} \; \frac{\partial Z^{[2]}}{\partial A^{[1]}} \frac{\partial A^{[1]}}{\partial Z^{[1]}} \frac{\partial Z^{[1]}}{\partial W^{[1]}} $
$ \frac{\partial Z^{[2]}}{\partial A^{[1]}} = \frac{\partial (A^{[1]}W^{[2]} + b^{[2]})}{\partial A^{[1]}} = W^{[2]} $
$ \frac{\partial A^{[1]}}{\partial Z^{[1]}} = \frac{\partial (\frac{1}{1 + e^{-Z^{[1]}}})}{\partial Z^{[1]}} = A^{[1]} \bullet (1-A^{[1]}) $
$ \frac{\partial Z^{[1]}}{\partial W^{[1]}} = \frac{\partial(X \ W^{[1]} + b^{[1]})}{\partial W^{[1]}} = X $
$ \frac{\partial J(\theta)}{\partial W^{[1]}} = \frac{\partial J(\theta)}{\partial O} \frac{\partial O}{\partial Z^{[4]}} \frac{\partial Z^{[4]}}{\partial A^{[3]}} \frac{\partial A^{[3]}}{\partial Z^{[3]}} \frac{\partial Z^{[3]}}{\partial A^{[2]}} \frac{\partial A^{[2]}}{\partial Z^{[2]}} \frac{\partial Z^{[2]}}{\partial A^{[1]}} \frac{\partial A^{[1]}}{\partial Z^{[1]}} \frac{\partial Z^{[1]}}{\partial W^{[1]}} $
$ \therefore \frac{\partial J(\theta)}{\partial W^{[1]}} = \delta^{[2]} \frac{\partial Z^{[2]}}{\partial A^{[1]}} \frac{\partial A^{[1]}}{\partial Z^{[1]}} \frac{\partial Z^{[1]}}{\partial W^{[1]}} = \delta^{[2]} \times W^{[2]} \times A^{[1]} \bullet (1 - A^{[1]}) \times X $
여기서 전치가 발행하므로 $ X \rightarrow X^T, \; X^T \times (A^{[1]} - (A^{[1]})^2) \times W^{[2]} \times \delta^{[2]} \rightarrow X^T \delta^{[1]} $