[Module 6] Deep Learning: Training Neural Networks

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

juooo1117

[Module 6] Deep Learning: Training Neural Networks 본문

Artificial Intelligence/LG Aimers: AI전문가과정

[Module 6] Deep Learning: Training Neural Networks

Hyo__ni 2024. 1. 12. 11:42

Part 2. Training Neural Networks

Training Neural Networks via Gradient Descent

loss function을 최소화하는 parameter(W)들을 찾는다. → min L(W) / 𝑊 ≔ 𝑊 − 𝛼 (𝑑𝐿(𝑊) / 𝑑𝑊)

Suppose loss function is steep vertically but shallow horizontally:

→ Very slow progress along flat direction, jitter along steep one (갈팡질팡하면서 결론적으로 굉장히 비효율적인 과정, 즉 필요보다 더 많은 수의 반복 과정을 통해 최종적인 loss function의 miminum point에 도달할 수 있게됨)

Backpropagation to Compute Gradient in Neural Network,

- Given an input data item, compute the loss function value via Forward Propagation.

- Afterwards, compute the gradient with respect to each neural network parameter via Backpropagation.

- Finally, update the parameters using gradient descent algorithm.

Activation Function

Sigmoid

하나의 neuron or perceptron이, input을 선형결합해서 만들어진 값에, hard threshold 를 적용해서 최종 output을 내 주었던 것을

- Maps real numbers in (−∞, ∞) into a range of [0, 1]

- Gives a probabilistic interpretation(해석)

- Historically, sigmoid activation function gives nice interpretation of saturating firing rate of a neuron.

하지만, Saturated neurons kills the gradients.

→ the gradient value, which decreases the gradient during backpropagation, i.e., causing a gradient vanishing problem.

To solve gradient vanishing problem,

Tanh

tanh(x) = 2 x sigmoid(x) - 1 → squashes numbers to range [-1, 1]

평균값이 0을 중심으로 하는 값으로 도출된다. 따라서 좀 더 학습을 빠르게 시켜주는 효과를 볼 수 있다. (zero-centered, average is 0)

하지만, still kills gradients when saturated, i.e., still causing a gradient vanishing problem.

ReLU(Rectified Linear Unit)

𝑓(𝑥) = max(0,𝑥)

The slope of the function) 𝑥 ≥ 0 : 1 (bypass) , 𝑥 < 0 (gating)

- Does not saturate in. (+) region

- Very computationally efficient

- Converge much faster than sigmoid/tanh (layer가 많이 쌓여있을 때, 훨씬 더 빠르게 계산될 수 있다.)

Batch Normalization

motivation of Batch Norm → Saturated gradients when random initialization is done. The parameters are not updated, so it's hard to optimize (especially in red region)

Batch Norm process

considering a batch of activations at some layer to make each dimension unit Gaussian.

→ compute the empirical mean 𝔼[𝑥^(𝑘)] and variance Var[𝑥^(𝑘)] independently for each dimension 𝑘

fully connected layer or linear combination 을 수행한 이후, activation function 으로 직전에 batch normalization layer를 추가하는 것이 일반적인 형태이다.

BUT, 이렇게 하면 중요한 정보를 담고 있는 평균 & 분산값을 모두 무시하고 0,1로만 만들어주기 때문에 neural network 를 잘 추출한 정보를 잃어버리는 과정이 될 수 있다.

→ 잃어버린 정보를 neural network가 정보를 복원할 수 있도록하는 단계가 batch Norm 에 존재!

따라서, 평균 & 분산을 각각 0과 1로 만든 그 값에, gradient descent 를 통한 학습에 의해 최적화시키는 parameter를 도입해서, y = ax + b 라를 변환을 수행하는 추가적인 layer를 batch norm의 두번째 단계로 삽입함!

- Improves gradient flow through the network.

- Reduces the strong dependence on initialization.

'Artificial Intelligence > LG Aimers: AI전문가과정' 카테고리의 다른 글

[Module 6] Deep Learning: Seq2Seq with Attention for Natural Language Understanding and Generation (0)	2024.01.13
[Module 6] Deep Learning: CNN and Image Classification (0)	2024.01.12
[Module 6] Deep Learning: Deep Neural Networks (0)	2024.01.12
[Module 4] Supervised Learning: Ensemble (0)	2024.01.11
[Module 4] Supervised Learning: Advanced Classification (2)	2024.01.11

'Artificial Intelligence/LG Aimers: AI전문가과정' Related Articles

juooo1117

[Module 6] Deep Learning: Training Neural Networks 본문

[Module 6] Deep Learning: Training Neural Networks

Part 2. Training Neural Networks

'Artificial Intelligence > LG Aimers: AI전문가과정' 카테고리의 다른 글

티스토리툴바