[Module 6] Deep Learning: Transformer

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

juooo1117

[Module 6] Deep Learning: Transformer 본문

Artificial Intelligence/LG Aimers: AI전문가과정

[Module 6] Deep Learning: Transformer

Hyo__ni 2024. 1. 14. 13:42

Part 5. How Transformer Model Works?

Attention module can work as a sequence encoder and a decoder in seq2seq with attention.

In other words, RNNs or CNNs are no longer necessary, but all we need is attention modules.

Transformer → solving Long-term Dependency Problem

Scaled Dot-product attention

- As 𝑑𝑘 gets large, the variance of 𝑞𝑇𝑘 increases. (더 고차원 vector의 내적값들이 되므로 분산이 더 커진다.)

- Some values inside the softmax get large. (분산이 더 큰 값이 softmax의 input으로 들어오기 때문에, 하나의 단일한 pick 을 가지는 attention weight가 구해진다.)

- The softmax gets very peaked.

- Hence its gradient gets smaller. (학습 차원에서 gradient가 잘 흐르지 않는 문제점이 발생한다.)

→ Solution: Scaled by the length of query / key vectors.

(key vector와 query vector 간에 각각의 measure 값에 dimension의 루트 값을 나누어 줌으로써 일정한 분산 값을 가지도록 하는 추가적인 장치를 만들어 준다)

Multi-head Attention

여러 개의 linear transformation set을 사용해서 attention을 수행하는 것

- The input word vectors can be the queries, keys and values.

- In other words, the word vectors themselves select one another.

Problem: only one way for words to interact with one another.

→ multi-head attention maps 𝑄, 𝐾, and 𝑉 into the h number of lower-dimensional spaces via 𝑊 matrices. Afterwards,

apply attention, then concatenate outputs and pipe through linear layer.

Quadratic Memory Complexity

하지만, memory 문제가 발생한다. (모든 query vector, key vector들 간의 내적을 하기 때문에 행렬을 계산하고 저장하는 과정에서 sequence 길이의 제곱만큼의 memory size가 필요하다)

Block-Based Model

Each block has two sub-layers

- Multi-head attention

- Two-layer feed-forward NN (with ReLU)

Each of these two steps also has

- Residual connection and layer normalization: LayerNorm(𝑥 + sublayer(𝑥))

Layer Normalization

두 단계로 구성된다.

- Normalization of each word vectors to have zero mean of zero and variance of one.

- Affine transformation of each sequence vector with learnable parameters.

*Affine transformation : 어떤 도형이 있을 때 한 dimension의 길이 비례하는 값을 다른 dimension에 더하는 변형이다. x,y축에 독립적으로 translation이 이루어지기 때문에 점 사이의 상대적 거리도 유지하고, 평행선도 보존할 수 있다.

Masked Self-attention

- Those words not yet generated cannot be accessed during the inference time.

- Renormalization of softmax output prevents the model from accessing not yet generated words.

최신 동향

• Transformer model and its self-attention block has become a general-purpose sequence (or set) encoder in recent NLP applications as well as in other areas.

• Training deeply stacked Transformer models via a self-supervised learning framework has significantly advanced various NLP tasks via transfer learning, e.g., BERT, GPT-2, GPT-3, XLNet, ALBERT, RoBERTa, Reformer, T5, ...

• Other applications are fast adopting the self-attention architecture and self- supervised learning settings, e.g., computer vision, recommender systems, drug discovery, and so on

• As for natural language generation, self-attention models still require a greedy decoding of words one at a time.

'Artificial Intelligence > LG Aimers: AI전문가과정' 카테고리의 다른 글

[Module 5] 인과추론: Causality (1)	2024.01.14
[Module 6] Deep Learning: Self-Supervised Learning & Pre-Trained Models (0)	2024.01.14
[Module 6] Deep Learning: Seq2Seq with Attention for Natural Language Understanding and Generation (0)	2024.01.13
[Module 6] Deep Learning: CNN and Image Classification (0)	2024.01.12
[Module 6] Deep Learning: Training Neural Networks (0)	2024.01.12

'Artificial Intelligence/LG Aimers: AI전문가과정' Related Articles

juooo1117

[Module 6] Deep Learning: Transformer 본문

[Module 6] Deep Learning: Transformer

Part 5. How Transformer Model Works?

'Artificial Intelligence > LG Aimers: AI전문가과정' 카테고리의 다른 글

티스토리툴바