juooo1117

[Module 6] Deep Learning: Transformer 본문

Artificial Intelligence/LG Aimers: AI전문가과정

[Module 6] Deep Learning: Transformer

Hyo__ni 2024. 1. 14. 13:42

Part 5. How Transformer Model Works?

Attention module can work as a sequence encoder and a decoder in seq2seq with attention.

In other words, RNNs or CNNs are no longer necessary, but all we need is attention modules.

 

Transformer → solving Long-term Dependency Problem

Scaled Dot-product attention

  -  As 𝑑𝑘 gets large, the variance of 𝑞𝑇𝑘 increases. (더 고차원 vector의 내적값들이 되므로 분산이 더 커진다.)

  -  Some values inside the softmax get large. (분산이 더 큰 값이 softmax의 input으로 들어오기 때문에, 하나의 단일한 pick 을 가지는 attention weight가 구해진다.)

  -  The softmax gets very peaked.

  -  Hence its gradient gets smaller. (학습 차원에서 gradient가 잘 흐르지 않는 문제점이 발생한다.)

 

→  Solution: Scaled by the length of query / key vectors.

(key vector와 query vector 간에 각각의 measure 값에 dimension의 루트 값을 나누어 줌으로써 일정한 분산 값을 가지도록 하는 추가적인 장치를 만들어 준다)

 

 

Multi-head Attention

여러 개의 linear transformation set을 사용해서 attention을 수행하는 것

  -  The input word vectors can be the queries, keys and values.

  -  In other words, the word vectors themselves select one another.

 

Problem: only one way for words to interact with one another.

multi-head attention maps 𝑄, 𝐾, and 𝑉 into the h number of lower-dimensional spaces via 𝑊 matrices. Afterwards, 

apply attention, then concatenate outputs and pipe through linear layer.

Multi-head Attention

Quadratic Memory Complexity

하지만, memory 문제가 발생한다. (모든 query vector, key vector들 간의 내적을 하기 때문에 행렬을 계산하고 저장하는 과정에서 sequence 길이의 제곱만큼의 memory size가 필요하다)

 

Block-Based Model

Each block has two sub-layers

  -  Multi-head attention

  -  Two-layer feed-forward NN (with ReLU)

Each of these two steps also has 

  -  Residual connection and layer normalization: LayerNorm(𝑥 + sublayer(𝑥))

Layer Normalization

두 단계로 구성된다.

  -  Normalization of each word vectors to have zero mean of zero and variance of one.

  -  Affine transformation of each sequence vector with learnable parameters.

 

*Affine transformation : 어떤 도형이 있을 때 한 dimension의 길이 비례하는 값을 다른 dimension에 더하는 변형이다. x,y축에 독립적으로 translation이 이루어지기 때문에 점 사이의 상대적 거리도 유지하고, 평행선도 보존할 수 있다.

 

Masked Self-attention

  -  Those words not yet generated cannot be accessed during the inference time.

  -  Renormalization of softmax output prevents the model from accessing not yet generated words.

 

최신 동향

Transformer model and its self-attention block has become a general-purpose sequence (or set) encoder in recent NLP applications as well as in other areas.

Training deeply stacked Transformer models via a self-supervised learning framework has significantly advanced various NLP tasks via transfer learning, e.g., BERT, GPT-2, GPT-3, XLNet, ALBERT, RoBERTa, Reformer, T5, ...

Other applications are fast adopting the self-attention architecture and self- supervised learning settings, e.g., computer vision, recommender systems, drug discovery, and so on

As for natural language generation, self-attention models still require a greedy decoding of words one at a time.