Chapter 3: Transformer Architecture Deep Dive

3.1 Transformer Design Principles

3.1.1 Core Concepts of Self-Attention Mechanism

Self-Attention is the core innovation of Transformer, fundamentally changing the way sequence modeling is performed.

Problems with Traditional Sequence Modeling:

Sequential Dependency: RNN/LSTM must process sequentially, cannot be parallelized
Long-range Dependency Decay: Information gradually lost in long sequences
Computational Bottleneck: Hidden states become bottlenecks for information transfer

Self-Attention Solutions:

Direct Relationship Modeling: Any two positions can directly compute correlation
Parallel Computation: All positions can be processed simultaneously, greatly improving efficiency
Dynamic Weights: Dynamically assign attention weights based on content

Mathematical Definition of Self-Attention:


Attention(Q,K,V) = softmax(QK^T/√dk)V

Where:

Q (Query): Query matrix, representing “what I want to attend to”
K (Key): Key matrix, representing “what information I can provide”
V (Value): Value matrix, representing “what content I actually contain”
dk: Dimension of key vectors, used for scaling to prevent gradient vanishing

Detailed Computation Process:

Linear Transformation: Input X gets Q, K, V through three different weight matrices
```
Q = XWQ,  K = XWK,  V = XWV
```
Similarity Computation: Calculate similarity between each query and all keys
```
scores = QK^T/√dk
```
Normalization: Use softmax to convert similarities to probability distribution
```
weights = softmax(scores)
```
Weighted Sum: Weighted average of values based on weights
```
output = weights × V
```

Advantages of Self-Attention:

Global Receptive Field: Each position can directly access any position in the sequence
Computational Complexity: For sequence length n, complexity is O(n²), better than RNN’s O(n) for short sequences
Interpretability: Attention weights provide intuitive interpretability
Position Independence: Breaks the mandatory sequential constraint of positions

3.1.2 Advantages of Parallel Processing

Transformer’s parallelization capability is a major advantage over RNN/LSTM.

Serial Computation Limitations of RNN:


h1 = f(x1, h0)
h2 = f(x2, h1)  # Must wait for h1 computation to complete
h3 = f(x3, h2)  # Must wait for h2 computation to complete
...

Parallel Computation of Transformer:


# All positions can be computed simultaneously
output1, output2, ..., outputn = Attention(Q, K, V)

Specific Advantages of Parallelization:

Training Acceleration: Can fully utilize GPU’s parallel computing capabilities
Memory Efficiency: Avoids the need to save all intermediate states as in RNN
Gradient Flow: Gradients from all positions can backpropagate directly, avoiding gradient vanishing
Hardware Friendly: Matrix operations are more suitable for modern GPU architectures

Computational Efficiency Comparison:

RNN Training Time: O(n) × sequence length (serial)
Transformer Training Time: O(1) (parallel, limited by memory)
Actual Speedup: Can achieve 10-100x training acceleration on modern GPUs

3.1.3 Encoder-Decoder Architecture

The original Transformer adopts the classic Encoder-Decoder architecture, providing a flexible framework for different tasks.

Overall Architecture Overview:


Input → Encoder → Context Representation → Decoder → Output

Role of Encoder:

Feature Extraction: Encode input sequences into high-dimensional representations
Context Modeling: Capture dependencies within input sequences
Multi-layer Stacking: Progressively abstract features through multiple encoder layers

Role of Decoder:

Sequence Generation: Generate target sequences based on encoder outputs
Autoregressive Generation: Each generation step depends on previously generated content
Conditional Generation: Combine encoder information for conditional generation

Information Flow in Encoder-Decoder:

Encoding Phase: Encoder processes complete input sequence
Interaction Phase: Decoder accesses Encoder outputs through Cross-Attention
Decoding Phase: Decoder autoregressively generates output sequence

Application Scenarios:

Machine Translation: Source language → Target language
Text Summarization: Long text → Summary
Question Answering: Question + Document → Answer
Code Generation: Natural language → Code

3.2 Key Component Analysis

3.2.1 Multi-Head Attention

Multi-Head Attention is an enhanced version of Self-Attention that learns different types of relationships through multiple “heads” in parallel.

Limitations of Single-Head Attention:

Limited Expressiveness: Single attention mechanism may only capture one type of relationship
Information Bottleneck: All information passes through the same attention channel

Multi-Head Design Philosophy:

Parallel Multi-heads: Use h different attention heads working simultaneously
Division of Labor: Each head can focus on learning different types of dependency relationships
Information Fusion: Concatenate outputs from multiple heads and apply linear transformation

Mathematical Representation:


MultiHead(Q,K,V) = Concat(head1, head2, ..., headh)WO

Where: headi = Attention(QWiQ, KWiK, VWiV)

Specific Computation Steps:

Projection and Division: Project Q, K, V to h subspaces respectively


Qi = QWiQ,  Ki = KWiK,  Vi = VWiV
Dimensions: d_model → d_model/h

Parallel Computation: Each head independently computes attention
```
headi = Attention(Qi, Ki, Vi)
```
Concatenation and Fusion: Concatenate outputs from all heads
```
MultiHead = Concat(head1, ..., headh)
```
Output Projection: Get final output through linear layer
```
Output = MultiHead × WO
```

Examples of Different Head Specializations:

Head 1: Focus on syntactic relationships (subject-verb-object structure)
Head 2: Focus on semantic relationships (synonyms, antonyms)
Head 3: Focus on long-range dependencies (pronoun references)
Head 4: Focus on local patterns (phrase structures)

Hyperparameter Selection:

Number of Heads (h): Usually 8 or 16, too many leads to parameter redundancy
Head Dimension (dk): Usually set to d_model/h, maintaining parameter balance
Rule of Thumb: h × dk = d_model, ensuring parameter efficiency

3.2.2 Position Encoding

Since Self-Attention is inherently position-agnostic, additional mechanisms are needed to model positional information.

Importance of Positional Information:

Word Order Sensitivity: Natural language meaning heavily depends on word order
Syntactic Structure: Positional information is crucial for understanding grammatical relationships
Temporal Modeling: Many tasks require understanding the temporal order of events

Absolute Position Encoding (Original Transformer): Uses trigonometric functions to encode absolute positions:


PE(pos, 2i) = sin(pos/10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))

Design Advantages:

Fixed Pattern: No learning required, reduces parameter count
Extrapolation Capability: Can handle sequence lengths not seen during training
Relative Positions: Encodings of different positions have fixed linear relationships

Learned Position Encoding:

Learnable Parameters: Position encodings as trainable parameters
Strong Adaptability: Can learn task-specific positional patterns
Limitation: Difficult to extrapolate to longer sequences

Relative Position Encoding: Modern variants increasingly use relative position encoding:

Relative Relationships: Focus on relative distance between position i and position j
Rotary Position Embedding (RoPE): Efficient encoding method used by GPT and other models
ALiBi: Linear bias position encoding method

Position Encoding Injection Methods:

Additive Injection: PE + Token Embedding (original method)
Concatenative Injection: Concat(Token Embedding, PE)
Multiplicative Injection: Token Embedding ⊗ PE

3.2.3 Feed-Forward Networks

Each Transformer layer contains a feed-forward network responsible for non-linear transformations and feature extraction.

FFN Structure:


FFN(x) = max(0, xW1 + b1)W2 + b2

I.e.: Linear layer → ReLU activation → Linear layer

Specific Implementation:

Dimension Expansion: d_model → d_ff (usually d_ff = 4 × d_model)
Activation Function: ReLU or GELU introduces non-linearity
Dimension Compression: d_ff → d_model

Role of FFN:

Non-linear Transformation: Introduces complex non-linear transformation capabilities
Feature Extraction: Learns position-specific feature transformations
Information Processing: Further processes outputs from attention mechanism
Expression Enhancement: Increases model’s expressive power and capacity

Activation Function Selection:

ReLU: Used in original Transformer, simple and efficient
GELU: Commonly used in modern LLMs, smoother activation function
SwiGLU: Used in some latest models, better performance

Dimension Selection Principles:

Expansion Ratio: d_ff is usually 4 times d_model
Computational Balance: Balance between expressive power and computational cost
Rule of Thumb: Larger d_ff can improve model capacity but increases computational cost

3.2.4 Layer Normalization

Layer Normalization is a key component for stable Transformer training.

Necessity of Normalization:

Gradient Stability: Prevent gradient explosion or vanishing
Training Acceleration: Speed up convergence
Numerical Stability: Maintain activation values within reasonable ranges

Layer Norm vs Batch Norm:

Batch Norm: Normalizes across batch dimension, suitable for CV tasks
Layer Norm: Normalizes across feature dimension, suitable for NLP tasks
Advantages: Independent of batch size, more stable in sequence modeling

Layer Norm Computation:


μ = (1/d) × Σxi
σ² = (1/d) × Σ(xi - μ)²
LN(x) = γ × (x - μ)/√(σ² + ε) + β

Parameter Explanation:

μ, σ²: Mean and variance of the layer
γ, β: Learnable scaling and shifting parameters
ε: Small constant to prevent division by zero (usually 1e-6)

Pre-Norm vs Post-Norm:

Post-Norm (Original): Sublayer → Add & Norm
Pre-Norm (Modern): Norm → Sublayer → Add
Pre-Norm Advantages: More stable training, especially in deep networks

3.2.5 Residual Connections

Residual connections are key technology for training deep Transformers.

Principle of Residual Connections:


output = SubLayer(input) + input

Problems Solved:

Gradient Vanishing: Provides direct backpropagation paths for gradients
Information Loss: Ensures input information is not completely lost
Training Stability: Makes deep networks easier to train

Application in Transformer:

Multi-Head Attention:


x' = x + MultiHeadAttention(LayerNorm(x))

Feed-Forward Network:
```
x'' = x' + FFN(LayerNorm(x'))
```

Challenges of Deep Networks:

Degradation Problem: Performance actually decreases when networks get deeper
Optimization Difficulty: Deep networks are hard to train to the performance of shallow networks
Residual Solution: Let networks learn residuals rather than direct mappings

Design Considerations:

Dimension Matching: Residual connections require input and output dimensions to match
Initialization Strategy: Weight initialization of residual paths is important
Gradient Flow: Ensure gradients can flow smoothly to early layers

3.3 Transformer Variants

3.3.1 GPT Series (Decoder-only)

GPT adopts a Decoder-only architecture, focusing on autoregressive language modeling.

Architectural Features:

Unidirectional Attention: Can only see content before the current position
Causal Masking: Uses masks to ensure no “looking into the future”
Autoregressive Generation: Generates next token one by one

Masking Mechanism:


Mask Matrix (Lower triangular matrix):
[1, 0, 0, 0]
[1, 1, 0, 0]
[1, 1, 1, 0]
[1, 1, 1, 1]

Evolution of GPT:

GPT-1: 117 million parameters, proved effectiveness of pretraining + fine-tuning
GPT-2: 1.5 billion parameters, demonstrated powerful text generation capabilities
GPT-3: 175 billion parameters, emerged few-shot learning capabilities
GPT-4: Parameters undisclosed, significantly improved multimodal capabilities and reasoning

Training Objective:

Next Word Prediction: Given context, predict probability distribution of next word
Mathematical Representation: maximize Σ log P(wi|w1, w2, …, wi-1)

Application Advantages:

Text Generation: Excellent at generating coherent long texts
Dialogue Systems: Natural dialogue generation capabilities
Creative Writing: Novel, poetry, and script creation
Code Generation: Understanding requirements and generating code

3.3.2 BERT Series (Encoder-only)

BERT adopts an Encoder-only architecture, focusing on language understanding tasks.

Architectural Features:

Bidirectional Attention: Can simultaneously see left and right context
No Generation Capability: Cannot perform autoregressive generation
Deep Understanding: Focuses on deep understanding and representation of text

Pretraining Tasks:

Masked Language Model (MLM):
- Randomly mask 15% of vocabulary
- Predict masked words
- Learn bidirectional context representations
Next Sentence Prediction (NSP):
- Judge whether two sentences are consecutive
- Learn relationships between sentences
- (Proven less important in subsequent variants)

BERT Variants:

RoBERTa: Removes NSP, optimizes training strategies
ALBERT: Parameter sharing, reduces model size
DeBERTa: Decoupled attention mechanism, improves performance
DistilBERT: Lightweight version through knowledge distillation

Application Scenarios:

Text Classification: Sentiment analysis, topic classification
Named Entity Recognition: Identifying person names, place names, organization names
Question Answering: Reading comprehension and knowledge QA
Similarity Computation: Text similarity and retrieval

3.3.3 T5 Series (Encoder-Decoder)

T5 maintains the complete Encoder-Decoder architecture, unifying all NLP tasks as text-to-text conversion.

“Text-to-Text” Philosophy:

Unified Framework: All tasks converted to text generation problems
Task Prefixes: Specify specific task types through prefixes
Examples:
- Translation: “translate English to German: Hello” → “Hallo”
- Summarization: “summarize: [long text]” → “[summary]”
- Classification: “classify: [text]” → “positive/negative”

Architectural Advantages:

Flexibility: Same model can handle multiple tasks
Transfer Learning: Tasks can mutually promote learning
Unified Interface: Simplifies model usage and deployment

T5 Variants:

T5-small/base/large/3B/11B: Models of different scales
mT5: Multilingual version
UL2: Unified Language Learner framework
PaLM-2: Large-scale model based on T5 architecture

Training Strategies:

Denoising Autoencoding: Corrupt input text, train model to recover
Multi-task Learning: Train simultaneously on multiple tasks
Prefix LM: Combines advantages of autoencoding and autoregressive approaches

Application Scenarios:

Multi-task Processing: Scenarios requiring multiple NLP tasks
Few-shot Learning: Quickly adapting to new tasks
Conditional Generation: Generating text based on specific conditions
Cross-task Transfer: Leveraging correlations between tasks