Chapter 4: Core Technical Components of LLMs
4.1 Text Preprocessing and Tokenization
4.1.1 The Importance of Tokenization
Tokenization is the first step in LLM text processing, converting raw text into numerical representations that models can understand.
Limitations of Traditional Word Segmentation:
- Vocabulary Explosion: Full word tokenization leads to excessively large vocabularies, especially for morphologically rich languages
- Out-of-Vocabulary Problem: Words not seen during training cannot be processed
- Low-Frequency Word Handling: Many low-frequency words occupy vocabulary space but receive insufficient training
- Cross-Language Consistency: Different languages have vastly different tokenization standards
Advantages of Subword Segmentation:
- Balancing Expressiveness and Efficiency: Finding a balance between character-level and word-level representation
- Handling Unknown Words: Can represent unseen words through subword combinations
- Alleviating Data Sparsity: Reduces the number of low-frequency words
- Cross-Language Unification: Provides consistent representation methods for multilingual models
4.1.2 Main Subword Segmentation Algorithms
Byte Pair Encoding (BPE):
- Core Idea: Iteratively merge the most frequently occurring character or character sequence pairs
- Algorithm Flow:
- Initialize: Break all words into character sequences
- Count the frequency of adjacent character pairs
- Merge the most frequent character pair
- Repeat steps 2-3 until reaching the preset vocabulary size
- Apply learned merge rules to process new text
BPE Example:
Initial vocabulary: ["low", "lower", "newest", "widest"]
Character level: ["l o w", "l o w e r", "n e w e s t", "w i d e s t"]
Iteration process:
1. Most frequent pair: "e s" → merge to "es"
2. Most frequent pair: "es t" → merge to "est"
3. Most frequent pair: "l o" → merge to "lo"
...
Final result: ["lo w", "lo w er", "new est", "wid est"]WordPiece Algorithm:
- Core Improvement: Merges based on language model probability rather than simple frequency
- Merge Criterion: Choose character pairs that maximize training data likelihood
- Formula:
score(x,y) = count(xy) / (count(x) × count(y)) - Advantages: Better preserves linguistic semantic integrity
- Application: Used in Google’s BERT model
SentencePiece Algorithm:
- Language Independence: Does not rely on pre-tokenization, directly processes raw text
- Unified Processing: Treats spaces as special characters, achieving true end-to-end tokenization
- Multilingual Support: Particularly suitable for languages without explicit space separation like Chinese and Japanese
- Implementation:
Original: "Hello world" SentencePiece: ["▁Hello", "▁wor", "ld"] (▁ represents original space position)
4.1.3 Vocabulary Construction Strategies
Vocabulary Size Selection:
- Small Vocabulary (8K-16K):
- Advantages: Fewer model parameters, faster training
- Disadvantages: Increased sequence length, limited expressiveness
- Medium Vocabulary (32K-64K):
- Balance Point: Achieves balance between efficiency and expressiveness
- Mainstream Choice: Scale adopted by most LLMs
- Large Vocabulary (128K+):
- Advantages: Better expressiveness, shorter sequences
- Disadvantages: Large embedding layer parameters, high computational overhead
Multilingual Vocabulary Design:
- Language Balance: Ensure adequate representation for all languages
- Script Coverage: Cover different writing systems (Latin, Chinese, Arabic, etc.)
- Sampling Strategy: Adjust training data sampling ratios based on target language distribution
4.1.4 Special Token Handling
Basic Special Tokens:
- [CLS]: Classification token, usually at sequence beginning, used for classification tasks
- [SEP]: Separator token, used to separate different text segments
- [PAD]: Padding token, used to align sequences in batches to the same length
- [UNK]: Unknown token, represents out-of-vocabulary words
- [MASK]: Mask token, used for masked language model training
Generation Task Special Tokens:
- [BOS]/[SOS]: Beginning of sequence token
- [EOS]: End of sequence token
- [EOD]: End of document token
Formatting Tokens:
<|system|>: System prompt information
<|user|>: User input
<|assistant|>: Assistant response
<|endoftext|>: End of text4.1.5 Differences Between English and Chinese Tokenization
English Tokenization Characteristics:
- Natural Separation: Spaces naturally provide word boundary information
- Morphological Changes: Need to handle word form variations (run/running/ran)
- Compound Words: Need to handle compound word segmentation
- Abbreviation Processing: Handle abbreviated forms (don’t → do not)
Chinese Tokenization Challenges:
- No Clear Separation: No natural separators like spaces
- Ambiguous Segmentation: The same sentence may have multiple reasonable tokenization methods
Example: 研究生命的起源 (studying the origin of life) Segmentation 1: 研究 / 生命 / 的 / 起源 (study / life / of / origin) Segmentation 2: 研究生 / 命 / 的 / 起源 (graduate student / life / of / origin) - Blurred Word Boundaries: Unclear boundaries between words and phrases
- Constantly Emerging New Words: Rapid emergence of internet slang and technical terms
Multilingual Unified Solution:
- SentencePiece Unified Processing: Independent of pre-tokenization, unified processing of various languages
- Character-Level Fallback: For difficult-to-recognize parts, fall back to character-level processing
- Context-Aware: Use contextual information to resolve tokenization ambiguity
4.2 Embedding Layers
4.2.1 Basic Concepts of Word Embeddings
Necessity of Embeddings:
- Sparse Representation Problem: One-hot encoding leads to high-dimensional sparse vectors with low computational efficiency
- Semantic Loss: One-hot encoding cannot capture semantic relationships between words
- Dimension Explosion: Vocabulary size directly determines vector dimension, difficult to scale
Advantages of Embeddings:
- Dense Representation: Maps sparse vocabulary to dense low-dimensional space
- Semantic Modeling: Similar words are closer in embedding space
- Parameter Efficiency: Significantly reduces model parameter count
- Transfer Ability: Pre-trained embeddings can transfer to downstream tasks
Mathematical Representation of Embeddings:
one-hot: [0, 0, 1, 0, ..., 0] ∈ ℝ^V (V is vocabulary size)
embedding: [0.2, -0.1, 0.8, ..., 0.3] ∈ ℝ^d (d is embedding dimension)
Embedding matrix: E ∈ ℝ^(V×d)
Embedding lookup: embedding = E[token_id]4.2.2 Token Embedding Implementation
Embedding Layer Design:
- Lookup Table Mechanism: Essentially a large lookup table where each token ID corresponds to a vector
- Parameter Sharing: Input and output embedding layers can share parameters to reduce total parameters
- Initialization Strategies:
# Common initialization methods Xavier initialization: weight ~ N(0, 1/sqrt(d)) He initialization: weight ~ N(0, 2/sqrt(d)) Uniform distribution: weight ~ U(-sqrt(3/d), sqrt(3/d))
Embedding Layer Training:
- End-to-End Learning: Embedding weights are trained together as model parameters
- Gradient Updates: Only embedding vectors corresponding to used tokens receive gradient updates
- Regularization Techniques:
- Dropout: Randomly zero out some embedding dimensions
- Weight Decay: L2 regularization to prevent overfitting
- Gradient Clipping: Prevent gradient explosion
4.2.3 Detailed Analysis of Position Embedding
Importance of Position Information:
- Order Sensitivity: Natural language is highly dependent on word order
- Grammatical Structure: Position information is crucial for parsing grammatical relationships
- Semantic Differences: Same words at different positions may have different meanings
Deep Dive into Absolute Position Encoding:
Sinusoidal Position Encoding:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))Design Principle Analysis:
- Frequency Decay: Different dimensions use different frequencies, from high to low
- Uniqueness Guarantee: Each position has a unique encoding representation
- Relative Position Relationship: Using trigonometric properties, relative positions have fixed linear relationships
- Extrapolation Capability: Can handle inputs longer than training sequences
Learned Position Encoding Comparison:
Advantages:
- Adapts to task-specific position patterns
- Can learn complex position relationships
- Usually better performance within limited length
Disadvantages:
- Cannot extrapolate to longer sequences
- Requires additional parameter storage
- Longer training timeModern Position Encoding Schemes:
Rotary Position Embedding (RoPE):
- Core Idea: Encode position information through rotation matrices in complex space
- Mathematical Representation:
f_q(x_m, m) = (W_q x_m) ⊗ e^(i m θ) f_k(x_n, n) = (W_k x_n) ⊗ e^(i n θ) - Advantages: Naturally encodes relative position relationships with good extrapolation performance
ALiBi (Attention with Linear Biases):
- Mechanism: Directly adds linear bias to attention scores
- Calculation:
attention_score = QK^T + bias_matrix bias_matrix[i,j] = -m × |i-j| - Benefits: Simple and efficient with excellent length extrapolation capability
4.2.4 Embedding Dimension Selection Principles
Dimension Selection Trade-offs:
- Expressiveness vs Computational Efficiency: Higher dimensions provide stronger expressiveness but increase computational cost
- Overfitting Risk: Excessively high dimensions may lead to overfitting, especially on small datasets
- Downstream Task Requirements: Different tasks have different requirements for representation capability
Empirical Rules:
- Small Models (<100M parameters): 128-512 dimensions
- Medium Models (100M-1B parameters): 512-1024 dimensions
- Large Models (>1B parameters): 1024-4096 dimensions or higher
Relationship Between Dimension and Other Model Components:
Common settings:
d_model = d_embedding
d_ff = 4 × d_model (Feed-Forward hidden layer dimension)
d_head = d_model / num_heads (dimension per attention head)4.3 Detailed Attention Mechanisms
4.3.1 Complete Self-Attention Computation Process
Step 1: Linear Transformations
# Input: X ∈ ℝ^(seq_len × d_model)
Q = X @ W_Q # Query matrix
K = X @ W_K # Key matrix
V = X @ W_V # Value matrix
# Weight matrices: W_Q, W_K, W_V ∈ ℝ^(d_model × d_model)Step 2: Attention Score Calculation
# Calculate attention scores
scores = Q @ K.T # ∈ ℝ^(seq_len × seq_len)
# Scaling factor
scaled_scores = scores / sqrt(d_model)Step 3: Mask Application (if needed)
# Causal mask (for GPT and other decoder-only models)
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
masked_scores = scaled_scores.masked_fill(mask == 1, -inf)Step 4: Softmax Normalization
attention_weights = softmax(masked_scores, dim=-1)
# Ensure each row sums to 1, representing probability distributionStep 5: Weighted Aggregation
output = attention_weights @ V # ∈ ℝ^(seq_len × d_model)4.3.2 Design Principles of Scaled Dot-Product Attention
Necessity of Scaling Factor:
- Gradient Stability: Prevents softmax function from entering saturation region causing gradient vanishing
- Numerical Stability: Avoids numerical overflow caused by excessively large dot products
- Theoretical Analysis:
Assuming q_i, k_j are independent and identically distributed with mean 0, variance 1 Then variance of q_i · k_j is d_k After scaling, variance becomes 1, maintaining numerical stability
Properties of Attention Distribution:
- Sparsity: After softmax, attention typically concentrates on a few relevant positions
- Smoothness: Temperature parameter (here √d_k) controls distribution smoothness
- Interpretability: Attention weights provide intuitive explanation of model decisions
4.3.3 Parallel Implementation of Multi-Head Attention
Parallel Computation Implementation Tricks:
# Traditional implementation (serial)
outputs = []
for head in range(num_heads):
Q_h = Q @ W_Q[head]
K_h = K @ W_K[head]
V_h = V @ W_V[head]
output_h = attention(Q_h, K_h, V_h)
outputs.append(output_h)
concat_output = concat(outputs)
# Efficient implementation (parallel)
# Concatenate weight matrices of all heads
W_Q_all = concat([W_Q[0], W_Q[1], ..., W_Q[h-1]], dim=1)
Q_all = X @ W_Q_all # Single matrix multiplication for all heads' Q
# Reshape to multi-head format
Q_heads = Q_all.reshape(batch_size, seq_len, num_heads, d_head)
Q_heads = Q_heads.transpose(1, 2) # (batch, num_heads, seq_len, d_head)Memory Efficiency Optimizations:
- Flash Attention: Reduces memory usage through block computation
- Gradient Checkpointing: Trades computation for memory by recomputing forward pass
- Mixed Precision Training: Uses FP16 to reduce memory footprint
4.3.4 Attention Weight Visualization and Interpretation
Attention Heatmaps:
- X-axis: Token positions in input sequence
- Y-axis: Token positions in output sequence (or same sequence for self-attention)
- Color Intensity: Magnitude of attention weights
Attention Patterns in Different Layers:
- Shallow Layers: Mainly focus on local grammatical relationships (adjacent words)
- Middle Layers: Capture medium-distance semantic relationships
- Deep Layers: Model long-distance dependencies and high-level semantic relationships
Attention Head Specialization: Analysis of actually trained models reveals:
- Syntactic Heads: Focus on subject-verb-object grammatical relationships
- Coreference Heads: Handle pronoun reference and noun mention
- Positional Heads: Mainly attend to relative position information
- Semantic Heads: Capture semantic similarity and topic relevance
Limitations of Attention Interpretation:
- Causality Issues: High attention weights don’t necessarily indicate causal relationships
- Multi-Head Aggregation: Single head interpretation may be incomplete
- Non-Linear Transformations: Subsequent FFN layers further transform attention output
- Training Dynamics: Attention patterns constantly change during training
Practical Attention Analysis Methods:
# Attention weight statistics
def analyze_attention_patterns(attention_weights):
# attention_weights: (batch, num_heads, seq_len, seq_len)
# Calculate attention locality (degree of focusing on adjacent tokens)
locality_score = compute_locality_bias(attention_weights)
# Calculate attention dispersion
entropy = compute_attention_entropy(attention_weights)
# Identify key attention connections
important_connections = find_important_attention_links(attention_weights)
return {
'locality': locality_score,
'entropy': entropy,
'key_connections': important_connections
}