Skip to Content
LLMCore Technical Components of LLMs

Chapter 4: Core Technical Components of LLMs

4.1 Text Preprocessing and Tokenization

4.1.1 The Importance of Tokenization

Tokenization is the first step in LLM text processing, converting raw text into numerical representations that models can understand.

Limitations of Traditional Word Segmentation:

  • Vocabulary Explosion: Full word tokenization leads to excessively large vocabularies, especially for morphologically rich languages
  • Out-of-Vocabulary Problem: Words not seen during training cannot be processed
  • Low-Frequency Word Handling: Many low-frequency words occupy vocabulary space but receive insufficient training
  • Cross-Language Consistency: Different languages have vastly different tokenization standards

Advantages of Subword Segmentation:

  • Balancing Expressiveness and Efficiency: Finding a balance between character-level and word-level representation
  • Handling Unknown Words: Can represent unseen words through subword combinations
  • Alleviating Data Sparsity: Reduces the number of low-frequency words
  • Cross-Language Unification: Provides consistent representation methods for multilingual models

4.1.2 Main Subword Segmentation Algorithms

Byte Pair Encoding (BPE):

  • Core Idea: Iteratively merge the most frequently occurring character or character sequence pairs
  • Algorithm Flow:
    1. Initialize: Break all words into character sequences
    2. Count the frequency of adjacent character pairs
    3. Merge the most frequent character pair
    4. Repeat steps 2-3 until reaching the preset vocabulary size
    5. Apply learned merge rules to process new text

BPE Example:

Initial vocabulary: ["low", "lower", "newest", "widest"] Character level: ["l o w", "l o w e r", "n e w e s t", "w i d e s t"] Iteration process: 1. Most frequent pair: "e s" → merge to "es" 2. Most frequent pair: "es t" → merge to "est" 3. Most frequent pair: "l o" → merge to "lo" ... Final result: ["lo w", "lo w er", "new est", "wid est"]

WordPiece Algorithm:

  • Core Improvement: Merges based on language model probability rather than simple frequency
  • Merge Criterion: Choose character pairs that maximize training data likelihood
  • Formula:
    score(x,y) = count(xy) / (count(x) × count(y))
  • Advantages: Better preserves linguistic semantic integrity
  • Application: Used in Google’s BERT model

SentencePiece Algorithm:

  • Language Independence: Does not rely on pre-tokenization, directly processes raw text
  • Unified Processing: Treats spaces as special characters, achieving true end-to-end tokenization
  • Multilingual Support: Particularly suitable for languages without explicit space separation like Chinese and Japanese
  • Implementation:
    Original: "Hello world" SentencePiece: ["▁Hello", "▁wor", "ld"] (▁ represents original space position)

4.1.3 Vocabulary Construction Strategies

Vocabulary Size Selection:

  • Small Vocabulary (8K-16K):
    • Advantages: Fewer model parameters, faster training
    • Disadvantages: Increased sequence length, limited expressiveness
  • Medium Vocabulary (32K-64K):
    • Balance Point: Achieves balance between efficiency and expressiveness
    • Mainstream Choice: Scale adopted by most LLMs
  • Large Vocabulary (128K+):
    • Advantages: Better expressiveness, shorter sequences
    • Disadvantages: Large embedding layer parameters, high computational overhead

Multilingual Vocabulary Design:

  • Language Balance: Ensure adequate representation for all languages
  • Script Coverage: Cover different writing systems (Latin, Chinese, Arabic, etc.)
  • Sampling Strategy: Adjust training data sampling ratios based on target language distribution

4.1.4 Special Token Handling

Basic Special Tokens:

  • [CLS]: Classification token, usually at sequence beginning, used for classification tasks
  • [SEP]: Separator token, used to separate different text segments
  • [PAD]: Padding token, used to align sequences in batches to the same length
  • [UNK]: Unknown token, represents out-of-vocabulary words
  • [MASK]: Mask token, used for masked language model training

Generation Task Special Tokens:

  • [BOS]/[SOS]: Beginning of sequence token
  • [EOS]: End of sequence token
  • [EOD]: End of document token

Formatting Tokens:

<|system|>: System prompt information <|user|>: User input <|assistant|>: Assistant response <|endoftext|>: End of text

4.1.5 Differences Between English and Chinese Tokenization

English Tokenization Characteristics:

  • Natural Separation: Spaces naturally provide word boundary information
  • Morphological Changes: Need to handle word form variations (run/running/ran)
  • Compound Words: Need to handle compound word segmentation
  • Abbreviation Processing: Handle abbreviated forms (don’t → do not)

Chinese Tokenization Challenges:

  • No Clear Separation: No natural separators like spaces
  • Ambiguous Segmentation: The same sentence may have multiple reasonable tokenization methods
    Example: 研究生命的起源 (studying the origin of life) Segmentation 1: 研究 / 生命 / 的 / 起源 (study / life / of / origin) Segmentation 2: 研究生 / 命 / 的 / 起源 (graduate student / life / of / origin)
  • Blurred Word Boundaries: Unclear boundaries between words and phrases
  • Constantly Emerging New Words: Rapid emergence of internet slang and technical terms

Multilingual Unified Solution:

  • SentencePiece Unified Processing: Independent of pre-tokenization, unified processing of various languages
  • Character-Level Fallback: For difficult-to-recognize parts, fall back to character-level processing
  • Context-Aware: Use contextual information to resolve tokenization ambiguity

4.2 Embedding Layers

4.2.1 Basic Concepts of Word Embeddings

Necessity of Embeddings:

  • Sparse Representation Problem: One-hot encoding leads to high-dimensional sparse vectors with low computational efficiency
  • Semantic Loss: One-hot encoding cannot capture semantic relationships between words
  • Dimension Explosion: Vocabulary size directly determines vector dimension, difficult to scale

Advantages of Embeddings:

  • Dense Representation: Maps sparse vocabulary to dense low-dimensional space
  • Semantic Modeling: Similar words are closer in embedding space
  • Parameter Efficiency: Significantly reduces model parameter count
  • Transfer Ability: Pre-trained embeddings can transfer to downstream tasks

Mathematical Representation of Embeddings:

one-hot: [0, 0, 1, 0, ..., 0] ∈ ℝ^V (V is vocabulary size) embedding: [0.2, -0.1, 0.8, ..., 0.3] ∈ ℝ^d (d is embedding dimension) Embedding matrix: E ∈ ℝ^(V×d) Embedding lookup: embedding = E[token_id]

4.2.2 Token Embedding Implementation

Embedding Layer Design:

  • Lookup Table Mechanism: Essentially a large lookup table where each token ID corresponds to a vector
  • Parameter Sharing: Input and output embedding layers can share parameters to reduce total parameters
  • Initialization Strategies:
    # Common initialization methods Xavier initialization: weight ~ N(0, 1/sqrt(d)) He initialization: weight ~ N(0, 2/sqrt(d)) Uniform distribution: weight ~ U(-sqrt(3/d), sqrt(3/d))

Embedding Layer Training:

  • End-to-End Learning: Embedding weights are trained together as model parameters
  • Gradient Updates: Only embedding vectors corresponding to used tokens receive gradient updates
  • Regularization Techniques:
    • Dropout: Randomly zero out some embedding dimensions
    • Weight Decay: L2 regularization to prevent overfitting
    • Gradient Clipping: Prevent gradient explosion

4.2.3 Detailed Analysis of Position Embedding

Importance of Position Information:

  • Order Sensitivity: Natural language is highly dependent on word order
  • Grammatical Structure: Position information is crucial for parsing grammatical relationships
  • Semantic Differences: Same words at different positions may have different meanings

Deep Dive into Absolute Position Encoding:

Sinusoidal Position Encoding:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Design Principle Analysis:

  • Frequency Decay: Different dimensions use different frequencies, from high to low
  • Uniqueness Guarantee: Each position has a unique encoding representation
  • Relative Position Relationship: Using trigonometric properties, relative positions have fixed linear relationships
  • Extrapolation Capability: Can handle inputs longer than training sequences

Learned Position Encoding Comparison:

Advantages: - Adapts to task-specific position patterns - Can learn complex position relationships - Usually better performance within limited length Disadvantages: - Cannot extrapolate to longer sequences - Requires additional parameter storage - Longer training time

Modern Position Encoding Schemes:

Rotary Position Embedding (RoPE):

  • Core Idea: Encode position information through rotation matrices in complex space
  • Mathematical Representation:
    f_q(x_m, m) = (W_q x_m) ⊗ e^(i m θ) f_k(x_n, n) = (W_k x_n) ⊗ e^(i n θ)
  • Advantages: Naturally encodes relative position relationships with good extrapolation performance

ALiBi (Attention with Linear Biases):

  • Mechanism: Directly adds linear bias to attention scores
  • Calculation:
    attention_score = QK^T + bias_matrix bias_matrix[i,j] = -m × |i-j|
  • Benefits: Simple and efficient with excellent length extrapolation capability

4.2.4 Embedding Dimension Selection Principles

Dimension Selection Trade-offs:

  • Expressiveness vs Computational Efficiency: Higher dimensions provide stronger expressiveness but increase computational cost
  • Overfitting Risk: Excessively high dimensions may lead to overfitting, especially on small datasets
  • Downstream Task Requirements: Different tasks have different requirements for representation capability

Empirical Rules:

  • Small Models (<100M parameters): 128-512 dimensions
  • Medium Models (100M-1B parameters): 512-1024 dimensions
  • Large Models (>1B parameters): 1024-4096 dimensions or higher

Relationship Between Dimension and Other Model Components:

Common settings: d_model = d_embedding d_ff = 4 × d_model (Feed-Forward hidden layer dimension) d_head = d_model / num_heads (dimension per attention head)

4.3 Detailed Attention Mechanisms

4.3.1 Complete Self-Attention Computation Process

Step 1: Linear Transformations

# Input: X ∈ ℝ^(seq_len × d_model) Q = X @ W_Q # Query matrix K = X @ W_K # Key matrix V = X @ W_V # Value matrix # Weight matrices: W_Q, W_K, W_V ∈ ℝ^(d_model × d_model)

Step 2: Attention Score Calculation

# Calculate attention scores scores = Q @ K.T # ∈ ℝ^(seq_len × seq_len) # Scaling factor scaled_scores = scores / sqrt(d_model)

Step 3: Mask Application (if needed)

# Causal mask (for GPT and other decoder-only models) mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) masked_scores = scaled_scores.masked_fill(mask == 1, -inf)

Step 4: Softmax Normalization

attention_weights = softmax(masked_scores, dim=-1) # Ensure each row sums to 1, representing probability distribution

Step 5: Weighted Aggregation

output = attention_weights @ V # ∈ ℝ^(seq_len × d_model)

4.3.2 Design Principles of Scaled Dot-Product Attention

Necessity of Scaling Factor:

  • Gradient Stability: Prevents softmax function from entering saturation region causing gradient vanishing
  • Numerical Stability: Avoids numerical overflow caused by excessively large dot products
  • Theoretical Analysis:
    Assuming q_i, k_j are independent and identically distributed with mean 0, variance 1 Then variance of q_i · k_j is d_k After scaling, variance becomes 1, maintaining numerical stability

Properties of Attention Distribution:

  • Sparsity: After softmax, attention typically concentrates on a few relevant positions
  • Smoothness: Temperature parameter (here √d_k) controls distribution smoothness
  • Interpretability: Attention weights provide intuitive explanation of model decisions

4.3.3 Parallel Implementation of Multi-Head Attention

Parallel Computation Implementation Tricks:

# Traditional implementation (serial) outputs = [] for head in range(num_heads): Q_h = Q @ W_Q[head] K_h = K @ W_K[head] V_h = V @ W_V[head] output_h = attention(Q_h, K_h, V_h) outputs.append(output_h) concat_output = concat(outputs) # Efficient implementation (parallel) # Concatenate weight matrices of all heads W_Q_all = concat([W_Q[0], W_Q[1], ..., W_Q[h-1]], dim=1) Q_all = X @ W_Q_all # Single matrix multiplication for all heads' Q # Reshape to multi-head format Q_heads = Q_all.reshape(batch_size, seq_len, num_heads, d_head) Q_heads = Q_heads.transpose(1, 2) # (batch, num_heads, seq_len, d_head)

Memory Efficiency Optimizations:

  • Flash Attention: Reduces memory usage through block computation
  • Gradient Checkpointing: Trades computation for memory by recomputing forward pass
  • Mixed Precision Training: Uses FP16 to reduce memory footprint

4.3.4 Attention Weight Visualization and Interpretation

Attention Heatmaps:

  • X-axis: Token positions in input sequence
  • Y-axis: Token positions in output sequence (or same sequence for self-attention)
  • Color Intensity: Magnitude of attention weights

Attention Patterns in Different Layers:

  • Shallow Layers: Mainly focus on local grammatical relationships (adjacent words)
  • Middle Layers: Capture medium-distance semantic relationships
  • Deep Layers: Model long-distance dependencies and high-level semantic relationships

Attention Head Specialization: Analysis of actually trained models reveals:

  • Syntactic Heads: Focus on subject-verb-object grammatical relationships
  • Coreference Heads: Handle pronoun reference and noun mention
  • Positional Heads: Mainly attend to relative position information
  • Semantic Heads: Capture semantic similarity and topic relevance

Limitations of Attention Interpretation:

  • Causality Issues: High attention weights don’t necessarily indicate causal relationships
  • Multi-Head Aggregation: Single head interpretation may be incomplete
  • Non-Linear Transformations: Subsequent FFN layers further transform attention output
  • Training Dynamics: Attention patterns constantly change during training

Practical Attention Analysis Methods:

# Attention weight statistics def analyze_attention_patterns(attention_weights): # attention_weights: (batch, num_heads, seq_len, seq_len) # Calculate attention locality (degree of focusing on adjacent tokens) locality_score = compute_locality_bias(attention_weights) # Calculate attention dispersion entropy = compute_attention_entropy(attention_weights) # Identify key attention connections important_connections = find_important_attention_links(attention_weights) return { 'locality': locality_score, 'entropy': entropy, 'key_connections': important_connections }
Last updated on