Skip to Content
LLMLLM Training Process

Chapter 5: LLM Training Process

5.1 Pretraining

5.1.1 Goals and Significance of Pretraining

Core Goals of Pretraining:

  • Language Representation Learning: Learn general language representations from large amounts of unlabeled text
  • Knowledge Acquisition: Encode human knowledge into model parameters
  • Pattern Recognition: Learn statistical patterns and structures of language
  • Transfer Capability Development: Provide powerful initialization parameters for downstream tasks

Important Significance of Pretraining:

Breaking Data Bottlenecks:

  • Abundant Unlabeled Data: Vast amounts of unlabeled text data available on the internet
  • Reduced Annotation Dependency: Avoids the cost of separately annotating large amounts of data for each task
  • Strong Generalizability: One pretraining can serve multiple downstream tasks

Knowledge Accumulation and Transfer:

  • World Knowledge Learning: Learn factual knowledge from Wikipedia, news, books, etc.
  • Common Sense Reasoning: Learn common sense knowledge from daily life
  • Professional Knowledge: Exposure to professional terminology and concepts from various domains
  • Cross-Task Transfer: Transfer learned knowledge to new tasks

Emergence of Language Capabilities:

  • Grammar Understanding: Learn grammatical rules and syntactic structures of language
  • Semantic Understanding: Understand meanings of words, phrases, and sentences
  • Context Modeling: Learn long-distance dependencies and contextual relationships
  • Reasoning Abilities: Reasoning capabilities emerge during large-scale pretraining

5.1.2 Language Modeling Tasks

Autoregressive Language Modeling:

Basic Principle:

P(x1, x2, ..., xn) = ∏(i=1 to n) P(xi | x1, x2, ..., xi-1)

Training Objective:

  • Next Word Prediction: Given context, predict the probability distribution of the next word
  • Maximum Likelihood Estimation: Maximize the log-likelihood of training data
  • Mathematical Representation:
L = -∑(i=1 to N) log P(xi | x<i; θ)

GPT Series Approach:

  • Causal Masking: Ensures the model can only see information before the current position
  • Teacher Forcing: Uses true sequences during training, autoregressive generation during inference
  • Advantages: Naturally suitable for text generation tasks
  • Limitations: Can only utilize unidirectional context information

Bidirectional Language Modeling:

BERT’s Masked Language Model (MLM):

  • Basic Idea: Randomly mask some tokens in the input and predict the masked content
  • Masking Strategy:
Randomly select 15% of tokens: - 80% replace with [MASK] - 10% replace with random token - 10% keep unchanged
  • Training Objective:
L = -∑(i∈M) log P(xi | x\M; θ)

where M is the set of masked positions

Advantages and Limitations:

  • Advantages: Can utilize bidirectional context information, better language understanding capability
  • Limitations: Inconsistent input forms between pretraining and fine-tuning stages (pretraining-fine-tuning gap)

5.1.3 Improvements to Masked Language Models

ELECTRA’s Replaced Token Detection:

  • Core Idea: Instead of predicting masked tokens, judge whether each token has been replaced
  • Generator-Discriminator Architecture:
Generator: Small MLM model that generates replacement tokens Discriminator: Judges whether each position's token is the original token
  • Advantages: All positions participate in training, improving sample efficiency

RoBERTa’s Optimization Strategies:

  • Remove NSP Task: Found that Next Sentence Prediction provides limited performance improvement
  • Dynamic Masking: Use different masking patterns for each epoch
  • Longer Training: Use more data and longer training time
  • Larger Batches: Increase batch size to improve training stability

DeBERTa’s Decoupled Attention:

  • Position Information Separation: Process content and position information separately
  • Relative Position Encoding: Use relative positions instead of absolute positions
  • Enhanced Mask Decoder: Better integrate position information during fine-tuning

5.1.4 Training Data Scale and Quality Requirements

Data Scale Requirements:

Relationship Between Parameter Scale and Data Volume:

  • Empirical Rule: Data volume is typically 20-100 times the parameter count
  • Specific Examples:
GPT-3 (175B parameters): ~300B tokens GPT-4 (estimated >1T parameters): >13T tokens PaLM (540B parameters): ~780B tokens

Scaling Laws:

  • Chinchilla Rule: For a given compute budget, should balance model size and training data volume
  • Optimal Ratio: Each parameter needs approximately 20 tokens of training data
  • Compute Budget Allocation:
Optimal model size ∝ (compute budget)^0.5 Optimal data volume ∝ (compute budget)^0.5

Data Quality Requirements:

Data Source Diversity:

  • Web Text: Common Crawl, Web crawls (60-70%)
  • Books and Literature: Books1, Books2, Project Gutenberg (10-15%)
  • Academic Literature: ArXiv, PubMed papers (5-10%)
  • News Articles: News websites, RSS feeds (5-10%)
  • Reference Materials: Wikipedia, encyclopedias (2-5%)
  • Code Repositories: GitHub, StackOverflow (5-10%)

Data Cleaning Strategies:

Content Filtering:

# Common data cleaning steps def clean_text_data(raw_text): # 1. Language detection and filtering if not is_target_language(raw_text): return None # 2. Quality scoring quality_score = compute_quality_score(raw_text) if quality_score < threshold: return None # 3. Deduplication if is_duplicate(raw_text): return None # 4. Privacy information filtering filtered_text = remove_pii(raw_text) # 5. Harmful content filtering if contains_harmful_content(filtered_text): return None return filtered_text

Quality Metrics:

  • Language Quality: Grammatical correctness, spelling error rate
  • Information Density: Proportion of meaningful content
  • Duplication Level: Avoid large amounts of repetitive content
  • Diversity: Diversity in topics, styles, and sources

5.1.5 Computational Resources and Hardware Requirements

Hardware Architecture Selection:

GPU Clusters:

  • Mainstream Choice: NVIDIA A100, H100 GPUs
  • Memory Requirements:
Parameter storage: 4 bytes per parameter (FP32) or 2 bytes (FP16) Gradient storage: Same size as parameters Optimizer state: Adam requires 2-3x parameter size Total memory requirement ≈ parameter count × (4-8) bytes
  • Practical Examples:
7B parameter model: requires ~56-112GB memory 175B parameter model: requires ~1.4-2.8TB memory

Distributed Training Strategies:

Data Parallelism:

  • Principle: Each GPU maintains complete model copy, processes different data batches
  • Communication Requirements: Need to synchronize gradients, communication volume equals parameter count
  • Applicable Scenarios: When model fits in single GPU

Model Parallelism:

  • Tensor Parallelism: Distribute single operations across multiple GPUs
# Tensor parallelism example for linear layer # Original: Y = XW # Split W into W1, W2: Y = [XW1, XW2]
  • Pipeline Parallelism: Distribute different layers across different GPUs
  • Applicable Scenarios: When model is too large for single GPU

3D Parallelism:

  • Combine Three Parallel Strategies: Data parallelism + Tensor parallelism + Pipeline parallelism
  • DeepSpeed ZeRO: Zero Redundancy Optimizer, reduces memory usage
  • Megatron-LM: NVIDIA’s large-scale parallel training framework

Training Time Estimation:

Computational Analysis:

FLOPs per token ≈ 6 × N (N is parameter count) Total FLOPs = 6 × N × D (D is training data token count) Training time = Total FLOPs / (GPU count × Per-GPU compute × Utilization)

Real-world Example:

GPT-3 training: - Parameters: 175B - Training data: 300B tokens - Total compute: 3.14×10^23 FLOPS - Hardware: ~1000 V100 GPUs - Training time: ~34 days

Cost Estimation:

  • Cloud Computing Cost: AWS p4d.24xlarge ~$32/hour
  • GPT-3 Level Model: Estimated training cost $4-10M
  • Open Source Alternatives: Use academic resources and open source tools to reduce costs

5.2 Fine-tuning

5.2.1 Supervised Fine-tuning

Basic Principles of Fine-tuning:

  • Parameter Initialization: Use pretrained model parameters as initialization
  • Task Adaptation: Continue training on task-specific data
  • Learning Rate Setting: Usually use smaller learning rates to avoid catastrophic forgetting
  • Freezing Strategy: Can choose to freeze some layers, only train top layers

Full Parameter Fine-tuning vs Parameter-Efficient Fine-tuning:

Full Parameter Fine-tuning:

# Full fine-tuning pseudocode for batch in task_dataloader: # Forward pass logits = model(batch.input_ids) loss = criterion(logits, batch.labels) # Backward pass, update all parameters loss.backward() optimizer.step() # Update all model parameters optimizer.zero_grad()

Parameter-Efficient Fine-tuning Methods:

LoRA (Low-Rank Adaptation):

  • Core Idea: Add low-rank matrices on top of pretrained parameters
  • Mathematical Representation:
W' = W + BA where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), r << min(d,k)
  • Advantages: Only need to train few parameters (usually <1%), significantly reduce computation and storage requirements
  • Implementation:
# LoRA layer implementation class LoRALayer(nn.Module): def __init__(self, in_features, out_features, rank=4): self.lora_A = nn.Linear(in_features, rank, bias=False) self.lora_B = nn.Linear(rank, out_features, bias=False) self.scaling = rank ** -0.5 def forward(self, x): return self.lora_B(self.lora_A(x)) * self.scaling

Adapter Methods:

  • Structure: Insert small feedforward networks between Transformer layers
  • Design Principle:
Adapter(x) = x + Down(ReLU(Up(LayerNorm(x))))
  • Advantages: Keep original model unchanged, only train adapter parameters

Prefix Tuning:

  • Method: Only train input prefix embeddings, freeze model parameters
  • Applicable Scenarios: Particularly suitable for generation tasks
  • Simple Implementation: Add trainable prefix tokens before input

5.2.2 Task-Specific Data Preparation

Data Formatting Strategies:

Classification Task Data Format:

# Text classification data example { "text": "This movie is really great, the actors performed excellently, and the plot is engaging.", "label": "positive", "metadata": { "domain": "movie_review", "source": "imdb" } }

Generation Task Data Format:

# QA task data example { "instruction": "Please answer the following question", "input": "What is machine learning?", "output": "Machine learning is a branch of artificial intelligence that enables computers to automatically learn and improve from data through algorithms, without being explicitly programmed.", "task_type": "qa" }

Multi-turn Dialogue Data Format:

# Dialogue data example { "conversation": [ { "role": "user", "content": "How's the weather today?" }, { "role": "assistant", "content": "I cannot access real-time weather information. I recommend checking weather apps or websites for accurate weather forecasts." }, { "role": "user", "content": "Can you recommend some good weather apps?" } ] }

Data Augmentation Techniques:

Text Data Augmentation:

  • Back Translation: Text→Foreign Language→Back Translation, increase expression diversity
  • Synonym Replacement: Use synonym dictionaries for vocabulary substitution
  • Sentence Restructuring: Change sentence structure while preserving semantics
  • Noise Injection: Add spelling errors, punctuation changes, etc.

Quality Control:

def quality_check(sample): checks = [ is_language_correct(sample['text']), is_label_valid(sample['label']), is_length_appropriate(sample['text']), is_content_appropriate(sample['text']) ] return all(checks)

5.2.3 Learning Rate Scheduling and Hyperparameter Optimization

Learning Rate Scheduling Strategies:

Linear Decay Scheduling:

def linear_schedule(current_step, total_steps, peak_lr, min_lr=0): if current_step < warmup_steps: # Warmup phase return peak_lr * current_step / warmup_steps else: # Linear decay phase progress = (current_step - warmup_steps) / (total_steps - warmup_steps) return peak_lr * (1 - progress) + min_lr * progress

Cosine Annealing Scheduling:

def cosine_schedule(current_step, total_steps, peak_lr, min_lr=0): if current_step < warmup_steps: return peak_lr * current_step / warmup_steps else: progress = (current_step - warmup_steps) / (total_steps - warmup_steps) return min_lr + (peak_lr - min_lr) * (1 + cos(π * progress)) / 2

AdamW Optimizer Configuration:

# Recommended AdamW configuration optimizer = AdamW( model.parameters(), lr=5e-5, # Base learning rate betas=(0.9, 0.999), # Momentum parameters eps=1e-8, # Numerical stability weight_decay=0.01, # Weight decay correct_bias=True # Bias correction )

Key Hyperparameter Selection:

Learning Rate Range:

  • Full Parameter Fine-tuning: 1e-5 to 5e-5
  • LoRA Fine-tuning: 1e-4 to 1e-3
  • Selection Principle: Larger models use smaller learning rates

Batch Size:

  • Gradient Accumulation: Use when GPU memory is insufficient
  • Effective Batch Size: batch_size × gradient_accumulation_steps
  • Recommended Range: 16-128 samples

Training Epochs:

  • Small Datasets: 10-50 epochs
  • Large Datasets: 2-5 epochs
  • Early Stopping Strategy: Monitor validation set performance to prevent overfitting

5.2.4 Overfitting Prevention Strategies

Regularization Techniques:

Dropout Strategy:

# Dropout rate settings for different components config = { "attention_dropout": 0.1, # Attention weight dropout "hidden_dropout": 0.1, # Hidden layer dropout "embedding_dropout": 0.1, # Embedding layer dropout "classifier_dropout": 0.3 # Classifier dropout (usually higher) }

Weight Decay:

  • L2 Regularization: Add penalty term for weight squared sum
  • Selective Application: Usually not applied to bias and LayerNorm parameters
  • Decay Rate Selection: 0.01-0.1 range

Data-Related Strategies:

Validation Set Split:

# Dataset split example train_size = int(0.8 * len(dataset)) val_size = int(0.1 * len(dataset)) test_size = len(dataset) - train_size - val_size train_data, val_data, test_data = random_split( dataset, [train_size, val_size, test_size] )

Cross Validation:

  • K-Fold Cross Validation: Particularly suitable for small datasets
  • Stratified Sampling: Ensure uniform distribution of categories across folds
  • Time Series Data: Use time-sensitive split methods

Early Stopping and Checkpoints:

# Early stopping implementation class EarlyStopping: def __init__(self, patience=7, min_delta=0): self.patience = patience self.min_delta = min_delta self.counter = 0 self.best_score = None def __call__(self, val_score): if self.best_score is None: self.best_score = val_score elif val_score < self.best_score + self.min_delta: self.counter += 1 if self.counter >= self.patience: return True else: self.best_score = val_score self.counter = 0 return False

5.3 Advanced Training Techniques

5.3.1 Reinforcement Learning from Human Feedback (RLHF)

Complete RLHF Pipeline:

Stage 1: Supervised Fine-tuning (SFT):

  • Goal: Supervised learning on high-quality instruction-response pairs
  • Data Requirements: High-quality dialogue data with human annotations
  • Training Objective: Maximize conditional probability P(response|instruction)

Stage 2: Reward Model Training:

  • Data Collection: Let SFT model generate multiple responses, human ranking
  • Model Architecture: Usually use same model as SFT, but output scalar reward scores
  • Training Objective:
L_reward = -E[(r(x,y_chosen) - r(x,y_rejected))]

where r(x,y) is the reward model output

Stage 3: PPO Reinforcement Learning:

  • Objective Function:
L_PPO = E[min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)] where r(θ) = π_θ(y|x) / π_θ_old(y|x)
  • KL Divergence Constraint: Prevent new policy from deviating too far from original model
  • Complete Objective:
L = L_PPO + β × KL(π_θ || π_SFT)

Key Technical Details in RLHF:

Reward Model Design:

class RewardModel(nn.Module): def __init__(self, base_model): super().__init__() self.base_model = base_model self.reward_head = nn.Linear(base_model.config.hidden_size, 1) def forward(self, input_ids, attention_mask): outputs = self.base_model(input_ids, attention_mask) # Use last token's hidden state last_hidden_state = outputs.last_hidden_state[:, -1, :] reward = self.reward_head(last_hidden_state) return reward

PPO Training Challenges:

  • Reward Hacking: Model may learn to deceive the reward model
  • Training Instability: RL training is more unstable than supervised learning
  • High Computational Overhead: Need to maintain multiple model copies simultaneously

5.3.2 Instruction Tuning

Core Philosophy of Instruction Tuning:

  • Unified Format: Unify various tasks into instruction-input-output format
  • Generalization Capability: Improve model’s understanding and execution of new instructions
  • Zero-shot Performance: Execute new tasks without task-specific fine-tuning

Instruction Data Construction:

Task Diversity:

# Instruction template examples instruction_templates = { "classification": [ "Classify the following text into {categories}", "Which category does this text belong to: {categories}", "Please determine the category of the following content" ], "generation": [ "Generate {content_type} based on the following description", "Please create a {content_type} about {topic}", "Continue the following content" ], "qa": [ "Answer the following question", "Answer the question based on given information", "Please explain {concept}" ] }

Negative Sample Construction:

  • Refusal to Answer: Learn to refuse inappropriate or beyond-capability requests
  • Clarification Queries: Learn to seek clarification for ambiguous instructions
  • Safety Boundaries: Learn to identify and refuse harmful requests

Chain-of-Thought (CoT) Training:

# CoT training data example { "instruction": "Solve this math problem", "input": "Tom has 15 apples, gave 3 to Mary, then bought 8 more. How many does he have now?", "output": "Let me calculate step by step:\n1. Tom initially had 15 apples\n2. After giving 3 to Mary: 15 - 3 = 12 apples\n3. After buying 8 more: 12 + 8 = 20 apples\nSo Tom now has 20 apples." }

5.3.3 Alignment Techniques

Multiple Dimensions of Alignment:

Helpfulness:

  • Task Completion Capability: Accurately understand and execute user instructions
  • Information Accuracy: Provide correct, up-to-date information
  • Response Completeness: Give comprehensive and relevant answers

Harmlessness:

  • Content Safety: Avoid generating harmful, violent, discriminatory content
  • Privacy Protection: Do not leak personal privacy information
  • Legal Compliance: Comply with relevant laws and regulations

Honesty:

  • Knowledge Boundaries: Acknowledge what is not known
  • Uncertainty Expression: Appropriately express uncertainty
  • Avoid Hallucination: Reduce generation of false information

Constitutional AI Method:

  • Self-Critique: Let model evaluate its own output
  • Self-Correction: Improve responses based on critique
  • Recursive Improvement: Multi-round self-improvement process

5.3.4 In-Context Learning (ICL)

ICL Working Mechanism:

Few-shot Learning:

# Few-shot prompting example prompt = """ Please translate the following sentences to English: Chinese: 今天天气很好。 English: The weather is nice today. Chinese: 我喜欢读书。 English: I like reading books. Chinese: 这个问题很复杂。 English: """

Key Factors in ICL:

Example Selection Strategies:

  • Similarity Selection: Choose examples most similar to target task
  • Diversity Balance: Ensure examples cover different situations
  • Quality Control: Use high-quality examples
  • Order Effects: Order of examples affects performance

Prompt Engineering Techniques:

# Structured prompt template def create_prompt(task_description, examples, query): prompt = f"Task: {task_description}\n\n" for i, (input_text, output_text) in enumerate(examples, 1): prompt += f"Example {i}:\nInput: {input_text}\nOutput: {output_text}\n\n" prompt += f"Now please process:\nInput: {query}\nOutput: " return prompt

Theoretical Understanding of ICL:

  • Gradient Update Simulation: ICL may simulate gradient descent process
  • Pattern Matching: Learn input-output mapping patterns through examples
  • Meta-Learning: Learn the ability to learn during pretraining
Last updated on