Chapter 5: LLM Training Process
5.1 Pretraining
5.1.1 Goals and Significance of Pretraining
Core Goals of Pretraining:
- Language Representation Learning: Learn general language representations from large amounts of unlabeled text
- Knowledge Acquisition: Encode human knowledge into model parameters
- Pattern Recognition: Learn statistical patterns and structures of language
- Transfer Capability Development: Provide powerful initialization parameters for downstream tasks
Important Significance of Pretraining:
Breaking Data Bottlenecks:
- Abundant Unlabeled Data: Vast amounts of unlabeled text data available on the internet
- Reduced Annotation Dependency: Avoids the cost of separately annotating large amounts of data for each task
- Strong Generalizability: One pretraining can serve multiple downstream tasks
Knowledge Accumulation and Transfer:
- World Knowledge Learning: Learn factual knowledge from Wikipedia, news, books, etc.
- Common Sense Reasoning: Learn common sense knowledge from daily life
- Professional Knowledge: Exposure to professional terminology and concepts from various domains
- Cross-Task Transfer: Transfer learned knowledge to new tasks
Emergence of Language Capabilities:
- Grammar Understanding: Learn grammatical rules and syntactic structures of language
- Semantic Understanding: Understand meanings of words, phrases, and sentences
- Context Modeling: Learn long-distance dependencies and contextual relationships
- Reasoning Abilities: Reasoning capabilities emerge during large-scale pretraining
5.1.2 Language Modeling Tasks
Autoregressive Language Modeling:
Basic Principle:
P(x1, x2, ..., xn) = ∏(i=1 to n) P(xi | x1, x2, ..., xi-1)Training Objective:
- Next Word Prediction: Given context, predict the probability distribution of the next word
- Maximum Likelihood Estimation: Maximize the log-likelihood of training data
- Mathematical Representation:
L = -∑(i=1 to N) log P(xi | x<i; θ)GPT Series Approach:
- Causal Masking: Ensures the model can only see information before the current position
- Teacher Forcing: Uses true sequences during training, autoregressive generation during inference
- Advantages: Naturally suitable for text generation tasks
- Limitations: Can only utilize unidirectional context information
Bidirectional Language Modeling:
BERT’s Masked Language Model (MLM):
- Basic Idea: Randomly mask some tokens in the input and predict the masked content
- Masking Strategy:
Randomly select 15% of tokens:
- 80% replace with [MASK]
- 10% replace with random token
- 10% keep unchanged- Training Objective:
L = -∑(i∈M) log P(xi | x\M; θ)where M is the set of masked positions
Advantages and Limitations:
- Advantages: Can utilize bidirectional context information, better language understanding capability
- Limitations: Inconsistent input forms between pretraining and fine-tuning stages (pretraining-fine-tuning gap)
5.1.3 Improvements to Masked Language Models
ELECTRA’s Replaced Token Detection:
- Core Idea: Instead of predicting masked tokens, judge whether each token has been replaced
- Generator-Discriminator Architecture:
Generator: Small MLM model that generates replacement tokens
Discriminator: Judges whether each position's token is the original token- Advantages: All positions participate in training, improving sample efficiency
RoBERTa’s Optimization Strategies:
- Remove NSP Task: Found that Next Sentence Prediction provides limited performance improvement
- Dynamic Masking: Use different masking patterns for each epoch
- Longer Training: Use more data and longer training time
- Larger Batches: Increase batch size to improve training stability
DeBERTa’s Decoupled Attention:
- Position Information Separation: Process content and position information separately
- Relative Position Encoding: Use relative positions instead of absolute positions
- Enhanced Mask Decoder: Better integrate position information during fine-tuning
5.1.4 Training Data Scale and Quality Requirements
Data Scale Requirements:
Relationship Between Parameter Scale and Data Volume:
- Empirical Rule: Data volume is typically 20-100 times the parameter count
- Specific Examples:
GPT-3 (175B parameters): ~300B tokens
GPT-4 (estimated >1T parameters): >13T tokens
PaLM (540B parameters): ~780B tokensScaling Laws:
- Chinchilla Rule: For a given compute budget, should balance model size and training data volume
- Optimal Ratio: Each parameter needs approximately 20 tokens of training data
- Compute Budget Allocation:
Optimal model size ∝ (compute budget)^0.5
Optimal data volume ∝ (compute budget)^0.5Data Quality Requirements:
Data Source Diversity:
- Web Text: Common Crawl, Web crawls (60-70%)
- Books and Literature: Books1, Books2, Project Gutenberg (10-15%)
- Academic Literature: ArXiv, PubMed papers (5-10%)
- News Articles: News websites, RSS feeds (5-10%)
- Reference Materials: Wikipedia, encyclopedias (2-5%)
- Code Repositories: GitHub, StackOverflow (5-10%)
Data Cleaning Strategies:
Content Filtering:
# Common data cleaning steps
def clean_text_data(raw_text):
# 1. Language detection and filtering
if not is_target_language(raw_text):
return None
# 2. Quality scoring
quality_score = compute_quality_score(raw_text)
if quality_score < threshold:
return None
# 3. Deduplication
if is_duplicate(raw_text):
return None
# 4. Privacy information filtering
filtered_text = remove_pii(raw_text)
# 5. Harmful content filtering
if contains_harmful_content(filtered_text):
return None
return filtered_textQuality Metrics:
- Language Quality: Grammatical correctness, spelling error rate
- Information Density: Proportion of meaningful content
- Duplication Level: Avoid large amounts of repetitive content
- Diversity: Diversity in topics, styles, and sources
5.1.5 Computational Resources and Hardware Requirements
Hardware Architecture Selection:
GPU Clusters:
- Mainstream Choice: NVIDIA A100, H100 GPUs
- Memory Requirements:
Parameter storage: 4 bytes per parameter (FP32) or 2 bytes (FP16)
Gradient storage: Same size as parameters
Optimizer state: Adam requires 2-3x parameter size
Total memory requirement ≈ parameter count × (4-8) bytes- Practical Examples:
7B parameter model: requires ~56-112GB memory
175B parameter model: requires ~1.4-2.8TB memoryDistributed Training Strategies:
Data Parallelism:
- Principle: Each GPU maintains complete model copy, processes different data batches
- Communication Requirements: Need to synchronize gradients, communication volume equals parameter count
- Applicable Scenarios: When model fits in single GPU
Model Parallelism:
- Tensor Parallelism: Distribute single operations across multiple GPUs
# Tensor parallelism example for linear layer
# Original: Y = XW
# Split W into W1, W2: Y = [XW1, XW2]- Pipeline Parallelism: Distribute different layers across different GPUs
- Applicable Scenarios: When model is too large for single GPU
3D Parallelism:
- Combine Three Parallel Strategies: Data parallelism + Tensor parallelism + Pipeline parallelism
- DeepSpeed ZeRO: Zero Redundancy Optimizer, reduces memory usage
- Megatron-LM: NVIDIA’s large-scale parallel training framework
Training Time Estimation:
Computational Analysis:
FLOPs per token ≈ 6 × N (N is parameter count)
Total FLOPs = 6 × N × D (D is training data token count)
Training time = Total FLOPs / (GPU count × Per-GPU compute × Utilization)Real-world Example:
GPT-3 training:
- Parameters: 175B
- Training data: 300B tokens
- Total compute: 3.14×10^23 FLOPS
- Hardware: ~1000 V100 GPUs
- Training time: ~34 daysCost Estimation:
- Cloud Computing Cost: AWS p4d.24xlarge ~$32/hour
- GPT-3 Level Model: Estimated training cost $4-10M
- Open Source Alternatives: Use academic resources and open source tools to reduce costs
5.2 Fine-tuning
5.2.1 Supervised Fine-tuning
Basic Principles of Fine-tuning:
- Parameter Initialization: Use pretrained model parameters as initialization
- Task Adaptation: Continue training on task-specific data
- Learning Rate Setting: Usually use smaller learning rates to avoid catastrophic forgetting
- Freezing Strategy: Can choose to freeze some layers, only train top layers
Full Parameter Fine-tuning vs Parameter-Efficient Fine-tuning:
Full Parameter Fine-tuning:
# Full fine-tuning pseudocode
for batch in task_dataloader:
# Forward pass
logits = model(batch.input_ids)
loss = criterion(logits, batch.labels)
# Backward pass, update all parameters
loss.backward()
optimizer.step() # Update all model parameters
optimizer.zero_grad()Parameter-Efficient Fine-tuning Methods:
LoRA (Low-Rank Adaptation):
- Core Idea: Add low-rank matrices on top of pretrained parameters
- Mathematical Representation:
W' = W + BA
where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), r << min(d,k)- Advantages: Only need to train few parameters (usually <1%), significantly reduce computation and storage requirements
- Implementation:
# LoRA layer implementation
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=4):
self.lora_A = nn.Linear(in_features, rank, bias=False)
self.lora_B = nn.Linear(rank, out_features, bias=False)
self.scaling = rank ** -0.5
def forward(self, x):
return self.lora_B(self.lora_A(x)) * self.scalingAdapter Methods:
- Structure: Insert small feedforward networks between Transformer layers
- Design Principle:
Adapter(x) = x + Down(ReLU(Up(LayerNorm(x))))- Advantages: Keep original model unchanged, only train adapter parameters
Prefix Tuning:
- Method: Only train input prefix embeddings, freeze model parameters
- Applicable Scenarios: Particularly suitable for generation tasks
- Simple Implementation: Add trainable prefix tokens before input
5.2.2 Task-Specific Data Preparation
Data Formatting Strategies:
Classification Task Data Format:
# Text classification data example
{
"text": "This movie is really great, the actors performed excellently, and the plot is engaging.",
"label": "positive",
"metadata": {
"domain": "movie_review",
"source": "imdb"
}
}Generation Task Data Format:
# QA task data example
{
"instruction": "Please answer the following question",
"input": "What is machine learning?",
"output": "Machine learning is a branch of artificial intelligence that enables computers to automatically learn and improve from data through algorithms, without being explicitly programmed.",
"task_type": "qa"
}Multi-turn Dialogue Data Format:
# Dialogue data example
{
"conversation": [
{
"role": "user",
"content": "How's the weather today?"
},
{
"role": "assistant",
"content": "I cannot access real-time weather information. I recommend checking weather apps or websites for accurate weather forecasts."
},
{
"role": "user",
"content": "Can you recommend some good weather apps?"
}
]
}Data Augmentation Techniques:
Text Data Augmentation:
- Back Translation: Text→Foreign Language→Back Translation, increase expression diversity
- Synonym Replacement: Use synonym dictionaries for vocabulary substitution
- Sentence Restructuring: Change sentence structure while preserving semantics
- Noise Injection: Add spelling errors, punctuation changes, etc.
Quality Control:
def quality_check(sample):
checks = [
is_language_correct(sample['text']),
is_label_valid(sample['label']),
is_length_appropriate(sample['text']),
is_content_appropriate(sample['text'])
]
return all(checks)5.2.3 Learning Rate Scheduling and Hyperparameter Optimization
Learning Rate Scheduling Strategies:
Linear Decay Scheduling:
def linear_schedule(current_step, total_steps, peak_lr, min_lr=0):
if current_step < warmup_steps:
# Warmup phase
return peak_lr * current_step / warmup_steps
else:
# Linear decay phase
progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
return peak_lr * (1 - progress) + min_lr * progressCosine Annealing Scheduling:
def cosine_schedule(current_step, total_steps, peak_lr, min_lr=0):
if current_step < warmup_steps:
return peak_lr * current_step / warmup_steps
else:
progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + (peak_lr - min_lr) * (1 + cos(π * progress)) / 2AdamW Optimizer Configuration:
# Recommended AdamW configuration
optimizer = AdamW(
model.parameters(),
lr=5e-5, # Base learning rate
betas=(0.9, 0.999), # Momentum parameters
eps=1e-8, # Numerical stability
weight_decay=0.01, # Weight decay
correct_bias=True # Bias correction
)Key Hyperparameter Selection:
Learning Rate Range:
- Full Parameter Fine-tuning: 1e-5 to 5e-5
- LoRA Fine-tuning: 1e-4 to 1e-3
- Selection Principle: Larger models use smaller learning rates
Batch Size:
- Gradient Accumulation: Use when GPU memory is insufficient
- Effective Batch Size: batch_size × gradient_accumulation_steps
- Recommended Range: 16-128 samples
Training Epochs:
- Small Datasets: 10-50 epochs
- Large Datasets: 2-5 epochs
- Early Stopping Strategy: Monitor validation set performance to prevent overfitting
5.2.4 Overfitting Prevention Strategies
Regularization Techniques:
Dropout Strategy:
# Dropout rate settings for different components
config = {
"attention_dropout": 0.1, # Attention weight dropout
"hidden_dropout": 0.1, # Hidden layer dropout
"embedding_dropout": 0.1, # Embedding layer dropout
"classifier_dropout": 0.3 # Classifier dropout (usually higher)
}Weight Decay:
- L2 Regularization: Add penalty term for weight squared sum
- Selective Application: Usually not applied to bias and LayerNorm parameters
- Decay Rate Selection: 0.01-0.1 range
Data-Related Strategies:
Validation Set Split:
# Dataset split example
train_size = int(0.8 * len(dataset))
val_size = int(0.1 * len(dataset))
test_size = len(dataset) - train_size - val_size
train_data, val_data, test_data = random_split(
dataset, [train_size, val_size, test_size]
)Cross Validation:
- K-Fold Cross Validation: Particularly suitable for small datasets
- Stratified Sampling: Ensure uniform distribution of categories across folds
- Time Series Data: Use time-sensitive split methods
Early Stopping and Checkpoints:
# Early stopping implementation
class EarlyStopping:
def __init__(self, patience=7, min_delta=0):
self.patience = patience
self.min_delta = min_delta
self.counter = 0
self.best_score = None
def __call__(self, val_score):
if self.best_score is None:
self.best_score = val_score
elif val_score < self.best_score + self.min_delta:
self.counter += 1
if self.counter >= self.patience:
return True
else:
self.best_score = val_score
self.counter = 0
return False5.3 Advanced Training Techniques
5.3.1 Reinforcement Learning from Human Feedback (RLHF)
Complete RLHF Pipeline:
Stage 1: Supervised Fine-tuning (SFT):
- Goal: Supervised learning on high-quality instruction-response pairs
- Data Requirements: High-quality dialogue data with human annotations
- Training Objective: Maximize conditional probability P(response|instruction)
Stage 2: Reward Model Training:
- Data Collection: Let SFT model generate multiple responses, human ranking
- Model Architecture: Usually use same model as SFT, but output scalar reward scores
- Training Objective:
L_reward = -E[(r(x,y_chosen) - r(x,y_rejected))]where r(x,y) is the reward model output
Stage 3: PPO Reinforcement Learning:
- Objective Function:
L_PPO = E[min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)]
where r(θ) = π_θ(y|x) / π_θ_old(y|x)- KL Divergence Constraint: Prevent new policy from deviating too far from original model
- Complete Objective:
L = L_PPO + β × KL(π_θ || π_SFT)Key Technical Details in RLHF:
Reward Model Design:
class RewardModel(nn.Module):
def __init__(self, base_model):
super().__init__()
self.base_model = base_model
self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
outputs = self.base_model(input_ids, attention_mask)
# Use last token's hidden state
last_hidden_state = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden_state)
return rewardPPO Training Challenges:
- Reward Hacking: Model may learn to deceive the reward model
- Training Instability: RL training is more unstable than supervised learning
- High Computational Overhead: Need to maintain multiple model copies simultaneously
5.3.2 Instruction Tuning
Core Philosophy of Instruction Tuning:
- Unified Format: Unify various tasks into instruction-input-output format
- Generalization Capability: Improve model’s understanding and execution of new instructions
- Zero-shot Performance: Execute new tasks without task-specific fine-tuning
Instruction Data Construction:
Task Diversity:
# Instruction template examples
instruction_templates = {
"classification": [
"Classify the following text into {categories}",
"Which category does this text belong to: {categories}",
"Please determine the category of the following content"
],
"generation": [
"Generate {content_type} based on the following description",
"Please create a {content_type} about {topic}",
"Continue the following content"
],
"qa": [
"Answer the following question",
"Answer the question based on given information",
"Please explain {concept}"
]
}Negative Sample Construction:
- Refusal to Answer: Learn to refuse inappropriate or beyond-capability requests
- Clarification Queries: Learn to seek clarification for ambiguous instructions
- Safety Boundaries: Learn to identify and refuse harmful requests
Chain-of-Thought (CoT) Training:
# CoT training data example
{
"instruction": "Solve this math problem",
"input": "Tom has 15 apples, gave 3 to Mary, then bought 8 more. How many does he have now?",
"output": "Let me calculate step by step:\n1. Tom initially had 15 apples\n2. After giving 3 to Mary: 15 - 3 = 12 apples\n3. After buying 8 more: 12 + 8 = 20 apples\nSo Tom now has 20 apples."
}5.3.3 Alignment Techniques
Multiple Dimensions of Alignment:
Helpfulness:
- Task Completion Capability: Accurately understand and execute user instructions
- Information Accuracy: Provide correct, up-to-date information
- Response Completeness: Give comprehensive and relevant answers
Harmlessness:
- Content Safety: Avoid generating harmful, violent, discriminatory content
- Privacy Protection: Do not leak personal privacy information
- Legal Compliance: Comply with relevant laws and regulations
Honesty:
- Knowledge Boundaries: Acknowledge what is not known
- Uncertainty Expression: Appropriately express uncertainty
- Avoid Hallucination: Reduce generation of false information
Constitutional AI Method:
- Self-Critique: Let model evaluate its own output
- Self-Correction: Improve responses based on critique
- Recursive Improvement: Multi-round self-improvement process
5.3.4 In-Context Learning (ICL)
ICL Working Mechanism:
Few-shot Learning:
# Few-shot prompting example
prompt = """
Please translate the following sentences to English:
Chinese: 今天天气很好。
English: The weather is nice today.
Chinese: 我喜欢读书。
English: I like reading books.
Chinese: 这个问题很复杂。
English: """Key Factors in ICL:
Example Selection Strategies:
- Similarity Selection: Choose examples most similar to target task
- Diversity Balance: Ensure examples cover different situations
- Quality Control: Use high-quality examples
- Order Effects: Order of examples affects performance
Prompt Engineering Techniques:
# Structured prompt template
def create_prompt(task_description, examples, query):
prompt = f"Task: {task_description}\n\n"
for i, (input_text, output_text) in enumerate(examples, 1):
prompt += f"Example {i}:\nInput: {input_text}\nOutput: {output_text}\n\n"
prompt += f"Now please process:\nInput: {query}\nOutput: "
return promptTheoretical Understanding of ICL:
- Gradient Update Simulation: ICL may simulate gradient descent process
- Pattern Matching: Learn input-output mapping patterns through examples
- Meta-Learning: Learn the ability to learn during pretraining