【NLP】27. 语言模型训练以及模型选择：从预训练到下游任务

语言模型训练：从预训练到下游任务

本文详细讲解大型语言模型（LLMs）是如何训练的，包括不同的模型类型（Encoder、Decoder、Encoder-Decoder），以及各类预训练任务的原理、对比、适用场景，帮助你构建完整的语言建模理解体系。

一、三种主流语言模型结构

语言模型（LLMs）主要分为三种结构，每种结构的训练方式、能力边界、应用场景均有所不同：

类型	代表模型	输入处理	输出形式	典型用途
编码器（Encoder）	BERT	输入整句（遮掩词）	词级表示向量	NER、分类、匹配等
解码器（Decoder）	GPT	输入前文	自回归生成后文	生成、续写、对话等
编码-解码结构	T5、BART	编码整句 → 解码目标	文本对到文本对	翻译、问答、摘要等

**注：**目前主流大模型如 GPT-4、Claude 多为 decoder-only 结构。

二、训练语言模型的基本思想

无论是哪类模型，训练过程都遵循一条核心路径：

给定原始文本，稍作修改（如遮盖、替换、删除），训练模型去“恢复或识别修改处”。

这种方式不仅可以学习词与词之间的语义关系，还能促使模型理解上下文和结构。

三、常见的训练任务详解与对比

1️⃣ Next Token Prediction（下一个词预测）

用于： Decoder 模型（如 GPT）
机制： 给定文本开头，预测下一个词
建模目标：

$\hat{y}_{t} = \arg\max P(y_t \mid y_{<t})$

例子：
输入：The weather is → 输出：sunny
优点： 能处理文本生成任务
缺点： 单向上下文，只能看到前文

2️⃣ Masked Token Prediction（遮盖词预测）

用于： Encoder 模型（如 BERT）
机制： 输入句子中随机遮盖若干 token，模型预测遮盖位置的原始词

例子：

输入：The capital of France is [MASK]
输出：Paris

注意： 非遮盖词位置不参与 loss 计算
对比 GPT： 能看到前后文（双向上下文），但不能用于生成任务

3️⃣ Span Prediction（SpanBERT）

区别： 遮盖连续多个词（而不是单词）
目的： 强化模型处理片段（span）的能力，更适合问答等任务

例子：

输入：Chocolate is [MASK] [MASK] [MASK] good
输出：incredibly irresistibly tasty

难度更高 → 更强能力

4️⃣ Random Token Correction（BERT）

机制： 随机替换句子中的词，模型需判断哪个词错了

例子：

I like MLP (原来是NLP)
输出：发现 MLP 是错的

挑战： 模型要全局理解文本含义，避免仅依靠表层词频

5️⃣ Token Edit Detection（ELECTRA）

流程：
- 由小型 generator 替换部分 token（伪造）
- 判别器判断每个 token 是否被替换
优点： 所有 token 都参与训练（比 BERT 更高效）
训练目标：

$\text{Output}_i = \begin{cases} S, & \text{token 是原始} \\ E, & \text{token 是生成} \end{cases}$

6️⃣ Combination（BERT 扩展任务）

整合多种任务：Masked + Replacement + 原文保留
效果： 模型能更全面学习语义结构、对抗扰动

7️⃣ Next Sentence Prediction（BERT）

机制： 判断两句话是否为上下文连续

例子：

A: I like NLP
B: I like MLP	
输出：是否 Next Sentence?

后续研究发现： 该任务效果有限，RoBERTa 移除该任务后表现反而更好

四、Encoder-Decoder 专属训练任务（如 T5、BART）

✅ 1. Masked Sequence Prediction（遮盖词预测）

输入：
```
I attended [MASK] at [MASK]
```
输出目标：
```
I attended a workshop at Google
```

✅ 2. Deleted Sequence Prediction（删除预测）

输入：
```
I watched yesterday
```
输出目标：
```
I watched a movie on Netflix yesterday
```
错误答案示例（模型需要避免）：
```
I watched a presentation at work yesterday
```

✅ 3. Deleted Span Prediction（删除片段预测）

输入：
```
She submitted the assignment
```
输出目标（分段补全）：
```
<X>: yesterday evening, <Y>: on Canvas
```

✅ 4. Permuted Sequence Prediction（打乱顺序重构）

输入：
```
Netflix the on movie watched I
```
输出目标：
```
I watched the movie on Netflix
```

✅ 5. Rotated Sequence Prediction（旋转预测）

输入：
```
the conference in I presented paper a
```
输出目标：
```
I presented a paper in the conference
```

✅ 6. Infilling Prediction（间隙填空）

输入：
```
She [MASK] the [MASK] before [MASK]
```

输出目标：

She completed the report before midnight

Encoder-Decoder 架构能更灵活地处理“输入 → 输出”任务，适合做结构性转换。其训练任务也更具多样性：

任务类型	模型示例	输入	输出（目标）
Masked Sequence	BART	I submitted [MASK] to [MASK]	I submitted the report to my supervisor
Deleted Sequence	BART	I submitted to my supervisor	I submitted the report to my supervisor
Span Mask	T5	I submitted to	: the report, : my supervisor
Permuted Sequence	BART	My supervisor to the report submitted I	I submitted the report to my supervisor
Rotated Sequence	BART	To my supervisor I submitted the report	I submitted the report to my supervisor
Infilling Prediction	BART	I [MASK] the [MASK] to [MASK]	I submitted the report to my supervisor

这些任务强化了模型处理不定结构输入的能力，提升其在翻译、摘要等任务中的泛化表现。

五、预训练 vs 微调：对比分析

对比维度	预训练（Pre-training）	微调（Fine-tuning）
执行频率	只做一次	每个任务可单独训练一次
训练时间	较长（几周/月）	可长可短
计算资源	通常需大规模 GPU（集群）	可在小规模 GPU 运行
数据来源	原始文本（如 Wikipedia, BooksCorpus）	任务特定数据（分类、问答等）
学习目标	通用语言理解（语义、上下文、关系）	针对具体任务性能最优

六、如何使用语言模型做任务？

模式一：使用语言模型作为特征提取器

提取词向量或句向量 → 输入到后续任务模型
类似于早期使用 Word2Vec 或 GloVe 向量
BERT 特别适合这种方式

模式二：将任务直接表述为语言建模

直接将任务转为“生成”问题
GPT 类型模型常用此方式

示例：

问答：Q: Who discovered gravity? A: → Isaac Newton
翻译：Translate: Hello → Bonjour

任务结构对比分析

📘 情感分析任务（模拟课程评价）

模型结构	输入内容	输出内容
Encoder	The student submitted the assignment on time. Rating: [MASK]	5
Decoder	The student submitted the assignment on time. Rating:	5
Enc-Dec	The student submitted the assignment on time. Rating:	: 5

📘 命名实体识别（NER）

模型结构	输入内容	输出内容
Encoder	The student submitted the assignment on time	[O, O, O, O, B-TASK, O, B-TIME]
Decoder	The student submitted the assignment on time	assignment: Task, on time: Time
Enc-Dec	The student submitted the assignment on time	assignment → Task on time → Time

📘 共指消解（改写句子以引入代词）

例句：The student told the lecturer that he was late.

模型结构	输入内容	输出内容
Encoder	The student told the lecturer that he was late	he → student / lecturer（需结构化解码）
Decoder	The student told the lecturer that he was late	“he” refers to the student
Enc-Dec	The student told the lecturer that he was late	he → student

📘 文本摘要（扩展长句 → 摘要）

长句：The student, after days of hard work and late nights, finally submitted the assignment well before the deadline.

模型结构	输入内容	输出内容
Decoder	The student, after days of hard work…	The student submitted early
Enc-Dec	Same as above	Submitted the assignment

📘 翻译任务（英 → 法）

模型结构	输入内容	输出内容
Decoder	Translate: The student submitted the assignment on time.	L’étudiant a rendu le devoir à temps.
Enc-Dec	Translate English to French: The student submitted…	L’étudiant a rendu le devoir à temps.

模型结构与任务适配总结

任务类型	Encoder (如 BERT)	Decoder (如 GPT)	Encoder-Decoder (如 T5, BART)
分类/回归	✅ 非常适合	✅ 也可建模	✅ 灵活，适合文本→标签结构
实体识别	✅ 标准做法（token 分类）	⚠️ 需序列生成	✅ 可做 span 生成
共指消解	⚠️ 通常需外部处理	⚠️ 表达模糊	✅ 可结构化生成
摘要	❌ 不能生成	✅ 具备能力	✅ 最适合（输入输出解耦）
翻译	❌ 无法实现	✅ 可做（Prompt式）	✅ 最佳（标准 Encoder-Decoder 应用）