LLM进阶

prologue：最近大模型火出天际，I’m definitely aware I’m late to the party，2022年毕业之后就很少在系统的跟踪一个domain了，所以这次下定决心要跟踪一下大模型的技术细节和实现过程，不做AI丁真

本文三条主线，（1）文本生成；（2）多模态；（3）对话大语言模型

summary

这个世界的模态无非就是图像、视频、文本、音频四种模态，现在的大模型的目的就是要pre-train出一个世界模型，也就是一个模型能同时处理四种模态的能力。其中，图像和视频属于CV领域，文本属于NLP领域，音频目前我还没有接触过，暂不知道如何归类。目前的大模型可以分为单模态和多模态的。
单模态的比如文生文大模型，比如chatGPT系列
多模态（目前最火的就是Vision-Language Model）的比如：Grok系列等
世界模型比如：Sora等

Large Language Model timeline

Transformer(2017，open-source)

totally self-attention structure， much more parameters than that ResNet based on CNN

Core Algorithmes

BPE

BPE是NLP训练过程和推理过程中常用的将语料转换成 tokens 的方式。BPE的核心是构建一个词表，

BERT(2018.10， open-source)

自监督语言预训练方法，词表是用BPE的变体构建的，两个proxy tasks：
（1）Mask 15% 的tokens；
（2）预测是否是下一句
都是构造人工 target, 之后fine-tune to downstream tasks，预训练过程只使用 Transformer的 Encoder，为了预测下一句任务，加了一层sengment Embedding

quesetion：
（1）BERT的双向体现在在什么地方？
在GPT-1这些模型中，为了不让模型利用到之后的信息，对attention过程做了mask，舍弃掉了attn矩阵中的右上部门数值，这就是单向的，但是bert没有这样做。
（2）BERT为什么只使用encoder部分？
BERT主要用于理解和表示任务,如文本分类、序列标注等,不需要生成文本,因此只使用encoder部分。

GPT-1（open-source，1.17亿parameters）

词表是用 BPE构建的。因为要生成文本，所以要用且也只用了decoder部分，但是decoder部分做了改进，由于没有用encoder，所以把原来的decoder中需要接收encoder的编码信息的网络层去掉了。

there is the answer from Grok3:
Why Not Encoder-Only?
An encoder-only Transformer (e.g., BERT) is bidirectional: it processes the entire input sequence at once, attending to both past and future tokens. This suits tasks like classification or masked language modeling (e.g., filling in blanks), but not autoregressive generation. GPT-2’s goal is to generate coherent text, not analyze a fixed input, so a bidirectional encoder wouldn’t work—it’d “see the future” and break the generation process.

GPT-1训练过程采用自监督方式（在预训练过程中，输入模型中的数据和伪标签相差1，即是因果语言模型Causal Language Model, CLM， GPT 系列模型都使用了这种预训练方式）。

在这种训练模式下，模型被训练来预测给定前文的下一个词，这样，模型就能学习到文本的顺序结构和语言的生成过程。），预训练过程中无需annotation，迁移到下游任务上时需要annotation。

生成式模型和判别式模型的区别？
以前经常认为生成式模型和判别式模型的区别就是是否有生成内容的过程，但是原生的Transformer就是用来做机器翻译的，机器翻译过程中，就天生包含了生成内容的过程，但是我们好像不把原声Transformer当作一个生成式模型, So there is the answer from Grok3:
Whether a translation model is considered a generative model depends on how it’s designed and what it does under the hood. Let’s break it down:
Translation Models: Generative or Not?
Translation involves taking text in one language (source) and producing text in another language (target). This process inherently involves generating new text, which suggests a generative nature. However, the classification depends on the model’s architecture and training objective:
（1）Traditional Rule-Based or Statistical Translation
Older translation systems (e.g., phrase-based statistical machine translation) weren’t strictly generative in the modern machine learning sense. They relied on rules or statistical alignments (e.g., word/phrase mappings) rather than learning a full data distribution.
Verdict: Not generative. These are more like engineered pipelines than models learning
$P(\text{target}|\text{source})$
in a probabilistic sense.
（2）Neural Machine Translation (NMT) - Encoder-Decoder Models
Most modern translation systems (e.g., Google Translate’s backbone, models like Transformer-based seq2seq) use an encoder-decoder architecture:
Encoder: Processes the source language into a representation (e.g., a context vector or hidden state).
Decoder: Generates the target language text, word by word or token by token, conditioned on the encoder’s output.
These models are trained to maximize the likelihood of the target sentence given the source sentence, i.e., $P(\text{target}|\text{source})$
.
Why Generative?: The decoder produces new sequences of text, sampling from a learned distribution of possible translations. It’s not just picking from a fixed set—it can generate novel outputs.
Examples: Transformer models (like those in PyTorch’s Fairseq or Hugging Face’s MarianMT).
Verdict: Yes, these are generative models because they generate new data (target text) based on an input distribution.
3. Autoregressive vs. Non-Autoregressive Translation
Autoregressive (AR): Most NMT models (e.g., Transformer with a decoder) generate text sequentially, predicting each word based on the previous ones. This is a hallmark of generative models (e.g., GPT-style generation).
Verdict: Generative.
Non-Autoregressive (NAR): Some newer translation models (e.g., NAT, or Non-Autoregressive Transformers) predict all words in the target sentence simultaneously, often for speed. These still generate text but don’t model the sequential dependency as strongly.
Verdict: Still generative, but less “classically” so, as they don’t rely on step-by-step sampling.
4. Comparison to Discriminative Models
A discriminative model in this context might classify something about the text (e.g., “Is this sentence in English or Spanish?”) rather than producing a new sentence. Translation models don’t classify—they create—so they don’t fit the discriminative mold.
Key Insight: Generative Nature of Translation
Translation models, especially neural ones, are generative because:
They learn a conditional distribution $P(\text{target}|\text{source})$ .
They output new, coherent sequences of text, not just decisions or labels.
They can produce multiple valid translations for the same input (e.g., “I love you” → “Je t’aime” or “Je vous aime” in French, depending on context), which reflects sampling from a learned distribution.
Examples in Practice
Google Translate: Uses Transformer-based NMT, which is generative.
T5 (Text-to-Text Transfer Transformer): Treats translation as a text generation task (“translate to French: Hello” → “Bonjour”). Generative.
Older SMT (Statistical Machine Translation): Less generative in the ML sense, more like a hybrid of statistical mapping and generation.
My Take as Grok
If you’re asking about modern translation models (like those powering apps today), they’re overwhelmingly generative, especially the neural ones. They’re designed to produce text, not just analyze or classify it. That said, their generative nature is conditional—they generate based on an input (source text)—unlike fully unconditional generative models (e.g., a GAN making random images).
Does that answer your question? If you’re thinking of a specific translation model, feel free to name it, and I can dig deeper!

GPT-2（2019.2 15亿参数）

GPT-2 参数量rapidly increase，说明模型也变得非常大，预训练过程还是自监督，伪标注还是和gpt-1一样用的简单的预测下一个词，训练数据集40GB，数据来源 internet。
Grok3’s answer of GPT-2’s network architecture as follows: It is based entirely on the decoder-only portion of the Transformer, adapted for autoregressive tasks like text generation. Let me clarify how this works and contrast it with the full Transformer architecture.

GPT-2 没有把15亿参数的模型放出来，只放出了一个smaller model for experiment validation.

生成文本：1024 tokens (doubled from GPT-1).

GPT-2和GPT-1在网络结构上基本一致，只是GPT-2更大更深，最主要的不同就是在训练数据上，GPT-1用的是小说训练的，内容比较单一，所以模型的鲁棒性不强，而且在生成长序列上的连贯性不强。GPT-2是用的网络数据，数据的多样性更强，更符合实际的数据分布，所以训练出来的模型更robust。

GPT-3(close source, 2020.3 1750亿parameters)

GPT-3当年横扫文本生成领域，不仅能问答、翻译、写文章，还能做数学计算。GPT-3首次提出了“上下文学习”概念，允许大语言模型通过少样本学习解决各种任务，消除了对新任务进行微调的需求（在特定任务中不需要微调‌。）。GPT-3采用了更高效的训练策略，包括更精细的梯度下降技术和改进的正则化方法，这些优化帮助模型在训练过程中更好地泛化和避免过拟合。

GPT-3 没有公布源代码和模型权重，但Meta复现了一遍GPT-3，改名OPT，把代码、权重、部署都开源了出来，

悟道1.0（2021.3 北京智源研究院）

悟道1.0 是我国首个超大规模智能模型系统，包含了4个不同面向领域的模型：
1、悟道·文源‌：以中文为核心的大规模预训练模型，模型参数量达26亿。具有识记、理解、检索、数值计算、多语言等多种能力，覆盖开放域回答、语法改错、情感分析等20种主流中文自然语言处理任务，技术能力已与GPT-3实现齐平‌。
2、‌悟道·文澜‌：超大规模多模态预训练模型，模型参数量达10亿。基于从公开来源收集到的5000万个图文对上进行训练，是首个公开的中文通用图文多模态预训练模型，旨在突破基于图、文和视频相结合的多模态数据的预训练理论难题‌。
3、悟道·文汇‌：面向认知的超大规模新型预训练模型，参数规模达113亿。通过微调可实现AI作诗、AI作图、AI制作视频、图文生成、复杂推理等功能，致力于从认知的角度研究通用人工智能中的本质问题‌
4、悟道·文溯‌：超大规模蛋白质序列预测预训练模型。

悟道2.0（2021.6 1.75万亿parameters）

M6-1T（阿里巴巴达摩院， MM，2021.6 10万亿parameters）

MultiModality-to-MultiModality Multitask Mega-transformer (sparse model)

M6-10T（阿里巴巴达摩院， MM，2021.10 100万亿parameters）

文心一言3.0（2.6亿parameters，百度）

GPT-3.5(2022.11 openAI)

Llama(2023.2 MetaAI)

65B parameters

GPT-4(2023.3 openAI)

1.76T parameters

Llama 2（2023.7 MetaAI）

70B parameters

文心一言4.0（2023.10 百度）

Grok-1 （2023.11 xAI）

314B parameters

Gemini（2023.12 Google DeepMind）

Gemini 1.5（2024.2 Google DeepMind）

Llama 3 70B(2024.4 MetaAI)

Llama 3.1 405B(2024.7 MetaAI)

Grok-2 （2024.8 xAI）

o1(2024.8 openAI)

A new series of reasoning models for solving hard problems.

Llama 3.3 70B(2024.11 MetaAI)

Gemini 2.0(2024.11 Google DeepMind)

o3(2024.12 openAI)

DeepSeek v3(DeepSeek-AI)

685B

DeepSeek R1(2025.1 DeepSeek-AI)

Grok-2 （2025.2 xAI）

GPT-4.5(2025.2 openAI)

chat LLM timeline

Meena（2020.1 google 26亿parameters）

LaMDA（1.37亿parameters， google， 2021.5）

全名LaMDA (Language Model for Dialogue Applications)，是一个面向聊天的ChatBot。前身是 Meena（google 开发的chatbot），后续为了对抗openAI的chatGPT的崛起，google又基于 LaMDA开发了 Bard (now aka Gemini)

from wike：LaMDA is a decoder-only Transformer language model.[48] It is pre-trained on a text corpus that includes both documents and dialogs consisting of 1.56 trillion words,[49] and is then trained with fine-tuning data generated by manually annotated responses for “sensibleness, interestingness, and safety”

LaMDA 2（1.37亿parameters， google， 2022.5）

chatGPT（2022.11 ）

ChatGPT is fine-tuned from a model in the GPT‑3.5 series, which finished training in early 2022. You can learn more about the 3.5 series here⁠(opens in a new window). ChatGPT and GPT‑3.5 were trained on an Azure AI supercomputing infrastructure.