【happy-llm】如何从零搭建一个自己的tokenizer

news/2026/1/21 21:32:15/文章来源:https://www.cnblogs.com/djisxiaozhu/p/19478394

https://huggingface.co/learn/llm-course/chapter6/8

获取你的corpus(语料库)

为了训练我们的新tokenizer,我们将使用一个小的文本语料库(因此示例运行速度很快)。获取语料库的步骤与本章开头的步骤相似,但这次我们将使用WikiText-2数据集:

from datasets import load_dataset
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")def get_training_corpus():for i in range(0, len(dataset), 1000):yield dataset[i : i + 1000]["text"]

该函数是一个生成器,它将生成1000个文本,我们将使用它来训练tokenizer.get_train_corpus()。Tokenizer也可以直接在文本文件上训练。下面是我们如何生成一个文本文件,其中包含来自WikiText-2的所有文本/输入,我们可以在本地使用:

with open("wikitext-2.txt", "w", encoding="utf-8") as f:for i in range(len(dataset)):f.write(dataset[i]["text"] + "\n")

从零开始构建WordPiece tokenizer

要使用Tokenizers库构建一个tokenizer,我们首先使用实例化一个对象,然后将其、和属性设置为我们想要的值。Tokenizer modelnormalizer pre_tokenizer post_processor decoder对于本示例,我们将使用WordPiece模型创建一个:Tokenizer:

from tokenizers import (decoders,models,normalizers,pre_tokenizers,processors,trainers,Tokenizer,
)tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

我们必须指定tokenizer,以便模型在遇到它以前没见过的字符时知道要返回什么。在这里设置的其他参数包括我们的模型(我们将训练模型,所以我们不需要设置这个),它为每个单词指定了最大长度(比传递的值长的单词将被拆分)

step1 Normalization

tokenization的第一步是归一化,所以我们从这个开始。由于BERT的应用非常广泛,我们可以为BERT设置一个经典的选项BertNormalizer;
lowercase:uncased模型必需,减少词汇表大小,提高泛化
strip_accents:建议开启——统一字符表示,为了去除重音变音符
clean_text:清理文本中的控制字符和多余空白——移除控制字符(除了\t, \n, \r);将多个空格合并为一个;清理无效Unicode字符
handle_chinese_chars:处理中文字符,为中文字符添加空格
bert-base-uncased:特定的 BERT 模型配置

# 1. 所有文本都转换为小写
# 2. 词汇表大小:30,522个token
# 3. 分词方式:WordPiece
# 4. 最大序列长度:512

Using Method below:

tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

Generally speaking, however, when building a new tokenizer you won’t have access to such a handy normalizer already implemented in the 🤗 Tokenizers library, 并手动实现Bert tokenizer

tokenizer.normalizer = normalizers.Sequence([normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

step2 pre-tokenization

请注意,pre-tokenization会对空格和所有不是字母、数字或下划线字符的字符进行分割,因此它严格地对空格和标点符号进行分割:

tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() # build from scratch
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")
# output
# [('Let', (0, 3)),("'", (3, 4)),('s', (4, 5)),('test', (6, 10)),('my', (11, 13)),
#	('pre', (14, 17)),('-', (17, 18)),('tokenizer', (18, 27)),('.', (27, 28))]

WhitespaceSplit则可以避免对所有空格进行分割

pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")
# output: [("Let's", (0, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre-tokenizer.', (14, 28))]

step3 running the inputs through the model

我们已经在初始化中指定了我们的模型,但我们仍然需要训练它,这将需要一个trainer。在Tokenizers中实例化训练器时,需要记住的主要事情是,你需要向它传递你打算使用的所有sepcial Tokenizers——否则它不会将它们添加到词汇表中,因为它们不在训练语料库中。

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

定义trainer的iterator:tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
可以使用文本文件来训练model:

tokenizer.model = models.WordPiece(unk_token="[UNK]")
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

使用encoder进行测试:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)
# ['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']

step4 post-processing.

我们需要在开头添加标记[CLS],在结尾添加标记[SEP](或者在每个句子之后添加标记,如果我们有一对句子)。

cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

比如写一个模板TemplateProcessor:我们必须指定如何处理单个句子和一对句子。对于两者,我们都写出了我们想要使用的特殊标记;第一句(或单个)由$A表示,而第二句(如果编码成对)由$B表示。对于其中的每一个(特殊令牌和句子),我们还在冒号后指定相应的token类型ID.

tokenizer.post_processor = processors.TemplateProcessing(single=f"[CLS]:0 $A:0 [SEP]:0",pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

Note that we need to pass along the IDs of the special tokens, so the tokenizer can properly convert them to their IDs. Once this is added, going back to our previous example will give:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)
# ['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]']
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)
# ['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', 
#	'...', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

step5 decoder

tokenizer.decoder = decoders.WordPiece(prefix="##")
tokenizer.decode(encoding.ids) # 根据encode出的id进行还原
# "let's test this tokenizer... on a pair of sentences."

tokenizer的保存和使用

tokenizer.save("tokenizer.json")
new_tokenizer = Tokenizer.from_file("tokenizer.json")

要在Transformer中使用这个tokenizer,我们必须将它包装在一个PreTrainedTokenizerFast中。

  • 我们可以使用泛型类,或者,如果我们的tokenizer对应于现有模型,则使用该类(这里是BertTokenizerFast)。
  • 如果您应用本课构建一个全新的tokenizer,则必须使用第一个选项。

要将tokenizer包装在PreTrainedTokenizerFast中,我们可以将我们构建的tokenizer作为tokenizer_对象传递,或者将我们保存为tokenizer_file的tokenizer文件传递。需要记住的关键是,我们必须手动设置所有特殊token,因为该类无法从tokenizer对象中推断出哪个token是掩码token、【CLS】token等。

from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer,# tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternativelyunk_token="[UNK]",pad_token="[PAD]",cls_token="[CLS]",sep_token="[SEP]",mask_token="[MASK]",
)

或者特点模型的特殊用法:

from transformers import BertTokenizerFast
wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

Building a BPE tokenizer from scratch

Let’s now build a GPT-2 tokenizer. Like for the BERT tokenizer, we start by initializing a Tokenizer with a BPE model:

tokenizer = Tokenizer(models.BPE())

Also like for BERT, we could initialize this model with a vocabulary if we had one (we would need to pass the vocab and merges in this case), but since we will train from scratch, we don’t need to do that. We also don’t need to specify an unk_token because GPT-2 uses byte-level BPE, which doesn’t require it.

GPT-2 does not use a normalizer, so we skip that step and go directly to the pre-tokenization:

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

The option we added to ByteLevel here is to not add a space at the beginning of a sentence (which is the default otherwise). We can have a look at the pre-tokenization of an example text like before:

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")# [('Let', (0, 3)), ("'s", (3, 5)), ('Ġtest', (5, 10)), ('Ġpre', (10, 14)), ('-', (14, 15)),
#	 ('tokenization', (15, 27)), ('!', (27, 28))]

Next is the model, which needs training. For GPT-2, the only special token is the end-of-text token:

trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

Like with the WordPieceTrainer, as well as the vocab_size and special_tokens, we can specify the min_frequency if we want to, or if we have an end-of-word suffix (like ), we can set it with end_of_word_suffix.

This tokenizer can also be trained on text files:

tokenizer.model = models.BPE()
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

Let’s have a look at the tokenization of a sample text:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)# ['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']

We apply the byte-level post-processing for the GPT-2 tokenizer as follows:

tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

The trim_offsets = False option indicates to the post-processor that we should leave the offsets of tokens that begin with ‘Ġ’ as they are: this way the start of the offsets will point to the space before the word, not the first character of the word (since the space is technically part of the token). Let’s have a look at the result with the text we just encoded, where 'Ġtest' is the token at index 4:

sentence = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]
' test'

Finally, we add a byte-level decoder:

tokenizer.decoder = decoders.ByteLevel()

and we can double-check it works properly:

tokenizer.decode(encoding.ids)
"Let's test this tokenizer."

Great! Now that we’re done, we can save the tokenizer like before, and wrap it in a PreTrainedTokenizerFast or GPT2TokenizerFast if we want to use it in 🤗 Transformers:

from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer,bos_token="<|endoftext|>",eos_token="<|endoftext|>",
)

or:

from transformers import GPT2TokenizerFastwrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

Building a Unigram tokenizer from scratch

Let’s now build an XLNet tokenizer. Like for the previous tokenizers, we start by initializing a Tokenizer with a Unigram model:

tokenizer = Tokenizer(models.Unigram())

Again, we could initialize this model with a vocabulary if we had one.
For the normalization, XLNet uses a few replacements (which come from SentencePiece):

from tokenizers import Regextokenizer.normalizer = normalizers.Sequence([normalizers.Replace("``", '"'),normalizers.Replace("''", '"'),normalizers.NFKD(),normalizers.StripAccents(),normalizers.Replace(Regex(" {2,}"), " "),]
)

This replaces''and`` with "and any sequence of two or more spaces with a single space, as well as removing the accents in the texts to tokenize.The pre-tokenizer to use for any SentencePiece tokenizer is Metaspace:

tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

We can have a look at the pre-tokenization of an example text like before:

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")[("▁Let's", (0, 5)), ('▁test', (5, 10)), ('▁the', (10, 14)), ('▁pre-tokenizer!', (14, 29))]

Next is the model, which needs training. XLNet has quite a few special tokens:

special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

A very important argument not to forget for the UnigramTrainer is the unk_token. We can also pass along other arguments specific to the Unigram algorithm, such as the shrinking_factor for each step where we remove tokens (defaults to 0.75) or the max_piece_length to specify the maximum length of a given token (defaults to 16).

This tokenizer can also be trained on text files:

tokenizer.model = models.Unigram()
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

Let’s have a look at the tokenization of a sample text:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.']

A peculiarity of XLNet is that it puts the token at the end of the sentence, with a type ID of 2 (to distinguish it from the other tokens). It’s padding on the left, as a result. We can deal with all the special tokens and token type IDs with a template, like for BERT, but first we have to get the IDs of the and tokens:

cls_token_id = tokenizer.token_to_id("<cls>")
sep_token_id = tokenizer.token_to_id("<sep>")
print(cls_token_id, sep_token_id)
# 0 1

The template looks like this:

tokenizer.post_processor = processors.TemplateProcessing(single="$A:0 <sep>:0 <cls>:2",pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)

And we can test it works by encoding a pair of sentences:

encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences!")
print(encoding.tokens)
print(encoding.type_ids)['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.', '.', '.', '<sep>', '▁', 'on', '▁', 'a', '▁pair', '▁of', '▁sentence', 's', '!', '<sep>', '<cls>']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]

Finally, we add a Metaspace decoder:

tokenizer.decoder = decoders.Metaspace()

and we’re done with this tokenizer! We can save the tokenizer like before, and wrap it in a PreTrainedTokenizerFast or XLNetTokenizerFast if we want to use it in 🤗 Transformers. One thing to note when using PreTrainedTokenizerFast is that on top of the special tokens, we need to tell the 🤗 Transformers library to pad on the left:

from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer,bos_token="<s>",eos_token="</s>",unk_token="<unk>",pad_token="<pad>",cls_token="<cls>",sep_token="<sep>",mask_token="<mask>",padding_side="left",
)

Or alternatively:

from transformers import XLNetTokenizerFastwrapped_tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer)

Now that you have seen how the various building blocks are used to build existing tokenizers, you should be able to write any tokenizer you want with the 🤗 Tokenizers library and be able to use it in 🤗 Transformers.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/1196349.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

开题报告不再卡壳!虎贲等考 AI:一键搭建学术研究的黄金框架

开题报告是学术研究的 “第一道门槛”&#xff0c;不少同学对着 “研究背景”“技术路线”“创新点” 抓耳挠腮&#xff0c;要么选题空泛被导师驳回&#xff0c;要么框架混乱逻辑不通&#xff0c;硬生生把开题写成 “学术渡劫现场”。作为深耕学术写作科普的博主&#xff0c;今…

AI Agent框架探秘:拆解 OpenHands(2)--- CodeAct论文

AI Agent 框架探秘:拆解 OpenHands(2)--- CodeAct论文 目录AI Agent 框架探秘:拆解 OpenHands(2)--- CodeAct论文0x00 概要0x01 背景知识1.1 Devin & OpenHands(原OpenDevin)1.2 CodeAct 的意义0x02 设计思…

图论杂题选做 #2

图论杂题选做 #2 Problem A. P13548 [OOI 2022] Air Reform 暴力的想法是,建出补图,补图的边权用 Kruskal 重构树来求,然后再求补图的 Kruskal 重构树,最后算答案。 但是补图的边数太多了,不能直接建。考虑枚举边…

给大模型“上上价值”:用PPO算法让AI更懂你的心

给大模型“上上价值”:用PPO算法让AI更懂你的心引言:当AI需要“价值观对齐” 你有没有遇到过这样的情况? 让ChatGPT写一首诗,它写得工整却缺乏灵气;请它帮忙写工作总结,结果通篇都是正确的废话;甚至让它扮演某个…

萌娃穿搭不踩坑!2026中国十大童装品牌宝藏清单来啦

萌娃穿搭不踩坑!2026中国十大童装品牌宝藏清单来啦一、开篇:童装选购难题?这份清单帮你搞定 在这个 “萌娃即流量” 的时代,宝宝的穿搭不仅是时尚态度的展现,更是家长对品质与安全的严格考验。面对市场上琳琅满目…

学术必备:7款AI论文写作工具测评,显著提高效率并减少重复率

AI写论文工具排名&#xff1a;7大模型查重率低技巧推荐 7大AI论文工具核心对比 工具名称 核心功能 查重优化 适用场景 效率评分 AiBiye 论文全流程辅助 智能降重 从选题到定稿 ★★★★★ AiCheck 查重与降重 深度降重算法 论文修改阶段 ★★★★☆ AskPaper 文…

阿里系开源大模型全解析:Qwen系列、具身智能与多智能体开发指南

文章全面介绍阿里系开源大模型项目&#xff0c;包括达摩院的具身智能三大件、视频多模态模型&#xff0c;通义实验室的Qwen2.5/Qwen3系列&#xff0c;以及蚂蚁集团的AI原生数据、扩散语言模型与多智能体项目。提供选型指南&#xff0c;帮助开发者构建基于Qwen大模型的AI应用系统…

Markdown 学习

Markdown 学习 二级标题 三级标题 (标题的设置:#+空格+内容【表示一级标题,多少级标题前面有多少个#号】) 字体 Hello word!(加粗) Hello word!(斜体) Hello word!(加粗、斜体) Hello word!(删除) 引用选择…

2026成都最新家装公司top5评测!服务深度覆盖金牛区、新都区、青羊区、成华区等地优质品牌权威榜单发布,品质筑家服务成都十区业主.

随着人们对居住品质要求的不断提升,家装行业迎来了新的发展机遇。成都作为新一线城市,家装市场需求旺盛,众多家装公司应运而生。本榜单基于设计实力、施工工艺、材料品质、服务体系、区域覆盖及客户口碑六大维度(四…

学术写作革命:7款AI助手如何实现高效创作与低重复率双赢

AI写论文工具排名&#xff1a;7大模型查重率低技巧推荐 7大AI论文工具核心对比 工具名称 核心功能 查重优化 适用场景 效率评分 AiBiye 论文全流程辅助 智能降重 从选题到定稿 ★★★★★ AiCheck 查重与降重 深度降重算法 论文修改阶段 ★★★★☆ AskPaper …

【Azure APIM】APIM的自建网关如何解决自签名证书的受信任问题呢?(方案二)

问题描述 在先前的三篇博文 1:【Azure APIM】APIM的自建网关如何解决自签名证书的受信任问题呢?(方案一) 2:【Azure APIM】如何解决后端API服务配置自签名证书时APIM请求报错500:Error occured while calling ba…

工业智能体落地指南:大模型+小模型在云-边-端架构中的协同应用

工业4.0与AI2.0融合的核心范式是"数据驱动的全要素智能闭环"&#xff0c;采用大模型小模型协同模式&#xff0c;通过云-边-端架构实现知识泛化与边缘实时决策的优势互补。这种协同模式能实现从单点智能到全局智能的升级&#xff0c;推动工业生产从刚性转向柔性&#…

实用指南:九联UNP-SJA8-国科GK6323V100C-2+8G-安卓9.0-原厂强刷固件包-可救砖及开ADB教程

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

【值得收藏】用生活化比喻彻底搞懂Transformer:AI大模型的基石,小白到程序员必学

Transformer&#xff0c;几乎是现代AI的代名词。从 ChatGPT 到 BERT&#xff0c;从翻译到写诗&#xff0c;它无处不在。但很多人一打开论文《Attention is All You Need》&#xff0c;就像翻进了一本天书。我自己也看了几次&#xff0c;大概形成了一些见解&#xff0c;但是具体…

题解:P6009 [USACO20JAN] Non-Decreasing Subsequences P

矩阵优化DPProblem Bessie 最近参加了一场 USACO 竞赛,遇到了以下问题。当然 Bessie 知道怎么做。那你呢? 考虑一个仅由范围在 \(1 \ldots K\)(\(1 \leq K \leq 20\))之间的整数组成的长为 \(N\) 的序列 \(A_1,A_2…

【计算机毕设】Python高校学生学业预警系统论文

&#x1f49f;博主&#xff1a;程序员小俊&#xff1a;CSDN作者、博客专家、全栈领域优质创作者 &#x1f49f;专注于计算机毕业设计&#xff0c;大数据、深度学习、Java、小程序、python、安卓等技术领域 &#x1f4f2;文章末尾获取源码数据库 &#x1f308;还有大家在毕设选题…

期刊论文创作不再难!虎贲等考 AI 解锁从创作到见刊的高效路径

做科研、评职称、毕业升学&#xff0c;绕不开的就是期刊论文&#xff01;可多少人卡在选题反复碰壁、文献梳理头大、格式排版磨人、投稿石沉大海的困境里&#xff1f;熬了几个月写的论文&#xff0c;要么因和期刊调性不符被秒拒&#xff0c;要么因格式细节疏漏反复返修&#xf…

【珍藏必读】Dify vs Coze:大模型开发平台全方位对比,从架构到部署助你快速选型

随着 Coze 的开源&#xff0c;很多圈内的小伙伴猜测会对 Dify 造成直接威胁&#xff0c;也看到不少关于本地部署 Coze 的例子。 本文从项目代码出发&#xff0c;从产品理念&#xff0c;架构设计&#xff0c;应用开发&#xff0c;技术栈对比&#xff0c;部署&#xff0c;生态&a…

我的第一个公开实战项目(XXX 用户中心系统)

程序员必会的实战项目(XXX用户中心系统)需求分析登录 / 注册 用户管理(仅管理员可见)对用户的查询或者修改 用户校验(仅系统用户)技术选型 前端:三件套 + React + 组件库 Ant Design + Umi + Ant Design Pro (现…

学术写作利器:9款免费用论文查重工具,每天无限次检测,省时又省力

核心工具对比速览 工具名称 查重速度 降重效果 特色功能 适用场景 aicheck 极快 重复率可降30% 专业术语保留 高重复率紧急处理 aibiye 中等 逻辑优化明显 学术表达增强 提升论文质量 askpaper 快 结构保持完整 多语言支持 外文论文降重 秒篇 极快 上下文…