huggingface 笔记：AutoClass (quick tour 部分）

AutoClass 是一个快捷方式，它可以自动从模型的名称或路径检索预训练模型的架构。只需要为任务选择适当的 AutoClass 及其关联的预处理类。

1 AutoTokenizer

分词器负责将文本预处理成模型输入的数字数组。控制分词过程的规则有多种，包括如何分割单词以及应在什么层级分割单词
需要用相同的模型名称实例化一个分词器，以确保使用的分词规则是模型预训练时使用的

1.1 使用 AutoTokenizer 加载分词器

from transformers import AutoTokenizermodel_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
encoding
'''
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
'''

input_ids：令牌的数值表示。
attention_mask：指示应该关注哪些令牌

1.2 分词器接受输入列表

分词器还可以接受输入列表，并对文本进行填充和截断，返回长度统一的批处理

tokenizer(["We are very happy to show you the Transformers library.","We hope you don't hate it."],padding=True,truncation=True,max_length=512,return_tensors="pt",
)
'''
{'input_ids': tensor([[  101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 58263,13299,   119,   102],[  101, 11312, 18763, 10855, 11530,   112,   162, 39487, 10197,   119,102,     0,     0]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
'''

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/pingmian/11737.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！