huggingface NLP主要知识点以及超级详解使用

1.安装huggingface依赖库

pip install transformers
pip install datasets
pip install pytorch

pip install tokenizers
pip install diffusers
pip install accelerate
pip install evaluate
pip install optimum

pip install pillow
pip install requests
pip install gradio

Transformer以及相关库：

2.设置huggingface大模型&数据集下载缓存目录：

配置缓存环境变量：HF_HOME=C:\.cache\huggingface

3.huggingface常用基本命令：

下载数据集：
huggingface-cli download lansinuote/ChnSentiCorp --repo-type dataset

下载模型：
huggingface-cli download bert-base-chinese

transformers 版本升级(降级)命令：
pip uninstall transformers
pip install --upgrade transformers==v4.48.0

4.huggingface自然语言处理NLP模块分类：

text-classification 文本分类，给一段文本进行打标分类
feature-extraction 特征提取：把一段文字用一个向量来表示
fill-mask 填词：把一段文字的某些部分mask住，然后让模型填空
ner 命名实体识别：识别文字中出现的人名地名的命名实体
question-answering 问答：给定一段文本以及针对它的一个问题，从文本中抽取答案
summarization 摘要：根据一段长文本中生成简短的摘要
text-generation文本生成：给定一段文本，让模型补充后面的内容
translation 翻译：把一种语言的文字翻译成另一种语言
conversional对话机器人：根据用户输入文本，产生回应，与用户对话

自然语言处理的几个阶段：

5.huggingface主要知识点：

tokenizer数据预处理
transformer
模型调用
pipeline
微调

transformer流程如下：

一般transformer模型有三个部分组成：1.tokennizer，2.Model，3.Post processing

tokenizer分词器将我们输入的信息转成 Input IDs;
将Input IDs输入到模型中，模型返回预测值;
将预测值输入到后置处理器中，返回我们可以看懂的信息

6.huggingface基本使用代码详解：

基于配置环境变量,模型下载code

import os
import re
from datasets import load_dataset
from transformers import BertTokenizer,BertForSequenceClassification
from transformers import Trainer,TrainingArguments
from sklearn.metrics import accuracy_score#step1.环境准备  pip install torch transformers datasets scikit-learn#step2.加载中文BERT预训练模型和和分词器#print("HF_HOME:",os.environ['HF_HOME'])
#print('HUGGINGFACE_HUB_CACHE:',os.environ['HUGGINGFACE_HUB_CACHE'])
#print('TRANSFORMERS_CACHE:',os.environ['TRANSFORMERS_CACHE'])
#print('HF_DATASETS_CACHE:',os.environ['HF_DATASETS_CACHE'])"""
自定义huggingface模型下载位置:windows系统，默认预训练模型会被下载并缓存在本地到C://用户//用户名//.cache//huggingface//hub目录下;    可以设置环境变量:HF_HOME=C:\.cache\huggingface
初次下载数据集或者模型,很慢,之后就读取本地的数据或模型
"""tokenizer=BertTokenizer.from_pretrained('bert-base-chinese')
model= BertForSequenceClassification.from_pretrained('bert-base-chinese',num_labels=3)#step3.加载数据集ChnSentiCorp,并进行清洗
#数据集地址:https://huggingface.co/datasets/lansinuote/ChnSentiCorp
dataset= load_dataset('lansinuote/ChnSentiCorp')#定义数据清洗函数
def clean_text(text):text=re.sub(r'[^\w\s]','',text)  #去除标点符号text=text.strip() #去除前后空格return text#step4.数据预处理def tokenize_function(examples):return tokenizer(examples['text'],padding='max_length',truncation=True,max_length=128)#对数据进行分词和编码
encoded_dataset=dataset.map(tokenize_function,batched=True)#step5.训练模型
#定义训练参数，创建一个TrainingArguments对象
training_args=TrainingArguments(output_dir='./results', #指定训练输出的目录，用于保存模型和其他输出文件num_train_epochs=1, #设置训练的轮数，这里设置为1轮per_device_train_batch_size=1,#每个设备(eg:GPU)上训练批次大小，这里设置为1per_device_eval_batch_size=1,#每个设备上的评估批次大小，设置为1evaluation_strategy='epoch',#设置评估策略为每个epoch结束后进行进行评估logging_dir='./logs',#指定日志保存的目录,用于记录训练过程中的日志信息
)#使用trainer进行训练
trainer = Trainer(model=model,args=training_args,train_dataset=encoded_dataset['train'],eval_dataset=encoded_dataset['validation'],
)#开始训练
trainer.train()#step6:评估模型性能#定义评估函数
def compute_metrics(p):preds=p.predictions.argmax(-1)return {"accuracy": accuracy_score(p.labels, preds)}#在测试集上评估模型
trainer.evaluate(encoded_dataset['test'],metric_key_prefix="eval")#{'eval_loss':0.2,'eval_accuracy':0.85}
"""
eval_loss:0.2 是模型在测试集上的损失值
损失值是一个衡量模型预测与实际标签之间差异的指标；
较低的损失值通常表示模型的预测更接近于真实标签；eval_accuracy:0.85 是模型在测试集上的准确率
准确率是指模型正确预测的样本数量占总样本数量的比例；
准确率为0.85，意味着模型在测试集上有85%的样本被正确分类
"""#step7:导出模型
#保存模型和分词器
model.save_pretrained('./saved_model')
tokenizer.save_pretrained('./saved_model')

使用HuggingFace HubAPI下载模型code

"""
1.安装pip install huggingface_hub
2.确定要下载的文件;对于大多数模型至少要下载两类文件:模型权重文件（如 .bin 或 .pt 文件）配置文件（如 .json 文件）
"""
from huggingface_hub import hf_hub_download#使用 Hugging Face Hub API 下载模型：#指定模型的仓库名称
repo_name="jonatasgrosman/wav2vec2-large-xlsr-53-english"
#指定要下载的文件名。需要知道模型的文件名,eg:"pytorch_model.bin", "config.json"
file_names=["pytorch_model.bin","config.json"]
for file_name in file_names:#下载文件到本地file_path=hf_hub_download(repo_id=repo_name,filename=file_name)print(f"Downloaded {file_name} to {file_path}")################采用这种方法下载模型文件，由于包含大文件会卡死超时，建议使用git clone方法下载模型##############

手动从huggingface下载模型，放在自己的工程目录code

import os.pathfrom transformers import BertTokenizer, BertModel"""
手动从huggingface下载模型，放在自己的工程目录：HuggingfaceModels下，进行调用
"""PATH=r"HuggingfaceModels/"
modelPath=os.path.join(PATH,'bert-base-chinese')#1.加载预训练的专用于bert的分词模型
tokenizer=BertTokenizer.from_pretrained(modelPath)
#2.加载预训练的bertModel
model=BertModel.from_pretrained(modelPath)
text="虽然今天下雨了，但我拿到了心意的offer,很开心！"
#3.将text输入分词和编码模型
encode_input=tokenizer(text,return_tensors="pt")
#4.将编码好的文字输入给预训练好的bert模型
output=model(**encode_input)
print("output:",output)

中文分词库jieba使用：
pip install jieba

代码如下：

"""
中文分词库 jieba：精确模式分词:试图将句子最精确地切开，适合文本分析.
"""
import jiebacontent="无线电法国别研究"
#直接返回列表内容,使用jieba.lcut
list=jieba.lcut(content,cut_all=False) #cut_all默认为False
print(list)
searchList=jieba.lcut_for_search(content)
print(searchList)

运行如下：

使用字典和分词工具：

from transformers import BertTokenizer
"""
1.使用字典和分词工具
"""#加载预训练字典和分词方法（即：加载tokenizer）
tokenizer=BertTokenizer.from_pretrained(pretrained_model_name_or_path='bert-base-chinese',cache_dir=None,force_download=False,
)#准备语料库
sents=['选择珠江花园的原因是方便。','笔记本的键盘确实是好。','房间太小。其他的一般。','今天才知道这本书还有第6卷，真有点郁闷.','机器背面似乎被撕两张什么标签，残胶还在.',
]tokenizer,sents#1.编码两个句子，使用简单的编码函数tokenizer.encode()
# out=tokenizer.encode(
#     text=sents[0],
#     text_pair=sents[1],
#     truncation=True,#当句子长度大于max_length时，截断
#     padding='max_length',#一律补pad到max_length长度
#     add_special_tokens=True,
#     max_length=30,
#     return_tensors=None,
# )
# print('编码两个句子,out:',out)
# tokenizer.decode(out)#2.使用增加的编码函数tokenizer.encode_plus()
# out=tokenizer.encode_plus(
#     text=sents[0],
#     text_pair=sents[1],
#     truncation=True,#当句子长度大于max_lenght时，截断
#     padding='max_length',#一律补零到max_length长度
#     add_special_tokens=True,
#     return_tensors=None,#可取值tf,pt,np,默认返回list
#     return_token_type_ids=True,#返回token_type_ids
#     return_attention_mask=True,#返回attention_mask
#     return_special_tokens_mask=True,#返回special_tokens_mask 特殊符号标识
#     #return_offsets_mapping=True, #返回offsets_mapping 标识每个单词的起止位置,这个参数只能BertTokenizerFast使用
#     return_length=True,#返回length表示长度
# )#增强编码的结果
#input_ids 就是编码后的词
#token_type_ids 第一个句子和特殊符号的位置是0，第二个句子的位置是1
#special_tokens_mask 特殊符合的位置是1，其他位置是0
#attention_mask ：pad的位置是0，其他的位置是1
#length 返回的句子长度# for k,v in out.items():
#     print(k,",",v)
#
# tokenizer.decode(out['input_ids'])#3.批量编码句子
# out=tokenizer.batch_encode_plus(
#     batch_text_or_text_pairs=[sents[0],sents[1]],
#     add_special_tokens=True,
#     truncation=True,#当句子大于max_length时，截断
#     padding='max_length',#一律补0到max_length长度
#     max_length=15,
#     return_tensors=None,#可取值 tf,pt,np,默认返回list
#     return_token_type_ids=True,#返回token_type_ids
#     return_attention_mask=True,#返回attention_mask
#     return_special_tokens_mask=True,#返回special_tokens_mask  特殊符号标识
#     #return_offsets_mapping=True,
#     return_length=True,#返回length 标识长度
# )
#
#
# for k,v in out.items():
#     print(k,",",v)
#
# tokenizer.decode(out['input_ids'][0],tokenizer.decode(out['input_ids'][1]))#4.批量成对编码
# out=tokenizer.batch_encode_plus(
#     batch_text_or_text_pairs=[(sents[0],sents[1]),(sents[2],sents[3])],
#     add_special_tokens=True,
#     truncation=True,#当句子大于max_length时，截断
#     padding='max_length',#一律补0到max_length长度
#     max_length=15,
#     return_tensors=None,#可取值 tf,pt,np,默认返回list
#     return_token_type_ids=True,#返回token_type_ids
#     return_attention_mask=True,#返回attention_mask
#     return_special_tokens_mask=True,#返回special_tokens_mask  特殊符号标识
#     #return_offsets_mapping=True,
#     return_length=True,#返回length 标识长度
# )
#
# for k,v in out.items():
#     print(k,",",v)
#
# tokenizer.decode(out['input_ids'][0])"""
字典操作
"""
#获取字典
zidian=tokenizer.get_vocab()
print(type(zidian),len(zidian),'月光' in zidian)#添加新词
tokenizer.add_tokens(new_tokens=['月光','希望'])#添加新符号
tokenizer.add_special_tokens({'eos_token':'[EOS]'})zidian=tokenizer.get_vocab()print(type(zidian),len(zidian),zidian['月光'],zidian['[EOS]'])#编码新词
out=tokenizer.encode(text='月光的新希望[EOS]',text_pair=None,truncation=True,padding='max_length',add_special_tokens=True,max_length=8,return_tensors=None,
)
print("out:",out)tokenizer.decode(out)

运行结果如下：

GPT2模型使用：

import os.path
from transformers import GPT2Tokenizer, GPT2Model, AutoTokenizer, AutoModel"""
使用GPT2模型
"""#AutoClasses通用模型，可以调用各种分词器PATH=r"HuggingfaceModels/"
modelPath=os.path.join(PATH,'gpt2')# tokenizer=GPT2Tokenizer.from_pretrained(modelPath)
# model=GPT2Model.from_pretrained(modelPath)tokenizer=AutoTokenizer.from_pretrained(modelPath)
model=AutoModel.from_pretrained(modelPath)
text="i love dog,dog is cute."
encoded_input=tokenizer(text, return_tensors="pt")
output=model(**encoded_input)
print("output:",output)

运行结果如下：

huggingface pipeline使用：

from transformers import pipeline,GPT2Tokenizer"""
huggingface pipeline使用
"""#1.情感分类：#文本分类
# classifier= pipeline('sentiment-analysis')
#
# result=classifier("I hate you")[0]
# print("result:",result)
#
# result=classifier("I love you")[0]
# print("result:",result)#2.阅读理解
# question_answerer= pipeline('question-answering')
# context=r"""
# Extractive Question Answering is the task of extracting an answer from question  answering dataset is the SQuAD dataset,
# which is entirely bring a model  on a SQuAD task,you may leverage the example/pytorch/question
# """
# result=question_answerer(question="What is extractive question answering?",context=context)
# print("result:",result)
#
# result=question_answerer(question="What is a good example of a  question answering dataset?",context=context)
# print("result:",result)#3.文本生成
#文本生成
# text_generator=pipeline('text-generation',model='gpt2')
# sample =text_generator("As far as I am concered,I will",
#                max_length=50,
#                truncation=True,
#                do_sample=False,
#                pad_token_id=text_generator.tokenizer.eos_token_id,
#                )
# print("sample:",sample)#4.命名实体识别
# ner_pipe= pipeline("ner")
# sequence="""
# Hugging face Inc. is a company based in New York City.therefore very close to the Manhattan Bridge which is visible from t
# """
# for entity in ner_pipe(sequence):
#     print("entity:",entity)#5.文本总结
# summarizer = pipeline("summarization")
# ARTIICLE = """
# New York (CNN) When Liana  was 22 years old, A year later, she got married again in Westchester Country,but to a
# only 18 days after that marriage,she got hitched yet again.how many time did she  marriage?
# """
# summarizer(ARTIICLE,max_length=40,min_length=30,do_sample=False,num_return_sequences=1)#6.翻译
translator=pipeline("translation_en_to_de")
sentence="I love china,do you like china?"
out=translator(sentence,max_length=40)
print("out:",out)

运行结果如下：

预训练Bert模型的二分类使用demo

import os.path
import torch
from torch import nn"""
预训练Bert模型的二分类使用demo
"""
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModelPATH=r"HuggingfaceModels/"
modelPath=os.path.join(PATH,'bert-base-chinese')#1.加载BERT模型和分词器
# tokenizer=BertTokenizer.from_pretrained(modelPath)
# model=BertModel.from_pretrained(modelPath)tokenizer=AutoTokenizer.from_pretrained(modelPath)
model=AutoModel.from_pretrained(modelPath)#2.定义句子分类器
class BertSentenceClassifier(nn.Module):def __init__(self,bert_model,num_classes):super(BertSentenceClassifier, self).__init__()self.bert=bert_modelself.classifier=nn.Linear(bert_model.config.hidden_size,num_classes)def forward(self,input_ids,attention_mask):#获取BERT的输出outputs=self.bert(input_ids=input_ids,attention_mask=attention_mask)#获取[CLS] token的表示pooler_output=outputs.pooler_output#将其输入到分类器中logits=self.classifier(pooler_output)return logits#示例文本
text="虽然今天下雨了，但我拿到了心意的offer,太糟糕了！"#将文本转换为BERT的输入格式
encode_input=tokenizer(text,return_tensors="pt")#初始化分类器，假设有两个标签（eg:积极和消极）
classifier=BertSentenceClassifier(model,num_classes=2)#获取分类结果
logits=classifier(encode_input['input_ids'],encode_input['attention_mask'])#将logits转换为概率
probabilities=torch.softmax(logits,dim=-1)#打印分类结果
print("Logits:",logits)
print("Probabilities:",probabilities)

运行结果如下：

说明这个人的态度是消极的，而非积极态度

huggingfade 中文分类demo-二分类

import os.path
import torch
from datasets import load_dataset
from transformers import BertTokenizer, BertModel, AdamW"""
huggingfade 中文分类demo-二分类
"""#1.定义数据集
class Dataset(torch.utils.data.Dataset):def __init__(self, split):self.dataset= load_dataset(os.path.join(r"HuggingfaceModels/", 'ChnSentiCorp'),split=split,trust_remote_code=True)def __len__(self):return len(self.dataset)def __getitem__(self, i):text=self.dataset[i]["text"]label=self.dataset[i]["label"]return text,labeldataset=Dataset('train')
len(dataset),dataset[0]#2.加载tokenizer
#加载字典和分词工具
token=BertTokenizer.from_pretrained('bert-base-chinese')
print(token)#3.定义批处理函数
def collate_fn(data):sents=[i[0] for i in data]labels=[i[1] for i in data]#编码data=token.batch_encode_plus(batch_text_or_text_pairs=sents,truncation=True,padding="max_length",max_length=500,return_tensors="pt",return_length=True)#input_ids:编码之后的数字#attention_mask:是补零的位置是0，其他位置是1input_ids=data['input_ids']attention_mask=data['attention_mask']token_type_ids=data['token_type_ids']labels =torch.LongTensor(labels)#print(data['length'],data['length'].max())return input_ids,attention_mask,token_type_ids,labels#4.定义数据加载器
loader=torch.utils.data.DataLoader(dataset=dataset,batch_size=16,collate_fn=collate_fn,shuffle=True,drop_last=True)
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):breakprint(len(loader))
input_ids.shape, attention_mask.shape, token_type_ids.shape,labels.shape#5.加载bert中文模型
#加载预训练模型
pretrained=BertModel.from_pretrained('bert-base-chinese')
#不训练，不需要计算梯度
for param in pretrained.parameters():param.requires_grad_(False)
#模型试算
out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
out.last_hidden_state.shape#6.定义下游任务模型-简单的神经网络模型
class Model(torch.nn.Module):def __init__(self):super().__init__()self.fc=torch.nn.Linear(768,2)def forward(self,input_ids,attention_mask,token_type_ids):with torch.no_grad():out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=self.fc(out.last_hidden_state[:,0]) #取第0个词的特征out=out.softmax(dim=-1)return outmodel = Model()
model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids).shape#7.训练下游任务模型
#训练
optimizer = AdamW(model.parameters(),lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()model.train()
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)loss=criterion(out,labels)loss.backward()optimizer.step()optimizer.zero_grad() #进行梯度下降if i % 5 == 0:out=out.argmax(dim=-1)accuracy=(out==labels).sum().item()/len(labels)print(i,loss.item(),accuracy)if i==3: #训练3次break#8.测试
def test():model.eval()correct=0total=0loader_test=torch.utils.data.DataLoader(dataset=Dataset('validation'),batch_size=32,collate_fn=collate_fn,shuffle=True,drop_last=True)for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader_test):if i % 5 == 0:breakprint(i)with torch.no_grad():out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=out.argmax(dim=-1)correct+=(out==labels).sum().item()total+=len(labels)print(correct/total)test()

huggingface 中文填空(Mask)-demo

import os.path
import torch
from datasets import load_dataset
from transformers import BertTokenizer, BertModel, AdamW"""
huggingface 中文填空(Mask)-demo
"""#1.定义数据集
class Dataset(torch.utils.data.Dataset):def __init__(self, split):path = r"HuggingfaceModels/"dataSetPath = os.path.join(path, 'ChnSentiCorp')dataset=load_dataset(dataSetPath,split=split)#对文本数据进行过滤def f(data):return len(data['text'])>30self.dataset=dataset.filter(f)def __len__(self):return len(self.dataset)def __getitem__(self, i):text=self.dataset[i]['text']return textdataset=Dataset("train")len(dataset),dataset[0]#2.加载tokenizer
token=BertTokenizer.from_pretrained('bert-base-chinese')
token#3.定义批处理函数
def collate_fn(data):sents=[i[0] for i in data]labels=[i[1] for i in data]#编码data=token.batch_encode_plus(batch_text_or_text_pairs=sents,truncation=True,padding="max_length",max_length=500,return_tensors="pt",return_length=True)#input_ids:编码之后的数字#attention_mask:是补零的位置是0，其他位置是1input_ids=data['input_ids']attention_mask=data['attention_mask']token_type_ids=data['token_type_ids']labels =torch.LongTensor(labels)#把第15个词固定替换为masklabels=input_ids[:,15].reshape(-1).clone()input_ids[:,15]=token.get_vocab()[token.mask_token]return input_ids,attention_mask,token_type_ids,labels#4.定义数据的加载器
loader=torch.utils.data.DataLoader(dataset=dataset,batch_size=16,collate_fn=collate_fn,shuffle=True,drop_last=True)
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):breakprint(len(loader))
print(token.decode(input_ids[0]))
print(token.decode(labels[0]))input_ids.shape, attention_mask.shape, token_type_ids.shape,labels.shape#5.加载bert中文模型
#加载预训练模型
pretrained=BertModel.from_pretrained('bert-base-chinese')
#不训练，不需要计算梯度
for param in pretrained.parameters():param.requires_grad_(False)
#模型试算
out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
out.last_hidden_state.shape#6.定义下游任务模型-简单的神经网络模型
class Model(torch.nn.Module):def __init__(self):super().__init__()self.decoder=torch.nn.Linear(768,token.vocab_size,bias=False)self.bias=torch.nn.Parameter(torch.zeros(token.vocab_size))self.decoder.bias=self.biasdef forward(self,input_ids,attention_mask,token_type_ids):with torch.no_grad():out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=self.decoder(out.last_hidden_state[:,15]) #取第15个词的特征return outmodel = Model()
model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids).shape#7.训练下游任务模型
#训练
optimizer = AdamW(model.parameters(),lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()model.train()
for epoch in range(5):for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)loss=criterion(out,labels)loss.backward()optimizer.step()optimizer.zero_grad() #进行梯度下降if i % 50 == 0:out=out.argmax(dim=-1)accuracy=(out==labels).sum().item()/len(labels)print(epoch,i,loss.item(),accuracy)#8.测试
def test():model.eval()correct=0total=0loader_test=torch.utils.data.DataLoader(dataset=Dataset('test'),batch_size=32,collate_fn=collate_fn,shuffle=True,drop_last=True)for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader_test):if i == 15:breakprint(i)with torch.no_grad():out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=out.argmax(dim=-1)correct+=(out==labels).sum().item()total+=len(labels)print(token.decode(input_ids[0]))print(token.decode(labels[0]),token.decode(labels[0]))print(correct/total)test()

huggingface中文句子关系推断demo

import random
import torch
from datasets import load_dataset
import os.path
from transformers import BertTokenizer, BertModel, AdamW"""
huggingface 中文句子关系推断demo
"""
#1.定义数据集
class Dataset(torch.utils.data.Dataset):def __init__(self, split):path = r"HuggingfaceModels/"dataSetPath = os.path.join(path, 'ChnSentiCorp')dataset=load_dataset(dataSetPath,split=split)#对文本数据进行过滤def f(data):return len(data['text'])>40self.dataset=dataset.filter(f)def __len__(self):return len(self.dataset)def __getitem__(self, i):text=self.dataset[i]['text']#切分一句话为前半句和后半句sentence1=text[:20]sentence2=text[20:40]label=0#有一半的概率把后半句替换为一句无关的话if random.randint(0,1)==0:j=random.randint(0,len(self.dataset)-1)sentence2=self.dataset[j]['text'][20:40]label=1return sentence1,sentence2,labeldataset=Dataset("train")sentence1,sentence2,label=dataset[0]
len(dataset),sentence1,sentence2,label#2.加载tokenizer
token=BertTokenizer.from_pretrained('bert-base-chinese')
token#3.定义批处理函数
def collate_fn(data):sents=[i[:2] for i in data]labels=[i[:2] for i in data]#编码data=token.batch_encode_plus(batch_text_or_text_pairs=sents,truncation=True,padding="max_length",max_length=45,return_tensors="pt",return_length=True,add_special_tokens=True,)#input_ids:编码之后的数字#attention_mask:是补零的位置是0，其他位置是1input_ids=data['input_ids']attention_mask=data['attention_mask']token_type_ids=data['token_type_ids']labels =torch.LongTensor(labels)return input_ids,attention_mask,token_type_ids,labels#4.定义数据的加载器
loader=torch.utils.data.DataLoader(dataset=dataset,batch_size=8,collate_fn=collate_fn,shuffle=True,drop_last=True)
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):breakprint(len(loader))
print(token.decode(input_ids[0]))input_ids.shape, attention_mask.shape, token_type_ids.shape,labels.shape#5.加载bert中文模型
#加载预训练模型
pretrained=BertModel.from_pretrained('bert-base-chinese')
#不训练，不需要计算梯度
for param in pretrained.parameters():param.requires_grad_(False)
#模型试算
out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
out.last_hidden_state.shape#6.定义下游任务模型-简单的神经网络模型
class Model(torch.nn.Module):def __init__(self):super().__init__()self.fc=torch.nn.Linear(768,2)def forward(self,input_ids,attention_mask,token_type_ids):with torch.no_grad():out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=self.fc(out.last_hidden_state[:,0]) #取第0个词的特征out=out.softmax(dim=-1)return outmodel = Model()
model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids).shape#7.训练下游任务模型
#训练
optimizer = AdamW(model.parameters(),lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()model.train()
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)loss=criterion(out,labels) #计算lossloss.backward()optimizer.step()optimizer.zero_grad() #进行梯度下降if i % 5 == 0:out=out.argmax(dim=-1)accuracy=(out==labels).sum().item()/len(labels)print(i,loss.item(),accuracy)if i==300:break#8.测试
def test():model.eval()correct=0total=0loader_test=torch.utils.data.DataLoader(dataset=Dataset('test'),batch_size=32,collate_fn=collate_fn,shuffle=True,drop_last=True)for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader_test):if i == 15:breakprint(i)with torch.no_grad():out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)pred=out.argmax(dim=-1)correct+=(pred==labels).sum().item()total+=len(labels)print(correct/total)test()