对Bert微调实现文本分类

Miyako2024-08-192024-12-21

介绍

在本教程中，记录使用Huggingface Transformers库在自己选择的数据集上对BERT（和其他 Transformer 模型）进行文本分类微调的示例。
原文
本文采用的微调数据集为The 20 newsgroups text dataset.

微调与预训练的区别

预训练（Pre-training）：代码中从零开始训练一个BERT模型，使用的是cc_news数据集，并进行了掩码语言模型（MLM）的训练任务。这属于预训练，因为是在通用数据集上训练模型，以学习语言的基本表示。
微调（Fine-tuning）：如果使用一个已经预训练好的BERT模型，然后在特定任务的数据集（例如文本分类、命名实体识别等）上进行进一步的训练，这就是微调。微调通常是在预训练的基础上进行的，目标是让模型适应特定的下游任务。

模型训练代码

导入必要的库

import torch 
from transformers.file_utils import is_tf_available, is_torch_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

设置seed

通过设置seed可以在不同次的运行当中得到相同的结果，代码如下：

def set_seed(seed: int):

    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)

选择预训练权重

我们将使用bert-base-uncased库中的预训练权重，之前训练过利用ccnews从头预训练的任务。

1 2	model_name = "bert-base-uncased" max_length = 512

max_length是序列的最大长度。我们将仅从每个文档或帖子中挑选前512个标记，可以随时将其更改为想要的任何值，增加时需要确保内存。

加载数据集

1	tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

加载过程中若遇到问题：SSLError可以通过以下方法解决

进入网站直接下载：https://huggingface.co/google-bert/bert-base-uncased/tree/main下载完将文件夹保存在路径中如图所示。

其中my_train.py为模型训练代码，inference.py为接口代码，用于使用模型。

从huggingface的dataset中下载数据集报错ConnectionError的解决方法
打开anaconda目录找到\envs<name>\Lib\urllib\request.py
根据图片进行修改保存

,随后打开vpn后即可下载，在autodl服务器上可以开启加速

1	source /etc/network_turbo

下载和加载sklearn库中的20newsgroups数据集，并将其划分为训练集和测试集。

def read_20newsgroups(test_size=0.2):
  dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers", "footers", "quotes"))
  documents = dataset.data
  labels = dataset.target
  return train_test_split(documents, labels, test_size=test_size), dataset.target_names
  
(train_texts, valid_texts, train_labels, valid_labels), target_names = read_20newsgroups()

其中train_test_split函数返回训练集文本，测试集文本，训练集标签，数据集标签，和数据集的目标标签名称。20newsgroups数据集是一个新闻组分类问题，数据集中的文本被分为不同的类别，如体育、科技、政治等。

利用加载的分词器对数据集进行标记

1 2	train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length) valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

我们将截断设置为 True，以便消除大于 max_length 的标记，我们还将填充设置为 True，以填充小于 max_length 的文档，其中包含空标记。

标记化文本数据包装到 torch型的Dataset

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)

加载我们的 BERT 模型及其预先训练的权重

1	model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(target_names)).to("cuda")

计算指标

from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

已经添加了准确度，但也可以添加精确度、召回率等
使用TrainingArguments类来指定我们的训练参数，例如时期数、批次大小和一些其他参数

training_args = TrainingArguments(
    output_dir='./results',          # 输出目录
    num_train_epochs=1,              # 训练的轮数
    per_device_train_batch_size=15,  # 训练批次大小为8
    per_device_eval_batch_size=15,   # 评估批次大小为20
    #warmup_steps=30,                # 预热步数是指在训练开始阶段，学习率逐渐增加的步数。它的目的是在训练初期使用较小的学习率，逐渐增加到设定的最大学习率，以帮助模型更好地收敛。
    weight_decay=0.01,               # 权重衰减的强度
    logging_dir='./logs',            # 保存训练日志的目录
    load_best_model_at_end=True,     # 这个参数指定了在训练结束后是否加载最佳模型。如果设置为True，则会加载具有最佳指标（默认是损失）的模型。
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=100,               # 这个参数指定了每隔多少步骤记录和保存日志信息。
    save_steps=100,                  #这个参数指定了每隔多少步骤保存模型的权重。
    eval_strategy="steps",     # 每隔logging_steps步骤进行一次评估。
)

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

之后对模型进行评估，保存最佳权重

# train the model
trainer.train()

# evaluate the current model after training
trainer.evaluate()

# saving the fine tuned model & tokenizer
model_path = "20newsgroups-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

若报错：报错：ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U
方法：pip install accelerate -U

模型接口代码

from transformers import BertForSequenceClassification, BertTokenizerFast
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups

model_path = "20newsgroups-bert-base-uncased"
max_length = 512

def read_20newsgroups(test_size=0.2):
  dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers", "footers", "quotes"))
  documents = dataset.data
  labels = dataset.target
  return train_test_split(documents, labels, test_size=test_size), dataset.target_names


(train_texts, valid_texts, train_labels, valid_labels), target_names = read_20newsgroups()

model = BertForSequenceClassification.from_pretrained(model_path, num_labels=len(target_names)).to("cuda")
tokenizer = BertTokenizerFast.from_pretrained(model_path)

def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return target_names[probs.argmax()]

# Example #1
text = """With the pace of smartphone evolution moving so fast, there's always something waiting in the wings. 
No sooner have you spied the latest handset, that there's anticipation for the next big thing. 
Here we look at those phones that haven't yet launched, the upcoming phones for 2021. 
We'll be updating this list on a regular basis, with those device rumours we think are credible and exciting."""
print(get_prediction(text))
# Example #2
text = """
A black hole is a place in space where gravity pulls so much that even light can not get out. 
The gravity is so strong because matter has been squeezed into a tiny space. This can happen when a star is dying.
Because no light can get out, people can't see black holes. 
They are invisible. Space telescopes with special tools can help find black holes. 
The special tools can see how stars that are very close to black holes act differently than other stars.
"""
print(get_prediction(text))

# Example #3
text = """
Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.
Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment.  
Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.
"""
print(get_prediction(text))