tensorflow2.10使用BERT实现SemanticSimilarity过程解析-编程学习网

前言

本文详细解释了在 tensorflow-gpu 基础上，实现用 BERT + BILSTM 计算文本相似度的过程，主要的配置如下：

tensorflow-gpu == 2.10.0
python == 3.10
transformers == 4.26.1

数据处理

这里导入了后续步骤需要用到的库，包括 NumPy、Pandas、TensorFlow 和 Transformers。同时设置了几个重要的参数。其中，max_length 表示输入文本的最大长度，batch_size 表示每个批次训练的样本数量，epochs 表示训练集训练次数，labels 列表包含了三个分类标签，分别为“矛盾”、“蕴含” 和 “中性”。

import numpy as np
import pandas as pd
import tensorflow as tf
import transformers
max_length = 128   
batch_size = 32
epochs = 2
labels = ["contradiction", "entailment", "neutral"]

这里使用 Pandas 库读取了 SNLI 数据集中的训练集、验证集和测试集。其中，训练集只读取了前 30 万条数据。接着打印了各数据集的样本数。然后，打印了训练集中的三组样本，每组样本包括两个句子和分类标签。

train_df = pd.read_csv("SNLI_Corpus/snli_1.0_train.csv", nrows=300000)
valid_df = pd.read_csv("SNLI_Corpus/snli_1.0_dev.csv")
test_df = pd.read_csv("SNLI_Corpus/snli_1.0_test.csv")
print(f"训练集样本数 : {train_df.shape[0]}")
print(f"验证集样本数: {valid_df.shape[0]}")
print(f"测试集样本数: {test_df.shape[0]}")
print()
print(f"句子一: {train_df.loc[5, 'sentence1']}")
print(f"句子二: {train_df.loc[5, 'sentence2']}")
print(f"相似度: {train_df.loc[5, 'similarity']}")
print()
print(f"句子一: {train_df.loc[3, 'sentence1']}")
print(f"句子二: {train_df.loc[3, 'sentence2']}")
print(f"相似度: {train_df.loc[3, 'similarity']}")
print()
print(f"句子一: {train_df.loc[4, 'sentence1']}")
print(f"句子二: {train_df.loc[4, 'sentence2']}")
print(f"相似度: {train_df.loc[4, 'similarity']}")

打印：

训练集样本数 : 300000
验证集样本数: 10000
测试集样本数: 10000
句子一: Children smiling and waving at camera
句子二: The kids are frowning
相似度: contradiction
句子一: Children smiling and waving at camera
句子二: They are smiling at their parents
相似度: neutral
句子一: Children smiling and waving at camera
句子二: There are children present
相似度: entailment

首先使用 dropna 函数删除训练集中的缺失数据。然后对训练集、验证集

测试集中的分类标签为“-”的数据进行了删除操作。接着使用 sample 函数进行了打乱处理，并使用 reset_index 函数重置了索引。最后，打印了处理后的各个数据集样本数。

train_df.dropna(axis=0, inplace=True)
train_df = ( train_df[train_df.similarity != "-"].sample(frac=1.0, random_state=30).reset_index(drop=True) )
valid_df = ( valid_df[valid_df.similarity != "-"].sample(frac=1.0, random_state=30).reset_index(drop=True) )
test_df  = ( test_df[test_df.similarity != "-"].sample(frac=1.0, random_state=30).reset_index(drop=True) )
print(f"处理后训练集样本数 : {train_df.shape[0]}")
print(f"处理后验证集样本数: {valid_df.shape[0]}")
print(f"处理后测试集样本数: {test_df.shape[0]}")

打印：

处理后训练集样本数 : 299616
处理后验证集样本数: 9842
处理后测试集样本数: 9824

这里将训练集、验证集和测试集中的分类标签转换为数字，并将标签转换为 one-hot 编码格式。具体来说就是使用 apply 函数将 "contradiction" 标签转换为数字 0，将 "entailment" 标签转换为数字 1，将 "neutral" 标签转换为数字 2。然后，使用 to_categorical 函数将数字标签转换为 one-hot 编码格式。最终使用 y_train、y_val 和 y_test 存储了训练集、验证集和测试集的 one-hot 编码标签结果。

train_df["label"] = train_df["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_train = tf.keras.utils.to_categorical(train_df.label, num_classes=3)
valid_df["label"] = valid_df["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_val = tf.keras.utils.to_categorical(valid_df.label, num_classes=3)
test_df["label"] = test_df["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_test = tf.keras.utils.to_categorical(test_df.label, num_classes=3)

模型搭建

这里定义了一个继承自 tf.keras.utils.Sequence 的类 BertSemanticDataGenerator ，用于生成 BERT 模型训练所需的数据。

在初始化时，需要传入句子对的数组 sentence_pairs 和对应的标签 labels，同时可以指定批次大小 batch_size ，shuffle 表示是否要打乱数据， include_targets 表示是否包含标签信息。类中还定义了一个 BERT 分词器 tokenizer，使用了 bert-base-uncased 预训练模型。

同时实现了 __len__ 、 __getitem__ 、on_epoch_end 三个方法， __len__ 用于获取数据集可以按照 batch_size 均分的批次数量，__getitem__ 首先使用索引从 self.sentence_pairs 中获取批数据，然后使用指定的编码器对这些句子对进行编码，使其适用于 BERT 模型的输入，最后返回输入和标签。on_epoch_end 方法在每轮训练之后判断是否需要打乱数据集。

class BertSemanticDataGenerator(tf.keras.utils.Sequence):
    def __init__( self, sentence_pairs, labels, batch_size=batch_size, shuffle=True, include_targets=True ):
        self.sentence_pairs = sentence_pairs
        self.labels = labels
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_targets = include_targets
        self.tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True )
        self.indexes = np.arange(len(self.sentence_pairs))
        self.on_epoch_end()
    def __len__(self):
        return len(self.sentence_pairs) // self.batch_size
    def __getitem__(self, idx):
        indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
        sentence_pairs = self.sentence_pairs[indexes]
        encoded = self.tokenizer.batch_encode_plus( sentence_pairs.tolist(), add_special_tokens=True,
            max_length=max_length, return_attention_mask=True, return_token_type_ids=True,
            pad_to_max_length=True, return_tensors="tf")
        input_ids = np.array(encoded["input_ids"], dtype="int32")
        attention_masks = np.array(encoded["attention_mask"], dtype="int32")
        token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")
        if self.include_targets:
            labels = np.array(self.labels[indexes], dtype="int32")
            return [input_ids, attention_masks, token_type_ids], labels
        else:
            return [input_ids, attention_masks, token_type_ids]
    def on_epoch_end(self):
        if self.shuffle:
            np.random.RandomState(30).shuffle(self.indexes)

这里使用 TensorFlow2 和 Transformers 库实现了一个基于 BERT 的文本分类模型。以下是代码的主要步骤。

首先，定义了三个输入张量：input_ids、attention_masks 和 token_type_ids ，这些张量的形状都是 (max_length,) ，其中 max_length 是预处理后的文本序列的最大长度。

接下来，定义了一个 BERT 模型 bert_model 。通过调用 TFBertModel.from_pretrained 方法，该模型从预先训练好的 BERT 模型中加载参数。同时，将 bert_model.trainable 设置为 False ，以避免在训练过程中更新 BERT 模型的参数。

然后，将 input_ids、attention_masks 和 token_type_ids 作为输入传入 bert_model ，得到 bert_output 。获取 BERT 模型的最后一个隐藏状态（last_hidden_state），作为 LSTM 层的输入。

接着，使用 Bi-LSTM 层对 sequence_output 进行处理，生成一个具有 64 个输出单元的 LSTM 层，返回整个序列。然后，将 Bi-LSTM 层的输出分别进行全局平均池化和全局最大池化，得到 avg_pool 和 max_pool 。将这两个输出连接起来，形成一个维度为 128 的向量，通过 Dropout 层后，经过一个 Dense 层输出最终的分类结果。

最后，使用 tf.keras.models.Model 方法，将 input_ids、attention_masks 和 token_type_ids 作为输入，output 作为输出，定义一个完整的神经网络模型。并使用 model.compile 方法编译模型，指定了优化器 Adam 、损失函数为 categorical_crossentropy 、评估指标为 acc 。

input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="input_ids")
attention_masks = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="attention_masks")
token_type_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="token_type_ids")
bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
bert_model.trainable = False
bert_output = bert_model.bert(input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids)
sequence_output = bert_output.last_hidden_state
bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(sequence_output)
avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
concat = tf.keras.layers.concatenate([avg_pool, max_pool])
dropout = tf.keras.layers.Dropout(0.5)(concat)
output = tf.keras.layers.Dense(3, activation="softmax")(dropout)
model = tf.keras.models.Model(inputs=[input_ids, attention_masks, token_type_ids], outputs=output)
model.compile( optimizer=tf.keras.optimizers.Adam(), loss="categorical_crossentropy", metrics=["acc"],)
model.summary()

打印模型结构可以看到， BERT 的参数都被冻结了：

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 128)]        0           []                               
 attention_masks (InputLayer)   [(None, 128)]        0           []                               
 token_type_ids (InputLayer)    [(None, 128)]        0           []                               
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_masks[0][0]',        
                                tentions(last_hidde               'token_type_ids[0][0]']         
                                n_state=(None, 128,                                               
                                 768),                                                            
                                 pooler_output=(Non                                               
                                e, 768),                                                          
                                 past_key_values=No                                               
                                ne, hidden_states=N                                               
                                one, attentions=Non                                               
                                e, cross_attentions                                               
                                =None)                                                            
 bidirectional (Bidirectional)  (None, 128, 128)     426496      ['bert[0][0]']                   
 global_average_pooling1d (Glob  (None, 128)         0           ['bidirectional[0][0]']          
 alAveragePooling1D)                                                                              
 global_max_pooling1d (GlobalMa  (None, 128)         0           ['bidirectional[0][0]']          
 xPooling1D)                                                                                      
 concatenate (Concatenate)      (None, 256)          0           ['global_average_pooling1d[0][0]'
                                                                 , 'global_max_pooling1d[0][0]']  
 dropout_37 (Dropout)           (None, 256)          0           ['concatenate[0][0]']            
 dense (Dense)                  (None, 3)            771         ['dropout_37[0][0]']             
==================================================================================================
Total params: 109,909,507
Trainable params: 427,267
Non-trainable params: 109,482,240

模型训练

首先，将训练集和验证集传入 BertSemanticDataGenerator 对象中，创建一个训练数据生成器 train_data 和一个验证数据生成器 valid_data。然后，通过调用 model.fit() 方法，对模型进行训练。其中，训练数据为 train_data，验证数据为 valid_data。 use_multiprocessing 和 workers 参数用于指定在训练期间使用的进程数，以加快训练速度。

最后，训练历史记录存储在 history 变量中，可以使用这些历史数据来分析模型的训练效果。

train_data = BertSemanticDataGenerator( train_df[["sentence1", "sentence2"]].values.astype("str"), y_train, batch_size=batch_size, shuffle=True)
valid_data = BertSemanticDataGenerator( valid_df[["sentence1", "sentence2"]].values.astype("str"), y_val, batch_size=batch_size, shuffle=False)
history = model.fit( train_data, validation_data=valid_data, epochs=epochs, use_multiprocessing=True,  workers=-1 )
Epoch 1/2
	11/9363 [..............................] - ETA: 16:31 - loss: 1.1949 - acc: 0.3580
	31/9363 [..............................] - ETA: 13:51 - loss: 1.1223 - acc: 0.3831
	...
Epoch 2/2
	...
	9363/9363 [==============================] - ETA: 0s - loss: 0.5691 - acc: 0.7724
	9363/9363 [==============================] - 791s 84ms/step - loss: 0.5691 - acc: 0.7724 - val_loss: 0.4635 - val_acc: 0.8226

微调模型

这里是对训练好的 BERT 模型进行 fine-tuning，即对其进行微调以适应新任务。具体来说就是通过将 bert_model.trainable 设置为 True ，可以使得 BERT 模型中的参数可以在 fine-tuning 过程中进行更新。然后使用 tf.keras.optimizers.Adam(1e-5) 作为优化器，以较小的学习率进行微调。同时使用 categorical_crossentropy 作为损失函数，用来评估模型输出的预测分布与实际标签分布之间的差异。最后，通过 model.summary() 函数查看模型的结构和参数信息，可以发现所有的参数现在都可以训练了。

bert_model.trainable = True
model.compile( optimizer=tf.keras.optimizers.Adam(1e-5), loss="categorical_crossentropy",  metrics=["accuracy"] )
model.summary()

打印：

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 128)]        0           []                               
 attention_masks (InputLayer)   [(None, 128)]        0           []                               
 token_type_ids (InputLayer)    [(None, 128)]        0           []                               
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_masks[0][0]',        
                                tentions(last_hidde               'token_type_ids[0][0]']         
                                n_state=(None, 128,                                               
                                 768),                                                            
                                 pooler_output=(Non                                               
                                e, 768),                                                          
                                 past_key_values=No                                               
                                ne, hidden_states=N                                               
                                one, attentions=Non                                               
                                e, cross_attentions                                               
                                =None)                                                            
 bidirectional (Bidirectional)  (None, 128, 128)     426496      ['bert[0][0]']                   
 global_average_pooling1d (Glob  (None, 128)         0           ['bidirectional[0][0]']          
 alAveragePooling1D)                                                                              
 global_max_pooling1d (GlobalMa  (None, 128)         0           ['bidirectional[0][0]']          
 xPooling1D)                                                                                      
 concatenate (Concatenate)      (None, 256)          0           ['global_average_pooling1d[0][0]'
                                                                 , 'global_max_pooling1d[0][0]']  
 dropout_37 (Dropout)           (None, 256)          0           ['concatenate[0][0]']            
 dense (Dense)                  (None, 3)            771         ['dropout_37[0][0]']             
==================================================================================================
Total params: 109,909,507
Trainable params: 109,909,507
Non-trainable params: 0

接着上面的模型，继续进行微调训练，我们可以看到这次的准确率比之前有所提升。

history = model.fit( train_data, validation_data=valid_data, epochs=epochs, use_multiprocessing=True, workers=-1,)

打印：

Epoch 1/2
7/9363 [..............................] - ETA: 24:41 - loss: 0.5716 - accuracy: 0.7946
...
Epoch 2/2
...
9363/9363 [==============================] - 1500s 160ms/step - loss: 0.3201 - accuracy: 0.8845 - val_loss: 0.2933 - val_accuracy: 0.8974

模型评估

使用测试数据对模型的性能进行评估。

test_data = BertSemanticDataGenerator(  test_df[["sentence1", "sentence2"]].values.astype("str"), y_test, batch_size=batch_size, shuffle=False)
model.evaluate(test_data, verbose=1)
307/307 [==============================] - 18s 57ms/step - loss: 0.2916 - accuracy: 0.8951

推理测试

这里定义了一个名为 check_similarity 的函数，该函数可以用来检查两个句子的语义相似度。传入的参数是两个句子 sentence1 和 sentence2 。首先将这两个句子组成一个 np.array 格式方便处理，然后通过 BertSemanticDataGenerator 函数创建一个数据生成器生成模型需要的测试数据格式，使用训练好的函数返回句子对的预测概率，最后取预测概率最高的类别作为预测结果。

def check_similarity(sentence1, sentence2):
    sentence_pairs = np.array([[str(sentence1), str(sentence2)]])
    test_data = BertSemanticDataGenerator( sentence_pairs, labels=None, batch_size=1, shuffle=False, include_targets=False )
    proba = model.predict(test_data[0])[0]
    idx = np.argmax(proba)
    proba = f"{proba[idx]: .2f}%"
    pred = labels[idx]
    return pred, proba
sentence1 = "Male in a blue jacket decides to lay in the grass"
sentence2 = "The guy wearing a blue jacket is laying on the green grass"
check_similarity(sentence1, sentence2)

打印：

('entailment', ' 0.51%')

以上就是tensorflow2.10使用BERT实现Semantic Similarity过程解析的详细内容，更多关于tensorflow Semantic Similarity的资料请关注编程网其它相关文章！

文章详情

tensorflow2.10使用BERT实现SemanticSimilarity过程解析

目录

前言

数据处理

模型搭建

模型训练

微调模型

模型评估

推理测试

软考中级精品资料免费领

相关文章

猜你喜欢

tensorflow2.10使用BERT实现SemanticSimilarity过程解析

tensorflow2.10怎么使用BERT实现Semantic Similarity

Tensorflow2.10使用BERT从文本中抽取答案实现详解

windows下用QTwebkit解析html实现过程

Air实现Go程序实时热重载使用过程解析示例

golang实现简单rpc调用过程解析

JavaScript实现单链表过程解析

python使用protobufde的过程解析

JavaScript实现双向链表过程解析

SpringBoot使用自定义注解实现数据脱敏过程详细解析

使用python远程操作linux过程解析

Zabbix实现监控多个mysql过程解析

使用OpenCV实现迷宫解密的全过程

HashMap在JDK7与JDK8中的实现过程解析

Linux使用fdisk实现磁盘分区过程图解

Linux实现驱动模块传参过程解析

Android使用GRPC进行通信过程解析

使用systemd部署服务的过程解析

Django执行源生mysql语句实现过程解析

Linux配置实现免密钥登录过程解析