文本分类-07ELMo

Posted yifanrensheng

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了文本分类-07ELMo相关的知识,希望对你有一定的参考价值。

目录

  1. 大纲概述
  2. 数据集合
  3. 数据处理
  4. 预训练word2vec模型

一、大纲概述

文本分类这个系列将会有8篇左右文章,从github直接下载代码,从百度云下载训练数据,在pycharm上导入即可使用,包括基于word2vec预训练的文本分类,与及基于近几年的预训练模型(ELMo,BERT等)的文本分类。总共有以下系列:

word2vec预训练词向量

textCNN 模型

charCNN 模型

Bi-LSTM 模型

Bi-LSTM + Attention 模型

Transformer 模型

ELMo 预训练模型

BERT 预训练模型

二、数据集合

数据集为IMDB 电影影评,总共有三个数据文件,在/data/rawData目录下,包括unlabeledTrainData.tsv,labeledTrainData.tsv,testData.tsv。在进行文本分类时需要有标签的数据(labeledTrainData),但是在训练word2vec词向量模型(无监督学习)时可以将无标签的数据一起用上。

训练数据地址:链接:https://pan.baidu.com/s/1-XEwx1ai8kkGsMagIFKX_g 提取码:rtz8

? ?

  ELMo模型是利用BiLM(双向语言模型)来预训练词的向量表示,可以根据我们的训练集动态的生成词的向量表示。ELMo预训练模型来源于论文:Deep contextualized word representations。具体的ELMo模型的详细介绍见这篇文章

  在使用之前我们还需要去下载已经预训练好的模型参数权重,打开https://allennlp.org/elmo链接,在Pre-trained ELMo Models 这个版块下总共有四个不同版本的模型,可以自己选择,我们在这里选择Small这个规格的模型,总共有两个文件需要下载,一个"options"的json文件,保存了模型的配置参数,另一个是"weights"的hdf5文件,保存了模型的结构和权重值(可以用h5py读取看看)。

? ?

三、主要代码 

3.1 配置训练参数:parameter_config.py

  在这里我们需要将optionFile,vocabFile,weightsFile,tokenEmbeddingFile的路径配置上,还有一个需要注意的地方就是这里的embeddingSize的值要和ELMo的词向量的大小一致。我们需要导入bilm文件夹中的函数和类

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

# Author:yifan

# _*_ coding:utf-8 _*_

#需要的所有导入包,存放留用,转换到jupyter后直接使用

# 1 配置训练参数

? ?

class TrainingConfig(object):

epoches = 5

evaluateEvery = 100

checkpointEvery = 100

learningRate = 0.001

?

class ModelConfig(object):

embeddingSize = 256 # 这个值是和ELMo模型的output Size 对应的值

?

hiddenSizes = [128] # LSTM结构的神经元个数

?

dropoutKeepProb = 0.5

l2RegLambda = 0.0

?

class Config(object):

sequenceLength = 200 # 取了所有序列长度的均值

batchSize = 128

?

dataSource = "../data/preProcess/labeledTrain.csv"

?

stopWordSource = "../data/english"

?

optionFile = "../data/elmodata/elmo_options.json"

weightFile = "../data/elmodata/elmo_weights.hdf5"

vocabFile = "../data/elmodata/vocab.txt"

tokenEmbeddingFile = ‘../data/elmodata/elmo_token_embeddings.hdf5‘

?

numClasses = 2

?

rate = 0.8 # 训练集的比例

?

training = TrainingConfig()

?

model = ModelConfig()

??

3.2 获取训练数据:get_train_data.py

1)将数据读取出来,

2)根据训练集生成vocabFile文件,

3)调用bilm文件夹中的dump_token_embeddings方法生成初始化的词向量表示,并保存为hdf5文件,文件中的键为"embedding",

4)固定输入数据的序列长度

5)分割成训练集和测试集

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

# _*_ coding:utf-8 _*_

# Author:yifan

import json

from collections import Counter

import gensim

import pandas as pd

import numpy as np

import parameter_config

? ?

# 2 数据预处理的类,生成训练集和测试集

? ?

# 数据预处理的类,生成训练集和测试集

? ?

class Dataset(object):

def __init__(self, config):

self._dataSource = config.dataSource

self._stopWordSource = config.stopWordSource

self._optionFile = config.optionFile

self._weightFile = config.weightFile

self._vocabFile = config.vocabFile

self._tokenEmbeddingFile = config.tokenEmbeddingFile

?

self._sequenceLength = config.sequenceLength # 每条输入的序列处理为定长

self._embeddingSize = config.model.embeddingSize

self._batchSize = config.batchSize

self._rate = config.rate

?

self.trainReviews = []

self.trainLabels = []

?

self.evalReviews = []

self.evalLabels = []

?

def _readData(self, filePath):

"""

csv文件中读取数据集

"""

?

df = pd.read_csv(filePath)

labels = df["sentiment"].tolist()

review = df["review"].tolist()

reviews = [line.strip().split() for line in review]

? ?

return reviews, labels

?

def _genVocabFile(self, reviews):

"""

用我们的训练数据生成一个词汇文件,并加入三个特殊字符

"""

allWords = [word for review in reviews for word in review]

wordCount = Counter(allWords) # 统计词频

sortWordCount = sorted(wordCount.items(), key=lambda x: x[1], reverse=True)

words = [item[0] for item in sortWordCount.items()]

allTokens = [‘<S>‘, ‘</S>‘, ‘<UNK>‘] + words

with open(self._vocabFile, ‘w‘,encoding=‘UTF-8‘) as fout:

fout.write( .join(allTokens))

?

def _fixedSeq(self, reviews):

"""

将长度超过200的截断为200的长度

"""

return [review[:self._sequenceLength] for review in reviews]

?

def _genElmoEmbedding(self):

"""

调用ELMO源码中的dump_token_embeddings方法,基于字符的表示生成词的向量表示。并保存成hdf5文件,

文件中的"embedding"键对应的value就是

词汇表文件中各词汇的向量表示,这些词汇的向量表示之后会作为BiLM的初始化输入。

"""

dump_token_embeddings(

self._vocabFile, self._optionFile, self._weightFile, self._tokenEmbeddingFile)

? ?

def _genTrainEvalData(self, x, y, rate):

"""

生成训练集和验证集

"""

y = [[item] for item in y]

trainIndex = int(len(x) * rate)

?

trainReviews = x[:trainIndex]

trainLabels = y[:trainIndex]

?

evalReviews = x[trainIndex:]

evalLabels = y[trainIndex:]

? ?

return trainReviews, trainLabels, evalReviews, evalLabels

?

?

def dataGen(self):

"""

初始化训练集和验证集

"""

# 初始化数据集

reviews, labels = self._readData(self._dataSource)

# self._genVocabFile(reviews) # 生成vocabFile

# self._genElmoEmbedding() # 生成elmo_token_embedding

reviews = self._fixedSeq(reviews)

# 初始化训练集和测试集

trainReviews, trainLabels, evalReviews, evalLabels = self._genTrainEvalData(reviews, labels, self._rate)

self.trainReviews = trainReviews

self.trainLabels = trainLabels

?

self.evalReviews = evalReviews

self.evalLabels = evalLabels

3.3 模型构建:mode_structure.py

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

# Author:yifan

# _*_ coding:utf-8 _*_

import tensorflow as tf

import parameter_config

config = parameter_config.Config()

# 构建模型 3 ELMo模型

# 构建模型

class ELMo(object):

""""""

def __init__(self, config):

# 定义模型的输入

self.inputX = tf.placeholder(tf.float32, [None, config.sequenceLength, config.model.embeddingSize], name="inputX")

self.inputY = tf.placeholder(tf.float32, [None, 1], name="inputY")

self.dropoutKeepProb = tf.placeholder(tf.float32, name="dropoutKeepProb")

?

# 定义l2损失

l2Loss = tf.constant(0.0)

?

with tf.name_scope("embedding"):

embeddingW = tf.get_variable(

"embeddingW",

shape=[config.model.embeddingSize, config.model.embeddingSize],

initializer=tf.contrib.layers.xavier_initializer())

reshapeInputX = tf.reshape(self.inputX, shape=[-1, config.model.embeddingSize])

?

self.embeddedWords = tf.reshape(tf.matmul(reshapeInputX, embeddingW), shape=[-1, config.sequenceLength, config.model.embeddingSize])

self.embeddedWords = tf.nn.dropout(self.embeddedWords, self.dropoutKeepProb)

?

# 定义两层双向LSTM的模型结构

with tf.name_scope("Bi-LSTM"):

for idx, hiddenSize in enumerate(config.model.hiddenSizes):

with tf.name_scope("Bi-LSTM" + str(idx)):

# 定义前向LSTM结构

lstmFwCell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=hiddenSize, state_is_tuple=True),

output_keep_prob=self.dropoutKeepProb)

# 定义反向LSTM结构

lstmBwCell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=hiddenSize, state_is_tuple=True),

output_keep_prob=self.dropoutKeepProb)

? ?

# 采用动态rnn,可以动态的输入序列的长度,若没有输入,则取序列的全长

# outputs是一个元组(output_fw, output_bw),其中两个元素的维度都是[batch_size, max_time, hidden_size],fwbwhidden_size一样

# self.current_state 是最终的状态,二元组(state_fw, state_bw)state_fw=[batch_size, s]s是一个元祖(h, c)

outputs_, self.current_state = tf.nn.bidirectional_dynamic_rnn(lstmFwCell, lstmBwCell,

self.embeddedWords, dtype=tf.float32,

scope="bi-lstm" + str(idx))

?

# outputs中的fwbw的结果拼接 [batch_size, time_step, hidden_size * 2], 传入到下一层Bi-LSTM

self.embeddedWords = tf.concat(outputs_, 2)

# 将最后一层Bi-LSTM输出的结果分割成前向和后向的输出

outputs = tf.split(self.embeddedWords, 2, -1)

? ?

# Bi-LSTM+Attention的论文中,将前向和后向的输出相加

with tf.name_scope("Attention"):

H = outputs[0] + outputs[1]

# 得到Attention的输出

output = self._attention(H)

outputSize = config.model.hiddenSizes[-1]

?

# 全连接层的输出

with tf.name_scope("output"):

outputW = tf.get_variable(

"outputW",

shape=[outputSize, 1],

initializer=tf.contrib.layers.xavier_initializer())

?

outputB= tf.Variable(tf.constant(0.1, shape=[1]), name="outputB")

l2Loss += tf.nn.l2_loss(outputW)

l2Loss += tf.nn.l2_loss(outputB)

self.predictions = tf.nn.xw_plus_b(output, outputW, outputB, name="predictions")

self.binaryPreds = tf.cast(tf.greater_equal(self.predictions, 0.0), tf.float32, name="binaryPreds")

?

# 计算二元交叉熵损失

with tf.name_scope("loss"):

losses = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.predictions, labels=self.inputY)

self.loss = tf.reduce_mean(losses) + config.model.l2RegLambda * l2Loss

?

def _attention(self, H):

"""

利用Attention机制得到句子的向量表示

"""

# 获得最后一层LSTM的神经元数量

hiddenSize = config.model.hiddenSizes[-1]

? ?

# 初始化一个权重向量,是可训练的参数

W = tf.Variable(tf.random_normal([hiddenSize], stddev=0.1))

? ?

# Bi-LSTM的输出用激活函数做非线性转换

M = tf.tanh(H)

?

# WM做矩阵运算,W=[batch_size, time_step, hidden_size],计算前做维度转换成[batch_size * time_step, hidden_size]

# newM = [batch_size, time_step, 1],每一个时间步的输出由向量转换成一个数字

newM = tf.matmul(tf.reshape(M, [-1, hiddenSize]), tf.reshape(W, [-1, 1]))

?

# newM做维度转换成[batch_size, time_step]

restoreM = tf.reshape(newM, [-1, config.sequenceLength])

?

# softmax做归一化处理[batch_size, time_step]

self.alpha = tf.nn.softmax(restoreM)

?

# 利用求得的alpha的值对H进行加权求和,用矩阵运算直接操作

r = tf.matmul(tf.transpose(H, [0, 2, 1]), tf.reshape(self.alpha, [-1, config.sequenceLength, 1]))

?

# 将三维压缩成二维sequeezeR=[batch_size, hidden_size]

sequeezeR = tf.squeeze(r)

?

sentenceRepren = tf.tanh(sequeezeR)

?

# Attention的输出可以做dropout处理

output = tf.nn.dropout(sentenceRepren, self.dropoutKeepProb)

?

return output

3.4 模型训练:mode_trainning.py

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

# Author:yifan

# _*_ coding:utf-8 _*_

import os

import datetime

import numpy as np

import tensorflow as tf

import parameter_config

import get_train_data

import mode_structure

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score

from data import TokenBatcher #不能从bilm直接导入TokenBatcher,因为需要修改内部的openwith open(filename, encoding="utf8") as f:

from bilm import BidirectionalLanguageModel, weight_layers, dump_token_embeddings, Batcher

? ?

#获取前些模块的数据

config =parameter_config.Config()

data = get_train_data.Dataset(config)

data.dataGen()

? ?

#4生成batch数据集

def nextBatch(x, y, batchSize):

# 生成batch数据集,用生成器的方式输出

# perm = np.arange(len(x)) #返回[0 1 2 ... len(x)]的数组

# np.random.shuffle(perm) #乱序

# # x = x[perm]

# # y = y[perm]

# x = np.array(x)[perm]

# y = np.array(y)[perm]

# print(x)

# # np.random.shuffle(x) #不能用这种,会导致xy不一致

# # np.random.shuffle(y)

midVal = list(zip(x, y))

np.random.shuffle(midVal)

x, y = zip(*midVal)

x = list(x)

y = list(y)

print(x)

numBatches = len(x) // batchSize

? ?

for i in range(numBatches):

start = i * batchSize

end = start + batchSize

batchX = np.array(x[start: end])

batchY = np.array(y[start: end])

yield batchX, batchY

# 5 定义计算metrics的函数

"""

定义各类性能指标

"""

def mean(item):

return sum(item) / len(item)

def genMetrics(trueY, predY, binaryPredY):

"""

生成accauc

"""

auc = roc_auc_score(trueY, predY)

accuracy = accuracy_score(trueY, binaryPredY)

precision = precision_score(trueY, binaryPredY)

recall = recall_score(trueY, binaryPredY)

?

return round(accuracy, 4), round(auc, 4), round(precision, 4), round(recall, 4)

???????? ?

# 6 训练模型

# 生成训练集和验证集

trainReviews = data.trainReviews

trainLabels = data.trainLabels

evalReviews = data.evalReviews

evalLabels = data.evalLabels

? ?

# 定义计算图

with tf.Graph().as_default():

? ?

session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)

session_conf.gpu_options.allow_growth=True

session_conf.gpu_options.per_process_gpu_memory_fraction = 0.9 # 配置gpu占用率

sess = tf.Session(config=session_conf)

?

# 定义会话

with sess.as_default():

elmoMode = mode_structure.ELMo(config)

?

# 实例化BiLM对象,这个必须放置在全局下,不能在elmo函数中定义,否则会出现重复生成tensorflow节点。

with tf.variable_scope("bilm", reuse=True):

bilm = BidirectionalLanguageModel(

config.optionFile,

config.weightFile,

use_character_inputs=False,

embedding_weight_file=config.tokenEmbeddingFile

)

inputData = tf.placeholder(‘int32‘, shape=(None, None))

?

# 调用bilm中的__call__方法生成op对象

inputEmbeddingsOp = bilm(inputData)

?

# 计算ELMo向量表示

elmoInput = weight_layers(‘input‘, inputEmbeddingsOp, l2_coef=0.0)

?

globalStep = tf.Variable(0, name="globalStep", trainable=False)

# 定义优化函数,传入学习速率参数

optimizer = tf.train.AdamOptimizer(config.training.learningRate)

# 计算梯度,得到梯度和变量

gradsAndVars = optimizer.compute_gradients(elmoMode.loss)

# 将梯度应用到变量下,生成训练器

trainOp = optimizer.apply_gradients(gradsAndVars, global_step=globalStep)

?

# summary绘制tensorBoard

gradSummaries = []

for g, v in gradsAndVars:

if g is not None:

tf.summary.histogram("{}/grad/hist".format(v.name), g)

tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))

?

outDir = os.path.abspath(os.path.join(os.path.curdir, "summarys"))

print("Writing to {} ".format(outDir))

?

lossSummary = tf.summary.scalar("loss", elmoMode.loss)

summaryOp = tf.summary.merge_all()

?

trainSummaryDir = os.path.join(outDir, "train")

trainSummaryWriter = tf.summary.FileWriter(trainSummaryDir, sess.graph)

?

evalSummaryDir = os.path.join(outDir, "eval")

evalSummaryWriter = tf.summary.FileWriter(evalSummaryDir, sess.graph)

?

# 初始化所有变量

saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)

savedModelPath ="../model/ELMo/savedModel"

if os.path.exists(savedModelPath):

os.rmdir(savedModelPath)

? ?

# 保存模型的一种方式,保存为pb文件

builder = tf.saved_model.builder.SavedModelBuilder(savedModelPath)

? ?

sess.run(tf.global_variables_initializer())

?

def elmo(reviews):

"""

对每一个输入的batch都动态的生成词向量表示

"""

# tf.reset_default_graph()

# TokenBatcher是生成词表示的batch

# print("________")

batcher = TokenBatcher(config.vocabFile)

# 生成batch数据

inputDataIndex = batcher.batch_sentences(reviews)

# 计算ELMo的向量表示

elmoInputVec = sess.run(

[elmoInput[‘weighted_op‘]],

feed_dict={inputData: inputDataIndex}

)

return elmoInputVec

? ?

def trainStep(batchX, batchY):

"""

训练函数

"""

?

feed_dict = {

elmoMode.inputX: elmo(batchX)[0], # inputX直接用动态生成的ELMo向量表示代入

elmoMode.inputY: np.array(batchY, dtype="float32"),

elmoMode.dropoutKeepProb: config.model.dropoutKeepProb

}

_, summary, step, loss, predictions, binaryPreds = sess.run(

[trainOp, summaryOp, globalStep, elmoMode.loss, elmoMode.predictions, elmoMode.binaryPreds],

feed_dict)

timeStr = datetime.datetime.now().isoformat()

acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds)

print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(timeStr, step, loss, acc, auc, precision, recall))

trainSummaryWriter.add_summary(summary, step)

? ?

def devStep(batchX, batchY):

"""

验证函数

"""

feed_dict = {

elmoMode.inputX: elmo(batchX)[0],

elmoMode.inputY: np.array(batchY, dtype="float32"),

elmoMode.dropoutKeepProb: 1.0

}

summary, step, loss, predictions, binaryPreds = sess.run(

[summaryOp, globalStep, elmoMode.loss, elmoMode.predictions, elmoMode.binaryPreds],

feed_dict)

?

acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds)

?

evalSummaryWriter.add_summary(summary, step)

?

return loss, acc, auc, precision, recall

?

for i in range(config.training.epoches):

# 训练模型

print("start training model")

for batchTrain in nextBatch(trainReviews, trainLabels, config.batchSize):

trainStep(batchTrain[0], batchTrain[1])

? ?

currentStep = tf.train.global_step(sess, globalStep)

if currentStep % config.training.evaluateEvery == 0:

print(" Evaluation:")

?

losses = []

accs = []

aucs = []

precisions = []

recalls = []

?

for batchEval in nextBatch(evalReviews, evalLabels, config.batchSize):

loss, acc, auc, precision, recall = devStep(batchEval[0], batchEval[1])

losses.append(loss)

accs.append(acc)

aucs.append(auc)

precisions.append(precision)

recalls.append(recall)

?

time_str = datetime.datetime.now().isoformat()

print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(time_str, currentStep, mean(losses),

mean(accs), mean(aucs), mean(precisions),

mean(recalls)))

?

if currentStep % config.training.checkpointEvery == 0:

# 保存模型的另一种方法,保存checkpoint文件

path = saver.save(sess, "../model/ELMo/model/my-model", global_step=currentStep)

print("Saved model checkpoint to {} ".format(path))

?

inputs = {"inputX": tf.saved_model.utils.build_tensor_info(elmoMode.inputX),

"keepProb": tf.saved_model.utils.build_tensor_info(elmoMode.dropoutKeepProb)}

? ?

outputs = {"binaryPreds": tf.saved_model.utils.build_tensor_info(elmoMode.binaryPreds)}

? ?

prediction_signature = tf.saved_model.signature_def_utils.build_signature_def(inputs=inputs, outputs=outputs,

method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)

legacy_init_op = tf.group(tf.tables_initializer(), name="legacy_init_op")

builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.SERVING],

signature_def_map={"predict": prediction_signature}, legacy_init_op=legacy_init_op)

? ?

builder.save()

训练结果

技术图片

? ?

相关代码可见:https://github.com/yifanhunter/NLP_textClassifier

主要参考:

【1】 https://home.cnblogs.com/u/jiangxinyang/

以上是关于文本分类-07ELMo的主要内容,如果未能解决你的问题,请参考以下文章

文本分类实战—— Bi-LSTM模型

文本分类实战—— Bi-LSTM + Attention模型

文本分类实战—— word2vec预训练词向量

文本分类-06Transformer

文本分类-02textCNN

文本分类-04BiLSTM