Keras 数据预处理文本转换为向量&文本预处理（超详解）

Posted 2021-10-22 ZSYL

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Keras 数据预处理文本转换为向量&文本预处理（超详解）相关的知识，希望对你有一定的参考价值。

文本转换为向量&文本预处理

实例演示
模块详解

实例演示

from keras.preprocessing.text import Tokenizer  # one-hot编码
from keras.preprocessing import sequence  # 数据长度规范化
 
text1 = "学习keras的Tokenizer"
text2 = "就是这么简单"
texts = [text1, text2]
 
"""
# num_words 表示用多少词语生成词典（vocabulary）
# char_level表示 如果为True，则每个字符都将被视为标记。if True, every character will be treated as a token.
# oov_token是out-of-vocabulary，用来代替那些字典上没有的字。
"""
tokenizer = Tokenizer(num_words=5000, char_level=True, oov_token='UNK')
tokenizer.fit_on_texts(texts)
 
# 可以设置词典
# tokenizer.word_index = {'UNK': 1, '学': 2, '习': 3}
 
# 每个word出现了几次
print(tokenizer.word_counts)
# 每个word出现在几个文档中
print(tokenizer.word_docs)
# 每个word出现了几次
print(tokenizer.document_count)
# 每个word对应的index，字典映射
print(tokenizer.word_index)
# mode：‘binary’，‘count’，‘tfidf’，‘freq’之一，默认为‘binary’
# 返回值：形如(len(texts), nb_words)的numpy array
print(tokenizer.texts_to_matrix(texts))
# 序列的列表
print(tokenizer.texts_to_sequences(texts))
texts = tokenizer.texts_to_sequences(texts)
texts = sequence.pad_sequences(texts, maxlen=30, padding='post',truncating='post')
print(texts)

OrderedDict([('学', 1), ('习', 1), ('k', 2), ('e', 3), ('r', 2), ('a', 1), ('s', 1), ('的', 1), ('t', 1), ('o', 1), ('n', 1), ('i', 1), ('z', 1), ('就', 1), ('是', 1), ('这', 1), ('么', 1), ('简', 1), ('单', 1)])
defaultdict(<class 'int'>, {'t': 1, 'o': 1, 'r': 1, 'n': 1, '习': 1, '学': 1, 's': 1, 'k': 1, 'a': 1, 'z': 1, 'e': 1, '的': 1, 'i': 1, '么': 1, '这': 1, '简': 1, '就': 1, '单': 1, '是': 1})
2
{'UNK': 1, 'e': 2, 'k': 3, 'r': 4, '学': 5, '习': 6, 'a': 7, 's': 8, '的': 9, 't': 10, 'o': 11, 'n': 12, 'i': 13, 'z': 14, '就': 15, '是': 16, '这': 17, '么': 18, '简': 19, '单': 20}
[[0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[5, 6, 3, 2, 4, 7, 8, 9, 10, 11, 3, 2, 12, 13, 14, 2, 4], [15, 16, 17, 18, 19, 20]]
[[ 5  6  3  2  4  7  8  9 10 11  3  2 12 13 14  2  4  0  0  0  0  0  0  0
   0  0  0  0  0  0]
 [15 16 17 18 19 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0]]

from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
 
text1 = "今天 北京 下 暴雨 了"
text2 = "我 今天 打车 回家"
texts = [text1, text2]
 
print(text_to_word_sequence(text1))  # 按空格分割语料
# ['今天', '北京', '下', '暴雨', '了']
 
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(texts)
print(tokenizer.document_count) # 处理文档的数量
# 2
print(tokenizer.word_counts) # 词频字典，按词频从大到小排序
# OrderedDict([('今天', 2), ('北京', 1), ('下', 1), ('暴雨', 1), ('了', 1), ('我', 1), ('打车', 1), ('回家', 1)])
print(tokenizer.word_docs) # 保存每个word出现的文档的数量
# {'了': 1, '暴雨': 1, '北京': 1, '下': 1, '今天': 2, '打车': 1, '回家': 1, '我': 1}
print(tokenizer.word_index) # 给每个词唯一id
# {'今天': 1, '北京': 2, '下': 3, '暴雨': 4, '了': 5, '我': 6, '打车': 7, '回家': 8}
print(tokenizer.index_docs) # 保存word的id出现的文档的数量
# {5: 1, 4: 1, 2: 1, 3: 1, 1: 2, 7: 1, 8: 1, 6: 1}
print(tokenizer.texts_to_matrix(texts))
# [[0. 1. 1. ... 0. 0. 0.]
# [0. 1. 0. ... 0. 0. 0.]]
# shape = (2, 5000)
print(tokenizer.texts_to_sequences(texts))
# [[1, 2, 3, 4, 5],
#  [6, 1, 7, 8] ] 
 
 
# 将序列填充到maxlen长度
print(pad_sequences([[1,2,3],[4,5,6]],maxlen=10,padding='pre')) # 在序列前填充
# [[0 0 0 0 0 0 0 1 2 3]
# [0 0 0 0 0 0 0 4 5 6]]
print(pad_sequences([[1,2,3],[4,5,6]],maxlen=10,padding='post')) # 在序列后填充
# [[1 2 3 0 0 0 0 0 0 0]
# [4 5 6 0 0 0 0 0 0 0]]

模块详解

数据填充pad_sequences
from keras.preprocessing.sequence import pad_sequences

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None,dtype='int32',padding='pre',truncating='pre', value=0.)

maxlen设置最大的序列长度，长于该长度的序列将会截短，短于该长度的序列将会填充

为什么要进行数据长度规范化？

RNN网络容易出现反向传播过程中的梯度问题。主要原因是我们通常给RNN的参数为有限的序列。

为了实现的简便，keras只能接受长度相同的序列输入。

因此如果目前序列长度参差不齐，这时需要使用pad_sequences()。

该函数是将序列转化为经过填充以后的一个新序列。

举一个例子，是否使用对齐函数取决于如何切割本文，对于一个文本而言，如果是选择根据‘。’来分割句子，因此需要使用该函数保证每个分割的句子能够得到同等长度，但是更加聪明的做法是考虑将文本按照每一个字来分隔，保证切割的句子都是等长的句子，不要再使用该函数。

最后，输入RNN网络之前将词汇转化为分布式表示。

案例:

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype=’int32’, padding=’pre’, truncating=’pre’, value=0.)

函数说明：

将长为nb_samples的序列（标量序列）转化为形如(nb_samples,nb_timesteps)2D numpy array。

如果提供了参数maxlen，nb_timesteps=maxlen，否则其值为最长序列的长度。其他短于该长度的序列都会在后部填充0以达到该长度，长于nb_timesteps的序列将会被截断，以使其匹配目标长度。padding和截断发生的位置分别取决于padding和truncating.

参数：

sequences：浮点数或整数构成的两层嵌套列表
maxlen：None或整数，为序列的最大长度。大于此长度的序列将被截短，小于此长度的序列将在后部填0.
dtype：返回的numpy array的数据类型
padding：‘pre’或‘post’，确定当需要补0时，在序列的起始还是结尾补
truncating：‘pre’或‘post’，确定当需要截断序列时，从起始还是结尾截断
value：浮点数，此值将在填充时代替默认的填充值0
返回值：返回形如(nb_samples,nb_timesteps)的2D张量

例子：

from tensorflow.keras.preprocessing.sequence import pad_sequences

a=[[1,2,3],[4,5,6,7]]

bs_packed = pad_sequence(a, maxlen=4, padding='pre', truncating='pre', value = 0)

print(bs_packed)