如何将预测序列转换回keras中的文本?
Posted
技术标签:
【中文标题】如何将预测序列转换回keras中的文本?【英文标题】:How to convert predicted sequence back to text in keras? 【发布时间】:2017-06-17 17:11:42 【问题描述】:我有一个序列到序列的学习模型,它运行良好并且能够预测一些输出。问题是我不知道如何将输出转换回文本序列。
这是我的代码。
from keras.preprocessing.text import Tokenizer,base_filter
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
txt1="""What makes this problem difficult is that the sequences can vary in length,
be comprised of a very large vocabulary of input symbols and may require the model
to learn the long term context or dependencies between symbols in the input sequence."""
#txt1 is used for fitting
tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
tk.fit_on_texts(txt1)
#convert text to sequence
t= tk.texts_to_sequences(txt1)
#padding to feed the sequence to keras model
t=pad_sequences(t, maxlen=10)
model = Sequential()
model.add(Dense(10,input_dim=10))
model.add(Dense(10,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
#predicting new sequcenc
pred=model.predict(t)
#Convert predicted sequence to text
pred=??
【问题讨论】:
还是没有答案? @BenUsman 您找到解决此问题的方法了吗?我也有同样的经历。 @TVH7 查看发布的答案 @Eka 也许你应该接受其中一个答案来关闭帖子。 【参考方案1】:你可以直接使用反函数tokenizer.sequences_to_texts
。
text = tokenizer.sequences_to_texts(<list-of-integer-equivalent-encodings>)
我已经测试了上述内容,它按预期工作。
PS.:请特别注意将参数设为整数编码列表,而不是 One Hot 编码列表。
【讨论】:
似乎是最直接的答案,如果您需要查看它的作用,请尝试以下行:print(tokenizer.sequences_to_texts([[1]]))
在运行 text_to_sequence 之前,请务必删除填充(即删除使用的填充编码)以及来自 <list-of-integer-equivalent-encodings>
的布尔编码【参考方案2】:
这是我找到的解决方案:
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
【讨论】:
【参考方案3】:我必须解决同样的问题,所以这就是我最终的解决方法(受@Ben Usemans 反转字典的启发)。
# Importing library
from keras.preprocessing.text import Tokenizer
# My texts
texts = ['These are two crazy sentences', 'that I want to convert back and forth']
# Creating a tokenizer
tokenizer = Tokenizer(lower=True)
# Building word indices
tokenizer.fit_on_texts(texts)
# Tokenizing sentences
sentences = tokenizer.texts_to_sequences(texts)
>sentences
>[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11, 12, 13]]
# Creating a reverse dictionary
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
# Function takes a tokenized sentence and returns the words
def sequence_to_text(list_of_indices):
# Looking up words in dictionary
words = [reverse_word_map.get(letter) for letter in list_of_indices]
return(words)
# Creating texts
my_texts = list(map(sequence_to_text, sentences))
>my_texts
>[['these', 'are', 'two', 'crazy', 'sentences'], ['that', 'i', 'want', 'to', 'convert', 'back', 'and', 'forth']]
【讨论】:
只是用于反转 word_index 顺序的替代代码reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
【参考方案4】:
您可以制作将索引映射回字符的字典。
index_word = v: k for k, v in tk.word_index.items() # map back
seqs = tk.texts_to_sequences(txt1)
words = []
for seq in seqs:
if len(seq):
words.append(index_word.get(seq[0]))
else:
words.append(' ')
print(''.join(words)) # output
>>> 'what makes this problem difficult is that the sequences can vary in length
>>> be comprised of a very large vocabulary of input symbols and may require the model
>>> to learn the long term context or dependencies between symbols in the input sequence '
但是,在问题中,您尝试使用字符序列来预测 10 个类的输出,这不是序列到序列模型。在这种情况下,您不能只将预测(或pred.argmax(axis=1)
)转回字符序列。
【讨论】:
【参考方案5】: p_test = model.predict(data_test).argmax(axis =1)
#Show some misclassified examples
misclassified_idx = np.where(p_test != Ytest)[0]
len(misclassified_idx)
i= np.random.choice(misclassified_idx)
print((i))
print((df_test[i]))
print('True label %s Predicted label %s' , (Ytest[i], p_test[i]))
df_test is the original text
data_test is sequence of integer
【讨论】:
请务必描述您发布的代码以上是关于如何将预测序列转换回keras中的文本?的主要内容,如果未能解决你的问题,请参考以下文章