Huggingface Bert:输出打印
Posted
技术标签:
【中文标题】Huggingface Bert:输出打印【英文标题】:Huggingface Bert: Output Printing 【发布时间】:2020-09-24 03:47:00 【问题描述】:我是编码新手,并且可以使用指导来了解为什么它会像现在这样奇怪地打印。虽然这与 NLP 有关,但我相信这个错误很可能是由比我有更多编码知识的人解释的。我希望这是提出这个问题的正确地方。感谢您的帮助!
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-cased-whole-word-masking")
model = AutoModelWithLMHead.from_pretrained("bert-large-cased-whole-word-masking")
sentence = """While United States [MASK] heed human rights,"""
token_ids = tokenizer.encode(sentence, return_tensors='pt')
# print(token_ids)
token_ids_tk = tokenizer.tokenize(sentence, return_tensors='pt')
print(token_ids_tk)
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
print (masked_pos)
with torch.no_grad():
output = model(token_ids)
last_hidden_state = output[0].squeeze()
print ("\n\n")
print ("sentence :", sentence)
print ("\n")
list_of_list =[]
for mask_index in masked_pos:
mask_hidden_state = last_hidden_state[mask_index]
idx = torch.topk(mask_hidden_state, k=25, dim=0)[1]
words = [tokenizer.decode(i.item()).strip() for i in idx]
list_of_list.append(words)
print (words)
best_guess = ""
for j in list_of_list:
best_guess = best_guess+" "+j[0]
print ("\nBest guess for fill in the blank :::",best_guess)
输出:
['While', 'United', 'States', '[MASK]', 'he', '##ed', 'human', 'rights', ',']
[4]
sentence : While United States [MASK] heed human rights,
['m u s t', 'c i t i z e n s', 's h o u l d', 'c a n n o t', 'l a w s', 'd o e s', 'g e n e r a l l y', 'd i d', 'a l w a y s', 'l a w', ',', 'g o v e r n m e n t', 'd o', 'p o l i t i c i a n s', 'm a y', 'd e f e n d e r s', 'c o u n t r i e s', 'c a n', 'o f f i c i a l s', 'g o v e r n m e n t s', 'w i l l', 'G o v e r n m e n t', 'v a l u e s', 'C o n s t i t u t i o n', 'p e o p l e']
Best guess for fill in the blank ::: m u s t
【问题讨论】:
你能显示 j 的输出吗? BERT 还使用了与普通工具非常不同的工件标记器 【参考方案1】:所以你首先要了解的是 BERT 给出的标记化输出
如果您查看输出,它已经被隔开(我已经写了一些打印语句,可以清楚地说明)
如果您只想要完美的输出:更改我添加 cmets 的行
!pip3 install transformers
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-cased-whole-word-masking")
model = AutoModelWithLMHead.from_pretrained("bert-large-cased-whole-word-masking")
sentence = """While United States [MASK] heed human rights,"""
token_ids = tokenizer.encode(sentence, return_tensors='pt')
# print(token_ids)
token_ids_tk = tokenizer.tokenize(sentence, return_tensors='pt')
print(token_ids_tk)
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
print (masked_pos)
with torch.no_grad():
output = model(token_ids)
last_hidden_state = output[0].squeeze()
print ("\n\n")
print ("sentence :", sentence)
print ("\n")
list_of_list =[]
for mask_index in masked_pos:
mask_hidden_state = last_hidden_state[mask_index]
idx = torch.topk(mask_hidden_state, k=25, dim=0)[1]
for i in idx: print(i,tokenizer.decode(i.item()).strip())
words = [tokenizer.decode(i.item()).strip().replace(" ","") for i in idx] ## REMOVING ANY SPACES ADDED WHILE COMBINING TOKENS
list_of_list.append(words)
print ("WORDS",words)
print('list_of_list',list_of_list)
best_guess = ""
for j in list_of_list:
print('j',j)
best_guess = (best_guess+" "+j[0]).strip() ## ADD THIS TO REMOVE TRAILING SPACES
print ("\nBest guess for fill in the blank :::",best_guess)
【讨论】:
以上是关于Huggingface Bert:输出打印的主要内容,如果未能解决你的问题,请参考以下文章
如何微调 HuggingFace BERT 模型以进行文本分类 [关闭]