如何在拥抱脸模型中获得令牌的概率分布?

Posted

技术标签:

【中文标题】如何在拥抱脸模型中获得令牌的概率分布?【英文标题】:How to get a probability distribution over tokens in a huggingface model? 【发布时间】:2022-01-14 20:19:22 【问题描述】:

我正在关注 this 教程,了解如何对蒙面词进行预测。我使用这个的原因是因为它似乎同时处理了几个蒙面词,而我尝试的其他方法一次只能使用 1 个蒙面词。

代码:

from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

sentence = "Tom has fully ___ ___ ___ illness."


def get_prediction (sent):
    
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
    masked_pos = [mask.item() for mask in masked_position ]

    with torch.no_grad():
        output = model(token_ids)

    last_hidden_state = output[0].squeeze()

    list_of_list =[]
    for index,mask_index in enumerate(masked_pos):
        mask_hidden_state = last_hidden_state[mask_index]
        idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
        words = [tokenizer.decode(i.item()).strip() for i in idx]
        list_of_list.append(words)
        print ("Mask ",index+1,"Guesses : ",words)
    
    best_guess = ""
    for j in list_of_list:
        best_guess = best_guess+" "+j[0]
        
    return best_guess


print ("Original Sentence: ",sentence)
sentence = sentence.replace("___","<mask>")
print ("Original Sentence replaced with mask: ",sentence)
print ("\n")

predicted_blanks = get_prediction(sentence)
print ("\nBest guess for fill in the blank :::",predicted_blanks)

如何获得 5 个标记而不是它们的索引的概率分布?也就是说,类似于this 方法(我之前使用过,但是一旦我更改为多个掩码标记,我就会出错)将分数作为输出:

from transformers import pipeline

# Initialize MLM pipeline
mlm = pipeline('fill-mask')

# Get mask token
mask = mlm.tokenizer.mask_token

# Get result for particular masked phrase
phrase = f'Read the rest of this mask to understand things in more detail'
result = mlm(phrase)

# Print result
print(result)

[
    'sequence': 'Read the rest of this article to understand things in more detail',
    'score': 0.35419148206710815,
    'token': 1566,
    'token_str': ' article'
,...

【问题讨论】:

【参考方案1】:

变量last_hidden_state[mask_index] 是预测掩码标记的logits。因此,要获得令牌概率,您可以在此使用 softmax,即

probs = torch.nn.functional.softmax(last_hidden_state[mask_index])

然后您可以使用

获得topk的概率
word_probs = [probs[i] for i in idx]

PS 我假设你知道你应该使用 而不是 ___,即发送 =“Tom has fully disease.”,我得到以下信息:

掩码 1 猜测:['recovered', 'returned', 'cleared', 'recover', 'healed']

[张量(0.9970),张量(0.0007),张量(0.0003),张量(0.0003),张量(0.0002)]

掩码 2 猜测:['from', 'his', 'with', 'to', 'the']

[张量(0.5066),张量(0.2048),张量(0.0684),张量(0.0513),张量(0.0399)]

面具 3 猜测:['his', 'the','mental', 'serious', 'this']

[张量(0.5152),张量(0.2371),张量(0.0407),张量(0.0257),张量(0.0199)]

【讨论】:

以上是关于如何在拥抱脸模型中获得令牌的概率分布?的主要内容,如果未能解决你的问题,请参考以下文章

概率图模型(推理:变量消除)

机器学习——概率生成模型

概率图模型

如何将概率分布与代理相关联 - Anylogic

机器学习中的概率模型和概率密度估计方法及VAE生成式模型详解之四(第2章)

概率图模型