CHRF评估指标

Posted 2023-01-02 雨宙

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了CHRF评估指标相关的知识，希望对你有一定的参考价值。

CHRF

CHRF指标从字符级别对译文质量进行评估，它考虑了一些形态—句法现象，除此之外，与其他评估指标相比，它很简单，不需要任何额外的工具或知识来源，它完全独立于语言，也独立于分词过程。
CHRF计算公式：
$\\mathrmchrF \\beta=\\left(1+\\beta^2\\right) \\frac\\mathrmchrP \\cdot \\mathrmchrR\\beta^2 \\cdot \\mathrmchrP+\\mathrmchrR$
- $\\mathrmchrP$ 是精确度，指翻译句子和参考译文句子匹配的字符级n-gram在翻译句子中占的比例
- $\\mathrmchrR$ 是召回率，指翻译句子和参考译文句子匹配的字符级n-gram在参考译文句子中占的比例
- $\\beta$ 可以控制召回率和精确度两个指标的重要性（召回率比准确率重要 $\\beta$ 倍），当 $\\beta=1$ 时二者同样重要
使用nltk计算CHRF
- 当n-gram词组长度为1时（词组的最小长度为1，最大长度也为1， $\\beta=3$ ）
```
from nltk.translate.chrf_score import sentence_chrf

ref = 'the cat is on the mat'.split()
hyp = 'the the the the the the the'.split()
sentence_chrf(ref, hyp, min_len=1, max_len=1, beta=3.0)
# 0.48484848484848486
```
  - 计算TP：重复出现的1-gram有t, h, e，总共有8次
  - 计算TP + FP：翻译句子的长度为21
  - 计算TP + TN：参考译文句子的长度为16
  - $\\mathrmchrP=8/21,\\mathrmchrR=8/16$
    $\\mathrmchrF=(1 + 3^2)\\frac\\frac821*\\frac123^2*\\frac821+\\frac12=\\frac1633=0.48484848484848485$
- 拓展到max_len=2的情况，此时n-gram词组的最小长度为1，最大长度为2， $\\beta=3$
```
from nltk.translate.chrf_score import sentence_chrf
ref = 'the cat is on the mat'.split()
hyp = 'the the the the the the the'.split()
print(sentence_chrf(ref, hyp, min_len=1, max_len=2, beta=3.0))
# 0.37145650048875856
```
  - 计算1-gram的情况：此时F-score=0.48484848484848486（和CHRF计算相同）
  - 计算2-gram的情况：
    - 计算TP：重复出现的2-gram有th, he，总共有4次
    - 计算TP + FP：翻译句子分成2-gram的长度为20
    - 计算TP + TN：参考译文句子分成2-gram的长度为15
    - $\\mathrmchrP=4/20,\\mathrmchrR=4/15$
      $\\mathrmchrF_2-gram=(1 + 3^2)\\frac\\frac15*\\frac4153^2*\\frac15+\\frac415=\\frac831=0.25806451612903225$
  - 计算总的CHRF
    $\\mathrmchrF=\\frac0.48484848484848486+0.258064516129032252=0.37145650048875856$
- 计算语料级的CHRF（以上都是句子级的CHRF）：基本思想是计算出每个句子的CHRF，然后再求算术平均
```
ref1 = str('It is a guide to action that ensures that the military will forever heed Party commands').split()
ref2 = str('It is the guiding principle which guarantees the military forces always being under the command of the Party').split()
hyp1 = str('It is a guide to action which ensures that the military always obeys the commands of the party').split()
hyp2 = str('It is to insure the troops forever hearing the activity guidebook that party direct').split()
corpus_chrf([ref1, ref2], [hyp1, hyp2]) 
# 0.4166529443281564

(sentence_chrf(ref1, hyp1) + sentence_chrf(ref2, hyp2)) / 2
# 0.4166529443281564
```

使用sacrebleu计算CHRF

计算句子级CHRF

print(sacrebleu.sentence_chrf(hypothesis='the the the the the the the',
                            references=['the cat is on the mat'],
                            char_order=1, word_order=0, beta=3, remove_whitespace=True).score)
# 48.484848484848484
print(sacrebleu.sentence_chrf(hypothesis='the the the the the the the',
                            references=['the cat is on the mat'],
                            char_order=2, word_order=0, beta=3, remove_whitespace=True).score)
# 37.145882975906794

计算语料级CHRF

与nltk工具提供的计算方法不同，sacrebleu并不是计算出每个句子的CHRF，再求算术平均
sacrebleu在计算i-gram的准确率和召回率时，将语料中的参考句子i-gram长度、翻译句子i-gram长度、参考句子和翻译句子匹配i-gram数量分别进行相加，即分数中的分子和分子进行相加，分母和分母进行相加，与nltk中的分数直接进行相加不同，这与sacrebleu中求BLEU的方法有异曲同工之妙

ref1 = 'It is a guide to action that ensures that the military will forever heed Party commands'
ref2 = 'It is the guiding principle which guarantees the military forces always being under the command of the Party'
hyp1 = 'It is a guide to action which ensures that the military always obeys the commands of the party'
hyp2 = 'It is to insure the troops forever hearing the activity guidebook that party direct'
print(sacrebleu.corpus_chrf(hypotheses=[hyp1, hyp2], references=[[ref1, ref2]], char_order=6, word_order=0, beta=3).score)
# 39.364938843711016

将以上的代码作为示例，sacrebleu首先计算出每个句子各n-gram模型中的参考句子n-gram长度、翻译句子n-gram长度、参考句子和翻译句子匹配n-gram数量，如下所示
```
[[77, 72, 65, 76, 71, 50, 75, 70, 44, 74, 69, 40, 73, 68, 36, 72, 67, 33], [70, 91, 60, 69, 90, 28, 68, 89, 12, 67, 88, 4, 66, 87, 1, 65, 86, 0]]
```
列表中的第一项代表第一个句子，列表中的第二项代表第二个句子，以第一个句子为例，列表项中共有18个元素，分别是翻译句子1-gram长度（翻译句子长度）、参考句子1-gram长度（参考句子长度）、翻译句子和参考句子匹配的1-gram数量…以此类推，一直到6-gram
将列表中的对应项进行相加，得出以下结果（相当于分子和分子相加，分母和分母相加）
```
[147, 163, 125, 145, 161, 78, 143, 159, 56, 141, 157, 44, 139, 155, 37, 137, 153, 33]
```

然后求各n-gram的准确率和召回率，将准确率和召回率求算术平均（总和除以6），再用平均后的准确率和召回率求最终的CHRF，基础逻辑如下所示（仿照sacrebleu手搓的，可能有一些特殊情况不适用，比如分母不能为0）

data_list = [147, 163, 125, 145, 161, 78, 143, 159, 56, 141, 157, 44, 139, 155, 37, 137, 153, 33]
sum_prec, sum_rec = 0, 0
for i in range(0, 6):
    index = 3 * i
    n_hyp = data_list[index]
    n_ref = data_list[index + 1]
    n_match = data_list[index + 2]
    
    n_prec = n_match / n_hyp
    n_rec = n_match / n_ref
    
    sum_prec += n_prec
    sum_rec += n_rec
    
    n_fscore = (1 + 9) * n_prec * n_rec / (9 * n_prec + n_rec)
    sum_fscore = sum_fscore + n_fscore

print((1 + 9) * (sum_prec / 6) * (sum_rec / 6) / (9 * (sum_prec / 6) + (sum_rec / 6)))
# 0.39364938843711017

CHRF++

之前的工作中显示，对于评分较差的句子，CHRF和WORDF分数的标准差是相似的——两个指标都分配了相对相似的（低）分数，但对于人类评分较高的句子，CHRF的偏差相较于WORDF的偏差要低得多，此外，人类评分越高，WORDF与CHRF的偏差的差异越大，这些结果表明，CHRF是优于WORDF的，尤其是在翻译质量较高的片段上
但是考虑到CHRF的结果可能过于乐观，所以将CHRF和WORDF结合起来，得到CHRF++
当单词n-grams与字符n-grams相加并取平均值时，就会得到CHRF++分数，这种组合的最佳n-gram长度对于字符n-gram来说是n=6，与CHRF中字符n-gram的最佳长度相同，对于单词n-gram来说是n=1或n=2
使用sacrebleu计算CHRF++
```
print(sacrebleu.sentence_chrf(hypothesis='the the the the the the the',
                                  references=['the cat is on the mat'],
                                  char_order=1, word_order=1, beta=3, remove_whitespace=True).score)
# 40.65040650406503
```
- 为了方便计算，这里字符和单词都选择1-gram，首先可以得到以下统计结果
```
[21, 16, 8, 7, 6, 2]
```
- 前三个数是字符级1-gram的统计结果，后三个数是单词级1-gram的统计结果（分别是翻译句子字符或单词1-gram长度、参考句子字符或单词1-gram长度、匹配的1-gram数量）
- 分别计算准确率和召回率并求平均值（实际上sacrebleu中求平均值是除以self.order实现的，此时self.order等于列表长度除以3，本例中为 $6/3 = 2$ ），得到最后的准确率和召回率，再计算CHRF++
```
prec = 0.6666666666666666 / 2
rec = 0.8333333333333333 / 2
(1 + 9) * prec * rec / (9 * prec + rec)
# 0.40650406504065034
```

参考文献：

以上是关于CHRF评估指标的主要内容，如果未能解决你的问题，请参考以下文章