文本分段的两个指标

Posted 狂徒归来

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了文本分段的两个指标相关的知识,希望对你有一定的参考价值。

  1. Pk
    •  1 def pk(ref, hyp, k=None, boundary=1):
       2     """
       3     Compute the Pk metric for a pair of segmentations A segmentation
       4     is any sequence over a vocabulary of two items (e.g. "0", "1"),
       5     where the specified boundary value is used to mark the edge of a
       6     segmentation.
       7 
       8     >>> ‘%.2f‘ % pk(‘0100‘*100, ‘1‘*400, 2)
       9     ‘0.50‘
      10     >>> ‘%.2f‘ % pk(‘0100‘*100, ‘0‘*400, 2)
      11     ‘0.50‘
      12     >>> ‘%.2f‘ % pk(‘0100‘*100, ‘0100‘*100, 2)
      13     ‘0.00‘
      14 
      15     :param ref: the reference segmentation
      16     :type ref: str or list
      17     :param hyp: the segmentation to evaluate
      18     :type hyp: str or list
      19     :param k: window size, if None, set to half of the average reference segment length
      20     :type boundary: str or int or bool
      21     :param boundary: boundary value
      22     :type boundary: str or int or bool
      23     :rtype: float
      24     """
      25 
      26     if k is None:
      27         k = int(round(len(ref) / (ref.count(boundary) * 2.)))
      28 
      29     err = 0
      30     for i in range(len(ref)-k +1):
      31         r = ref[i:i+k].count(boundary) > 0
      32         h = hyp[i:i+k].count(boundary) > 0
      33         if r != h:
      34            err += 1
      35     return err / (len(ref)-k +1.)

       

  2. WindowDiff
    •   
       1 def windowdiff(seg1, seg2, k, boundary="1", weighted=False):
       2     """
       3     Compute the windowdiff score for a pair of segmentations.  A
       4     segmentation is any sequence over a vocabulary of two items
       5     (e.g. "0", "1"), where the specified boundary value is used to
       6     mark the edge of a segmentation.
       7 
       8         >>> s1 = "000100000010"
       9         >>> s2 = "000010000100"
      10         >>> s3 = "100000010000"
      11         >>> ‘%.2f‘ % windowdiff(s1, s1, 3)
      12         ‘0.00‘
      13         >>> ‘%.2f‘ % windowdiff(s1, s2, 3)
      14         ‘0.30‘
      15         >>> ‘%.2f‘ % windowdiff(s2, s3, 3)
      16         ‘0.80‘
      17 
      18     :param seg1: a segmentation
      19     :type seg1: str or list
      20     :param seg2: a segmentation
      21     :type seg2: str or list
      22     :param k: window width
      23     :type k: int
      24     :param boundary: boundary value
      25     :type boundary: str or int or bool
      26     :param weighted: use the weighted variant of windowdiff
      27     :type weighted: boolean
      28     :rtype: float
      29     """
      30 
      31     if len(seg1) != len(seg2):
      32         raise ValueError("Segmentations have unequal length")
      33     if k > len(seg1):
      34         raise ValueError("Window width k should be smaller or equal than segmentation lengths")
      35     wd = 0
      36     for i in range(len(seg1) - k + 1):
      37         ndiff = abs(seg1[i:i + k].count(boundary) - seg2[i:i + k].count(boundary))
      38         if weighted:
      39             wd += ndiff
      40         else:
      41             wd += min(1, ndiff)
      42     return wd / (len(seg1) - k + 1.)

      这两个指标观看文献,还真是有点玄学!还好,找到了nltk中对应的实现,极其简单明了!

以上是关于文本分段的两个指标的主要内容,如果未能解决你的问题,请参考以下文章

使用OnItemClickListener将Listview Fragment分段为事务

如何从我的活动中意图在我的标签栏中分段?

GraphQL 查询,根据某些条件使用片段。已加载 GraphQL 文件

QApplication 执行分段故障错误

JavaScript单行代码,也就是代码片段

显示 ActionBar 选项卡的两个片段