开源中文分词工具探析：Stanford CoreNLP

Posted 2020-10-23 en-heng

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了开源中文分词工具探析：Stanford CoreNLP相关的知识，希望对你有一定的参考价值。

CoreNLP是由斯坦福大学开源的一套Java NLP工具，提供诸如：词性标注（part-of-speech (POS) tagger）、命名实体识别（named entity recognizer (NER)）、情感分析（sentiment analysis）等功能。

【开源中文分词工具探析】系列：

1. 前言

CoreNLP的中文分词基于CRF模型：

\\[ P_w(y|x) = \\frac{exp \\left( \\sum_i w_i f_i(x,y) \\right)}{Z_w(x)} \\]

其中，\\(Z_w(x)\\)为归一化因子，\\(w\\)为模型的参数，\\(f_i(x,y)\\)为特征函数。

2. 分解

以下源码分析基于3.7.0版本，分词示例见SegDemo类。

模型

主要模型文件有两份，一份为词典文件dict-chris6.ser.gz：

// dict-chris6.ser.gz 对应于长度为7的Set数组词典
// 共计词数：0+7323+125336+142252+82139+26907+39243
ChineseDictionary::loadDictionary(String serializePath) {
    Set<String>[] dict = new HashSet[MAX_LEXICON_LENGTH + 1];
    for (int i = 0; i <= MAX_LEXICON_LENGTH; i++) {
        dict[i] = Generics.newHashSet();
    }
    dict = IOUtils.readObjectFromURLOrClasspathOrFileSystem(serializePath);
    return dict;
}

词典的索引值为词的长度，比如第0个词典中没有词，第1个词典为长度为1的词，第6个词典为长度为6的词。其中，第6个词典为半成词，比如，有词“《双峰》（电”、“８０年国家领”、“１８２４年英”。

另一份为CRF训练模型文件/ctb.gz：

CRFClassifier::loadClassifier(ObjectInputStream ois, Properties props) {
    Object o = ois.readObject();
    if (o instanceof List) {
        labelIndices = (List<Index<CRFLabel>>) o; // label索引
    }
    classIndex = (Index<String>) ois.readObject(); // 序列标注label
    featureIndex = (Index<String>) ois.readObject(); // 特征
    flags = (SeqClassifierFlags) ois.readObject(); // 模型配置

    Object featureFactory = ois.readObject(); // 特征模板，用于生成特征
    else if (featureFactory instanceof FeatureFactory) {
        featureFactories = Generics.newArrayList();
        featureFactories.add((FeatureFactory<IN>) featureFactory);
    }

    windowSize = ois.readInt(); // 窗口大小为2
    weights = (double[][]) ois.readObject(); // 特征+label 对应的权重

    Set<String> lcWords = (Set<String>) ois.readObject(); // Set为空
    else {
        knownLCWords = new MaxSizeConcurrentHashSet<>(lcWords);
    }

    reinit();
}

不同于其他分词器采用B、M、E、S四中label来做分词，CoreNLP的中文分词label只有两种，“1”表示当前字符与前一字符连接成词，“0”则表示当前字符为另一词的开始，换言之前一字符为上一个词的结尾。

class CRFClassifier {
    classIndex: class edu.stanford.nlp.util.HashIndex
      ["1","0"]
}

// 中文分词label对应的类
public static class AnswerAnnotation implements CoreAnnotation<String>{}

特征

CoreNLP的特征如下（示例）：

class CRFClassifier {
    // 特征
    featureIndex: class edu.stanford.nlp.util.HashIndex
        size = 3408491
        0=的膀cc2|C
        1=身也pc|C
        44=LSSLp2spscsc2s|C
        45=科背p2p|C
        46=迪。cc2|C
        ...
        =球-行pc2|CnC
        =音非cc2|CpC
    
    // 权重
    weights: double[3408491][2]
        [[2.2114868426005005E-5, -2.2114868091546352E-5]...]
}

特征后缀只有3类：C, CpC, CnC，分别代表了三大类特征；均由特征模板生成：

// 特征模板List
featureFactories: ArrayList<FeatureFactory>
    0 = Gale2007ChineseSegmenterFeatureFactory

// 具体特征模板
Gale2007ChineseSegmenterFeatureFactory::getCliqueFeatures() {
    if (clique == cliqueC) {
        addAllInterningAndSuffixing(features, featuresC(cInfo, loc), "C");
    } else if (clique == cliqueCpC) {
        addAllInterningAndSuffixing(features, featuresCpC(cInfo, loc), "CpC");
        addAllInterningAndSuffixing(features, featuresCnC(cInfo, loc - 1), "CnC");
    }
}

特征模板只用到了两个特征簇cliqueC与cliqueCpC，其中，cliqueC由函数featuresC()实现，cliqueCpC由函数featuresCpC()与featuresCnC()

Gale2007ChineseSegmenterFeatureFactory::featuresC() {
    if (flags.useWord1) {
        // Unigram 特征
        features.add(charc +"::c"); // c[0]
        features.add(charc2+"::c2"); // c[1]
        features.add(charp +"::p"); // c[-1]
        features.add(charp2 +"::p2"); // c[-2]
    
        // Bigram 特征
        features.add(charc +charc2  +"::cn"); // c[0]c[1]
        features.add(charc +charc3  +"::cn2"); // c[0]c[2]
        features.add(charp +charc  +"::pc"); // c[-1]c[0]
        features.add(charp +charc2  +"::pn"); // c[-1]c[1]
        features.add(charp2 +charp  +"::p2p"); // c[-2]c[-1]
        features.add(charp2 +charc  +"::p2c"); // c[-2]c[0]
        features.add(charc2 +charc  +"::n2c"); // c[1]c[0]
    }
    
    // 三个字符c[-1]c[0]c[1]对应的LBeginAnnotation、LMiddleAnnotation、LEndAnnotation 三种label特征
    // 结果特征分别以6种形式结尾，"-lb", "-lm", "-le", "-plb", "-plm", "-ple", "-c2lb", "-c2lm", "-c2le"
    // null || ".../models/segmenter/chinese/dict-chris6.ser.gz"
    if (flags.dictionary != null || flags.serializedDictionary != null) {
        dictionaryFeaturesC(CoreAnnotations.LBeginAnnotation.class,
                CoreAnnotations.LMiddleAnnotation.class,
                CoreAnnotations.LEndAnnotation.class,
                "", features, p, c, c2);
    }

    // 特征 c[1]c[0], c[1]
    if (flags.useFeaturesC4gram || flags.useFeaturesC5gram || flags.useFeaturesC6gram) {
        features.add(charp2 + charp + "p2p");
        features.add(charp2 + "p2");
    }

    // Unicode特征
    if (flags.useUnicodeType || flags.useUnicodeType4gram || flags.useUnicodeType5gram) {
        features.add(uTypep + "-" + uTypec + "-" + uTypec2 + "-uType3");
    }

    // UnicodeType特征
    if (flags.useUnicodeType4gram || flags.useUnicodeType5gram) {
        features.add(uTypep2 + "-" + uTypep + "-" + uTypec + "-" + uTypec2 + "-uType4");
    }

    // UnicodeBlock特征
    if (flags.useUnicodeBlock) {
        features.add(p.getString(CoreAnnotations.UBlockAnnotation.class) + "-" + c.getString(CoreAnnotations
                .UBlockAnnotation.class) + "-" + c2.getString(CoreAnnotations.UBlockAnnotation.class) + "-uBlock");
    }

    // Shape特征
    if (flags.useShapeStrings) {
        if (flags.useShapeStrings1) {
            features.add(p.getString(CoreAnnotations.ShapeAnnotation.class) + "ps");
            features.add(c.getString(CoreAnnotations.ShapeAnnotation.class) + "cs");
            features.add(c2.getString(CoreAnnotations.ShapeAnnotation.class) + "c2s");
        }
        if (flags.useShapeStrings3) {
            features.add(p.getString(CoreAnnotations.ShapeAnnotation.class) + c.getString(CoreAnnotations
                    .ShapeAnnotation.class) + c2.getString(CoreAnnotations.ShapeAnnotation.class) + "pscsc2s");
        }
        if (flags.useShapeStrings4) {
            features.add(p2.getString(CoreAnnotations.ShapeAnnotation.class) + p.getString(CoreAnnotations
                    .ShapeAnnotation.class) + c.getString(CoreAnnotations.ShapeAnnotation.class) + c2.getString
                    (CoreAnnotations.ShapeAnnotation.class) + "p2spscsc2s");
        }
        if (flags.useShapeStrings5) {
            features.add(p2.getString(CoreAnnotations.ShapeAnnotation.class) + p.getString(CoreAnnotations
                    .ShapeAnnotation.class) + c.getString(CoreAnnotations.ShapeAnnotation.class) + c2.getString
                    (CoreAnnotations.ShapeAnnotation.class) + c3.getString(CoreAnnotations.ShapeAnnotation.class)
                    + "p2spscsc2sc3s");
        }
    }
}

Gale2007ChineseSegmenterFeatureFactory::featuresCpC() {}

Gale2007ChineseSegmenterFeatureFactory::featuresCnC() {}

三大类特征分别以“|C”为结尾（共计有32个）、以“|CpC”结尾（共计有37个）、以“|CnC”结尾（共计有9个）；总计78个特征。个人感觉CoreNLP定义的特征过于复杂，大部分特征并没有什么用。

CoreNLP后面处理流程跟其他分词器别无二样了，求每个label的权重加权之和，Viterbi解码求解最大概率路径，解析label序列得到分词结果。CoreNLP分词速度巨慢，效果也一般，在PKU、MSR测试集上的表现如下：

测试集	分词器	准确率	召回率	F1
PKU	thulac4j	0.948	0.936	0.942
	CoreNLP	0.901	0.894	0.897
MSR	thulac4j	0.866	0.896	0.881
	CoreNLP	0.822	0.859	0.840

3.参考资料

[1] Huihsin, Tseng, et al. "A conditional random field word segmenter." Fourth SIGHAN Workshop. 2005.
[2] Chang, Pi-Chuan, Michel Galley, and Christopher D. Manning. "Optimizing Chinese word segmentation for machine translation performance." Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, 2008.

以上是关于开源中文分词工具探析：Stanford CoreNLP的主要内容，如果未能解决你的问题，请参考以下文章

学习常用的开源中文分词工具

学界 | 北大开源中文分词工具包 pkuseg

北大开源了中文分词工具包

北大开源全新中文分词工具包：准确率远超THULAC结巴分词

NLP干货！Python NLTK结合stanford NLP工具包进行文本处理

准确率秒杀结巴分词，北大开源全新中文分词工具包PKUSeg