stanford corenlp自定义切词类

Posted 2020-08-22 春文秋武

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了stanford corenlp自定义切词类相关的知识，希望对你有一定的参考价值。

stanford corenlp的中文切词有时不尽如意，那我们就需要实现一个自定义切词类，来完全满足我们的私人定制（加各种词典干预）。上篇文章《IKAnalyzer》介绍了IKAnalyzer的自由度，本篇文章就说下怎么把IKAnalyzer作为corenlp的切词工具。

《stanford corenlp的TokensRegex》提到了corenlp的配置CoreNLP-chinese.properties，其中customAnnotatorClass.segment就是用于指定切词类的，在这里我们只需要模仿ChineseSegmenterAnnotator来实现一个自己的Annotator，并设置在配置文件中即可。

customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator

下面是我的实现：

public class IKSegmenterAnnotator extends ChineseSegmenterAnnotator {
    public IKSegmenterAnnotator() {
        super();
    }

    public IKSegmenterAnnotator(boolean verbose) {
        super(verbose);
    }

    public IKSegmenterAnnotator(String segLoc, boolean verbose) {
        super(segLoc, verbose);
    }

    public IKSegmenterAnnotator(String segLoc, boolean verbose, String serDictionary, String sighanCorporaDict) {
        super(segLoc, verbose, serDictionary, sighanCorporaDict);
    }

    public IKSegmenterAnnotator(String name, Properties props) {
        super(name, props);
    }

    private List<String> splitWords(String str) {
        try {
            List<String> words = new ArrayList<String>();
            IKSegmenter ik = new IKSegmenter(new StringReader(str), true);
            Lexeme lex = null;
            while ((lex = ik.next()) != null) {
                words.add(lex.getLexemeText());
            }
            return words;
        } catch (IOException e) {
            //LOGGER.error(e.getMessage(), e);
            System.out.println(e);
            List<String> words = new ArrayList<String>();
            words.add(str);
            return words;
        }
    }

    @Override
    public void runSegmentation(CoreMap annotation) {
        //0 2
        // A BC D E
        // 1 10 1 1
        // 0 12 3 4
        // 0, 0+1 ,

        String text = annotation.get(CoreAnnotations.TextAnnotation.class);
        List<CoreLabel> sentChars = annotation.get(ChineseCoreAnnotations.CharactersAnnotation.class);
        List<CoreLabel> tokens = new ArrayList<CoreLabel>();
        annotation.set(CoreAnnotations.TokensAnnotation.class, tokens);

        //List<String> words = segmenter.segmentString(text);
        List<String> words = splitWords(text);
        System.err.println(text);
        System.err.println("--->");
        System.err.println(words);

        int pos = 0;
        for (String w : words) {
            CoreLabel fl = sentChars.get(pos);
            fl.set(CoreAnnotations.ChineseSegAnnotation.class, "1");
            if (w.length() == 0) {
                continue;
            }
            CoreLabel token = new CoreLabel();
            token.setWord(w);
            token.set(CoreAnnotations.CharacterOffsetBeginAnnotation.class, fl.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class));
            pos += w.length();
            fl = sentChars.get(pos - 1);
            token.set(CoreAnnotations.CharacterOffsetEndAnnotation.class, fl.get(CoreAnnotations.CharacterOffsetEndAnnotation.class));
            tokens.add(token);
        }
    }
}

在外面为IKAnalyzer初始化词典，指定扩展词典和删除词典

        //为ik初始化词典，删除干扰词
        Dictionary.initial(DefaultConfig.getInstance());
        String delDic = System.getProperty(READ_IK_DEL_DIC, null);
        BufferedReader reader = new BufferedReader(new FileReader(delDic));
        String line = null;
        List<String> delWords = new ArrayList<>();
        while ((line = reader.readLine()) != null) {
            delWords.add(line);
        }
        Dictionary.getSingleton().disableWords(delWords);

以上是关于stanford corenlp自定义切词类的主要内容，如果未能解决你的问题，请参考以下文章

使用Stanford CoreNLP进行句法分析实战

Stanford Corenlp学习笔记——词性标注

stanford coreNLP CRFClassifier 模型加载和序列化

运行 Stanford.NLP.CoreNLP 示例时出现 TypeInitializationException

开源中文分词工具探析：Stanford CoreNLP

Stanford CoreNLP句法分析可视化及保存在json文件中