结巴并行分词
Posted 张乐乐章
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了结巴并行分词相关的知识,希望对你有一定的参考价值。
源文件有4列
import os import sys import pandas as pd from joblib import Parallel, delayed import jieba import yaml config = yaml.load(open(‘config.yaml‘, ‘r‘)) def read_df(trainfile): data = pd.read_csv(trainfile, sep=‘\\t‘, header=None, nrows=60000, encoding=‘utf-8‘, names=[‘id‘, ‘title‘, ‘content‘, ‘label‘]) return data def word_cut(df): with open(config[‘train_cut‘], ‘a+‘) as f: line = ‘\t‘.join([df[0],‘ ‘.join(jieba.cut(df[1])) ,‘ ‘.join(jieba.cut(df[2])),df[3]]) f.writelines(line) f.writelines(‘\n‘) def applyParallel(content, func, n_thread): with Parallel(n_jobs=n_thread) as parallel: parallel(delayed(func)(c) for c in content) def main(): overwrite = True if overwrite: if os.path.exists(config[‘train_cut‘]): os.remove(config[‘train_cut‘]) trainfile = ‘data/train_fusai.tsv‘ df = read_df(trainfile) content = df.values applyParallel(content, word_cut, 22) if __name__ == ‘__main__‘: main()
以上是关于结巴并行分词的主要内容,如果未能解决你的问题,请参考以下文章