解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't
Posted jiangxinyang
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't相关的知识,希望对你有一定的参考价值。
在window下使用gemsim.models.word2vec.LineSentence加载中文维基百科语料库(已分词)时报如下错误:
UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xca in position 0: invalid continuation byte
这种编码问题真的很让人头疼,这种问题都是出现在xxx.decode("utf-8")的时候,所以接下来我们来看看gensim中的源码:
class LineSentence(object): """Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace. """ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None): """ Parameters ---------- source : string or a file-like object Path to the file on disk, or an already-open file object (must support `seek(0)`). limit : int or None Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default). Examples -------- .. sourcecode:: pycon >>> from gensim.test.utils import datapath >>> sentences = LineSentence(datapath(‘lee_background.cor‘)) >>> for sentence in sentences: ... pass """ self.source = source self.max_sentence_length = max_sentence_length self.limit = limit def __iter__(self): """Iterate through the lines in the source.""" try: # Assume it is a file-like object and try treating it as such # Things that don‘t have seek will trigger an exception self.source.seek(0) for line in itertools.islice(self.source, self.limit): line = utils.to_unicode(line).split() i = 0 while i < len(line): yield line[i: i + self.max_sentence_length] i += self.max_sentence_length except AttributeError: # If it didn‘t work like a file, use it as a string filename with utils.smart_open(self.source) as fin: for line in itertools.islice(fin, self.limit): line = utils.to_unicode(line).split() i = 0 while i < len(line): yield line[i: i + self.max_sentence_length] i += self.max_sentence_length
从源码中可以看到__iter__方法让LineSentence成为了一个可迭代的对象,而且文件读取的方法也都定义在__iter__方法中。一般我们输入的source参数都是一个文件路径(也就是一个字符串形式),因此在try时,self.source.seek(0)会报“字符串没有seek方法”的错,所以真正执行的代码是在except中。
接下来我们有两种方法来解决我们的问题:
1)from gensim import utils
utils.samrt_open(url, mode="rb", **kw)
在源码中用utils.smart_open()方法打开文件时默认是用二进制的形式打开的,可以将mode=“rb” 改成mode=“r”。
2)from gensim import utils
utils.to_unicode(text, encoding=‘utf8‘, errors=‘strict‘)
在源码中在decode("utf8")时,其默认errors=“strict”, 可以将其改成errors="ignore"。即utils.to_unicode(line, errors="ignore")
不过建议大家不要直接在源码上修改,可以直接将源码复制下来,例如:
import logging import itertools import gensim from gensim.models import word2vec from gensim import utils logging.basicConfig(format=‘%(asctime)s : %(levelname)s : %(message)s‘, level=logging.INFO) class LineSentence(object): """Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace. """ def __init__(self, source, max_sentence_length=10000, limit=None): """ Parameters ---------- source : string or a file-like object Path to the file on disk, or an already-open file object (must support `seek(0)`). limit : int or None Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default). Examples -------- .. sourcecode:: pycon >>> from gensim.test.utils import datapath >>> sentences = LineSentence(datapath(‘lee_background.cor‘)) >>> for sentence in sentences: ... pass """ self.source = source self.max_sentence_length = max_sentence_length self.limit = limit def __iter__(self): """Iterate through the lines in the source.""" try: # Assume it is a file-like object and try treating it as such # Things that don‘t have seek will trigger an exception self.source.seek(0) for line in itertools.islice(self.source, self.limit): line = utils.to_unicode(line).split() i = 0 while i < len(line): yield line[i: i + self.max_sentence_length] i += self.max_sentence_length except AttributeError: # If it didn‘t work like a file, use it as a string filename with utils.smart_open(self.source, mode="r") as fin: for line in itertools.islice(fin, self.limit): line = utils.to_unicode(line).split() i = 0 while i < len(line): yield line[i: i + self.max_sentence_length] i += self.max_sentence_length our_sentences = LineSentence("./zhwiki_token.txt") model = gensim.models.Word2Vec(our_sentences, size=200, iter=30) # 大语料,用CBOW,适当的增大迭代次数 # model.save(save_model_file) model.wv.save_word2vec_format("./mathWord2Vec" + ".bin", binary=True) # 以二进制类型保存模型以便之后可以继续增量训练
以上是关于解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't的主要内容,如果未能解决你的问题,请参考以下文章
如果在检查使用 hashmap 解决 Leetcode 二和的解决方案之前执行 map.put,为啥会失败?
在 Javascript 中使用 document.domain 的同源策略解决方法