UnicodeDecodeError:'ascii'编解码器无法解码Textranking代码中的字节[重复]

Posted

技术标签:

【中文标题】UnicodeDecodeError:\'ascii\'编解码器无法解码Textranking代码中的字节[重复]【英文标题】:UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code [duplicate]UnicodeDecodeError:'ascii'编解码器无法解码Textranking代码中的字节[重复] 【发布时间】:2017-09-02 09:59:25 【问题描述】:

当我执行下面的代码时

import networkx as nx
import numpy as np
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def textrank(document):
    sentence_tokenizer = PunktSentenceTokenizer()
    sentences = sentence_tokenizer.tokenize(document)

    bow_matrix = CountVectorizer().fit_transform(sentences)
    normalized = TfidfTransformer().fit_transform(bow_matrix)

    similarity_graph = normalized * normalized.T

    nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
    scores = nx.pagerank(nx_graph)
    return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

fp = open("QC")    
txt = fp.read()
sents = textrank(txt)
print sents

我收到以下错误

Traceback (most recent call last):
  File "Textrank.py", line 44, in <module>
    sents = textrank(txt)
  File "Textrank.py", line 10, in textrank
    sentences = sentence_tokenizer.tokenize(document)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

我正在 Ubuntu 中执行代码。为了获得文字,我参考了这个网站 https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101。我创建了一个文件 QC(不是 QC.txt)并将数据逐段复制粘贴到文件中。 请帮我解决错误。 谢谢你

【问题讨论】:

欢迎来到 ***!请看***.com/help/how-to-ask。另外,请在发布问题之前先在 Google 或其他地方搜索问题。 抱歉,我无法理解现有的解决方案。编辑:我刚刚重新检查了链接,现在它有点意义了。我无法弄清楚该解决方案如何适应。我是 Python 新手,并且已经深入研究了 NLP,所以我很容易不知所措。请原谅我。 【参考方案1】:

如果以下内容适合您,请尝试。

import networkx as nx
import numpy as np
import sys

reload(sys)
sys.setdefaultencoding('utf8')

from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def textrank(document):
    sentence_tokenizer = PunktSentenceTokenizer()
    sentences = sentence_tokenizer.tokenize(document)

    bow_matrix = CountVectorizer().fit_transform(sentences)
    normalized = TfidfTransformer().fit_transform(bow_matrix)

    similarity_graph = normalized * normalized.T

    nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
    scores = nx.pagerank(nx_graph)
    return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

fp = open("QC")    
txt = fp.read()
sents = textrank(txt.encode('utf-8'))
print sents

【讨论】:

非常感谢。一旦我得到句子,我将它们打印为 for s in sents: st = str(s[1]) print st 当我打印发送时,因为它是很多 unicode 类型的东西,但是当我将它们转换为字符串时,它们就消失了。为什么会这样?

以上是关于UnicodeDecodeError:'ascii'编解码器无法解码Textranking代码中的字节[重复]的主要内容,如果未能解决你的问题,请参考以下文章

UnicodeDecodeError: 'ascii' 编解码器无法在位置解码字节 0xec

Python/Flask:UnicodeDecodeError/UnicodeEncodeError:“ascii”编解码器无法解码/编码

UnicodeDecodeError:“ascii”编解码器无法解码位置 1 的字节 0xef

如何避免 Redshift Python UDF 出现 UnicodeDecodeError ascii 错误?

python2 当中 遇到 UnicodeDecodeError UnicodeDecodeError: 'ascii' codec can't decode byte 0xe

UnicodeDecodeError:'ascii'编解码器无法解码位置 284 中的字节 0x93:序数不在范围内(128)[重复]