在 NLTK 中找不到 ghostscript?
Posted
技术标签:
【中文标题】在 NLTK 中找不到 ghostscript?【英文标题】:Can't find ghostscript in NLTK? 【发布时间】:2016-08-17 23:30:00 【问题描述】:当我尝试使用块模块时,我正在玩弄 NLTK
enter import nltk as nk
Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
tokens = nk.word_tokenize(Sentence)
tagged = nk.pos_tag(tokens)
entities = nk.chunk.ne_chunk(tagged)
当我输入代码时,代码运行良好
>> entities
我收到以下错误消息:
enter code here Out[2]: Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])Traceback (most recent call last):
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\formatters.py", line 343, in __call__
return method()
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\tree.py", line 726, in _repr_png_
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 602, in find_binary
binary_names, url, verbose))
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 596, in find_binary_iter
url, verbose):
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 567, in find_file_iter
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div))
LookupError:
===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================
根据to this post,解决方案是安装 Ghostscript,因为分块器正在尝试使用它来显示解析树,并且正在寻找 3 个二进制文件之一:
file_names=['gs', 'gswin32c.exe', 'gswin64c.exe']
使用。 但即使我安装了 ghostscript,现在我可以在 Windows 搜索中找到二进制文件,但我仍然遇到同样的错误。
我需要修复或更新什么?
附加路径信息:
import os; print os.environ['PATH']
返回:
C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Program Files (x86)\Parallels\Parallels Tools\Applications;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Oracle\RPAS14.1\RpasServer\bin;C:\Oracle\RPAS14.1\RpasServer\applib;C:\Program Files (x86)\Java\jre7\bin;C:\Program Files (x86)\Java\jre7\bin\client;C:\Program Files (x86)\Java\jre7\lib;C:\Program Files (x86)\Java\jre7\jre\bin\client;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;
【问题讨论】:
您可能缺少模型。尝试运行import nltk; nltk.download('all', halt_on_error=False)
。然后重新运行你的脚本。
@alvas 没有修复它。
你在哪里安装了ghostscript? ghostscript .exe 文件在哪里?
好的,现在这有点有趣 =)
@alvas C:\Program Files\gs\gs9.19\bin
【参考方案1】:
简而言之:
请执行以下操作,而不是 >>> entities
:
>>> print entities.__repr__()
或者:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities
长期:
问题在于您尝试打印ne_chunk
的输出,这将触发ghostscript 以获取NE 标记句子的字符串和绘图表示,这是一个nltk.tree.Tree
对象。这将需要 ghostscript,以便您可以使用小部件对其进行可视化。
让我们一步一步来。
首先当你使用ne_chunk
时,你可以直接在顶层导入:
from nltk import ne_chunk
建议为您的导入使用命名空间,即:
from nltk import word_tokenize, pos_tag, ne_chunk
当你使用ne_chunk
时,它来自https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py
目前还不清楚 pickle 加载是什么类型的函数,但经过一些检查,我们发现只有一个内置的 NE chunker 不是基于规则的,并且由于 pickle 二进制状态的名称 maxent,我们可以假设它是一个统计分块器,所以它很可能来自NEChunkParser
对象:https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py。还有 ACE 数据 API 函数,比如 pickle 二进制文件的名称。
现在,只要您可以使用 ne_chunk
函数,它实际上就是在调用
NEChunkParser.parse()
函数返回 nltk.tree.Tree
对象:https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118
class NEChunkParser(ChunkParserI):
"""
Expected input: list of pos-tagged words
"""
def __init__(self, train):
self._train(train)
def parse(self, tokens):
"""
Each token should be a pos-tagged word
"""
tagged = self._tagger.tag(tokens)
tree = self._tagged_to_parse(tagged)
return tree
def _train(self, corpus):
# Convert to tagged sequence
corpus = [self._parse_to_tagged(s) for s in corpus]
self._tagger = NEChunkParserTagger(train=corpus)
def _tagged_to_parse(self, tagged_tokens):
"""
Convert a list of tagged tokens to a chunk-parse tree.
"""
sent = Tree('S', [])
for (tok,tag) in tagged_tokens:
if tag == 'O':
sent.append(tok)
elif tag.startswith('B-'):
sent.append(Tree(tag[2:], [tok]))
elif tag.startswith('I-'):
if (sent and isinstance(sent[-1], Tree) and
sent[-1].label() == tag[2:]):
sent[-1].append(tok)
else:
sent.append(Tree(tag[2:], [tok]))
return sent
如果我们看一下nltk.tree.Tree
ject,它会在尝试调用_repr_png_
函数时出现ghostscript 问题:https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702:
def _repr_png_(self):
"""
Draws and outputs in PNG for ipython.
PNG is used instead of PDF, since it can be displayed in the qt console and
has wider browser support.
"""
import os
import base64
import subprocess
import tempfile
from nltk.draw.tree import tree_to_treesegment
from nltk.draw.util import CanvasFrame
from nltk.internals import find_binary
_canvas_frame = CanvasFrame()
widget = tree_to_treesegment(_canvas_frame.canvas(), self)
_canvas_frame.add_widget(widget)
x, y, w, h = widget.bbox()
# print_to_file uses scrollregion to set the width and height of the pdf.
_canvas_frame.canvas()['scrollregion'] = (0, 0, w, h)
with tempfile.NamedTemporaryFile() as file:
in_path = '0:.ps'.format(file.name)
out_path = '0:.png'.format(file.name)
_canvas_frame.print_to_file(in_path)
_canvas_frame.destroy_widget(widget)
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
'-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile=0: 1:'
.format(out_path, in_path).split())
with open(out_path, 'rb') as sr:
res = sr.read()
os.remove(in_path)
os.remove(out_path)
return base64.b64encode(res).decode()
但是请注意,当您在解释器中使用>>> entities
时,python 解释器会触发_repr_png
而不是__repr__
,这很奇怪(请参阅Purpose of Python's __repr__)。当试图打印出对象的表示时,本机 CPython 解释器不可能是如何工作的,所以我们看一下Ipython.core.formatters
,我们看到它允许_repr_png
被https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L725 触发:
class PNGFormatter(BaseFormatter):
"""A PNG formatter.
To define the callables that compute the PNG representation of your
objects, define a :meth:`_repr_png_` method or use the :meth:`for_type`
or :meth:`for_type_by_name` methods to register functions that handle
this.
The return value of this formatter should be raw PNG data, *not*
base64 encoded.
"""
format_type = Unicode('image/png')
print_method = ObjectName('_repr_png_')
_return_type = (bytes, unicode_type)
我们看到,当 IPython 初始化 DisplayFormatter
对象时,它会尝试激活所有格式化程序:https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66
def _formatters_default(self):
"""Activate the default formatters."""
formatter_classes = [
PlainTextFormatter,
htmlFormatter,
MarkdownFormatter,
SVGFormatter,
PNGFormatter,
PDFFormatter,
JPEGFormatter,
LatexFormatter,
JSONFormatter,
javascriptFormatter
]
d =
for cls in formatter_classes:
f = cls(parent=self)
d[f.format_type] = f
return d
请注意,在 Ipython
之外,在本机 CPython 解释器中,它只会调用 __repr__
而不会调用 _repr_png
:
>>> from nltk import ne_chunk
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sentence)))
>>> entities
Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])
所以现在解决方案:
解决方案 1:
当打印出ne_chunk
的字符串输出时,可以使用
>>> print entities.__repr__()
IPython 应该只显式调用 __repr__
而不是 >>> entities
,而不是调用所有可能的格式化程序。
解决方案 2
如果你真的需要使用_repr_png_
来可视化Tree 对象,那么我们需要弄清楚如何将ghostscript 二进制文件添加到NLTK 环境变量中。
在您的情况下,默认 nltk.internals
似乎无法找到二进制文件。更具体地说,我们指的是https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599
如果我们回到https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726,我们会看到,它正在尝试寻找
env_vars=['PATH']
当 NLTK 尝试初始化它的环境变量时,它正在查看 os.environ
,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L495
注意find_binary
调用find_binary_iter
调用find_binary_iter
尝试通过获取os.environ
来查找env_vars
所以如果我们添加到路径中:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
现在这应该可以在 Ipython 中使用:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities
【讨论】:
在 path_to_gs 分配中将正斜杠更改为反斜杠后工作。谢谢 您能否编辑您的问题以将输出添加到os.environ['PATH']
?如果其他人有同样的问题,这将在未来有所帮助 =) 谢谢!
正如@predictorx 下面提到的,“\b”可以映射到“\x08”,为避免这种情况,要么转义每个反斜杠(将"C:\Program Files\gs\gs9.19\bin"
替换为"C:\\Program Files\\gs\\gs9.19\\bin"
)或使用正斜杠@ 987654379@ 在路径中。更一般地,在 Windows 中,将该路径添加到用户的Path
环境变量中,以避免每个脚本中出现这些行。【参考方案2】:
就我而言,当我使用相同的 alvas 代码添加路径时,结果是:
'C:\\Program Files\\gs\\gs9.27\x08in'
这是不正确的,所以,我改为:path_to_gs = 'C:/Program Files/gs/gs9.27/bin' 并且它可以工作。
【讨论】:
【参考方案3】:从“https://www.ghostscript.com/download/gsdnld.html”下载gs.exe
并将其路径添加到Environment Variables
路径可能存储在
C:\Program Files\
(在我的系统中它看起来像“C:\Program Files\gs\gs9.21\bin”)
并将其添加到环境变量中:
控制面板->系统和安全->系统->高级系统 设置->环境变量->(在系统变量中向下滚动并 双击路径)->
然后添加复制的路径
(在我的情况下为“C:\Program Files\gs\gs9.21\bin”)
P.S.:不要忘记在处理路径之前添加分号 (;
),而不是删除现有路径然后简单地把它放在那里,你可能会遇到麻烦,需要运行备份:)
【讨论】:
【参考方案4】:添加到@predictorx 的评论中。对我有用的是
path_to_gs = "C:\Program Files\gs\gs9.53.3\\bin"
os.environ['PATH'] += path_to_gs
【讨论】:
以上是关于在 NLTK 中找不到 ghostscript?的主要内容,如果未能解决你的问题,请参考以下文章