以编程方式安装 NLTK 语料库/模型,即没有 GUI 下载器?
Posted
技术标签:
【中文标题】以编程方式安装 NLTK 语料库/模型,即没有 GUI 下载器?【英文标题】:Programmatically install NLTK corpora / models, i.e. without the GUI downloader? 【发布时间】:2011-08-16 04:11:15 【问题描述】:我的项目使用 NLTK。如何列出项目的语料库和模型要求以便自动安装?不想点开nltk.download()
的GUI,一一安装包。
另外,有什么方法可以冻结相同的需求列表(如pip freeze
)?
【问题讨论】:
【参考方案1】:NLTK 站点确实在此页面底部列出了用于下载包和集合的命令行界面:
http://www.nltk.org/data
命令行用法因您使用的 Python 版本而异,但在我的 Python2.6 安装中,我注意到我缺少“spanish_grammar”模型,这很好用:
python -m nltk.downloader spanish_grammars
您提到列出项目的语料库和模型要求,虽然我不确定自动执行此操作的方法,但我想我至少会分享一下。
【讨论】:
【参考方案2】:安装所有 NLTK 语料库和模型:
python -m nltk.downloader all
或者,在 Linux 上,您可以使用:
sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
如果您只想列出最流行的语料库和模型,请将 all
替换为 popular
。
您也可以通过命令行浏览语料库和模型:
mlee@server:/scratch/jjylee/tests$ sudo python -m nltk.downloader
[sudo] password for jjylee:
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> l
Packages:
[ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
[ ] basque_grammars..... Grammars for Basque
[ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
[ ] book_grammars....... Grammars from NLTK Book
[ ] cess_esp............ CESS-ESP Treebank
[ ] chat80.............. Chat-80 Data Files
[ ] city_database....... City Database
[ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
[ ] comparative_sentences Comparative Sentence Dataset
[ ] comtrans............ ComTrans Corpus Sample
[ ] conll2000........... CONLL 2000 Chunking Corpus
[ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
[ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[ ] crubadan............ Crubadan Corpus
[ ] dependency_treebank. Dependency Parsed Treebank
[ ] europarl_raw........ Sample European Parliament Proceedings Parallel
Corpus
[ ] floresta............ Portuguese Treebank
[ ] framenet_v15........ FrameNet 1.5
Hit Enter to continue:
[ ] framenet_v17........ FrameNet 1.7
[ ] gazetteers.......... Gazeteer Lists
[ ] genesis............. Genesis Corpus
[ ] gutenberg........... Project Gutenberg Selections
[ ] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
[ ] ieer................ NIST IE-ER DATA SAMPLE
[ ] inaugural........... C-Span Inaugural Address Corpus
[ ] indian.............. Indian Language POS-Tagged Corpus
[ ] jeita............... JEITA Public Morphologically Tagged Corpus (in
ChaSen format)
[ ] kimmo............... PC-KIMMO Data Files
[ ] knbc................ KNB Corpus (Annotated blog corpus)
[ ] large_grammars...... Large context-free and feature-based grammars
for parser comparison
[ ] lin_thesaurus....... Lin's Dependency Thesaurus
[ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with
part-of-speech tags
[ ] machado............. Machado de Assis -- Obra Completa
[ ] masc_tagged......... MASC Tagged Corpus
[ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
[ ] moses_sample........ Moses Sample Models
Hit Enter to continue: x
Download which package (l=list; x=cancel)?
Identifier> conll2002
Downloading package conll2002 to
/afs/mit.edu/u/m/mlee/nltk_data...
Unzipping corpora/conll2002.zip.
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader>
【讨论】:
【参考方案3】:除了已经提到的命令行选项之外,您还可以通过向download()
函数添加参数,以编程方式在 Python 脚本中安装 NLTK 数据。
见help(nltk.download)
文字,具体如下:
Individual packages can be downloaded by calling the ``download()`` function with a single argument, giving the package identifier for the package that should be downloaded: >>> download('treebank') # doctest: +SKIP [nltk_data] Downloading package 'treebank'... [nltk_data] Unzipping corpora/treebank.zip.
我可以确认这适用于一次下载一个包,或者通过list
或tuple
。
>>> import nltk
>>> nltk.download('wordnet')
[nltk_data] Downloading package 'wordnet' to
[nltk_data] C:\Users\_my-username_\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\wordnet.zip.
True
您也可以尝试下载已经下载的包而没有问题:
>>> nltk.download('wordnet')
[nltk_data] Downloading package 'wordnet' to
[nltk_data] C:\Users\_my-username_\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
True
此外,该函数似乎返回一个布尔值,您可以使用它来查看下载是否成功:
>>> nltk.download('not-a-real-name')
[nltk_data] Error loading not-a-real-name: Package 'not-a-real-name'
[nltk_data] not found in index
False
【讨论】:
【参考方案4】:我已经设法使用以下代码将语料库和模型安装在自定义目录中:
import nltk
nltk.download(info_or_id="popular", download_dir="/path/to/dir")
nltk.data.path.append("/path/to/dir")
这将在/path/to/dir
中安装“all”语料库/模型,并告知 NLTK 在哪里可以找到它 (data.path.append
)。
您不能“冻结”需求文件中的数据,但您可以将此代码添加到您的__init__
,此外还可以添加代码以检查文件是否已经存在。
【讨论】:
以上是关于以编程方式安装 NLTK 语料库/模型,即没有 GUI 下载器?的主要内容,如果未能解决你的问题,请参考以下文章