Python re.split() vs nltk word_tokenize 和 sent_tokenize

Posted 2023-02-23

技术标签:

【中文标题】Python re.split() vs nltk word_tokenize 和 sent_tokenize【英文标题】：Python re.split() vs nltk word_tokenize and sent_tokenize 【发布时间】：2016-05-22 14:36:29 【问题描述】：

我正在通过this question。

我只是想知道 NLTK 在单词/句子标记化方面是否会比正则表达式更快。

【问题讨论】：

... 是什么让您不敢尝试？运行示例并使用timeit对其计时？在 python 中仍然是新的，尤其是 nltk。我刚刚注意到，当我从 nltk 切换时，re.split()、s.split() 更快。我以前用这个：sentence = sent_tokenize(txt)，现在这个：sentence = re.split(r'(? 会不会是运行时要加载wordnet，导致nltk慢的原因？ 【参考方案1】：

默认的nltk.word_tokenize() 使用Treebank tokenizer 模拟来自Penn Treebank tokenizer 的标记器。

请注意，str.split() 无法实现语言学意义上的标记，例如：

>>> sent = "This is a foo, bar sentence."
>>> sent.split()
['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
>>> from nltk import word_tokenize
>>> word_tokenize(sent)
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']

它通常用于分隔带有指定分隔符的字符串，例如在制表符分隔的文件中，您可以使用str.split('\t')，或者当您的文本文件每行一个句子时，您可以使用换行符\n 分割字符串。

让我们在python3 中做一些基准测试：

import time
from nltk import word_tokenize

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        line.split()
    print ('str.split():\t', time.time() - start)

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        word_tokenize(line)
    print ('word_tokenize():\t', time.time() - start)

[出]：

str.split():     0.05451083183288574
str.split():     0.054320573806762695
str.split():     0.05368804931640625
str.split():     0.05416440963745117
str.split():     0.05299568176269531
str.split():     0.05304527282714844
str.split():     0.05356955528259277
str.split():     0.05473494529724121
str.split():     0.053118228912353516
str.split():     0.05236077308654785
word_tokenize():     4.056122779846191
word_tokenize():     4.052812337875366
word_tokenize():     4.042144775390625
word_tokenize():     4.101543664932251
word_tokenize():     4.213029146194458
word_tokenize():     4.411528587341309
word_tokenize():     4.162556886672974
word_tokenize():     4.225975036621094
word_tokenize():     4.22914719581604
word_tokenize():     4.203172445297241

如果我们尝试来自https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl 的another tokenizers in bleeding edge NLTK：

import time
from nltk.tokenize import ToktokTokenizer

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

toktok = ToktokTokenizer().tokenize

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        toktok(line)
    print ('toktok:\t', time.time() - start)

[出]：

toktok:  1.5902607440948486
toktok:  1.5347232818603516
toktok:  1.4993178844451904
toktok:  1.5635688304901123
toktok:  1.5779635906219482
toktok:  1.8177132606506348
toktok:  1.4538452625274658
toktok:  1.5094449520111084
toktok:  1.4871931076049805
toktok:  1.4584410190582275

（注：文本文件来源来自https://github.com/Simdiva/DSL-Task）

如果我们看一下原生的 perl 实现，ToktokTokenizer 的 python 与 perl 时间是可比的。但是在python实现中这样做，正则表达式是在perl中预编译的，不是the proof is still in the pudding：

alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
--2016-02-11 20:36:36--  https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [text/plain]
Saving to: ‘tok-tok.pl’

100%[===============================================================================================================================>] 2,690       --.-K/s   in 0s      

2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]

alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
--2016-02-11 20:36:38--  https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3483550 (3.3M) [text/plain]
Saving to: ‘test.txt’

100%[===============================================================================================================================>] 3,483,550    363KB/s   in 7.4s   

2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]

alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.703s
user    0m1.693s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.715s
user    0m1.704s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.700s
user    0m1.686s
sys 0m0.012s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.727s
user    0m1.700s
sys 0m0.024s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.734s
user    0m1.724s
sys 0m0.008s

(注意：在tok-tok.pl计时时，我们必须将输出通过管道传输到文件中，因此这里的计时包括机器输出到文件所需的时间，而在nltk.tokenize.ToktokTokenizer计时中，它不是包括输出到文件的时间）

关于sent_tokenize()，它有点不同，在不考虑准确性的情况下比较速度基准有点古怪。

考虑一下：

如果正则表达式将文本文件/段落拆分为 1 个句子，则速度几乎是瞬时的，即完成 0 个工作。但这将是一个可怕的句子标记器......

如果文件中的句子已经被\n分隔，那么这只是比较str.split('\n')与re.split('\n')和nltk与句子标记化无关的情况;P

有关sent_tokenize() 如何在 NLTK 中工作的信息，请参阅：

training data format for nltk punkt Use of PunktSentenceTokenizer in NLTK

因此，为了有效地比较 sent_tokenize() 与其他基于正则表达式的方法（不是 str.split('\n')），还必须评估准确性并拥有一个数据集，其中包含人工评估的标记化格式的句子。

考虑这个任务：https://www.hackerrank.com/challenges/from-paragraphs-to-sentences

给定文字：

在第三类中，他包括那些兄弟（大多数）在共济会中只看到了外部形式和仪式，并且珍视这些形式的严格表现，而不用担心它们的主旨或意义。 Willarski 甚至 Grand 主会馆的主人。最后，到第四类还有一个很多兄弟都属于，尤其是那些最近加入。根据皮埃尔的观察，这些人是没有什么都相信，也没有什么欲望，而是加入了共济会只是为了与那些富有的年轻兄弟交往通过他们的关系或等级有影响力，其中有小屋里有很多人。皮埃尔开始对他的所作所为感到不满意正在做。共济会，至少他在这里看到的，有时在他看来，这仅仅是基于外在的。他没想到怀疑共济会本身，但怀疑俄罗斯共济会采取了走错了路，偏离了原来的原则。所以朝着年底出国进修秩序的秘密。在这种情况下该怎么办？到赞成革命，推翻一切，以武力击退？不！我们远非如此。每一次暴力改革都值得谴责，因为它当人保持原样时，完全无法消除邪恶，而且因为智慧不需要暴力。 “但跑过去有什么意义就这样吗？”伊拉金的新郎说。“有一次她错过了，转身把它拿走，任何杂种都可以拿走它，”伊拉金同时说道时间，因为他的奔腾和兴奋而喘不过气来。

我们想要得到这个：

In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.

所以简单地做str.split('\n') 不会给你任何东西。即使不考虑句子的顺序，你也会得到 0 个肯定的结果：

>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
... Such were Willarski and even the Grand Master of the principal lodge.
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
... Pierre began to feel dissatisfied with what he was doing.
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
... What is to be done in these circumstances?
... To favor revolutions, overthrow everything, repel force by force?
... No!
... We are very far from that.
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
... "But what is there in running across it like that?" said Ilagin's groom.
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
>>> 
>>> output = text.split('\n')
>>> sum(1 for sent in text.split('\n') if sent in answer)
0

【讨论】：

很好的答案。我喜欢包含一些简单的基准。我认为这个问题是指句子分割，而不是词标记化。

以上是关于Python re.split() vs nltk word_tokenize 和 sent_tokenize的主要内容，如果未能解决你的问题，请参考以下文章

区别 |Python str.split()和re.split()

Python3正则匹配re.split，re.finditer及re.findall函数用法详解

python re.split要求保留字符串内部的空格，要怎么处理？

关于python中re模块split方法的使用

两万文字详解Python正则表达式(语法验证方法使用案例练习题常见错误)

如何使用 re.split 在 python 中拆分两列从 CSV 中查找字符串值