php句子边界检测[重复]

Posted 2023-03-13

技术标签:

【中文标题】php句子边界检测[重复]【英文标题】：php sentence boundaries detection [duplicate] 【发布时间】：2011-06-29 06:24:39 【问题描述】：

我想在 php 中将文本分成句子。我目前正在使用正则表达式，它带来了约 95% 的准确率，并希望通过使用更好的方法来改进。我见过在 Perl、Java 和 C 中执行此操作的 NLP 工具，但没有看到任何适合 PHP 的东西。你知道这样的工具吗？

【问题讨论】：

你用的是什么正则表达式？ PHP 中的 NLP 听起来会让你很痛苦。 "pain" 因为它比说 C 慢？这是我正在使用的正则表达式：preg_split("/(?<!\..)([\?\!\.]+)\s(?!.\.)/",$text,-1, PREG_SPLIT_DELIM_CAPTURE); 你会推荐什么方法？ github.com/bigwhoop/sentence-breaker library 对你有用吗？ 【参考方案1】：

增强的正则表达式解决方案

假设您确实关心处理：Mr. 和 Mrs. 等缩写，那么以下单个正则表达式解决方案效果很好：

<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
    # Split sentences on whitespace between them.
    # See: http://***.com/a/5844564/433790
    (?<=          # Sentence split location preceded by
      [.!?]       # either an end of sentence punct,
    | [.!?][\'"]  # or end of sentence punct and quote.
    )             # End positive lookbehind.
    (?<!          # But don\'t split after these:
      Mr\.        # Either "Mr."
    | Mrs\.       # Or "Mrs."
    | Ms\.        # Or "Ms."
    | Jr\.        # Or "Jr."
    | Dr\.        # Or "Dr."
    | Prof\.      # Or "Prof."
    | Sr\.        # Or "Sr."
    | T\.V\.A\.   # Or "T.V.A."
                 # Or... (you get the idea).
    )             # End negative lookbehind.
    \s+           # Split on whitespace between sentences,
    (?=\S)        # (but not at end of string).
    %xi';  // End $split_sentences.

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! '; // Note ws at end.

$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) 
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);

?>

请注意，您可以轻松地在表达式中添加或删除缩写词。给定以下测试段落：

这是第一句话。第二句！第三句？句“四”。一句话“五”！ “六”字？句“七”。句子“八！”琼斯博士说：“史密斯夫人，您有一个可爱的女儿！” T.V.A.是个大工程！

这是脚本的输出：

Sentence[1] = [This is sentence one.]Sentence[2] = [Sentence two!]Sentence[3] = [Sentence three?]Sentence[4] = [Sentence "four".]Sentence[5] = [Sentence "five"!]Sentence[6] = [Sentence "six"?]Sentence[7] = [Sentence "seven."]Sentence[8] = [Sentence 'eight!']Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]Sentence[10] = [The T.V.A. is a big project!]

基本的正则表达式解决方案

问题的作者评论说，上述解决方案“忽略了许多选项”，不够通用。我不确定这意味着什么，但上述表达式的本质是尽可能简洁明了。这里是：

$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

请注意，这两种解决方案都能正确识别在结尾标点符号后以引号结尾的句子。如果您不关心匹配以引号结尾的句子，则可以将正则表达式简化为：/(?<=[.!?])\s+(?=\S)/。

编辑：20130820_1000 在正则表达式和测试字符串中添加了T.V.A.（另一个要忽略的标点符号）。（回答PapyRef的评论问题）

编辑：20130820_1800 整理并重命名正则表达式并添加 shebang。还修复了正则表达式，以防止在尾随空格上拆分文本。

【讨论】：

这仍然是一个非常直接的方法。我正在寻找通用的东西，它是通过学习过程构建的。您的解决方案忽略了许多选项。 @giorgio79：是的，如果“省略号”由三个连续的点组成。如果您谈论的是表示省略号的单个 Unicode 字符，则需要将此 Unicode 字符添加到字符类中才能使此正则表达式起作用。 @Noam - 如果您特别想要基于机器学习的解决方案，请更新您的问题。使用这个增强的正则表达式解决方案，我如何检测“T.V.A”这个词？我做了这个| [t|T]\.[v|V]\.[a|A]\. # or "T.V.A",，但它不起作用 @PapyRef - 是的，很容易。看一下正则表达式。查看例外列表？即Mr\.|Mrs\.|Ms\.|etc...?只需将您的 T\.V\.A\. 术语添加到此列表中，使用 or | 运算符将其与其他术语分开。（别忘了你需要避开这些点。）【参考方案2】：

对别人的工作略有改进：

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?]             # Either an end of sentence punct,
| [.!?][\'"]        # or end of sentence punct and quote.
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| Sr\.              # or "Sr.",
| \s[A-Z]\.              # or initials ex: "George W. Bush",
                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.
/ix';

$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);

【讨论】：

您介意解释一下您实际改进的地方吗？【参考方案3】：

作为一种低技术含量的方法，您可能需要考虑在循环中使用一系列 explode 调用，使用 .、! 和 ?作为你的针。这将非常占用内存和处理器（就像大多数文本处理一样）。您将拥有一堆临时数组和一个主数组，其中所有找到的句子都以正确的顺序进行数字索引。

此外，您还必须检查常见异常（例如 Mr. 和 Dr. 等标题中的 .），但所有内容都在数组中，这些类型的检查应该不会那么糟糕。

我不确定这在速度和缩放方面是否比正则表达式更好，但值得一试。您想将这些文本块分成多大的句子？

【讨论】：

这不能回答我的问题，因为我正在寻找一个为我做这件事的库。但是，你能解释一下使用 explode 和 preg_split 的区别吗？ @Noam: explode() 在一个简单的字符串匹配上拆分，不做任何正则表达式。答案的含义是，对于您的用例，它应该足够简单，无需正则表达式即可；即只是在每个常见的标点符号上爆炸。但是我同意，它并不能真正回答您的问题，甚至不能解决您要问的问题。你的目标是准确，这根本不是他所关注的。（但如果您采用这种方法，我认为strtok() 是比explode() 更好的解决方案，因为涉及多个标点符号）【参考方案4】：

我正在使用这个正则表达式：

preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);

不适用于以数字开头的句子，但也应该很少有误报。当然，你在做什么也很重要。我的程序现在使用

explode('.',$text);

因为我认为速度比准确性更重要。

【讨论】：

【参考方案5】：

建立一个这样的缩写列表

$skip_array = array ( 

'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.

将它们编译成表达式

$skip = '';
foreach($skip_array as $abbr) 
$skip = $skip . (empty($skip) ? '' : '|') . '\s1' . $abbr . '[.!?]';

最后运行这个 preg_split 来分解成句子。

$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
                     $txt, -1, PREG_SPLIT_NO_EMPTY);

如果您正在处理 html，请注意是否有标签被删除，这会消除句子之间的空格。<p></p> 如果您有 situations.Like 这个 where.They 粘在一起，解析起来会变得非常困难。

【讨论】：

Explode 只是根据delimiter 将字符串炸成碎片。如果你说explode(" ", "Where are my suspenders?") The delimiter is " "` 空格。当遇到空格时，PHP 会将explode 你的字符串分成几块。在这种情况下，导致四个字存储在array 中，为keys [0-3]。 delimiter 可以是任何东西，&, #, -, : etc。 preg_split 是一个更复杂的爆炸器，它合并了多个metacharacters, switches, functions and expressions，如上例所示。【参考方案6】：

@ridgerunner 我用 C# 编写了你的 PHP 代码

结果我得到了 2 个句子：

先生。 J. Dujardin régle sa T.V. A.尤其是独特性

正确的结果应该是句子：Mr. J. Dujardin régle sa T.V.A.尤其是独特性

以及我们的测试段落

string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";

结果是

index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!

C#代码：

                string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
                Regex rx = new Regex(@"(\S.+?
                                       [.!?]               # Either an end of sentence punct,
                                       | [.!?]['""]         # or end of sentence punct and quote.
                                       )
                                       (?<!                 # Begin negative lookbehind.
                                          Mr.                   # Skip either Mr.
                                        | Mrs.                  # or Mrs.,
                                        | Ms.                   # or Ms.,
                                        | Jr.                   # or Jr.,
                                        | Dr.                   # or Dr.,
                                        | Prof.                 # or Prof.,
                                        | Sr.                   # or Sr.,
                                        | \s[A-Z].              # or initials ex: George W. Bush,
                                        | T\.V\.A\.             # or "T.V.A."
                                       )                    # End negative lookbehind.
                                       (?=|\s+|$)", 
                                       RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
                foreach (Match match in rx.Matches(sText))
                
                    Console.WriteLine("index: 0  sentence: 1", match.Index, match.Value);

【讨论】：

以上是关于php句子边界检测[重复]的主要内容，如果未能解决你的问题，请参考以下文章