匹配大量不同的句子（使用正则表达式模式解析）

Posted 2023-03-13

技术标签:

【中文标题】匹配大量不同的句子（使用正则表达式模式解析）【英文标题】：matching high number of different sentences (using regexp patterns parsing) 【发布时间】：2019-10-30 04:20:46 【问题描述】：

我想使用 regexps 构建文本句子分类器（用于聊天机器人自然语言处理）。

我有大量（例如 >> 100）不同类型的文本句子来匹配正则表达式模式。

当一个句子匹配一个正则表达式（比如，一个intent）时，激活一个特定的action（一个函数处理程序）。

我预设了特定的正则表达式以匹配任何不同的不同句子集，例如：

     // I have a long list of regexps (also many regexp for a many intents)

    const regexps = [ 
      /from (?<fromCity>.+)/,  // ---> actionOne()
      /to (?<toCity>.+)/,      // ---> actionTwo()
      /.../,                   // ---> anotherAction()
      /.../                   // ---> yetAnotherAction()
    ]

   // I have a long list of actions (function handlers)

   const actions = [
     actionOne(),
     actionTwo(),
     ...,
     ...
   ]

如何构建最快的（多正则表达式）分类器（在 Javascript 中）？

我目前快速而肮脏的解决方案是按顺序检查每个正则表达式：

    // at run time        
    ...
    sentence = 'from Genova'
    ...

    if (sentence.match(/from (?<fromCity>.+)/)
      actionOne()

    else if(sentence.match(/to (?<toCity>.+)/)
      actionTwo()

    else if ...
    else if ...
    else 
      fallback()

上述 if-then 序列 方法的可扩展性不大，而且最重要的是在性能方面很慢（即使大多数频率使用的正则表达式排序可能会有所帮助）。

提高性能的另一种方法可能是：创建一个单个（大）正则表达式，由命名组组成（每个匹配器正则表达式一个）交替？

在最小的例子中：

   const regexp = /(?<one>from (?<toCity>.+))|(?<two>to (?<toCity>.+))/

所以我简单地创建了 regexp 分类器（请将下面的代码作为 javascript 伪代码）：

    // at build time

    // I collect all possible regexps, each one as a named group
    const intents = [
      '(?<one>from (?<fromCity>.+))',
      '(?<two>to (?<toCity>.+))',
      '...',
      '...'
    ]

    const classifier = new RegExp(intents.join('|'))

    // collection of functions handlers, one for each regexp
    const Actions = 
     'one': 'actionOne',
     'two': 'actionTwo',
     ...,
     ...
    

    // at run time

    const match = sentence.match(classifier)

    // if match, call corresponding function handler
    // match.groups contains the matching named group
    const action = Actions[match.groups]

    if ( action )
      action()
    else
      fallback() // no match

这有意义吗？对更好的方法有什么建议吗？

【问题讨论】：

一个改进是创建一个functionMapper 并根据匹配的组名调用函数，而不是写很多 if else 正确。我更新了代码我投票结束这个问题，因为它属于codereview.stackexchange.com 我不同意。我这边我提出了一个解决方案，但问题是关于替代提案（和/或我的提案草案的验证） 【参考方案1】：

这很可能取决于很多事情，例如每个 RegExp（例如，捕获组的数量）、列表的实际大小和输入的长度。

但是当测试大量的 RegExp（10000 个简单的）时，大组合 RegExp 的任何变体都比单独执行单个的变体要慢得多。 JSPerf

鉴于这些信息，以及它总体上使代码更简单的事实，我建议不要采用那种大的 RegExp 方法。

为了使事情更易于维护，我建议将每个触发器及其操作存储在同一个位置，例如对象数组。如果需要，这也可以让您稍后向这些对象添加更多内容（例如命名意图）：

const intents = [
     regexp: /from (?<fromCity>.+)/, action: fromCity ,
     regexp: /to (?<toCity>.+)/, action: toCity ,
     regexp: /.../, action: anotherAction ,
];

// We use find to stop as soon as we've got a result
let result = intents.find(intent => 
    let match = sentence.match(intent.regexp);
    if (match) 
        // You can include a default action in case the action is not specified in the intent object
        // Decide what you send to your action function here
        (match.action || defaultAction)(match, sentence, intent);
    
    return match;
);
if (!result) 
    fallback();

【讨论】：

很好的答案！让我仔细阅读您的 JSPerf 代码；它表明大正则表达式是最慢的解决方案！谢谢你的工作。我完全同意对象数组方法。如果正则表达式按概率频率排序（intents[0].regexp -> 最可能的），则顺序评估也很聪明。 :)

以上是关于匹配大量不同的句子（使用正则表达式模式解析）的主要内容，如果未能解决你的问题，请参考以下文章

正则表达式匹配换行

正则表达式中的重叠匹配

Python正则表达式，多行匹配模式..为啥这不起作用？

正则表达式

正则表达式匹配整个句子。

5.2.1 正则表达式语法与子模式扩展语法