带有 unicode 和标点符号的 Javascript 正则表达式

Posted

技术标签:

【中文标题】带有 unicode 和标点符号的 Javascript 正则表达式【英文标题】:Javascript regexp with unicode and punctuation 【发布时间】:2017-02-03 06:49:58 【问题描述】:

我有以下用于拆分 unicode 单词的测试用例,但不知道如何在其中使用 javascript

describe("garden: utils", () => 
  it("should split correctly", () => 
    assert.deepEqual(segmentation('Hockey is a popular sport in Canada.'), [
      'Hockey', 'is', 'a', 'popular', 'sport', 'in', 'Canada', '.'
    ]);

    assert.deepEqual(segmentation('How many provinces are there in Canada?'), [
      'How', 'many', 'provinces', 'are', 'there', 'in', 'Canada', '?'
    ]);

    assert.deepEqual(segmentation('The forest is on fire!'), [
      'The', 'forest', 'is', 'on', 'fire', '!'
    ]);

    assert.deepEqual(segmentation('Emily Carr, who was born in 1871, was a great painter.'), [
      'Emily', 'Carr', ',', 'who', 'was', 'born', 'in', '1871', ',', 'was', 'a', 'great', 'painter', '.'
    ]);

    assert.deepEqual(segmentation('This is David\'s computer.'), [
      'This', 'is', 'David', '\'', 's', 'computer', '.'
    ]);

    assert.deepEqual(segmentation('The prime minister said, "We will win the election."'), [
      'The', 'prime', 'minister', 'said', ',', '"', 'We', 'will', 'win', 'the', 'election', '.', '"'
    ]);

    assert.deepEqual(segmentation('There are three positions in hockey: goalie, defence, and forward.'), [
      'There', 'are', 'three', 'positions', 'in', 'hockey', ':', 'goalie', ',', 'defence', ',', 'and', 'forward', '.'
    ]);

    assert.deepEqual(segmentation('The festival is very popular; people from all over the world visit each year.'), [
      'The', 'festival', 'is', 'very', 'popular', ';', 'people', 'from', 'all', 'over', 'the', 'world',
      'visit', 'each', 'year', '.'
    ]);

    assert.deepEqual(segmentation('Mild, wet, and cloudy - these are the characteristics of weather in Vancouver.'), [
      'Mild', ',', 'wet', ',', 'and', 'cloudy', '-', 'these', 'are', 'the', 'characteristics', 'of', 'weather',
      'in', 'Vancouver', '.'
    ]);

    assert.deepEqual(segmentation('sweet-smelling'), [
      'sweet', '-', 'smelling'
    ]);
  );

  it("should not split unicoded words", () => 
    assert.deepEqual(segmentation('hacer a propósito'), [
      'hacer', 'a', 'propósito'
    ]);

    assert.deepEqual(segmentation('nhà em có con mèo'), [
      'nhà', 'em', 'có', 'con', 'mèo'
    ]);
  );

  it("should group periods", () => 
    assert.deepEqual(segmentation('So are ... the fishes.'), [
      'So', 'are', '...', 'the', 'fishes', '.'
    ]);

    assert.deepEqual(segmentation('So are ...... the fishes.'), [
      'So', 'are', '......', 'the', 'fishes', '.'
    ]);

    assert.deepEqual(segmentation('arriba arriba ja....'), [
      'arriba', 'arriba', 'ja', '....'
    ]);
  );
);

这是python中的等价表达式:

class Segmentation(BaseNLPProcessor):
    pattern = re.compile('((?u)\w+|\.2,|[%s])' % string.punctuation)

    @classmethod
    def ignore_value(cls, value):
        # type: (str) -> bool
        return negate(compose(is_empty, string.strip))(value)

    def split(self):
        # type: () -> List[str]
        return filter(self.ignore_value, self.pattern.split(self.value()))

我想在 python 中编写一个等效的函数,用于 javascript 以按 unicode 单词和标点符号分割,按多个点分组 ...

Segmentation("Hockey is a popular sport in Canada.").split()

【问题讨论】:

【参考方案1】:

相当复杂,因为 JavaScript RegExp 中没有负面的后向断言,并且 Unicode 支持还不是官方的(目前仅在 Firefox 中通过标志支持)。这使用一个库 (XRegExp) 来处理 unicode 类。如果您需要完整的正常正则表达式,它很庞大。只需发表评论并让我知道,我将更新答案以使用包含 Unicode 范围的分解的正常 RegExp 语句。

const rxLetterToOther = XRegExp('(\\pL)((?!\\s)\\PL)','g');
const rxOtherToLetter = XRegExp('((?!\\s)\\PL)(\\pL)','g');
const rxNumberToOther = XRegExp('(\\pN)((?!\\s)\\PN)','g');
const rxOtherToNumber = XRegExp('((?!\\s)\\PN)(\\pN)','g');
const rxPuctToPunct = XRegExp('(\\pP)(\\pP)','g');
const rxSep = XRegExp('\\s+','g');

function segmentation(s) 
  return s
    .replace(rxLetterToOther, '$1 $2')
    .replace(rxOtherToLetter, '$1 $2')
    .replace(rxNumberToOther, '$1 $2')
    .replace(rxOtherToNumber, '$1 $2')
    .replace(rxPuctToPunct, '$1 $2')
    .split(rxSep);

Here it is passing all the test cases!

window.onbeforeunload = "";
*  margin: 0; padding: 0; border: 0; overflow: hidden; 
object  width: 100%; height: 100%; width: 100vw; height: 100vh; 
<object data="https://fiddle.jshell.net/a3tf68ae/14/show/" />

编辑:更新了测试用例以在测试结果下方打印大量 RegExp 源。运行 sn-p 以查看嵌入式测试用例。

【讨论】:

在答案中添加了一些骇客来在 jsfiddle 上运行测试用例。分解的正则表达式出现在测试结果下方。【参考方案2】:

我找到了答案,但很复杂。有没有人对此有另一个简单的答案

module.exports = (string) => 
  const segs = string.split(/(\.2,|!|"|#|$|%|&|'|\(|\)|\*|\+|,|-|\.|\/|:|;|<|=|>|\?|¿|@|[|]|\\|^|_|`||\|||~| )/);

  return segs.filter((seg) => seg.trim() !== "");
;

【讨论】:

您的语法有一些错误...您确定您粘贴正确吗? 这里插入测试用例; jsfiddle.net/9u0javhg/17

以上是关于带有 unicode 和标点符号的 Javascript 正则表达式的主要内容,如果未能解决你的问题,请参考以下文章

Python Unicode 编码错误序数不在范围<128> 中,带有欧元符号

Python 和带有 Unicode 的正则表达式

在 ggplot2 geom_text 中以颜色呈现 unicode 表情符号

CSS 对手机表情符号字体的引用?

将字符串(带有 unicode 字符)添加到字典时添加额外的斜杠( \ )

在这样的UILabel中获得正常外观的Unicode向下箭头⬇