修复中间有换行的句子:Python is \n is fun

Posted

技术标签:

【中文标题】修复中间有换行的句子:Python is \\n is fun【英文标题】:Repair sentences that have line breaks in the middle of them: Python is \n is fun修复中间有换行的句子:Python is \n is fun 【发布时间】:2015-12-30 01:37:24 【问题描述】:

我目前正在使用 Apache Tika 从 PDF 中提取文本。我正在使用 NLTK 来执行命名实体识别和其他任务。我遇到了一个问题,即 pdf 文档中的句子被提取出来,中间有换行符。例如,

我是一个中间有一个python行\nbreak的句子。

该模式通常是一个空格,后跟一个换行符,<space>\n 或有时是<space>\n<space>。我想修复这些句子,以便对它们使用句子标记器。

我正在尝试使用正则表达式模式(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])\n 替换为

问题:

    另一个句子结束后在同一行开始的句子不匹配。

    如何匹配包含多行换行符的句子?换句话说,我如何允许(?:\r\n|\n)多次出现?

    text = """
    Random Data, Company
    2015
    
    This is a sentence that has line 
    break in the middle of it due to extracting from a PDF.
    
    How do I support
    3 line sentence 
    breaks please?
    
    HEADER HERE
    
    The first sentence will 
    match. However, this line will not match
    for some reason 
    that I cannot figure out.
    
    Portfolio: 
    http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 
    
    Full Name 
    San Francisco, CA  
    94000
    
    1500 testing a number as the first word in
    a broken sentence.
    
    Match sentences with capital letters on the next line like 
    Wi-Fi.
    
    This line has 
    trailing spaces after exclamation mark!       
    """
    import re
    new_text = re.sub(pattern=r'(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])', repl='\g<1>\g<2>', string=text, flags=re.MULTILINE)
    print(new_text)
    
    expected_result = """
    Random Data, Company
    2015
    
    This is a sentence that has line break in the middle of it due to extracting from a PDF.
    
    How do I support 3 line sentence breaks please?
    
    HEADER HERE
    
    The first sentence will match. However, this line will not match for some reason that I cannot figure out.
    
    Portfolio: 
    http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 
    
    Full Name 
    San Francisco, CA  
    94000
    
    1500 testing a number as the first word in a broken sentence.
    
    Match sentences with capital letters on the next line like Wi-Fi.
    
    This line has trailing spaces after exclamation mark!       
    """
    

Demo at regex101.com

【问题讨论】:

看起来您正在尝试使用(单个)RegEx 进行句子拆分。有这方面的工具,例如。 nltk.tokenize.PunktSentenceTokenizer,它允许您进行(无监督)训练,以了解有哪些缩写词(这些是英语等语言句子拆分中最难的部分)。我假设句子拆分器不会关心换行符或其他空格,无论它们在句子中的哪个位置。 【参考方案1】:

正则表达式不匹配末尾有空格的行,句子被分成 3 行就是这种情况。结果,句子没有合二为一。

这是另一种正则表达式,它将两个空行之间的所有行合并为一个,确保连接的行之间只有一个空格:

# The new regex
(\S)[ \t]*(?:\r\n|\n)[ \t]*(\S)
# The replacement string: \1 \2

解释 这将搜索任何非空格字符\S,然后是一个新行,然后是空格,然后是\S。它将两个 '\S' 之间的换行符和空格替换为一个空格。空格和制表符是明确给出的,因为\s 也匹配新行。这是demo link。

【讨论】:

以上是关于修复中间有换行的句子:Python is \n is fun的主要内容,如果未能解决你的问题,请参考以下文章

正则前面的 (?i) (?s) (?m) (?is) (?im)

从没有换行的键盘读取数据

换行的展示

java如何把有换行的字符串弄到一行显示?

java中实现换行的几种方法

SqlServer查询数据行中是不是有换行