使用 re.sub 的更好方法

Posted 2023-02-23

技术标签:

【中文标题】使用 re.sub 的更好方法【英文标题】：Better way to use re.sub 【发布时间】：2014-06-24 18:28:02 【问题描述】：

我正在清理 Twitter 流中的一系列来源。以下是数据示例：

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
          '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
          '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
          '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
          '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']


import re
for i in source:
    re.sub('<.*?>', '', re.sub(r'(<.*?>)(Twitter for)(\s+)', r'', i))

### This would be the expected output ###
'Android Tablets'
'Android'
'foursquare'
'web'
'iPhone'
'BlackBerry'

后者是我拥有的可以完成这项工作但看起来很糟糕的代码。我希望有更好的方法来做到这一点，包括re.sub() 或其他可能更合适的功能。

【问题讨论】：

s[s.index('>')+1:s.rindex('<')]。顺便说一句：我会使用[^>]*，而不是.*?。 @Bakuriu 感谢您的评论。 [^>]* 的解释是什么？查看我的答案，它匹配任何不是> 的字符，这意味着您上下文中标签内的所有内容。 【参考方案1】：

另一种选择，使用BeautifulSoup html 解析器：

>>> from bs4 import BeautifulSoup
>>> for link in source:
...     print BeautifulSoup(link, 'html.parser').text.replace('Twitter for', '').strip()
... 
Android Tablets
Android
foursquare
web
iPhone
BlackBerry

【讨论】：

【参考方案2】：

以下是改进代码的建议：

使用正则表达式编译，这样您就不会在每次应用正则表达式时都处理正则表达式，使用原始字符串来避免 python 对正则表达式字符串的任何解释，使用正则表达式，除了结束标记字符之外的任何内容在标记内进行匹配您不需要重复替换，因为它默认匹配行上的每个出现

这里有一个更简单更好的结果：

>>> import re
>>> r = re.compile(r'<[^>]+>')
>>> for it in source:
...     r.sub('', it)
... 
'Twitter for Android Tablets'
'Twitter for  Android'
'foursquare'
'web'
'Twitter for iPhone'
'Twitter for BlackBerry'

注意：您的用例的最佳解决方案是@bakuriu 的建议：

 >>> for it in source:
 ...     it[it.index('>')+1:it.rindex('<')]
'Twitter for Android Tablets'
'Twitter for  Android'
'foursquare'
'Twitter for iPhone'
'Twitter for BlackBerry'

它不会增加任何重要的开销并使用基本的快速字符串操作。但是该解决方案只采用什么是 between 标签，而不是删除它，如果 <a> 和 </a> 中有标签或没有标签，这可能会产生副作用根本，即它不适用于web 字符串。完全没有标签的解决方案：

 >>> for it in source:
 ...     if '>' in it and '<' in it:
 ...         it[it.index('>')+1:it.rindex('<')]
 ...     else:
 ...         it
 'Twitter for Android Tablets'
 'Twitter for  Android'
 'foursquare'
 'web'
 'Twitter for iPhone'
 'Twitter for BlackBerry'

【讨论】：

+1 表示正则表达式解决方案。由于“网络”案例，bakuriu 的那个不起作用。它没有“”。然而，听到它很有趣，因为我是 python 的新手。我将使用以下内容：r = re.compile(r'(<[^>]+>)|(Twitter for\s+)') 以便部分摆脱 Twitter。【参考方案3】：

如果您要执行大量此类操作，请使用旨在处理 (X)HTML 的库。 lxml 效果很好，但我更熟悉 BeautifulSoup 包装器。

from bs4 import BeautifulSoup

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
      '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
      '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
      '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
      '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']

soup = BeautifulSoup('\n'.join(source))
for tag in soup.findAll('a'):
    print(tag.text)

不过，对于您的用例来说，这可能有点矫枉过正。

【讨论】：

【参考方案4】：

如果文本确实是这种一致的格式，一个选择是只使用字符串操作而不是正则表达式：

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
          '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
          '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
          '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
          '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']

for i in source:
    print i.partition('>')[-1].rpartition('<')[0]

这段代码在字符串中找到第一个'>'，取其后的所有内容，在剩下的部分中找到第一个'' 和最后一个 '

@Bakuriu 在评论中还有更精简的版本，这可能比我的更好！

【讨论】：

【参考方案5】：

这对我来说看起来不那么难看，应该同样有效：

import re
for i in source:
    print re.sub('(<.*?>)|(Twitter for\s+)', '', i);

【讨论】：

以上是关于使用 re.sub 的更好方法的主要内容，如果未能解决你的问题，请参考以下文章