Python：隔离研究结果

Posted 2023-02-23

技术标签:

【中文标题】Python：隔离研究结果【英文标题】：Python: isolating re.search results 【发布时间】：2015-09-09 14:05:49 【问题描述】：

所以我有这段代码（可能效率极低，但那是另一回事）从博客的 html 代码中提取 url。我在 .csv 中有 html，我将其放入 python，然后运行正则表达式来获取 url。代码如下：

import csv, re # required imports

infile = open('Book1.csv', 'rt')  # open the csv file
reader = csv.reader(infile)  # read the csv file


strings = [] # initialize a list to read the rows into

for row in reader: # loop over all the rows in the csv file 
    strings += row  # put them into the list

link_list = []  # initialize list that all the links will be put in
for i in strings:  #  loop over the list to access each string for regex (can't regex on lists)

    links = re.search(r'((https?|ftp)://|www\.)[^\s/$.?#].[^\s]*', i) # regex to find the links
    if links != None: # if it finds a link..
        link_list.append(links) # put it into the list!

for link in link_list: # iterate the links over a loop so we can have them in a nice column format
    print(link)

但是，当我打印结果时，它会以以下形式出现：

<_sre.SRE_Match object; span=(49, 80), match='http://buy.tableausoftware.com"'>
<_sre.SRE_Match object; span=(29, 115), match='https://c.velaro.com/visitor/requestchat.aspx?sit>
<_sre.SRE_Match object; span=(34, 117), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(32, 115), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(76, 166), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(9, 34), match='http://twitter.com/share"'>

有没有办法让我从包含的其他废话中提取链接？另外，这只是正则表达式搜索的一部分吗？谢谢！

【问题讨论】：

【参考方案1】：

这里的问题是re.search返回一个match object而不是匹配字符串，你需要使用group属性来访问你想要的结果。

如果您想要所有捕获的组，您可以使用groups 属性，对于特殊组，您可以将预期组的数量传递给它。

在这种情况下，您似乎想要整个匹配，因此您可以使用group(0)：

for i in strings:  #  loop over the list to access each string for regex (can't regex on lists)

    links = re.search(r'((https?|ftp)://|www\.)[^\s/$.?#].[^\s]*', i) # regex to find the links
    if links != None: # if it finds a link..
        link_list.append(links.group(0))

group([group1, ...])

返回匹配的一个或多个子组。如果只有一个参数，则结果为单个字符串；如果有多个参数，则结果是一个元组，每个参数一个项目。如果没有参数， group1 默认为零（返回整个匹配项）。如果 groupN 参数为零，则对应的返回值是整个匹配字符串；如果在包含范围 [1..99] 内，则为匹配相应括号组的字符串。如果组数为负数或大于模式中定义的组数，则会引发 IndexError 异常。如果组包含在不匹配的模式部分中，则相应的结果为无。如果一个组包含在模式中多次匹配的部分中，则返回最后一个匹配项。

【讨论】：

以上是关于Python：隔离研究结果的主要内容，如果未能解决你的问题，请参考以下文章