如何使用正则表达式从 html 字符串中获取图像 url

Posted 2023-02-23

技术标签:

【中文标题】如何使用正则表达式从 html 字符串中获取图像 url【英文标题】：How to get image url from html string using regex 【发布时间】：2016-06-05 19:32:07 【问题描述】：

我正在尝试使用我目前正在处理的正则表达式从 html 字符串中获取：

extension String 
func regex (pattern: String) -> [String] 
    do 
        let regex = try NSRegularExpression(pattern: pattern, options: NSRegularExpressionOptions(rawValue: 0))
        let nsstr = self as NSString
        let all = NSRange(location: 0, length: nsstr.length)
        var matches : [String] = [String]()
        regex.enumerateMatchesInString(self, options: NSMatchingOptions(rawValue: 0), range: all) 
            (result : NSTextCheckingResult?, _, _) in
            if let r = result 
                let result = nsstr.substringWithRange(r.range) as String
                matches.append(result)
            
        
        return matches
     catch 
        return [String]()

模式是：<img[^>]+src\\s*=\\s*['\']([^'\"]+)['\"][^>]*>

我仍然无法从中获取图像 url，这意味着它返回了空数组。实际上我的 html 字符串包含一个图像。由于UITableView 调整大小问题，我不想使用UIWebView。所以，我需要从 html 中获取图像 url 并使用 AlamofireImage 在 UIImageView 中显示它。

任何帮助？这只是我需要获取的一个网址。

这是我的标签：

<img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>

收件人：

https://en.wikipedia.org/wiki/File:BH_LMC.png

【问题讨论】：

很高兴看到这个。您能否提供要从中获取此结果的 URL？我已经编辑了问题。看看 【参考方案1】：

说明

<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s-s-rc=['"]([^"]*)['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>

此正则表达式将执行以下操作：

此正则表达式捕获整个 IMG 标记将源属性值放入捕获组 1，如果存在则不带引号。允许属性有单引号、双引号或无引号可以修改以验证任意数量的其他属性避免难以解析 HTML 的困难边缘情况

示例

现场演示

https://regex101.com/r/qW9nG8/1

示例文本

请注意我们正在寻找特定机器人的第一行中的困难边缘情况。

<img onmouseover=' if ( 6 > 3  funSwap(" src="NotTheDroidYourLookingFor.jpg", 6 > 3 )  ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
some text

<img src="http://website/someurl.jpeg" onload="img_onload(this);" />
more text
<img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>

示例匹配

捕获组0获取整个IMG标签捕获组 1 仅获取 src 属性值

[0][0] = <img onmouseover=' funSwap(" src='NotTheDroidYourLookingFor.jpg", data-pid) ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
[0][1] = http://website/ThisIsTheDroidYourLookingFor.jpeg

[1][0] = <img src="http://website/someurl.jpeg" onload="img_onload(this);" />
[1][1] = http://website/someurl.jpeg

[2][0] = <img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
[2][1] = https://en.wikipedia.org/wiki/File:BH_LMC.png

说明

NODE                     EXPLANATION
----------------------------------------------------------------------
  <img                     '<img'
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    src=                     'src='
----------------------------------------------------------------------
    ['"]                     any character of: ''', '"'
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    ['"]?                    any character of: ''', '"' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"\s]*                 any character except: ''', '"',
                             whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  \s?                      whitespace (\n, \r, \t, \f, and " ")
                           (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  \/?                      '/' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------

【讨论】：

【参考方案2】：

这是您应该使用的正则表达式模式：

(http[^\s]+(jpg|jpeg|png|tiff)\b)

翻译：

查找以“http”开头的所有内容不允许有空格以下列 jpg、jpeg、png、tiff 之一结尾

正则表达式函数：

func matches(for regex: String!, in text: String!) -> [String]     
    do 
        let regex = try RegularExpression(pattern: regex, options: [])
        let nsString = text as NSString
        let results = regex.matches(in: text, range: NSMakeRange(0, nsString.length))
        return results.map  nsString.substring(with: $0.range)
     catch let error as NSError 
        print("invalid regex: \(error.localizedDescription)")
        return []

用法：

var matched = matches(for: "(http[^\\s]+(jpg|jpeg|png|tiff)\\b)", in: String(htmlStr))

注意：

matched 以匹配字符串对象的数组形式返回此代码是为 Swift 3 编写的

【讨论】：

【参考方案3】：

只需对模式稍作调整即可：

let string = "some text and other text <img src=\"en.wikipedia.org/wiki/File:BH_LMC.png\"/>;and then more text and more text"

let matches = string.regex("<img[^>]+src*=\".*?\"['/']>")

返回一个匹配的数组。

【讨论】：

【参考方案4】：

Swift 5 优雅的解决方案，从 htmlString 中提取图像、音频、视频src

let filesList = matches(for: "src=\"(.*?)\"", in: htmlString)
for j in 0..<filesList.count 
    var fileName = filesList[j].replacingOccurrences(of: "src=", with: "", options: .literal, range: nil)
    fileName = fileName.replacingOccurrences(of: "\"", with: "", options: .literal, range: nil)

其中htmlString 是字符串变量。

【讨论】：

以上是关于如何使用正则表达式从 html 字符串中获取图像 url的主要内容，如果未能解决你的问题，请参考以下文章