如何使用正则表达式从 html 字符串中获取图像 url
Posted
技术标签:
【中文标题】如何使用正则表达式从 html 字符串中获取图像 url【英文标题】:How to get image url from html string using regex 【发布时间】:2016-06-05 19:32:07 【问题描述】:我正在尝试使用我目前正在处理的正则表达式从 html 字符串中获取:
extension String
func regex (pattern: String) -> [String]
do
let regex = try NSRegularExpression(pattern: pattern, options: NSRegularExpressionOptions(rawValue: 0))
let nsstr = self as NSString
let all = NSRange(location: 0, length: nsstr.length)
var matches : [String] = [String]()
regex.enumerateMatchesInString(self, options: NSMatchingOptions(rawValue: 0), range: all)
(result : NSTextCheckingResult?, _, _) in
if let r = result
let result = nsstr.substringWithRange(r.range) as String
matches.append(result)
return matches
catch
return [String]()
模式是:<img[^>]+src\\s*=\\s*['\']([^'\"]+)['\"][^>]*>
我仍然无法从中获取图像 url,这意味着它返回了空数组。实际上我的 html 字符串包含一个图像。由于UITableView
调整大小问题,我不想使用UIWebView
。所以,我需要从 html 中获取图像 url 并使用 AlamofireImage 在 UIImageView 中显示它。
任何帮助?这只是我需要获取的一个网址。
这是我的标签:
<img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
收件人:
https://en.wikipedia.org/wiki/File:BH_LMC.png
【问题讨论】:
很高兴看到这个。您能否提供要从中获取此结果的 URL? 我已经编辑了问题。看看 【参考方案1】:说明
<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s-s-rc=['"]([^"]*)['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>
此正则表达式将执行以下操作:
此正则表达式捕获整个 IMG 标记 将源属性值放入捕获组 1,如果存在则不带引号。 允许属性有单引号、双引号或无引号 可以修改以验证任意数量的其他属性 避免难以解析 HTML 的困难边缘情况示例
现场演示
https://regex101.com/r/qW9nG8/1
示例文本
请注意我们正在寻找特定机器人的第一行中的困难边缘情况。
<img onmouseover=' if ( 6 > 3 funSwap(" src="NotTheDroidYourLookingFor.jpg", 6 > 3 ) ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
some text
<img src="http://website/someurl.jpeg" onload="img_onload(this);" />
more text
<img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
示例匹配
捕获组0获取整个IMG标签 捕获组 1 仅获取 src 属性值[0][0] = <img onmouseover=' funSwap(" src='NotTheDroidYourLookingFor.jpg", data-pid) ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
[0][1] = http://website/ThisIsTheDroidYourLookingFor.jpeg
[1][0] = <img src="http://website/someurl.jpeg" onload="img_onload(this);" />
[1][1] = http://website/someurl.jpeg
[2][0] = <img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
[2][1] = https://en.wikipedia.org/wiki/File:BH_LMC.png
说明
NODE EXPLANATION
----------------------------------------------------------------------
<img '<img'
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
src= 'src='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
【讨论】:
【参考方案2】:这是您应该使用的正则表达式模式:
(http[^\s]+(jpg|jpeg|png|tiff)\b)
翻译:
查找以“http”开头的所有内容 不允许有空格 以下列 jpg、jpeg、png、tiff 之一结尾正则表达式函数:
func matches(for regex: String!, in text: String!) -> [String]
do
let regex = try RegularExpression(pattern: regex, options: [])
let nsString = text as NSString
let results = regex.matches(in: text, range: NSMakeRange(0, nsString.length))
return results.map nsString.substring(with: $0.range)
catch let error as NSError
print("invalid regex: \(error.localizedDescription)")
return []
用法:
var matched = matches(for: "(http[^\\s]+(jpg|jpeg|png|tiff)\\b)", in: String(htmlStr))
注意:
matched 以匹配字符串对象的数组形式返回 此代码是为 Swift 3 编写的【讨论】:
【参考方案3】:只需对模式稍作调整即可:
let string = "some text and other text <img src=\"en.wikipedia.org/wiki/File:BH_LMC.png\"/>;and then more text and more text"
let matches = string.regex("<img[^>]+src*=\".*?\"['/']>")
返回一个匹配的数组。
【讨论】:
【参考方案4】:Swift 5 优雅的解决方案,从 htmlString 中提取图像、音频、视频src
let filesList = matches(for: "src=\"(.*?)\"", in: htmlString)
for j in 0..<filesList.count
var fileName = filesList[j].replacingOccurrences(of: "src=", with: "", options: .literal, range: nil)
fileName = fileName.replacingOccurrences(of: "\"", with: "", options: .literal, range: nil)
其中htmlString
是字符串变量。
【讨论】:
以上是关于如何使用正则表达式从 html 字符串中获取图像 url的主要内容,如果未能解决你的问题,请参考以下文章