如何从文本 (NVARCHAR(MAX)) 列中提取一个或多个 URL
Posted
技术标签:
【中文标题】如何从文本 (NVARCHAR(MAX)) 列中提取一个或多个 URL【英文标题】:How do i extract one or multiple URLs from a text (NVARCHAR(MAX)) column 【发布时间】:2014-05-04 13:39:53 【问题描述】:我的表中有一个数据列,在该列中,每一行的其他文本可以有零个、一个或多个 URL。我想将这些 URL 提取到仅包含这些 URL 的新数据集中。
为什么?因为我想将其中一些 URL 添加到我的数据库中的阻止列表中以防止垃圾邮件。
例如,我在数据列中有这样一段文字:
hmaruqbtufcvdlfu, <a href="httx://portugal-forex.com/">Day forex signal strategy trading</a>, KzxiIIO, [url=httx://portugal-forex.com/]Forex Broker[/url], mtNZQDi, httx://portugal-forex.com/ The best forex broker, IBWlBzg, <a href="httx://phen375treatment.com/">Avantage inconveniant phen 375</a>, ApEuXTp, [url=httx://phen375treatment.com/]Phen375[/url], QDVLpSn, httx://phen375treatment.com/ Where to buy phen 375, Fnwpugj, <a href="httx://priligy2000.org/">Priligy t</a>, zwRZhIC, [url=httx://priligy2000.org/]Order priligy[/url], FBgSaWs, httx://priligy2000.org/ Priligy buy online, FsemWnW, <a href="httx://ossorio.org/">Online Casino</a>, aOBtTaK, [url=httx://ossorio.org/]Online Casino[/url], oMMMacf, httx://ossorio.org/ Free online casino bounuses, occFLyZ, <a href="httx://paroxetine247.com/">Paroxetine adema</a>, xvrIdnq, [url=httx://paroxetine247.com/]Paroxetine depression[/url], MLSRAXX, httx://paroxetine247.com/ Paroxetine dark skin, GLYTcZY, <a href="httx://resolvedisputes.org/">Fioricet prescription online</a>, PmEMaMA, [url=httx://resolvedisputes.org/]Fioricet wcodiene for headache[/url], vPlKLhq, httx://resolvedisputes.org/ Online pharmacy fioricet, fxfhRcV.
那我要文本中的所有网址:
httx://portugal-forex.com/
httx://phen375treatment.com/
httx://priligy2000.org/
And so on.
我真的不知道从哪里开始在 SQL 中执行此操作。
【问题讨论】:
您只需要获取主域,例如 httx://portugal-forex.com/ 或者也可以是 httx://portugal-forex.com/xxx?Page=2 主域就够了 【参考方案1】:这里是例子。我从“httx://”到第一个“/”搜索字符串:
在任何情况下,您都需要一排排。
将代码放入函数
CREATE FUNCTION Temporary.getLinksFromText (@Tekstas NVARCHAR(MAX))
RETURNS @Data TABLE(TheLink NVARCHAR(500))
AS
BEGIN
DECLARE @FirstIndexOfChar INT,
@LastIndexOfChar INT,
@LengthOfStringBetweenChars INT ,
@String VARCHAR(MAX)
SET @FirstIndexOfChar = CHARINDEX('httx://',@Tekstas,0)
WHILE @FirstIndexOfChar > 0
BEGIN
SET @String = ''
SET @LastIndexOfChar = CHARINDEX('/',@Tekstas,@FirstIndexOfChar+7)
SET @LengthOfStringBetweenChars = @LastIndexOfChar - @FirstIndexOfChar + 1
SET @String = SUBSTRING(@Tekstas,@FirstIndexOfChar,@LengthOfStringBetweenChars)
INSERT INTO @Data (TheLink) VALUES (@String);
SET @Tekstas = SUBSTRING(@Tekstas, @LastIndexOfChar, LEN(@Tekstas))
SET @FirstIndexOfChar = CHARINDEX('httx://',@Tekstas, 0)
END
RETURN
END
创建一些测试数据:
CREATE TABLE #Data(weLink NVARCHAR(MAX));
INSERT INTO #Data VALUES
('hmaruqbtufcvdlfu, <a href="httx://portugal-forex.com/">Day forex signal strategy trading</a>, KzxiIIO, [url=httx://portugal-forex.com/]Forex Broker[/url], mtNZQDi, httx://portugal-forex.com/ The best forex broker, IBWlBzg, <a href="httx://phen375treatment.com/">Avantage inconveniant phen 375</a>, ApEuXTp, [url=httx://phen375treatment.com/]Phen375[/url], QDVLpSn, httx://phen375treatment.com/ Where to buy phen 375, Fnwpugj, <a href="httx://priligy2000.org/">Priligy t</a>, zwRZhIC, [url=httx://priligy2000.org/]Order priligy[/url], FBgSaWs, httx://priligy2000.org/ Priligy buy online, FsemWnW, <a href="httx://ossorio.org/">Online Casino</a>, aOBtTaK, [url=httx://ossorio.org/]Online Casino[/url], oMMMacf, httx://ossorio.org/ Free online casino bounuses, occFLyZ, <a href="httx://paroxetine247.com/">Paroxetine adema</a>, xvrIdnq, [url=httx://paroxetine247.com/]Paroxetine depression[/url], MLSRAXX, httx://paroxetine247.com/ Paroxetine dark skin, GLYTcZY, <a href="httx://resolvedisputes.org/">Fioricet prescription online</a>, PmEMaMA, [url=httx://resolvedisputes.org/]Fioricet wcodiene for headache[/url], vPlKLhq, httx://resolvedisputes.org/ Online pharmacy fioricet, fxfhRcV.'),
('hmaruqbtufcvdlfu, <a href="httx://portugal-forex.com/">Day forex signal strategy trading</a>, KzxiIIO, [url=httx://portugal-forex.com/]Forex Broker[/url], mtNZQDi, httx://portugal-forex.com/ The best forex broker, IBWlBzg, <a href="httx://phen375treatment.com/">Avantage inconveniant phen 375</a>, ApEuXTp, [url=httx://phen375treatment.com/]Phen375[/url], QDVLpSn, httx://phen375treatment.com/ Where to buy phen 375, Fnwpugj, <a href="httx://priligy2000.org/">Priligy t</a>, zwRZhIC, [url=httx://priligy2000.org/]Order priligy[/url], FBgSaWs, httx://priligy2000.org/ Priligy buy online, FsemWnW, <a href="httx://ossorio.org/">Online Casino</a>, aOBtTaK, [url=httx://ossorio.org/]Online Casino[/url], oMMMacf, httx://ossorio.org/ Free online casino bounuses, occFLyZ, <a href="httx://paroxetine247.com/">Paroxetine adema</a>, xvrIdnq, [url=httx://paroxetine247.com/]Paroxetine depression[/url], MLSRAXX, httx://paroxetine247.com/ Paroxetine dark skin, GLYTcZY, <a href="httx://resolvedisputes.org/">Fioricet prescription online</a>, PmEMaMA, [url=httx://resolvedisputes.org/]Fioricet wcodiene for headache[/url], vPlKLhq, httx://resolvedisputes.org/ Online pharmacy fioricet, fxfhRcV.')
你可以像这样执行它(没有光标)
SELECT allLinks.*
FROM #Data AS D
OUTER APPLY Temporary.getLinksFromText (D.weLink) AS allLinks
【讨论】:
此解决方案有效,但我必须创建一个 cusor 来处理所有非最佳行。没有光标可以做到吗? =) 改为基于集合。我想处理这个语句“SELECT Data FROM Paste WHERE paste.CaptchaOK = 0”。 你的意思是 paste.CaptchaOK = 0? 我认为无论如何您都需要一一进行。检查更新的答案。 "paste.CaptchaOK = 0" 这只是我的选择,没什么好关心的。 =)以上是关于如何从文本 (NVARCHAR(MAX)) 列中提取一个或多个 URL的主要内容,如果未能解决你的问题,请参考以下文章
为长文本字符串覆盖流利的 NHibernate nvarchar(MAX) 而不是 nvarchar(255)
如何将列从 nvarchar(max) 更改为 nvarchar(50)
如何使用 Nvarchar(max) 参数创建 CLR 存储过程?