SQL:查找列包含所有给定单词的行
Posted
技术标签:
【中文标题】SQL:查找列包含所有给定单词的行【英文标题】:SQL: Find rows where Column contains all of the given words 【发布时间】:2011-04-17 19:42:12 【问题描述】:我有一些列 EntityName,我希望用户能够通过输入以空格分隔的单词来搜索名称。空格被隐式视为“AND”运算符,这意味着返回的行必须具有指定的所有单词,并且不一定按照给定的顺序。
例如,如果我们有这样的行:
-
阿巴尼娜漂亮的芭蕾舞演员
acdc,你整晚都在震撼我
你就是我
梦想剧场,一切都与你有关
当用户输入:me you
,或you me
(结果必须相等)时,结果有第2行和第3行。
我知道我可以这样做:
WHERE Col1 LIKE '%' + word1 + '%'
AND Col1 LIKE '%' + word2 + '%'
但我想知道是否有更优化的解决方案。
CONTAINS
需要全文索引,但(出于各种原因)这不是一个选项。
也许 Sql2008 对这些情况有一些内置的、半隐藏的解决方案?
【问题讨论】:
我只是想知道全文索引解决方案被搁置的原因。这当然是我想去这里的方式。 抱歉,回复晚了 - 它不适合我们,因为它不支持像我用作问题示例那样的搜索('%term%' - 搜索不限于分隔单词,但即使是仅包含该术语的单词)。此外,SqlServer 位于具有共享网络驱动器的集群机器上,任何其他安装都被冻结(我们需要安装全文搜索,因为管理员在安装时没有包含它) - 他们向我们保证这是一个地狱对节点进行额外的安装......所以这就是它不在桌面上的原因...... 【参考方案1】:我唯一能想到的就是编写一个CLR
函数来进行LIKE
比较。这应该快很多倍。
更新:现在我想起来了,CLR 帮不上忙是有道理的。另外两个想法:
1 - 尝试索引 Col1 并执行以下操作:
WHERE (Col1 LIKE word1 + '%' or Col1 LIKE '%' + word1 + '%')
AND (Col1 LIKE word2 + '%' or Col1 LIKE '%' + word2 + '%')
根据最常见的搜索(以 vs. 子字符串开头),这可能会有所改进。
2 - 添加您自己的全文索引表,其中每个单词都是表中的一行。然后你可以正确索引。
【讨论】:
尽管我一开始反对它,但这似乎是迄今为止最好的解决方案...... 在我尝试过之后,我想添加一个更新 - 它只是非常慢......如果'like'方法在 10 秒内完成,这个 CLR 函数需要......好吧我不知道 - 我只是在 20 分钟后停止了它......所以这个解决方案也被搁置了...... 1. 不包括行不以搜索词开头的情况(但它更快,因为在这种情况下它可以使用索引)。 2. 没问题,我们已经在考虑了。感谢您的更新! @veljkoz:这是不正确的,#1 确实涵盖了子字符串匹配,请参阅OR
子句。【参考方案2】:
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
【讨论】:
【参考方案3】:无论如何,您最终都会进行全表扫描。
排序规则显然可以产生很大的不同。 Kalen Delaney 在“Microsoft SQL Server 2008 Internals”一书中说:
排序规则可以产生巨大的影响 SQL Server 的时候几乎要看看 字符串中的所有字符。为了 例如,请看以下内容:
SELECT COUNT(*) FROM tbl WHERE longcol LIKE '%abc%'
二进制排序规则的执行速度可能比非二进制 Windows 排序规则快 10 倍或更多。对于
varchar
数据,SQL 排序规则的执行速度比 Windows 排序规则快七到八倍。
【讨论】:
【参考方案4】:WITH Tokens AS(SELECT 'you' AS Token UNION ALL SELECT 'me')
SELECT ...
FROM YourTable AS t
WHERE (SELECT COUNT(*) FROM Tokens WHERE y.Col1 LIKE '%'+Tokens.Token+'%')
=
(SELECT COUNT(*) FROM Tokens) ;
【讨论】:
【参考方案5】:功能
CREATE FUNCTION [dbo].[fnSplit] ( @sep CHAR(1), @str VARCHAR(512) )
RETURNS TABLE AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(@sep, @str)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(@sep, @str, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT
pn AS Id,
SUBSTRING(@str, start, CASE WHEN stop > 0 THEN stop - start ELSE 512 END) AS Data
FROM
Pieces
)
查询
DECLARE @FilterTable TABLE (Data VARCHAR(512))
INSERT INTO @FilterTable (Data)
SELECT DISTINCT S.Data
FROM fnSplit(' ', 'word1 word2 word3') S -- Contains words
SELECT DISTINCT
T.*
FROM
MyTable T
INNER JOIN @FilterTable F1 ON T.Col1 LIKE '%' + F1.Data + '%'
LEFT JOIN @FilterTable F2 ON T.Col1 NOT LIKE '%' + F2.Data + '%'
WHERE
F2.Data IS NULL
来源:SQL SELECT WHERE field contains words
【讨论】:
【参考方案6】:理想情况下,这应该在上面提到的全文搜索的帮助下完成。 但, 如果您没有为您的数据库配置全文,这里是一个执行优先字符串搜索的性能密集型解决方案。
-- table to search in
drop table if exists dbo.myTable;
go
CREATE TABLE dbo.myTable
(
myTableId int NOT NULL IDENTITY (1, 1),
code varchar(200) NOT NULL,
description varchar(200) NOT NULL -- this column contains the values we are going to search in
) ON [PRIMARY]
GO
-- function to split space separated search string into individual words
drop function if exists [dbo].[fnSplit];
go
CREATE FUNCTION [dbo].[fnSplit] (@StringInput nvarchar(max),
@Delimiter nvarchar(1))
RETURNS @OutputTable TABLE (
id nvarchar(1000)
)
AS
BEGIN
DECLARE @String nvarchar(100);
WHILE LEN(@StringInput) > 0
BEGIN
SET @String = LEFT(@StringInput, ISNULL(NULLIF(CHARINDEX(@Delimiter, @StringInput) - 1, -1),
LEN(@StringInput)));
SET @StringInput = SUBSTRING(@StringInput, ISNULL(NULLIF(CHARINDEX
(
@Delimiter, @StringInput
),
0
), LEN
(
@StringInput)
)
+ 1, LEN(@StringInput));
INSERT INTO @OutputTable (id)
VALUES (@String);
END;
RETURN;
END;
GO
-- this is the search script which can be optionally converted to a stored procedure /function
declare @search varchar(max) = 'infection upper acute genito'; -- enter your search string here
-- the searched string above should give rows containing the following
-- infection in upper side with acute genitointestinal tract
-- acute infection in upper teeth
-- acute genitointestinal pain
if (len(trim(@search)) = 0) -- if search string is empty, just return records ordered alphabetically
begin
select 1 as Priority ,myTableid, code, Description from myTable order by Description
return;
end
declare @splitTable Table(
wordRank int Identity(1,1), -- individual words are assinged priority order (in order of occurence/position)
word varchar(200)
)
declare @nonWordTable Table( -- table to trim out auxiliary verbs, prepositions etc. from the search
id varchar(200)
)
insert into @nonWordTable values
('of'),
('with'),
('at'),
('in'),
('for'),
('on'),
('by'),
('like'),
('up'),
('off'),
('near'),
('is'),
('are'),
(','),
(':'),
(';')
insert into @splitTable
select id from dbo.fnSplit(@search,' '); -- this function gives you a table with rows containing all the space separated words of the search like in this e.g., the output will be -
-- id
-------------
-- infection
-- upper
-- acute
-- genito
delete s from @splitTable s join @nonWordTable n on s.word = n.id; -- trimming out non-words here
declare @countOfSearchStrings int = (select count(word) from @splitTable); -- count of space separated words for search
declare @highestPriority int = POWER(@countOfSearchStrings,3);
with plainMatches as
(
select myTableid, @highestPriority as Priority from myTable where Description like @search -- exact matches have highest priority
union
select myTableid, @highestPriority-1 as Priority from myTable where Description like @search + '%' -- then with something at the end
union
select myTableid, @highestPriority-2 as Priority from myTable where Description like '%' + @search -- then with something at the beginning
union
select myTableid, @highestPriority-3 as Priority from myTable where Description like '%' + @search + '%' -- then if the word falls somewhere in between
),
splitWordMatches as( -- give each searched word a rank based on its position in the searched string
-- and calculate its char index in the field to search
select myTable.myTableid, (@countOfSearchStrings - s.wordRank) as Priority, s.word,
wordIndex = CHARINDEX(s.word, myTable.Description) from myTable join @splitTable s on myTable.Description like '%'+ s.word + '%'
-- and not exists(select myTableid from plainMatches p where p.myTableId = myTable.myTableId) -- need not look into myTables that have already been found in plainmatches as they are highest ranked
-- this one takes a long time though, so commenting it, will have no impact on the result
),
matchingRowsWithAllWords as (
select myTableid, count(myTableid) as myTableCount from splitWordMatches group by(myTableid) having count(myTableid) = @countOfSearchStrings
)
, -- trim off the CTE here if you don't care about the ordering of words to be considered for priority
wordIndexRatings as( -- reverse the char indexes retrived above so that words occuring earlier have higher weightage
-- and then normalize them to sequential values
select s.myTableid, Priority, word, ROW_NUMBER() over (partition by s.myTableid order by wordindex desc) as comparativeWordIndex
from splitWordMatches s join matchingRowsWithAllWords m on s.myTableId = m.myTableId
)
,
wordIndexSequenceRatings as ( -- need to do this to ensure that if the same set of words from search string is found in two rows,
-- their sequence in the field value is taken into account for higher priority
select w.myTableid, w.word, (w.Priority + w.comparativeWordIndex + coalesce(sequncedPriority ,0)) as Priority
from wordIndexRatings w left join
(
select w1.myTableid, w1.priority, w1.word, w1.comparativeWordIndex, count(w1.myTableid) as sequncedPriority
from wordIndexRatings w1 join wordIndexRatings w2 on w1.myTableId = w2.myTableId and w1.Priority > w2.Priority and w1.comparativeWordIndex>w2.comparativeWordIndex
group by w1.myTableid, w1.priority,w1.word, w1.comparativeWordIndex
)
sequencedPriority on w.myTableId = sequencedPriority.myTableId and w.Priority = sequencedPriority.Priority
),
prioritizedSplitWordMatches as ( -- this calculates the cumulative priority for a field value
select w1.myTableId, sum(w1.Priority) as OverallPriority from wordIndexSequenceRatings w1 join wordIndexSequenceRatings w2 on w1.myTableId = w2.myTableId
where w1.word <> w2.word group by w1.myTableid
),
completeSet as (
select myTableid, priority from plainMatches -- get plain matches which should be highest ranked
union
select myTableid, OverallPriority as priority from prioritizedSplitWordMatches -- get ranked split word matches (which are ordered based on word rank in search string and sequence)
),
maximizedCompleteSet as( -- set the priority of a field value = maximum priority for that field value
select myTableid, max(priority) as Priority from completeSet group by myTableId
)
select priority, myTable.myTableid , code, Description from maximizedCompleteSet m join myTable on m.myTableId = myTable.myTableId
order by Priority desc, Description -- order by priority desc to get highest rated items on top
--offset 0 rows fetch next 50 rows only -- optional paging
【讨论】:
以上是关于SQL:查找列包含所有给定单词的行的主要内容,如果未能解决你的问题,请参考以下文章