SQL:查找列包含所有给定单词的行

Posted

技术标签:

【中文标题】SQL:查找列包含所有给定单词的行【英文标题】:SQL: Find rows where Column contains all of the given words 【发布时间】:2011-04-17 19:42:12 【问题描述】:

我有一些列 EntityName,我希望用户能够通过输入以空格分隔的单词来搜索名称。空格被隐式视为“AND”运算符,这意味着返回的行必须具有指定的所有单词,并且不一定按照给定的顺序。

例如,如果我们有这样的行:

    阿巴尼娜漂亮的芭蕾舞演员 acdc,你整晚都在震撼我 你就是我 梦想剧场,一切都与你有关

当用户输入:me you,或you me(结果必须相等)时,结果有第2行和第3行。

我知道我可以这样做:

WHERE Col1 LIKE '%' + word1 + '%'
  AND Col1 LIKE '%' + word2 + '%'

但我想知道是否有更优化的解决方案。

CONTAINS 需要全文索引,但(出于各种原因)这不是一个选项。

也许 Sql2008 对这些情况有一些内置的、半隐藏的解决方案?

【问题讨论】:

我只是想知道全文索引解决方案被搁置的原因。这当然是我想去这里的方式。 抱歉,回复晚了 - 它不适合我们,因为它不支持像我用作问题示例那样的搜索('%term%' - 搜索不限于分隔单词,但即使是仅包含该术语的单词)。此外,SqlServer 位于具有共享网络驱动器的集群机器上,任何其他安装都被冻结(我们需要安装全文搜索,因为管理员在安装时没有包含它) - 他们向我们保证这是一个地狱对节点进行额外的安装......所以这就是它不在桌面上的原因...... 【参考方案1】:

我唯一能想到的就是编写一个CLR 函数来进行LIKE 比较。这应该快很多倍。

更新:现在我想起来了,CLR 帮不上忙是有道理的。另外两个想法:

1 - 尝试索引 Col1 并执行以下操作:

WHERE (Col1 LIKE word1 + '%' or Col1 LIKE '%' + word1 + '%')
  AND (Col1 LIKE word2 + '%' or Col1 LIKE '%' + word2 + '%')

根据最常见的搜索(以 vs. 子字符串开头),这可能会有所改进。

2 - 添加您自己的全文索引表,其中每个单词都是表中的一行。然后你可以正确索引。

【讨论】:

尽管我一开始反对它,但这似乎是迄今为止最好的解决方案...... 在我尝试过之后,我想添加一个更新 - 它只是非常慢......如果'like'方法在 10 秒内完成,这个 CLR 函数需要......好吧我不知道 - 我只是在 20 分钟后停止了它......所以这个解决方案也被搁置了...... 1. 不包括行不以搜索词开头的情况(但它更快,因为在这种情况下它可以使用索引)。 2. 没问题,我们已经在考虑了。感谢您的更新! @veljkoz:这是不正确的,#1 确实涵盖了子字符串匹配,请参阅OR 子句。【参考方案2】:

http://msdn.microsoft.com/en-us/magazine/cc163473.aspx

【讨论】:

【参考方案3】:

无论如何,您最终都会进行全表扫描。

排序规则显然可以产生很大的不同。 Kalen Delaney 在“Microsoft SQL Server 2008 Internals”一书中说:

排序规则可以产生巨大的影响 SQL Server 的时候几乎要看看 字符串中的所有字符。为了 例如,请看以下内容:

SELECT COUNT(*) FROM tbl WHERE longcol LIKE '%abc%'

二进制排序规则的执行速度可能比非二进制 Windows 排序规则快 10 倍或更多。对于varchar 数据,SQL 排序规则的执行速度比 Windows 排序规则快七到八倍。

【讨论】:

【参考方案4】:
WITH Tokens AS(SELECT 'you' AS Token UNION ALL SELECT 'me')
SELECT ...
FROM YourTable AS t
WHERE (SELECT COUNT(*) FROM Tokens WHERE y.Col1 LIKE '%'+Tokens.Token+'%') 
 = 
(SELECT COUNT(*) FROM Tokens) ;

【讨论】:

【参考方案5】:

功能

 CREATE FUNCTION [dbo].[fnSplit] ( @sep CHAR(1), @str VARCHAR(512) )
 RETURNS TABLE AS
 RETURN (
           WITH Pieces(pn, start, stop) AS (
           SELECT 1, 1, CHARINDEX(@sep, @str)
           UNION ALL
           SELECT pn + 1, stop + 1, CHARINDEX(@sep, @str, stop + 1)
           FROM Pieces
           WHERE stop > 0
      )

      SELECT
           pn AS Id,
           SUBSTRING(@str, start, CASE WHEN stop > 0 THEN stop - start ELSE 512 END) AS Data
      FROM
           Pieces
 )

查询

 DECLARE @FilterTable TABLE (Data VARCHAR(512))

 INSERT INTO @FilterTable (Data)
 SELECT DISTINCT S.Data
 FROM fnSplit(' ', 'word1 word2 word3') S -- Contains words

 SELECT DISTINCT
      T.*
 FROM
      MyTable T
      INNER JOIN @FilterTable F1 ON T.Col1 LIKE '%' + F1.Data + '%'
      LEFT JOIN @FilterTable F2 ON T.Col1 NOT LIKE '%' + F2.Data + '%'
 WHERE
      F2.Data IS NULL

来源:SQL SELECT WHERE field contains words

【讨论】:

【参考方案6】:

理想情况下,这应该在上面提到的全文搜索的帮助下完成。 但, 如果您没有为您的数据库配置全文,这里是一个执行优先字符串搜索的性能密集型解决方案。

-- table to search in
drop table if exists dbo.myTable;
go
CREATE TABLE dbo.myTable
    (
    myTableId int NOT NULL IDENTITY (1, 1),
    code varchar(200) NOT NULL, 
    description varchar(200) NOT NULL -- this column contains the values we are going to search in 
    )  ON [PRIMARY]
GO

-- function to split space separated search string into individual words
drop function if exists [dbo].[fnSplit];
go
CREATE FUNCTION [dbo].[fnSplit] (@StringInput nvarchar(max),
@Delimiter nvarchar(1))
RETURNS @OutputTable TABLE (
  id nvarchar(1000)
)
AS
BEGIN
  DECLARE @String nvarchar(100);

  WHILE LEN(@StringInput) > 0
  BEGIN
    SET @String = LEFT(@StringInput, ISNULL(NULLIF(CHARINDEX(@Delimiter, @StringInput) - 1, -1),
    LEN(@StringInput)));
    SET @StringInput = SUBSTRING(@StringInput, ISNULL(NULLIF(CHARINDEX
    (
    @Delimiter, @StringInput
    ),
    0
    ), LEN
    (
    @StringInput)
    )
    + 1, LEN(@StringInput));

    INSERT INTO @OutputTable (id)
      VALUES (@String);
  END;

  RETURN;
END;
GO

-- this is the search script which can be optionally converted to a stored procedure /function


declare @search varchar(max) = 'infection upper acute genito'; -- enter your search string here
-- the searched string above should give rows containing the following
-- infection in upper side with acute genitointestinal tract
-- acute infection in upper teeth
-- acute genitointestinal pain

if (len(trim(@search)) = 0) -- if search string is empty, just return records ordered alphabetically
begin
 select 1 as Priority ,myTableid, code, Description from myTable order by Description 
 return;
end

declare @splitTable Table(
wordRank int Identity(1,1), -- individual words are assinged priority order (in order of occurence/position)
word varchar(200)
)
declare @nonWordTable Table( -- table to trim out auxiliary verbs, prepositions etc. from the search
id varchar(200)
)

insert into @nonWordTable values
('of'),
('with'),
('at'),
('in'),
('for'),
('on'),
('by'),
('like'),
('up'),
('off'),
('near'),
('is'),
('are'),
(','),
(':'),
(';')

insert into @splitTable
select id from dbo.fnSplit(@search,' '); -- this function gives you a table with rows containing all the space separated words of the search like in this e.g., the output will be -
--  id
-------------
-- infection
-- upper
-- acute
-- genito

delete s from @splitTable s join @nonWordTable n  on s.word = n.id; -- trimming out non-words here
declare @countOfSearchStrings int = (select count(word) from @splitTable);  -- count of space separated words for search
declare @highestPriority int = POWER(@countOfSearchStrings,3);

with plainMatches as
(
select myTableid, @highestPriority as Priority from myTable where Description like @search  -- exact matches have highest priority
union                                      
select myTableid, @highestPriority-1 as Priority from myTable where Description like  @search + '%'  -- then with something at the end
union                                      
select myTableid, @highestPriority-2 as Priority from myTable where Description like '%' + @search -- then with something at the beginning
union                                      
select myTableid, @highestPriority-3 as Priority from myTable where Description like '%' + @search + '%' -- then if the word falls somewhere in between
),
splitWordMatches as( -- give each searched word a rank based on its position in the searched string
                     -- and calculate its char index in the field to search
select myTable.myTableid, (@countOfSearchStrings - s.wordRank) as Priority, s.word,
wordIndex = CHARINDEX(s.word, myTable.Description)  from myTable join @splitTable s on myTable.Description like '%'+ s.word + '%'
-- and not exists(select myTableid from plainMatches p where p.myTableId = myTable.myTableId) -- need not look into myTables that have already been found in plainmatches as they are highest ranked
                                                                              -- this one takes a long time though, so commenting it, will have no impact on the result
),
matchingRowsWithAllWords as (
 select myTableid, count(myTableid) as myTableCount from splitWordMatches group by(myTableid) having count(myTableid) = @countOfSearchStrings
)
, -- trim off the CTE here if you don't care about the ordering of words to be considered for priority
wordIndexRatings as( -- reverse the char indexes retrived above so that words occuring earlier have higher weightage
                     -- and then normalize them to sequential values
select s.myTableid, Priority, word, ROW_NUMBER() over (partition by s.myTableid order by wordindex desc) as comparativeWordIndex 
from splitWordMatches s join matchingRowsWithAllWords m on s.myTableId = m.myTableId
)
,
wordIndexSequenceRatings as ( -- need to do this to ensure that if the same set of words from search string is found in two rows,
                              -- their sequence in the field value is taken into account for higher priority
    select w.myTableid, w.word, (w.Priority + w.comparativeWordIndex + coalesce(sequncedPriority ,0)) as Priority
    from wordIndexRatings w left join 
    (
     select w1.myTableid, w1.priority, w1.word, w1.comparativeWordIndex, count(w1.myTableid) as sequncedPriority
     from wordIndexRatings w1 join wordIndexRatings w2 on w1.myTableId = w2.myTableId and w1.Priority > w2.Priority and w1.comparativeWordIndex>w2.comparativeWordIndex
     group by w1.myTableid, w1.priority,w1.word, w1.comparativeWordIndex
    ) 
    sequencedPriority on w.myTableId = sequencedPriority.myTableId and w.Priority = sequencedPriority.Priority
),
prioritizedSplitWordMatches as ( -- this calculates the cumulative priority for a field value
select  w1.myTableId, sum(w1.Priority) as OverallPriority from wordIndexSequenceRatings w1 join wordIndexSequenceRatings w2 on w1.myTableId =  w2.myTableId 
where w1.word <> w2.word group by w1.myTableid 
),
completeSet as (
select myTableid, priority from plainMatches -- get plain matches which should be highest ranked
union
select myTableid, OverallPriority as priority from prioritizedSplitWordMatches -- get ranked split word matches (which are ordered based on word rank in search string and sequence)
),
maximizedCompleteSet as( -- set the priority of a field value = maximum priority for that field value
select myTableid, max(priority) as Priority  from completeSet group by myTableId
)
select priority, myTable.myTableid , code, Description from maximizedCompleteSet m join myTable  on m.myTableId = myTable.myTableId 
order by Priority desc, Description -- order by priority desc to get highest rated items on top
--offset 0 rows fetch next 50 rows only -- optional paging

【讨论】:

以上是关于SQL:查找列包含所有给定单词的行的主要内容,如果未能解决你的问题,请参考以下文章

SQL Server 在特定列的所有行中查找和替换特定单词

如何将单独列中冒号前后的单词拆分为sql中的行

如何通过给定的两个文件检索特定单词之间的行?

带有单个单词的 SQL Contains() 不会返回所有预期的行

如何在包含子字符串的数据框中查找所有行?

什么工具能将给定的几个单词的所有组合生成指定位数的字典?跑hashcat用