SQL Server 匹配词组和排序相关性的最佳方法

Posted

技术标签:

【中文标题】SQL Server 匹配词组和排序相关性的最佳方法【英文标题】:SQL Server best method to match word phrases and order relevence 【发布时间】:2011-09-23 12:41:52 【问题描述】:

根据参数中单词的数量(计数)/匹配对 sql varchar 列进行排名的最佳方法是什么,有四个不同的唯一标准。这可能不是一个微不足道的问题,但我面临的挑战是使用我的标准根据“最佳匹配”对行进行排序。

列:描述 varchar(100) 参数:@MyParameter varchar(100)

具有此顺序偏好的输出:

完全匹配(整个字符串匹配)- 总是在前 开头(根据匹配的参数长度递减) 在匹配词数相同的情况下,连续词排名更高的词数排名 单词匹配任何地方(不连续)

单词可能不完全匹配,因为允许并且很可能单词的部分匹配,应将出租人值应用于部分单词以进行排名但不是关键的(pot 将匹配每个:pot、potter、potholder、depot、depotting for实例)。以其他单词匹配开头的排名应高于没有后续匹配的排名,但这不是交易杀手/超级重要。

我想要一种方法来对列“以”参数中的值“开始”的位置进行排名。假设我有以下字符串:

'This is my value string as a test template to rank on.'

在第一种情况下,我希望在存在最多字数的列/行中获得排名。

在开始时根据出现次数(最佳匹配)排名第二:

'This is my string as a test template to rank on.' - first
'This is my string as a test template to rank on even though not exact.'-second
'This is my string as a test template to rank' - third
'This is my string as a test template to' - next
'This is my string as a test template' - next etc.

其次:(可能是第一组(开头)之后的第二组/第二组数据 - 这是需要的

我想按@MyParameter 中出现在@MyParameter 中的单词数对行进行排名(排序),其中连续单词的排名高于单独的相同计数。

因此,对于上面的示例字符串,'is my string as shown' 的排名将高于'is not my other string as',因为连续字符串(单词一起)的“更好匹配”与相同的单词数。匹配度较高的行(出现的字数)将按降序排列最佳匹配。

如果可能,我想在单个查询中执行此操作。

结果中不应出现任何行。

出于性能考虑,表中的行数不会超过 10,000 行。

表中的值是相当静态的,变化不大,但并非完全如此。

我目前无法更改结构,但稍后会考虑(如单词/短语表)

为了使这稍微复杂一些,单词列表位于两个表中 - 但我可以为此创建一个视图,但是一个表结果(较小的列表)应该先于第二个出现,给定相同匹配的较大数据集结果- 这些表以及表中都会有重复项,我只想要不同的值。选择 DISTINCT 并不容易,因为我想返回一个列(sourceTable),这很可能使行不同,在这种情况下,只能从第一个(较小的)表中选择,但需要所有其他列 DISTINCT(不要考虑“独特”评估中的列。

表格中的伪列:

procedureCode   VARCHAR(50),
description VARCHAR(100), -- this is the sort/evaluation column
category    VARCHAR(50),
relvu       VARCHAR(50),
charge  VARCHAR(15),
active  bit
sourceTable   VARCHAR(50) - just shows which table it comes from of the two

不存在像 ID 列这样的唯一索引

匹配不在要排除的第三个表中 SELECT * FROM (select * from tableone where procedureCode not in (select procedureCode from tablethree)) UNION ALL (select * from tabletwo where procedureCode not in (select procedureCode from tablethree))

编辑:为了解决这个问题,我创建了一个表值参数,如下所示:

0       Gastric Intubation & Aspiration/Lavage, Treatmen
1       Gastric%Intubation%Aspiration%Lavage%Treatmen
2       Gastric%Intubation%Aspiration%Lavage
3       Gastric%Intubation%Aspiration
4       Gastric%Intubation
5       Gastric
6       Intubation%Aspiration%Lavage%Treatmen
7       Intubation%Aspiration%Lavage
8       Intubation%Aspiration
9       Intubation
10      Aspiration%Lavage%Treatmen
11      Aspiration%Lavage
12      Aspiration
13      Lavage%Treatmen
14      Lavage
15      Treatmen

实际短语在第 0 行的位置

这是我目前的尝试:

CREATE PROCEDURE [GetProcedureByDescription]
(   
        @IncludeMaster  BIT,
        @ProcedureSearchPhrases CPTFavorite READONLY

)
AS

    DECLARE @myIncludeMaster    BIT;

    SET @myIncludeMaster = @IncludeMaster;

    CREATE TABLE #DistinctMatchingCpts
    (
    procedureCode   VARCHAR(50),
    description     VARCHAR(100),
    category        VARCHAR(50),
    rvu     VARCHAR(50),
    charge      VARCHAR(15),
    active      VARCHAR(15),
    sourceTable   VARCHAR(50),
    sequenceSet VARCHAR(2)
    )

    IF @myIncludeMaster = 0
        BEGIN -- Excluding master from search   
          INSERT INTO #DistinctMatchingCpts (sourceTable, procedureCode, description    ,   category  ,charge, active, rvu, sequenceSet
) 
      SELECT DISTINCT sourceTable, procedureCode, description, category ,charge, active, rvu, sequenceSet
          FROM (
                  SELECT TOP 1
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[COMBO])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      ''True'' AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''0CPTMore'' AS sourceTable,
                      ''01'' AS sequenceSet
                  FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [CPTMORE] AS CPT
                      ON CPT.[LEVEL] = PP.[LEVEL]
                  WHERE 
                      (CPT.[COMBO] IS NULL OR CPT.[COMBO] NOT IN (''Editor'',''MOD'',''CATEGORY'',''Types'',''Bundles''))
                      AND CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)
                  ORDER BY PP.CODE

          UNION ALL

                  SELECT 
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[COMBO])) AS category,
                      LTRIM(RTRIM([CHARGE])) AS charge,
                      ''True'' AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''0CPTMore'' AS sourceTable, 
                      ''02'' AS sequenceSet
                  FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [CPTMORE] AS CPT
                      ON CPT.[LEVEL] LIKE PP.[LEVEL] + ''%''
                  WHERE 
                      (CPT.[COMBO] IS NULL OR CPT.[COMBO] NOT IN (''Editor'',''MOD'',''CATEGORY'',''Types'',''Bundles''))
                      AND CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)

          UNION ALL

            SELECT 
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[COMBO])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      ''True'' AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''0CPTMore'' AS sourceTable,
                      ''03'' AS sequenceSet
                  FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [CPTMORE] AS CPT
                      ON CPT.[LEVEL] LIKE ''%'' + PP.[LEVEL] + ''%''
                  WHERE 
                      (CPT.[COMBO] IS NULL OR CPT.[COMBO] NOT IN (''Editor'',''MOD'',''CATEGORY'',''Types'',''Bundles''))
                      AND CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)

            ) AS CPTS
            ORDER BY 
                 procedureCode, sourceTable, [description]
        END -- Excluded master from search
    ELSE
        BEGIN -- Including master in search, but present favorites before master for each code
            -- Get matching procedures, ordered by code, source (favorites first), and description.
            -- There probably will be procedures with duplicated code+description, so we will filter
            -- duplicates shortly.
      INSERT INTO #DistinctMatchingCpts (sourceTable, procedureCode, description    ,   category  ,charge, active, rvu, sequenceSet) 
      SELECT DISTINCT sourceTable, procedureCode, description, category ,charge, active, rvu, sequenceSet
          FROM (
                  SELECT TOP 1
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[COMBO])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      ''True'' AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''0CPTMore'' AS sourceTable,
                      ''00'' AS sequenceSet
                FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [CPTMORE] AS CPT
                      ON CPT.[LEVEL] = PP.[LEVEL]
                  WHERE 
                      (CPT.[COMBO] IS NULL OR CPT.[COMBO] NOT IN (''Editor'',''MOD'',''CATEGORY'',''Types'',''Bundles''))
                      AND CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)
                  ORDER BY PP.CODE

                  UNION ALL

                  SELECT TOP 1
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[CATEGORY])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      COALESCE(CASE [ACTIVE] WHEN 1 THEN ''True'' WHEN 0 THEN ''False'' WHEN '''' THEN ''False'' ELSE ''False'' END,''True'') AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''2MasterCPT'' AS sourceTable,
                      ''00'' AS sequenceSet
                  FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [MASTERCPT] AS CPT
                      ON CPT.[LEVEL] = PP.[LEVEL]
                  WHERE 
                      CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)
                  ORDER BY PP.CODE

                  UNION ALL

                  SELECT 
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[COMBO])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      ''True'' AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''0CPTMore'' AS sourceTable,
                      ''01'' AS sequenceSet
                FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [CPTMORE] AS CPT
                      ON CPT.[LEVEL] = PP.[LEVEL]
                  WHERE 
                      (CPT.[COMBO] IS NULL OR CPT.[COMBO] NOT IN (''Editor'',''MOD'',''CATEGORY'',''Types'',''Bundles''))
                      AND CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)

                  UNION ALL

                  SELECT 
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[CATEGORY])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      COALESCE(CASE [ACTIVE] WHEN 1 THEN ''True'' WHEN 0 THEN ''False'' WHEN '''' THEN ''False'' ELSE ''False'' END,''True'') AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''2MasterCPT'' AS sourceTable,
                      ''01'' AS sequenceSet
                  FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [MASTERCPT] AS CPT
                      ON CPT.[LEVEL] = PP.[LEVEL]
                  WHERE 
                      CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)

                  UNION ALL

                  SELECT TOP 1
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[COMBO])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      ''True'' AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''0CPTMore'' AS sourceTable,
                      ''02'' AS sequenceSet
                FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [CPTMORE] AS CPT
                      ON CPT.[LEVEL] LIKE PP.[LEVEL] + ''%''
                  WHERE 
                      (CPT.[COMBO] IS NULL OR CPT.[COMBO] NOT IN (''Editor'',''MOD'',''CATEGORY'',''Types'',''Bundles''))
                      AND CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)
                  ORDER BY PP.CODE

                  UNION ALL

                  SELECT TOP 1
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[CATEGORY])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      COALESCE(CASE [ACTIVE] WHEN 1 THEN ''True'' WHEN 0 THEN ''False'' WHEN '''' THEN ''False'' ELSE ''False'' END,''True'') AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''2MasterCPT'' AS sourceTable,
                      ''02'' AS sequenceSet
                  FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [MASTERCPT] AS CPT
                      ON CPT.[LEVEL] LIKE PP.[LEVEL] + ''%''
                  WHERE 
                      CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)
                  ORDER BY PP.CODE

                  UNION ALL

                  SELECT 
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[COMBO])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      ''True'' AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''0CPTMore'' AS sourceTable,
                      ''03'' AS sequenceSet
                FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [CPTMORE] AS CPT
                      ON CPT.[LEVEL] LIKE PP.[LEVEL] + ''%''
                  WHERE 
                      (CPT.[COMBO] IS NULL OR CPT.[COMBO] NOT IN (''Editor'',''MOD'',''CATEGORY'',''Types'',''Bundles''))
                      AND CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)

                  UNION ALL

                  SELECT 
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[CATEGORY])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      COALESCE(CASE [ACTIVE] WHEN 1 THEN ''True'' WHEN 0 THEN ''False'' WHEN '''' THEN ''False'' ELSE ''False'' END,''True'') AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''2MasterCPT'' AS sourceTable,
                      ''03'' AS sequenceSet
                  FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [MASTERCPT] AS CPT
                      ON CPT.[LEVEL] LIKE PP.[LEVEL] + ''%''
                  WHERE 
                      CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)

                  UNION ALL

                  SELECT 
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[COMBO])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      ''True'' AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''0CPTMore'' AS sourceTable,
                      ''04'' AS sequenceSet
                FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [CPTMORE] AS CPT
                      ON CPT.[LEVEL] LIKE ''%'' + PP.[LEVEL] + ''%''
                  WHERE 
                      (CPT.[COMBO] IS NULL OR CPT.[COMBO] NOT IN (''Editor'',''MOD'',''CATEGORY'',''Types'',''Bundles''))
                      AND CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)

                  UNION ALL

                  SELECT 
                      LTRIM(RTRIM(CPT.[CODE])) AS procedureCode, 
                      LTRIM(RTRIM(CPT.[LEVEL])) AS description, 
                      LTRIM(RTRIM(CPT.[CATEGORY])) AS category,
                      LTRIM(RTRIM(CPT.[CHARGE])) AS charge,
                      COALESCE(CASE [ACTIVE] WHEN 1 THEN ''True'' WHEN 0 THEN ''False'' WHEN '''' THEN ''False'' ELSE ''False'' END,''True'') AS active,
                      LTRIM(RTRIM([RVU])) AS rvu,
                      ''2MasterCPT'' AS sourceTable,
                      ''04'' AS sequenceSet
                  FROM 
                    @ProcedureSearchPhrases PP
                    INNER JOIN  [MASTERCPT] AS CPT
                      ON CPT.[LEVEL] LIKE ''%'' + PP.[LEVEL] + ''%''
                  WHERE 
                      CPT.[CODE] IS NOT NULL
                      AND CPT.[CODE] NOT IN (''0'', '''')
                    AND CPT.[CODE] NOT IN (SELECT CPTE.[CODE] FROM CPT AS CPTE WHERE CPTE.[CODE] IS NOT NULL)

             ) AS CPTS 

            ORDER BY 
                 sequenceSet, sourceTable, [description]

        END

        /* Final select - uses artificial ordering from the insertion ORDER BY */
        SELECT procedureCode, description,  category, rvu, charge, active FROM
        ( 
        SELECT TOP 500 *-- procedureCode, description,  category, rvu, charge, active
        FROM #DistinctMatchingCpts
        ORDER BY sequenceSet, sourceTable, description

        ) AS CPTROWS

        DROP TABLE #DistinctMatchingCpts

但是,这不符合单词计数的最佳匹配标准(如样本中的第 1 行值),该标准应匹配从该行中找到的最佳(最多)单词计数。

如果有区别,我可以完全控制表值参数的形式/格式。

如果有用,我会将此结果返回给 c# 程序。

【问题讨论】:

这些是否回答了您的问题? 几个答案,一些想法,但没有一个完全足以获得满足标准列表的完整结果集。目前,我正在设计一个似乎正在做我想做的事情的算法原型——一旦我对它进行了全面审查,我将确定它是否是满足这些目标的可行解决方案。 【参考方案1】:

您需要能够拆分字符串才能解决此问题。 I prefer the number table approach to split a string in TSQL

为了使我的以下代码(以及我的拆分功能)正常工作,您需要执行此一个时间表设置:

SELECT TOP 10000 IDENTITY(int,1,1) AS Number
    INTO Numbers
    FROM sys.objects s1
    CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)

一旦设置了 Numbers 表,创建这个拆分函数:

CREATE FUNCTION [dbo].[FN_ListToTable]
(
     @SplitOn  char(1)      --REQUIRED, the character to split the @List string on
    ,@List     varchar(8000)--REQUIRED, the list to split apart
)
RETURNS TABLE
AS
RETURN 
(

    ----------------
    --SINGLE QUERY-- --this will not return empty rows
    ----------------
    SELECT
        ListValue
        FROM (SELECT
                  LTRIM(RTRIM(SUBSTRING(List2, number+1, CHARINDEX(@SplitOn, List2, number+1)-number - 1))) AS ListValue
                  FROM (
                           SELECT @SplitOn + @List + @SplitOn AS List2
                       ) AS dt
                      INNER JOIN Numbers n ON n.Number < LEN(dt.List2)
                  WHERE SUBSTRING(List2, number, 1) = @SplitOn
             ) dt2
        WHERE ListValue IS NOT NULL AND ListValue!=''

);
GO 

您可以随意创建自己的拆分函数,但您仍然需要 Numbers 表才能让我的解决方案发挥作用。

您现在可以轻松地将 CSV 字符串拆分为表格并加入表格:

select * from dbo.FN_ListToTable(',','1,2,3,,,4,5,6777,,,')

输出:

ListValue
-----------------------
1
2
3
4
5
6777

(6 row(s) affected)

现在试试这个:

DECLARE @BaseTable table (RowID int primary key, RowValue varchar(100))
set nocount on
INSERT @BaseTable VALUES ( 1,'The cows came home empty handed')
INSERT @BaseTable VALUES ( 2,'This is my string as a test template to rank')                           -- third
INSERT @BaseTable VALUES ( 3,'pencil pen paperclip eraser')
INSERT @BaseTable VALUES ( 4,'wow')
INSERT @BaseTable VALUES ( 5,'no dice here')
INSERT @BaseTable VALUES ( 6,'This is my string as a test template to rank on even though not exact.') -- second
INSERT @BaseTable VALUES ( 7,'apple banana pear grape lemon orange kiwi strawberry peach watermellon')
INSERT @BaseTable VALUES ( 8,'This is my string as a test template')                                   -- 5th
INSERT @BaseTable VALUES ( 9,'rat cat bat mat sat fat hat pat ')
INSERT @BaseTable VALUES (10,'house home pool roll')
INSERT @BaseTable VALUES (11,'This is my string as a test template to')                                -- 4th
INSERT @BaseTable VALUES (12,'talk wisper yell scream sing hum')
INSERT @BaseTable VALUES (13,'This is my string as a test template to rank on.')                       -- first
INSERT @BaseTable VALUES (14,'aaa bbb ccc ddd eee fff ggg hhh')
INSERT @BaseTable VALUES (15,'three twice three once twice three')
set nocount off

DECLARE @SearchValue varchar(100)
SET @SearchValue='This is my value string as a test template to rank on.'

;WITH SplitBaseTable AS --expand each @BaseTable row into one row per word
(SELECT
     b.RowID, b.RowValue, s.ListValue
     FROM @BaseTable b
         CROSS APPLY  dbo.FN_ListToTable(' ',b.RowValue) AS s
)
, WordMatchCount AS --for each @BaseTable row that has has a word in common withe the search string, get the count of matching words
(SELECT
     s.RowID,COUNT(*) AS CountOfWordMatch
     FROM dbo.FN_ListToTable(' ',@SearchValue) v
         INNER JOIN SplitBaseTable             s ON v.ListValue=s.ListValue
     GROUP BY s.RowID
     HAVING COUNT(*)>0
)
, SearchLen AS --get one row for each possible length of the search string
(
SELECT
    n.Number,SUBSTRING(@SearchValue,1,n.Number) AS PartialSearchValue
    FROM Numbers n
    WHERE n.Number<=LEN(@SearchValue)
)
, MatchLen AS --for each @BaseTable row, get the max starting length that matches the search string
(
 SELECT
     b.RowID,MAX(l.Number) MatchStartLen
     FROM @BaseTable                 b
         LEFT OUTER JOIN SearchLen   l ON LEFT(b.RowValue,l.Number)=l.PartialSearchValue
     GROUP BY b.RowID
)
SELECT --return the final search results
    b.RowValue,w.CountOfWordMatch,m.MatchStartLen
    FROM @BaseTable                     b
        LEFT OUTER JOIN WordMatchCount  w ON b.RowID=w.RowID
        LEFT OUTER JOIN MatchLen        m ON b.RowID=m.RowID
    WHERE w.CountOfWordMatch>0
    ORDER BY w.CountOfWordMatch DESC,m.MatchStartLen DESC,LEN(b.RowValue) DESC,b.RowValue ASC

输出:

RowValue                                                                CountOfWordMatch MatchStartLen
----------------------------------------------------------------------- ---------------- -------------
This is my string as a test template to rank on.                        11               11
This is my string as a test template to rank on even though not exact.  10               11
This is my string as a test template to rank                            10               11
This is my string as a test template to                                 9                11
This is my string as a test template                                    8                11

(5 row(s) affected)

它对字符串开头的单词匹配略有不同,因为它查看匹配字符串开头的字符数。

一旦你得到这个工作,你可以尝试通过为 SplitBaseTable 创建一些静态索引表来优化它。可能在您的 @BaseTable 上使用触发器。

【讨论】:

这是一个有趣的想法。当前的挑战是:没有一个单词分隔符,如:)(,/=%-][ 都存在于双引号中,并且在双引号中的短语中有些重要:“Level > 9.0%”或“LDL -C 关于“疯狂”的单词/短语拆分规则,您有三个选项:1) 编写一个 CLR 拆分例程来处理所有必要的逻辑。 2) 在您的字符串中插入一个像PRINT CHAR(182) 这样的字符以清楚地识别拆分。 3) 重新设计表格,使每个“短语”已经被分割成自己的行,并且可以根据 ID 和序列号重构它们。至于主键,添加一个标识列,如下所示:blog.sqlauthority.com/2009/05/03/… 并使其成为 PK 很好的解决方案!我正在使用这种方法,并且效果很好。谢谢!但是是否也可以显示具有相似词的结果?我不是在谈论 leventhein 距离。 LIKE 运算符会为我做这件事......但我不知道如何扩展这个解决方案。【参考方案2】:

听起来您正在寻找一种匹配算法,如果不使用存储过程,可能很难创建该算法。从过去的经验来看,有edit distance algorithms(如Levenshtein)在确定相似性方面非常有用。这些返回一个数字,有时是字符串之间的一些差异,您可以在其上创建自己的加权方程来给出分数。然后,您可以为分数创建排名或阈值以降低误报/正数。

【讨论】:

有点像,但是这些术语非常具体,并且只保留了一组有限的单词——所有这些都是具体的,所以“完全匹配”对于我的目的来说就足够了,只需要获得几组精确的“组”根据描述的优先级进行匹配。不过建议很好。 我也可以使用存储过程或任何我需要的东西来获得正确的结果集。 Levenshtein 距离的好建议,但它适用于字母,用于比较术语。现在这是有趣的部分:这是正确的答案:Levenshtein 距离。你想要达到的是 Levenshtein 距离,但不是用字母,而是用文字。我建议如果可能的话,制作一个 CLR 程序集来计算这个距离。 word-Levenshtein 距离为 1 意味着一个词(不是字母)不在它的位置(它不应该在那里,或者它丢失了)。所以你可以很容易地按这个距离订购。【参考方案3】:

不久前我有一个类似的问题。我试图回答的问题是在两个不同的列之间匹配了多少单词,并根据匹配单词的最高百分比进行排名。这超出了我的想象,但我从 Martin 那里得到了一个绝妙的答案。

看看他对my question here的回答。

【讨论】:

【参考方案4】:

您所有问题的一个答案:使用 sphynx http://sphinxsearch.com 并且不要在 SQL 中解决这个问题。

Sphynx 是开源的,适用于所有数据库和所有操作系统。

这就是 craigslist 所使用的。

这是本文发布时最好的外部全文搜索系统。它将按照您要求的相关性对您的结果进行排序,并且您不需要花哨的 SQL 表或 SQL 过程。试试看。

【讨论】:

有时,无论好坏,您必须使用 SQL 检索记录(例如,如果您还根据其他条件过滤它们,并且如果您将它们链接到相关表)。

以上是关于SQL Server 匹配词组和排序相关性的最佳方法的主要内容,如果未能解决你的问题,请参考以下文章

查找最佳类别匹配的 SQL 查询

匹配停用词组并替换为 Pyspark 中的空格

在 SQL Server 中创建和管理全局 Procs 和 UDF 的最佳方法是啥?

从 ASP.NET Core 连接到 SQL Server 的最佳实践?

SQL Server 中索引的排序规则

如何在LUA里准确匹配中文词组,求助完整语句