Sql Server 2005 全文搜索中的噪声词
Posted
技术标签:
【中文标题】Sql Server 2005 全文搜索中的噪声词【英文标题】:Noise Words in Sql Server 2005 Full Text Search 【发布时间】:2009-06-02 06:31:51 【问题描述】:我正在尝试对我的数据库中的一系列名称进行全文搜索。这是我第一次尝试使用全文搜索。目前我使用输入的搜索字符串并在每个术语之间放置一个 NEAR 条件(即“Kings of Leon”的输入短语变为“Kings NEAR of NEAR Leon”)。
不幸的是,我发现这种策略会导致错误的否定搜索结果,因为 SQL Server 在创建索引时删除了“of”这个词,因为它是一个干扰词。因此,“Kings Leon”会正确匹配,但“Kings of Leon”不会。
我的同事建议将 MSSQL\FTData\noiseENG.txt 中定义的所有干扰词放入 .Net 代码中,以便在执行全文搜索之前去除干扰词。
这是最好的解决方案吗?是否有一些我可以在 SQL Server 中更改的自动魔术设置来为我执行此操作?或者也许只是一个不那么老套的更好的解决方案?
【问题讨论】:
在之前的项目中,我们使用 SQL Server 全文搜索并使用 c# 删除了干扰词。 【参考方案1】:全文将根据您提供的搜索条件起作用。您可以从文件中删除干扰词,但这样做确实有使索引大小膨胀的风险。 Robert Cain 在他的博客上有很多很好的信息:
http://arcanecode.com/2008/05/29/creating-and-customizing-noise-words-in-sql-server-2005-full-text-search/
为了节省一些时间,您可以查看此方法如何删除它们并复制代码和单词:
public string PrepSearchString(string sOriginalQuery)
string strNoiseWords = @" 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | $ | ! | @ | # | $ | % | ^ | & | * | ( | ) | - | _ | + | = | [ | ] | | | about | after | all | also | an | and | another | any | are | as | at | be | because | been | before | being | between | both | but | by | came | can | come | could | did | do | does | each | else | for | from | get | got | has | had | he | have | her | here | him | himself | his | how | if | in | into | is | it | its | just | like | make | many | me | might | more | most | much | must | my | never | now | of | on | only | or | other | our | out | over | re | said | same | see | should | since | so | some | still | such | take | than | that | the | their | them | then | there | these | they | this | those | through | to | too | under | up | use | very | want | was | way | we | well | were | what | when | where | which | while | who | will | with | would | you | your | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z ";
string[] arrNoiseWord = strNoiseWords.Split("|".ToCharArray());
foreach (string noiseword in arrNoiseWord)
sOriginalQuery = sOriginalQuery.Replace(noiseword, " ");
sOriginalQuery = sOriginalQuery.Replace(" ", " ");
return sOriginalQuery.Trim();
但是,我可能会为此使用 Regex.Replace,它应该比循环快得多。我只是没有一个简单的例子可以发布。
【讨论】:
在方法的开头添加以下行后,它可以正常工作:sOriginalQuery = " " + sOriginalQuery + " ";这是允许匹配作为搜索短语的第一个或最后一个词的噪声词所必需的。【参考方案2】:这是一个工作函数。文件 noiseENU.txt
按原样从 \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData
复制。
Public Function StripNoiseWords(ByVal s As String) As String
Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
NoiseWordsRegex = String.Format("\s?\b(?:0)\b\s?", NoiseWordsRegex)
Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
Return Result
End Function
【讨论】:
以上是关于Sql Server 2005 全文搜索中的噪声词的主要内容,如果未能解决你的问题,请参考以下文章