SQLite 全文搜索相关性排名
Posted
技术标签:
【中文标题】SQLite 全文搜索相关性排名【英文标题】:SQLite full-text search relevance ranking 【发布时间】:2011-11-08 11:28:25 【问题描述】:我正在使用 sqlite3 的 fts4 扩展来启用全文索引和文本数据搜索。这很好用,但我注意到结果根本没有相关性排名。我想我太习惯 Lucene 了。我已经看到了一些使用 matchinfo() 结果编写自定义排名方法的简短建议,但我不清楚这是如何完成的,或者是否有任何复杂的示例。其他人是如何处理的?
【问题讨论】:
【参考方案1】:文档中有一个完整的示例,look at the end of appendix a。您需要做更多的工作才能获得良好的相关性排名,因为提供的功能仅适用于入门。例如,对于matchinfo(table,'pcnalx')
,有足够的信息来实现Okapi BM25。
【讨论】:
matchinfo 中是否有任何 Okapi BM25 的公共实现? 我不知道。我自己实现了,但代码不公开。 如果有人遇到这个问题,有一些关于 BM25/BM25F 的教程 - irthoughts.wordpress.com/2011/08/03/… 确实是这样,但这并不是我想要的公开内容的质量。不过这并不难,基本上是***方程的 1:1 翻译,看起来比实际更糟糕。我确实发现将总和设置为零最适合我的情况,YMMV。 你是在 SQLite 中实现的吗?【参考方案2】:似乎明显缺乏关于如何在 C 中实现 Okapi BM25 的文档,而且将实现留给用户作为练习似乎是不言而喻的事情。
我在 GitHub 上找到了一个程序员“Radford 'rads' Smith”的兄弟
https://github.com/rads/sqlite-okapi-bm25
虽然我现在正在研究 BM25F 的调整,但它只实现了 BM25....
....就在这里。
https://github.com/neozenith/sqlite-okapi-bm25
【讨论】:
make
在 gcc 4.8.4
上给出 gcc: error: unrecognized command line option ‘-bundle’
。我知道这个答案已经很老了,你的最后一次提交是在 2 年前,但你能告诉我你用来构建 c 文件的 gcc
编译器的哪个版本吗?
configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 7.0.2 (clang-700.1.81) Target: x86_64-apple-darwin14.5.0 Thread model: posix
这是在 OSX 开发环境中抱歉。我还没有测试过其他构建链。【参考方案3】:
对于FTS5,根据SQLite FTS5 Extension,
FTS5 没有matchinfo()
。
FTS5 支持ORDER BY rank
非常简单,类似
SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;
没有DESC
有效。
【讨论】:
一些进一步的细节......“所有 FTS5 表都有一个名为“rank”的特殊隐藏列。如果当前查询不是全文查询(即,如果它不包含 MATCH 运算符) , "rank" 列的值始终为 NULL。否则,在全文查询中,列 rank 默认包含与执行不带尾随参数的 bm25() 辅助函数返回的值相同的值。"跨度> 正是我需要的!非常感谢。 Sqlite 每天都让我越来越惊艳。【参考方案4】:这是Okapi BM25 的实现。将此与SQLite.org 的建议结合使用将帮助您生成相关性排序的 MATCH 查询。这都是在 VB.Net 中编写的,查询是使用System.Data.SQLite
函数调用的。只要使用System.Data.SQLite
函数调用SQL代码,末尾的自定义SQLiteFunction
就可以毫无问题地从SQL代码中调用。
Public Class MatchInfo
Property matchablePhrases As Integer
Property userDefinedColumns As Integer
Property totalDocuments As Integer
Private _int32HitData As List(Of Integer)
Private _longestSubsequencePhraseMatches As New List(Of Integer)
Private _tokensInDocument As New List(Of Integer)
Private _averageTokensInDocument As New List(Of Integer)
Private _max_hits_this_row As Integer?
Public ReadOnly Property max_hits_this_row As Integer
Get
If _max_hits_this_row Is Nothing Then
_max_hits_this_row = 0
For p = 0 To matchablePhrases - 1
For c = 0 To userDefinedColumns - 1
Dim myHitsThisRow As Integer = hits_this_row(p, c)
If myHitsThisRow > _max_hits_this_row Then
_max_hits_this_row = myHitsThisRow
End If
Next
Next
End If
Return _max_hits_this_row
End Get
End Property
Private _max_hits_all_rows As Integer?
Public ReadOnly Property max_hits_all_rows As Integer
Get
If _max_hits_all_rows Is Nothing Then
_max_hits_all_rows = 0
For p = 0 To matchablePhrases - 1
For c = 0 To userDefinedColumns - 1
Dim myHitsAllRows As Integer = hits_all_rows(p, c)
If myHitsAllRows > _max_hits_all_rows Then
_max_hits_all_rows = myHitsAllRows
End If
Next
Next
End If
Return _max_hits_all_rows
End Get
End Property
Private _max_docs_with_hits As Integer?
Public ReadOnly Property max_docs_with_hits As Integer
Get
If _max_docs_with_hits Is Nothing Then
_max_docs_with_hits = 0
For p = 0 To matchablePhrases - 1
For c = 0 To userDefinedColumns - 1
Dim myDocsWithHits As Integer = docs_with_hits(p, c)
If myDocsWithHits > _max_docs_with_hits Then
_max_docs_with_hits = myDocsWithHits
End If
Next
Next
End If
Return _max_docs_with_hits
End Get
End Property
Private _BM25Rank As Double?
Public ReadOnly Property BM25Rank As Double
Get
If _BM25Rank Is Nothing Then
_BM25Rank = 0
'calculate BM25 Rank
'http://en.wikipedia.org/wiki/Okapi_BM25
'k1, calibrates the document term frequency scaling. Having k1 as 0 corresponds to a binary model – no term frequency. Increasing k1 will give rare words more boost.
'b, calibrates the scaling by document length, and can take values from 0 to 1, where having 0 means no length normalization and having 1 corresponds to fully scaling the term weight by the document length.
Dim k1 As Double = 1.2
Dim b As Double = 0.75
For column = 0 To userDefinedColumns - 1
For phrase = 0 To matchablePhrases - 1
Dim IDF As Double = Math.Log((totalDocuments - hits_all_rows(phrase, column) + 0.5) / (hits_all_rows(phrase, column) + 0.5))
Dim score As Double = (IDF * ((hits_this_row(phrase, column) * (k1 + 1)) / (hits_this_row(phrase, column) + k1 * (1 - b + b * _tokensInDocument(column) / _averageTokensInDocument(column)))))
If score < 0 Then
score = 0
End If
_BM25Rank += score
Next
Next
End If
Return _BM25Rank
End Get
End Property
Public Sub New(raw_pcnalsx_MatchInfo As Byte())
Dim int32_pcsx_MatchInfo As New List(Of Integer)
For i = 0 To raw_pcnalsx_MatchInfo.Length - 1 Step 4
int32_pcsx_MatchInfo.Add(BitConverter.ToUInt32(raw_pcnalsx_MatchInfo, i))
Next
'take the raw data and parse it out
Me.matchablePhrases = int32_pcsx_MatchInfo(0)
int32_pcsx_MatchInfo.RemoveAt(0)
Me.userDefinedColumns = int32_pcsx_MatchInfo(0)
int32_pcsx_MatchInfo.RemoveAt(0)
Me.totalDocuments = int32_pcsx_MatchInfo(0)
int32_pcsx_MatchInfo.RemoveAt(0)
'remember that the columns are 0-based
For i = 0 To userDefinedColumns - 1
_averageTokensInDocument.Add(int32_pcsx_MatchInfo(0))
int32_pcsx_MatchInfo.RemoveAt(0)
Next
For i = 0 To userDefinedColumns - 1
_tokensInDocument.Add(int32_pcsx_MatchInfo(0))
int32_pcsx_MatchInfo.RemoveAt(0)
Next
For i = 0 To userDefinedColumns - 1
_longestSubsequencePhraseMatches.Add(int32_pcsx_MatchInfo(0))
int32_pcsx_MatchInfo.RemoveAt(0)
Next
_int32HitData = New List(Of Integer)(int32_pcsx_MatchInfo)
End Sub
Public Function hits_this_row(phrase As Integer, column As Integer) As Integer
Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 0)
End Function
Public Function hits_all_rows(phrase As Integer, column As Integer) As Integer
Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 1)
End Function
Public Function docs_with_hits(phrase As Integer, column As Integer) As Integer
Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 2)
End Function
End Class
<SQLiteFunction("Rank", 1, FunctionType.Scalar)>
Public Class Rank
Inherits SQLiteFunction
Public Overrides Function Invoke(args() As Object) As Object
Return New MatchInfo(args(0)).BM25Rank
End Function
End Class
【讨论】:
以上是关于SQLite 全文搜索相关性排名的主要内容,如果未能解决你的问题,请参考以下文章