SQLite 全文搜索相关性排名

Posted

技术标签:

【中文标题】SQLite 全文搜索相关性排名【英文标题】:SQLite full-text search relevance ranking 【发布时间】:2011-11-08 11:28:25 【问题描述】:

我正在使用 sqlite3 的 fts4 扩展来启用全文索引和文本数据搜索。这很好用,但我注意到结果根本没有相关性排名。我想我太习惯 Lucene 了。我已经看到了一些使用 matchinfo() 结果编写自定义排名方法的简短建议,但我不清楚这是如何完成的,或者是否有任何复杂的示例。其他人是如何处理的?

【问题讨论】:

【参考方案1】:

文档中有一个完整的示例,look at the end of appendix a。您需要做更多的工作才能获得良好的相关性排名,因为提供的功能仅适用于入门。例如,对于matchinfo(table,'pcnalx'),有足够的信息来实现Okapi BM25。

【讨论】:

matchinfo 中是否有任何 Okapi BM25 的公共实现? 我不知道。我自己实现了,但代码不公开。 如果有人遇到这个问题,有一些关于 BM25/BM25F 的教程 - irthoughts.wordpress.com/2011/08/03/… 确实是这样,但这并不是我想要的公开内容的质量。不过这并不难,基本上是***方程的 1:1 翻译,看起来比实际更糟糕。我确实发现将总和设置为零最适合我的情况,YMMV。 你是在 SQLite 中实现的吗?【参考方案2】:

似乎明显缺乏关于如何在 C 中实现 Okapi BM25 的文档,而且将实现留给用户作为练习似乎是不言而喻的事情。

我在 GitHub 上找到了一个程序员“Radford 'rads' Smith”的兄弟

https://github.com/rads/sqlite-okapi-bm25

虽然我现在正在研究 BM25F 的调整,但它只实现了 BM25....

....就在这里。

https://github.com/neozenith/sqlite-okapi-bm25

【讨论】:

makegcc 4.8.4 上给出 gcc: error: unrecognized command line option ‘-bundle’ 。我知道这个答案已经很老了,你的最后一次提交是在 2 年前,但你能告诉我你用来构建 c 文件的 gcc 编译器的哪个版本吗? configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 7.0.2 (clang-700.1.81) Target: x86_64-apple-darwin14.5.0 Thread model: posix 这是在 OSX 开发环境中抱歉。我还没有测试过其他构建链。【参考方案3】:

对于FTS5,根据SQLite FTS5 Extension,

FTS5 没有matchinfo()。 FTS5 支持ORDER BY rank

非常简单,类似

SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;

没有DESC 有效。

【讨论】:

一些进一步的细节......“所有 FTS5 表都有一个名为“rank”的特殊隐藏列。如果当前查询不是全文查询(即,如果它不包含 MATCH 运算符) , "rank" 列的值始终为 NULL。否则,在全文查询中,列 rank 默认包含与执行不带尾随参数的 bm25() 辅助函数返回的值相同的值。"跨度> 正是我需要的!非常感谢。 Sqlite 每天都让我越来越惊艳。【参考方案4】:

这是Okapi BM25 的实现。将此与SQLite.org 的建议结合使用将帮助您生成相关性排序的 MATCH 查询。这都是在 VB.Net 中编写的,查询是使用System.Data.SQLite 函数调用的。只要使用System.Data.SQLite函数调用SQL代码,末尾的自定义SQLiteFunction就可以毫无问题地从SQL代码中调用。

    Public Class MatchInfo
        Property matchablePhrases As Integer
        Property userDefinedColumns As Integer
        Property totalDocuments As Integer
        Private _int32HitData As List(Of Integer)
        Private _longestSubsequencePhraseMatches As New List(Of Integer)
        Private _tokensInDocument As New List(Of Integer)
        Private _averageTokensInDocument As New List(Of Integer)

        Private _max_hits_this_row As Integer?
        Public ReadOnly Property max_hits_this_row As Integer
            Get
                If _max_hits_this_row Is Nothing Then
                    _max_hits_this_row = 0
                    For p = 0 To matchablePhrases - 1
                        For c = 0 To userDefinedColumns - 1
                            Dim myHitsThisRow As Integer = hits_this_row(p, c)
                            If myHitsThisRow > _max_hits_this_row Then
                                _max_hits_this_row = myHitsThisRow
                            End If
                        Next
                    Next
                End If

                Return _max_hits_this_row
            End Get
        End Property

        Private _max_hits_all_rows As Integer?
        Public ReadOnly Property max_hits_all_rows As Integer
            Get
                If _max_hits_all_rows Is Nothing Then
                    _max_hits_all_rows = 0
                    For p = 0 To matchablePhrases - 1
                        For c = 0 To userDefinedColumns - 1
                            Dim myHitsAllRows As Integer = hits_all_rows(p, c)
                            If myHitsAllRows > _max_hits_all_rows Then
                                _max_hits_all_rows = myHitsAllRows
                            End If
                        Next
                    Next
                End If

                Return _max_hits_all_rows
            End Get
        End Property

        Private _max_docs_with_hits As Integer?
        Public ReadOnly Property max_docs_with_hits As Integer
            Get
                If _max_docs_with_hits Is Nothing Then
                    _max_docs_with_hits = 0
                    For p = 0 To matchablePhrases - 1
                        For c = 0 To userDefinedColumns - 1
                            Dim myDocsWithHits As Integer = docs_with_hits(p, c)
                            If myDocsWithHits > _max_docs_with_hits Then
                                _max_docs_with_hits = myDocsWithHits
                            End If
                        Next
                    Next
                End If

                Return _max_docs_with_hits
            End Get
        End Property

        Private _BM25Rank As Double?
        Public ReadOnly Property BM25Rank As Double
            Get
                If _BM25Rank Is Nothing Then
                    _BM25Rank = 0
                    'calculate BM25 Rank
                    'http://en.wikipedia.org/wiki/Okapi_BM25

                    'k1, calibrates the document term frequency scaling. Having k1 as 0 corresponds to a binary model – no term frequency. Increasing k1 will give rare words more boost.
                    'b, calibrates the scaling by document length, and can take values from 0 to 1, where having 0 means no length normalization and having 1 corresponds to fully scaling the term weight by the document length.

                    Dim k1 As Double = 1.2
                    Dim b As Double = 0.75

                    For column = 0 To userDefinedColumns - 1
                        For phrase = 0 To matchablePhrases - 1
                            Dim IDF As Double = Math.Log((totalDocuments - hits_all_rows(phrase, column) + 0.5) / (hits_all_rows(phrase, column) + 0.5))
                            Dim score As Double = (IDF * ((hits_this_row(phrase, column) * (k1 + 1)) / (hits_this_row(phrase, column) + k1 * (1 - b + b * _tokensInDocument(column) / _averageTokensInDocument(column)))))
                            If score < 0 Then
                                score = 0
                            End If
                            _BM25Rank += score
                        Next
                    Next

                End If

                Return _BM25Rank
            End Get
        End Property

        Public Sub New(raw_pcnalsx_MatchInfo As Byte())
            Dim int32_pcsx_MatchInfo As New List(Of Integer)
            For i = 0 To raw_pcnalsx_MatchInfo.Length - 1 Step 4
                int32_pcsx_MatchInfo.Add(BitConverter.ToUInt32(raw_pcnalsx_MatchInfo, i))
            Next

            'take the raw data and parse it out
            Me.matchablePhrases = int32_pcsx_MatchInfo(0)
            int32_pcsx_MatchInfo.RemoveAt(0)

            Me.userDefinedColumns = int32_pcsx_MatchInfo(0)
            int32_pcsx_MatchInfo.RemoveAt(0)

            Me.totalDocuments = int32_pcsx_MatchInfo(0)
            int32_pcsx_MatchInfo.RemoveAt(0)

            'remember that the columns are 0-based
            For i = 0 To userDefinedColumns - 1
                _averageTokensInDocument.Add(int32_pcsx_MatchInfo(0))
                int32_pcsx_MatchInfo.RemoveAt(0)
            Next

            For i = 0 To userDefinedColumns - 1
                _tokensInDocument.Add(int32_pcsx_MatchInfo(0))
                int32_pcsx_MatchInfo.RemoveAt(0)
            Next

            For i = 0 To userDefinedColumns - 1
                _longestSubsequencePhraseMatches.Add(int32_pcsx_MatchInfo(0))
                int32_pcsx_MatchInfo.RemoveAt(0)
            Next

            _int32HitData = New List(Of Integer)(int32_pcsx_MatchInfo)

        End Sub

        Public Function hits_this_row(phrase As Integer, column As Integer) As Integer
            Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 0)
        End Function

        Public Function hits_all_rows(phrase As Integer, column As Integer) As Integer
            Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 1)
        End Function

        Public Function docs_with_hits(phrase As Integer, column As Integer) As Integer
            Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 2)
        End Function
    End Class

    <SQLiteFunction("Rank", 1, FunctionType.Scalar)>
    Public Class Rank
        Inherits SQLiteFunction

        Public Overrides Function Invoke(args() As Object) As Object
            Return New MatchInfo(args(0)).BM25Rank
        End Function

    End Class

【讨论】:

以上是关于SQLite 全文搜索相关性排名的主要内容,如果未能解决你的问题,请参考以下文章

ElasticSearch 结构化搜索全文

ElasticSearch 全文搜索

PhoneGap、SQLite 和全文搜索

SQlite 全文搜索(FTS)?

如何在 Django 中基于全文搜索功能 SearchRank 创建自定义排名?

Peewee 可以使用 SQLite 的 FTS5(全文搜索)辅助函数 highlight() 吗?