如何使用 SQL 比较序列?

Posted

技术标签:

【中文标题】如何使用 SQL 比较序列?【英文标题】:How can I compare sequences using SQL? 【发布时间】:2014-06-03 20:59:56 【问题描述】:

这可能有点复杂,但我不知道如何更简单地解释它。

我有两张序列表:

t1:

+-------+----------+-------+-------+
| state | sequence | gie   | match |
+-------+----------+-------+-------+
| a     |        1 | fna   |       |
| c     |        2 | fna   |       |
| b     |        3 | fna   |       |
| d     |        1 | dmc   |       |
| c     |        2 | dmc   |       |
| c     |        3 | dmc   |       |
+-------+----------+-------+-------+

t2:

+-------+----------+-------+-------+
| state | sequence | gie   | match |
+-------+----------+-------+-------+
| a     |        1 | fna   |       |
| d     |        2 | fna   |       |
| c     |        3 | fna   |       |
| b     |        4 | fna   |       |
| d     |        1 | dmc   |       |
| c     |        2 | dmc   |       |
+-------+----------+-------+-------+

对于给定group 的每个序列,我想从t2 中找到所有不适合t1 序列的记录,反之亦然。在t1中,不匹配的记录是GIEdmc,序列3;在t2,不匹配的记录是GIE fna,序列2

我无法弄清楚如何使用 SQL 来查找不匹配项,因为不清楚我应该加入什么。我用 VBA 尝试如下:

'assumes both recordsets are ordered by GIE,sequence
Sub findNonMatch(rs_base As DAO.Recordset, rs_compare As DAO.Recordset)

rs_base.MoveFirst
rs_compare.MoveFirst

While Not rs_base.EOF
    If rs_compare.EOF Then
        updateRS rs_base, False
    'separated into different if-clauses because checking rs_compare!GIE will throw error if rs_compare.eof
    Else
        If rs_compare!gie < rs_base!gie Then
        While rs_compare!gie < rs_base!gie
            rs_compare.MoveNext
        Wend
        End If

        While (rs_compare!gie = rs_base!gie And rs_compare!state <> rs_base!state And (Not rs_compare.EOF))
            rs_compare.MoveNext
        Wend
        If (rs_compare!state = rs_base!state And rs_compare!gie = rs_base!gie) Then
            updateRS rs_base, True
            rs_compare.MoveNext
        End If
    End If
    rs_base.MoveNext

Wend

End Sub

Sub updateRS(rs As DAO.Recordset, status As Boolean)
rs.Edit
rs!Match = status
rs.update
End Sub

如果rs_compare 中的序列有一个不在rs_base 中的额外值,则此方法有效,但如果rs_compare 缺少rs_base 中的值,则该方法无法正常工作,因为该函数将尝试找到该值并转到rs_compare 序列的末尾,这意味着此后将找不到任何rs_base 值(因为rs_compare 游标现在已经经过该序列中的所有记录)。

有没有更简单的方法来查找这些序列差异?也许一些我没有想到的 SQL 方法,尤其是因为这种算法不能很好地扩展到更大的数据集?

【问题讨论】:

您好,您写道:“在 t2 中,未匹配的记录是 GIE fna,序列 2。”你是说序列 4 吗? @VBlades,不,是序列 2;值是“d”,表t1序列1中的“a”后面没有“d”。 好的,谢谢。会多想。有趣的问题。 嘿,我有一个解决方案,我认为(是否是一个好的,不确定,哈哈),但想用另一组数据进行测试。是否可以提供另一组模拟数据?谢谢。 【参考方案1】:

看看这种方式在ms访问How can I implement SQL INTERSECT and MINUS operations in MS Access中实现减号。

您要做的是在序列、状态和 GIE 上左外连接 t1 到 t2,并选择 t2.id 为空的所有行。

然后您可以将该查询与在序列、状态和 GIE 上将外连接 t2 保留到 t1 的第二个查询联合,并选择 t1.id 为空的所有行。

【讨论】:

我认为这行不通,因为如果t2 在序列中间有一个额外的记录,那么从那时起所有t2.sequence 都会被丢弃,我将无法将它们与相应的t1.sequence 匹配。 那我们怎么知道两个序列是一样的呢? 我不只是试图完全匹配序列。我认为 user3260813 的回答正确地将其描述为最长的公共子字符串问题。【参考方案2】:

您所拥有的是用于比较两个字符串的最长公共子字符串问题。

查看VBA中的代码http://thydzik.com/longest-common-subsequence-implemented-in-vba-visual-basic-for-applications/

您应该以某种方式从 Access 获取 string1 和 string2,并使用链接中的函数。 所以对于你的例子

  String1 = acb
  String2 = adcb

查看链接中的示例函数,了解如何使用它。 “getDiff”函数的输出将是

 =+==

所以区别在 2 处。 + 表示在字符串 1 中插入 'd' 使字符串相等。

【讨论】:

【参考方案3】:

我自己模拟了一些测试数据,它似乎按预期运行,所以我想我会发布。我想让 SQL 完成繁重的工作,它确实做到了,但仍有一些代码要运行。如果您只是想尝试一下,我已将 accdb (Access 2007) 文件放在这里:http://www.sendspace.com/file/eqm5vh。如果这样做,只需将数据输入 t1 和 t2,然后打开 Module1 并运行 RunSequences;潜艇应该负责其余的工作。

我的代码不像你的那样简洁,sigil,并且需要更多的辅助对象。话虽如此,它可能比纯粹的基于游标的解决方案具有更好的扩展性,因为它只需要为每个表中的每个项目运行一行(或多或少,取决于跨表的重复数,如果有的话)。我的想法是能够逐行对每个表进行排序(如 SQL Server 中的 ROW_NUMBER),这样我就可以有一个绝对位置进行比较。我通过将两个表中的所有数据插入到带有 AutoNumber 字段的临时表中来做到这一点,然后使用旧的 DCount 技巧来获取行 ID。其余的基于此数据集。我不会解释死,而是让你尝试一下,看看它是否有效,但我会将我的代码发布在下面,以防有人想要查看。

表格:

查询:

qryT1T2_Ordered_INSERT:

INSERT INTO tblTemp
SELECT *
FROM (SELECT "t1" AS SourceTable, t1.State, t1.Sequence, t1.GIE, t1.Match
FROM t1

UNION ALL

SELECT "t2" AS SourceTable, t2.State, t2.Sequence, t2.GIE, t2.Match
FROM t2)  AS [%$##@_Alias]
ORDER BY SourceTable, GIE DESC , Sequence;

qryT1_Sequenced:

SELECT DCount("*","tblTemp","[SourceTable] = 't1' AND [ID] <= " & [ID]) AS SequenceID, tblTemp.ID, tblTemp.State, tblTemp.Sequence, tblTemp.GIE, tblTemp.Match, [State] & "_" & [GIE] AS JoinValue
FROM tblTemp
WHERE tblTemp.SourceTable="t1";

qryT1_比较:

SELECT qryT1_Sequenced.SequenceID AS MySequenceID, qryT2_Sequenced.SequenceID AS OtherSequenceID, qryT1_Sequenced.ID AS MyID, qryT2_Sequenced.ID AS OtherID, qryT2_Sequenced.JoinValue
FROM qryT1_Sequenced LEFT JOIN qryT2_Sequenced ON qryT1_Sequenced.JoinValue = qryT2_Sequenced.JoinValue
ORDER BY qryT1_Sequenced.SequenceID, qryT2_Sequenced.ID;

qryT2_Sequenced:

SELECT DCount("*","tblTemp","[SourceTable] = 't2' AND [ID] <= " & [ID]) AS SequenceID, tblTemp.ID, tblTemp.State, tblTemp.Sequence, tblTemp.GIE, tblTemp.Match, [State] & "_" & [GIE] AS JoinValue
FROM tblTemp
WHERE tblTemp.SourceTable="t2";

qryT2_比较:

SELECT qryT2_Sequenced.SequenceID AS MySequenceID, qryT1_Sequenced.SequenceID AS OtherSequenceID, qryT2_Sequenced.ID AS MyID, qryT1_Sequenced.ID AS OtherID, qryT2_Sequenced.JoinValue
FROM qryT2_Sequenced LEFT JOIN qryT1_Sequenced ON qryT2_Sequenced.JoinValue=qryT1_Sequenced.JoinValue
ORDER BY qryT2_Sequenced.SequenceID, qryT1_Sequenced.ID;

qryT1T2_Compared_FINAL:

SELECT tblTemp.SourceTable, tblTemp.State, tblTemp.Sequence, tblTemp.GIE, tblTemp.Match
FROM tblTemp
WHERE tblTemp.Match="No"
ORDER BY tblTemp.SourceTable, tblTemp.GIE DESC , tblTemp.Sequence;

模块:

Public Sub RunSequences()
On Error GoTo ErrorHandler

    DoCmd.SetWarnings False

    Set db = CurrentDb()

    'Do our setup:
    '1. Clear our temp table.
    CurrentDb.Execute "DELETE * FROM [tblTemp]"

    '2. Insert data from t1 and t2 into temp table.
    DoCmd.OpenQuery "qryT1T2_Ordered_INSERT"

    '3. Now process the sequence.
    ReportSequences "qryT1_Compare"
    ReportSequences "qryT2_Compare"

    '4. Open non-matched report.
    DoCmd.OpenQuery "qryT1T2_Compared_FINAL"

ExitMe:
    DoCmd.SetWarnings True

    Exit Sub
ErrorHandler:
    Debug.Print Err.Number & ": " & Err.Description
    GoTo ExitMe
End Sub

'----

Public Sub ReportSequences(strSourceQuery As String)
On Error GoTo ErrorHandler

    Dim db As DAO.Database
    Dim rst As DAO.Recordset
    Dim intLastOtherSequenceID As Integer
    Dim dicMasterSequenceIDs As New Scripting.Dictionary
    Dim dicComparedSequenceIDs As New Scripting.Dictionary
    Dim strSQL_UpdateYes As String
    Dim strSQL_UpdateNo As String

    'Running all my updates inline, but you can break this out.
    strSQL_UpdateYes = "UPDATE [tblTemp] SET [Match] = 'Yes' WHERE [ID] = @ID"
    strSQL_UpdateNo = "UPDATE [tblTemp] SET [Match] = 'No' WHERE [ID] = @ID"

    Set db = CurrentDb()
    Set rst = db.OpenRecordset(strSourceQuery, dbOpenDynaset)

    With rst
        Do Until .EOF
            'Need this to keep track of Master Sequence IDs (MyID) we've processed
            'successfully.
            'If there is more than one match for MyID, we want only to take the first
            'match that fulfills the condition of being next in the sequence,
            'not jump ahead.
            If dicMasterSequenceIDs.Exists(.Fields("MyID").Value) = True Then
                If dicMasterSequenceIDs(.Fields("MyID").Value) = "Done" Then
                    GoTo MoveNext
                End If
            Else
                dicMasterSequenceIDs.Add .Fields("MyID").Value, ""
            End If

            Select Case IsNull(.Fields("OtherID"))
                Case True
                    'If OtherID is null, it means no match in other table, so Match is
                    'automatically no.
                    db.Execute Replace(strSQL_UpdateNo, "@ID", .Fields("MyID"))
                Case False
                    'Check to see if current OtherSequenceID is greater than the old
                    'one...
                    '(If it is, it is in sequence).
                    If intLastOtherSequenceID < CInt(.Fields("OtherSequenceID")) Then
                        'Use the dictionary to keep track of distinct OtherSequenceIDs we've already added.
                        If dicComparedSequenceIDs.Exists(.Fields("OtherSequenceID").Value) = False Then
                            dicComparedSequenceIDs.Add .Fields("OtherSequenceID").Value, ""
                            db.Execute Replace(strSQL_UpdateYes, "@ID", .Fields("MyID"))
                            dicMasterSequenceIDs(.Fields("MyID").Value) = "Done"
                        'If it's a dupe, means the sequence is broken.
                        Else
                            db.Execute Replace(strSQL_UpdateNo, "@ID", .Fields("MyID"))
                        End If
                    Else
                        'If the old one is equal or greater, means sequence is broken.
                        db.Execute Replace(strSQL_UpdateNo, "@ID", .Fields("MyID"))
                    End If

                    intLastOtherSequenceID = .Fields("OtherSequenceID")
            End Select

MoveNext:
            .MoveNext
        Loop
    End With

ExitMe:
    Set dicComparedSequenceIDs = Nothing
    Set rst = Nothing
    Set db = Nothing

    Exit Sub
ErrorHandler:
    Debug.Print Err.Number & ": " & Err.Description
    GoTo ExitMe

End Sub

无论如何,希望它对你有用。如果没有,希望它能给你更多的想法。

编辑: 意识到子 ReportSequences 中的逻辑存在问题。如果我们在另一个序列中有几个匹配项,我们只想取一个序列中最早且符合条件的匹配项。已添加。新的 accdb 在这里:http://www.sendspace.com/file/hcdxvp

【讨论】:

以上是关于如何使用 SQL 比较序列?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 mongoDb 在单词序列集合中搜索输入单词

如何使用 sql 查询序列化 2 个表

使用 SQL,如何在时间序列中长时间间隔后删除案例?

如何比较两个字符串数组的序列

sql developer 如何在界面上创建序列

如何使用没有临时表的 SQL 查询为组中的每个元素添加序列号