如何使用 SQL 比较序列?
Posted
技术标签:
【中文标题】如何使用 SQL 比较序列?【英文标题】:How can I compare sequences using SQL? 【发布时间】:2014-06-03 20:59:56 【问题描述】:这可能有点复杂,但我不知道如何更简单地解释它。
我有两张序列表:
t1:
+-------+----------+-------+-------+
| state | sequence | gie | match |
+-------+----------+-------+-------+
| a | 1 | fna | |
| c | 2 | fna | |
| b | 3 | fna | |
| d | 1 | dmc | |
| c | 2 | dmc | |
| c | 3 | dmc | |
+-------+----------+-------+-------+
t2:
+-------+----------+-------+-------+
| state | sequence | gie | match |
+-------+----------+-------+-------+
| a | 1 | fna | |
| d | 2 | fna | |
| c | 3 | fna | |
| b | 4 | fna | |
| d | 1 | dmc | |
| c | 2 | dmc | |
+-------+----------+-------+-------+
对于给定group
的每个序列,我想从t2
中找到所有不适合t1
序列的记录,反之亦然。在t1
中,不匹配的记录是GIEdmc
,序列3
;在t2
,不匹配的记录是GIE fna
,序列2
。
我无法弄清楚如何使用 SQL 来查找不匹配项,因为不清楚我应该加入什么。我用 VBA 尝试如下:
'assumes both recordsets are ordered by GIE,sequence
Sub findNonMatch(rs_base As DAO.Recordset, rs_compare As DAO.Recordset)
rs_base.MoveFirst
rs_compare.MoveFirst
While Not rs_base.EOF
If rs_compare.EOF Then
updateRS rs_base, False
'separated into different if-clauses because checking rs_compare!GIE will throw error if rs_compare.eof
Else
If rs_compare!gie < rs_base!gie Then
While rs_compare!gie < rs_base!gie
rs_compare.MoveNext
Wend
End If
While (rs_compare!gie = rs_base!gie And rs_compare!state <> rs_base!state And (Not rs_compare.EOF))
rs_compare.MoveNext
Wend
If (rs_compare!state = rs_base!state And rs_compare!gie = rs_base!gie) Then
updateRS rs_base, True
rs_compare.MoveNext
End If
End If
rs_base.MoveNext
Wend
End Sub
Sub updateRS(rs As DAO.Recordset, status As Boolean)
rs.Edit
rs!Match = status
rs.update
End Sub
如果rs_compare
中的序列有一个不在rs_base
中的额外值,则此方法有效,但如果rs_compare
缺少rs_base
中的值,则该方法无法正常工作,因为该函数将尝试找到该值并转到rs_compare
序列的末尾,这意味着此后将找不到任何rs_base
值(因为rs_compare
游标现在已经经过该序列中的所有记录)。
有没有更简单的方法来查找这些序列差异?也许一些我没有想到的 SQL 方法,尤其是因为这种算法不能很好地扩展到更大的数据集?
【问题讨论】:
您好,您写道:“在 t2 中,未匹配的记录是 GIE fna,序列 2。”你是说序列 4 吗? @VBlades,不,是序列 2;值是“d”,表t1
序列1中的“a”后面没有“d”。
好的,谢谢。会多想。有趣的问题。
嘿,我有一个解决方案,我认为(是否是一个好的,不确定,哈哈),但想用另一组数据进行测试。是否可以提供另一组模拟数据?谢谢。
【参考方案1】:
看看这种方式在ms访问How can I implement SQL INTERSECT and MINUS operations in MS Access中实现减号。
您要做的是在序列、状态和 GIE 上左外连接 t1 到 t2,并选择 t2.id 为空的所有行。
然后您可以将该查询与在序列、状态和 GIE 上将外连接 t2 保留到 t1 的第二个查询联合,并选择 t1.id 为空的所有行。
【讨论】:
我认为这行不通,因为如果t2
在序列中间有一个额外的记录,那么从那时起所有t2.sequence
都会被丢弃,我将无法将它们与相应的t1.sequence
匹配。
那我们怎么知道两个序列是一样的呢?
我不只是试图完全匹配序列。我认为 user3260813 的回答正确地将其描述为最长的公共子字符串问题。【参考方案2】:
您所拥有的是用于比较两个字符串的最长公共子字符串问题。
查看VBA中的代码http://thydzik.com/longest-common-subsequence-implemented-in-vba-visual-basic-for-applications/
您应该以某种方式从 Access 获取 string1 和 string2,并使用链接中的函数。 所以对于你的例子
String1 = acb
String2 = adcb
查看链接中的示例函数,了解如何使用它。 “getDiff”函数的输出将是
=+==
所以区别在 2 处。 + 表示在字符串 1 中插入 'd' 使字符串相等。
【讨论】:
【参考方案3】:我自己模拟了一些测试数据,它似乎按预期运行,所以我想我会发布。我想让 SQL 完成繁重的工作,它确实做到了,但仍有一些代码要运行。如果您只是想尝试一下,我已将 accdb (Access 2007) 文件放在这里:http://www.sendspace.com/file/eqm5vh。如果这样做,只需将数据输入 t1 和 t2,然后打开 Module1 并运行 RunSequences;潜艇应该负责其余的工作。
我的代码不像你的那样简洁,sigil,并且需要更多的辅助对象。话虽如此,它可能比纯粹的基于游标的解决方案具有更好的扩展性,因为它只需要为每个表中的每个项目运行一行(或多或少,取决于跨表的重复数,如果有的话)。我的想法是能够逐行对每个表进行排序(如 SQL Server 中的 ROW_NUMBER),这样我就可以有一个绝对位置进行比较。我通过将两个表中的所有数据插入到带有 AutoNumber 字段的临时表中来做到这一点,然后使用旧的 DCount 技巧来获取行 ID。其余的基于此数据集。我不会解释死,而是让你尝试一下,看看它是否有效,但我会将我的代码发布在下面,以防有人想要查看。
表格:
查询:
qryT1T2_Ordered_INSERT:
INSERT INTO tblTemp
SELECT *
FROM (SELECT "t1" AS SourceTable, t1.State, t1.Sequence, t1.GIE, t1.Match
FROM t1
UNION ALL
SELECT "t2" AS SourceTable, t2.State, t2.Sequence, t2.GIE, t2.Match
FROM t2) AS [%$##@_Alias]
ORDER BY SourceTable, GIE DESC , Sequence;
qryT1_Sequenced:
SELECT DCount("*","tblTemp","[SourceTable] = 't1' AND [ID] <= " & [ID]) AS SequenceID, tblTemp.ID, tblTemp.State, tblTemp.Sequence, tblTemp.GIE, tblTemp.Match, [State] & "_" & [GIE] AS JoinValue
FROM tblTemp
WHERE tblTemp.SourceTable="t1";
qryT1_比较:
SELECT qryT1_Sequenced.SequenceID AS MySequenceID, qryT2_Sequenced.SequenceID AS OtherSequenceID, qryT1_Sequenced.ID AS MyID, qryT2_Sequenced.ID AS OtherID, qryT2_Sequenced.JoinValue
FROM qryT1_Sequenced LEFT JOIN qryT2_Sequenced ON qryT1_Sequenced.JoinValue = qryT2_Sequenced.JoinValue
ORDER BY qryT1_Sequenced.SequenceID, qryT2_Sequenced.ID;
qryT2_Sequenced:
SELECT DCount("*","tblTemp","[SourceTable] = 't2' AND [ID] <= " & [ID]) AS SequenceID, tblTemp.ID, tblTemp.State, tblTemp.Sequence, tblTemp.GIE, tblTemp.Match, [State] & "_" & [GIE] AS JoinValue
FROM tblTemp
WHERE tblTemp.SourceTable="t2";
qryT2_比较:
SELECT qryT2_Sequenced.SequenceID AS MySequenceID, qryT1_Sequenced.SequenceID AS OtherSequenceID, qryT2_Sequenced.ID AS MyID, qryT1_Sequenced.ID AS OtherID, qryT2_Sequenced.JoinValue
FROM qryT2_Sequenced LEFT JOIN qryT1_Sequenced ON qryT2_Sequenced.JoinValue=qryT1_Sequenced.JoinValue
ORDER BY qryT2_Sequenced.SequenceID, qryT1_Sequenced.ID;
qryT1T2_Compared_FINAL:
SELECT tblTemp.SourceTable, tblTemp.State, tblTemp.Sequence, tblTemp.GIE, tblTemp.Match
FROM tblTemp
WHERE tblTemp.Match="No"
ORDER BY tblTemp.SourceTable, tblTemp.GIE DESC , tblTemp.Sequence;
模块:
Public Sub RunSequences()
On Error GoTo ErrorHandler
DoCmd.SetWarnings False
Set db = CurrentDb()
'Do our setup:
'1. Clear our temp table.
CurrentDb.Execute "DELETE * FROM [tblTemp]"
'2. Insert data from t1 and t2 into temp table.
DoCmd.OpenQuery "qryT1T2_Ordered_INSERT"
'3. Now process the sequence.
ReportSequences "qryT1_Compare"
ReportSequences "qryT2_Compare"
'4. Open non-matched report.
DoCmd.OpenQuery "qryT1T2_Compared_FINAL"
ExitMe:
DoCmd.SetWarnings True
Exit Sub
ErrorHandler:
Debug.Print Err.Number & ": " & Err.Description
GoTo ExitMe
End Sub
'----
Public Sub ReportSequences(strSourceQuery As String)
On Error GoTo ErrorHandler
Dim db As DAO.Database
Dim rst As DAO.Recordset
Dim intLastOtherSequenceID As Integer
Dim dicMasterSequenceIDs As New Scripting.Dictionary
Dim dicComparedSequenceIDs As New Scripting.Dictionary
Dim strSQL_UpdateYes As String
Dim strSQL_UpdateNo As String
'Running all my updates inline, but you can break this out.
strSQL_UpdateYes = "UPDATE [tblTemp] SET [Match] = 'Yes' WHERE [ID] = @ID"
strSQL_UpdateNo = "UPDATE [tblTemp] SET [Match] = 'No' WHERE [ID] = @ID"
Set db = CurrentDb()
Set rst = db.OpenRecordset(strSourceQuery, dbOpenDynaset)
With rst
Do Until .EOF
'Need this to keep track of Master Sequence IDs (MyID) we've processed
'successfully.
'If there is more than one match for MyID, we want only to take the first
'match that fulfills the condition of being next in the sequence,
'not jump ahead.
If dicMasterSequenceIDs.Exists(.Fields("MyID").Value) = True Then
If dicMasterSequenceIDs(.Fields("MyID").Value) = "Done" Then
GoTo MoveNext
End If
Else
dicMasterSequenceIDs.Add .Fields("MyID").Value, ""
End If
Select Case IsNull(.Fields("OtherID"))
Case True
'If OtherID is null, it means no match in other table, so Match is
'automatically no.
db.Execute Replace(strSQL_UpdateNo, "@ID", .Fields("MyID"))
Case False
'Check to see if current OtherSequenceID is greater than the old
'one...
'(If it is, it is in sequence).
If intLastOtherSequenceID < CInt(.Fields("OtherSequenceID")) Then
'Use the dictionary to keep track of distinct OtherSequenceIDs we've already added.
If dicComparedSequenceIDs.Exists(.Fields("OtherSequenceID").Value) = False Then
dicComparedSequenceIDs.Add .Fields("OtherSequenceID").Value, ""
db.Execute Replace(strSQL_UpdateYes, "@ID", .Fields("MyID"))
dicMasterSequenceIDs(.Fields("MyID").Value) = "Done"
'If it's a dupe, means the sequence is broken.
Else
db.Execute Replace(strSQL_UpdateNo, "@ID", .Fields("MyID"))
End If
Else
'If the old one is equal or greater, means sequence is broken.
db.Execute Replace(strSQL_UpdateNo, "@ID", .Fields("MyID"))
End If
intLastOtherSequenceID = .Fields("OtherSequenceID")
End Select
MoveNext:
.MoveNext
Loop
End With
ExitMe:
Set dicComparedSequenceIDs = Nothing
Set rst = Nothing
Set db = Nothing
Exit Sub
ErrorHandler:
Debug.Print Err.Number & ": " & Err.Description
GoTo ExitMe
End Sub
无论如何,希望它对你有用。如果没有,希望它能给你更多的想法。
编辑: 意识到子 ReportSequences 中的逻辑存在问题。如果我们在另一个序列中有几个匹配项,我们只想取一个序列中最早且符合条件的匹配项。已添加。新的 accdb 在这里:http://www.sendspace.com/file/hcdxvp
【讨论】:
以上是关于如何使用 SQL 比较序列?的主要内容,如果未能解决你的问题,请参考以下文章