过滤列表的算法

Posted 2023-02-27

技术标签:

【中文标题】过滤列表的算法【英文标题】：Algorithm for filtering a list 【发布时间】：2018-10-22 14:05:54 【问题描述】：

我已经实现了一种我认为在 VBA 中过滤System.Collections.ArrayList 的非常垃圾的方法。该代码需要一个列表和一个项目/比较值来过滤掉。它遍历列表并删除匹配的项目。然后它重新开始循环（因为你不能同时For Each 和.Remove）

Public Sub Filter(ByVal testValue As Object, ByVal dataSet As ArrayList)
'testValue and the items in `dataSet` all Implement IComparable from mscorlib.dll
'This allows comparing objects for equality
'i.e. obj1.CompareTo(obj2) = 0 is equivalent to obj1 = obj2
    Dim item As IComparable
    Dim repeat As Boolean
    repeat = False
    For Each item In dataSet
        If item.CompareTo(testValue) = 0 Then   'or equiv; If item = testValue
            dataSet.Remove item
            repeat = True
            Exit For
        End If
    Next item
    If repeat Then Filter testValue, dataSet 
End Sub

为什么垃圾？

假设列表的长度为 X 元素，并且包含与过滤条件匹配的 Y 项，其中 X>Y。据我所知，最好的情况是O(X)，当所有Ys 在一开始就聚集在一起时。最坏的情况是当所有Ys 最终都聚集在一起时。在这种情况下，该算法需要(X-Y)*Y 查找操作，最大在Y=X/2 时，所以O(X^2)

与简单的O(X) 算法相比，这很糟糕，当您到达匹配项时，会继续前进并移除，但不会打破循环。但是我不知道如何实现它。 有没有办法提高这个过滤器的性能？

【问题讨论】：

很好的问题，虽然它确实感觉像是一个代码审查风格的问题。另外，arrayList 中的数据类型是什么？这不是VB.NET吗？ @QHarr 该列表包含Implement IComparable 的对象，这是mscorlib.dll 的一些接口。然而，它们也可以很容易地成为字符串或数字，然后检查将是 = 而不是 .CompareTo() = 0。 PS 谢谢，虽然这只是我的代码的浓缩 sn-p 并且觉得它比 CR 更适合这里 - 我认为 CR 只适用于长的东西？ @DisplayName 不，我在Excel-VBA 做这个，参考mscorlib.dll like this CR 需要整个代码（不是关于大小，而是关于上下文。原位代码）而不是 MCVE。所以是的......也许这里的 sn-p 更好。 【参考方案1】：

你能不能不做类似下面的事情，我相信这是 O(n)：

Option Explicit

Public Sub RemItems()

    Const TARGET_VALUE As String = "dd"
    Dim myList As Object
    Set myList = CreateObject("System.Collections.ArrayList")

    myList.Add "a"
    myList.Add "dd"
    myList.Add "a"
    myList.Add "a"
    myList.Add "a"
    myList.Add "dd"
    myList.Add "a"
    myList.Add "a"
    myList.Add "dd"
    myList.Add "a"
    myList.Add "a"

    Dim i As Long
    For i = myList.Count - 1 To 0 Step -1
        If myList(i) = TARGET_VALUE Then myList.Remove myList(i)
    Next i

End Sub

有关复杂性信息，请参阅此讨论：

Asymptotic complexity of .NET collection classes

如果相信this (.NET-Big-O-Algorithm-Complexity-Cheat-Sheet)：

注意：我使用 https://htmledit.squarefree.com/ 渲染了 HTML

编辑：

警告 - 我不是 CS 毕业生。这是在玩。我确信关于正在处理的数据类型、分布等方面存在争议......欢迎改进

上面的 .Net 表显示从 HashTable 中的删除平均为 O(1) 用于删除，而 O(n) 用于 ArrayList，因此我从值 "a","b","c" 中随机生成了 100,000 行。然后，我将其用作以下结果的固定测试集。

测试运行代码（请温柔！）

Option Explicit

Private Declare PtrSafe Function getFrequency Lib "kernel32" _
Alias "QueryPerformanceFrequency" (cyFrequency As Currency) As Long
Private Declare PtrSafe Function getTickCount Lib "kernel32" _
Alias "QueryPerformanceCounter" (cyTickCount As Currency) As Long

Public Sub TestingArrayList()
    Const TARGET_VALUE = "a"
    Dim aList As Object
    Set aList = CreateObject("System.Collections.ArrayList")

    Dim arr()
    arr = ThisWorkbook.Worksheets("Sheet1").Range("A1").CurrentRegion.Value '<== Reads in 100000 value

    Dim i As Long
    For i = 1 To UBound(arr, 1) '50000
        aList.Add arr(i, 2)
    Next i

    Debug.Print aList.Contains(TARGET_VALUE)

    Dim StartTime As Double

    StartTime = MicroTimer()

    For i = aList.Count - 1 To 0 Step -1
       If aList(i) = TARGET_VALUE Then aList.Remove aList(i)
    Next i

    Debug.Print "Removal from array list took: " & Round(MicroTimer - StartTime, 3) & " seconds"
    Debug.Print aList.Contains(TARGET_VALUE)

End Sub

Public Sub TestingHashTable()
    Const TARGET_VALUE = "a"
    Dim hTable As Object
    Set hTable = CreateObject("System.Collections.HashTable")

    Dim arr()
    arr = ThisWorkbook.Worksheets("Sheet1").Range("A1").CurrentRegion.Value '<== Reads in 100000 value

    Dim i As Long
    For i = 1 To UBound(arr, 1) '50000
        hTable.Add i, arr(i, 2)
    Next i

    Debug.Print hTable.ContainsValue(TARGET_VALUE)

    Dim StartTime As Double

    StartTime = MicroTimer()

    For i = hTable.Count To 1 Step -1
       If hTable(i) = TARGET_VALUE Then hTable.Remove i
    Next i

    Debug.Print "Removal from hash table took: " & Round(MicroTimer - StartTime, 3) & " seconds"
    Debug.Print hTable.ContainsValue(TARGET_VALUE)

End Sub

Public Function MicroTimer() As Double

    Dim cyTicks1 As Currency
    Static cyFrequency As Currency

    MicroTimer = 0

    If cyFrequency = 0 Then getFrequency cyFrequency

    getTickCount cyTicks1

    If cyFrequency Then MicroTimer = cyTicks1 / cyFrequency
End Function

上面的内容似乎是 0(1)。

仅查看删除过程（删除其他因素），结果不太确定，但同样，我的编码可能是一个因素！

修改后的代码（去除其他因素）：

Option Explicit

Public Sub TestingComparison()

    Const RUN_COUNT As Long = 4

    Dim hTable As Object
    Dim aList As Object
    Dim i As Long, j As Long, k As Long, rowCount As Long
    Dim results() As Double

    Set hTable = CreateObject("System.Collections.HashTable")
    Set aList = CreateObject("System.Collections.ArrayList")

    Dim testSizes()
    testSizes = Array(100, 1000, 10000, 100000)  ', 1000000)
    ReDim results(0 To RUN_COUNT * (UBound(testSizes) + 1) - 1, 0 To 4)

    Application.ScreenUpdating = False

    With ThisWorkbook.Worksheets("Sheet5")

        For i = LBound(testSizes) To UBound(testSizes)

            For k = 1 To RUN_COUNT

                For j = 1 To testSizes(i)
                    hTable.Add j, 1
                    aList.Add 1
                Next j

                Dim StartTime As Double, completionTime As Double

                StartTime = MicroTimer()

                For j = hTable.Count To 1 Step -1
                    hTable.Remove j
                Next j

                results(rowCount, 3) = Round(MicroTimer - StartTime, 3)
                results(rowCount, 0) = testSizes(i)
                results(rowCount, 1) = k

                StartTime = MicroTimer()

                For j = aList.Count - 1 To 0 Step -1
                    aList.Remove aList(j)
                Next j

                results(rowCount, 2) = Round(MicroTimer - StartTime, 3)

                hTable.Clear
                aList.Clear
                rowCount = rowCount + 1
            Next k

        Next i

        .Range("A2").Resize(UBound(results, 1) + 1, UBound(results, 2)) = results

    End With

    Application.ScreenUpdating = True
End Sub

【讨论】：

我猜你故意避免使用 SortedList？是的，这似乎是一种更好的方法。假设枚举数组列表 (For...Each) 和索引 (myList(i)) 具有相同的算法复杂度，你知道吗？我的直觉是它们都是O(n^0)，即独立于数组的大小，索引的开销略大（如果ArrayLists 像Collections 一样工作）。当 n->large 时，开销变得可以忽略不计。你认为我为什么要使用 SortedList。 docs 没有显示任何内置的 Filter 方法？我可以用SortedList 做什么而我不能用ArrayList 做什么？忘记我认为排序列表不允许重复键。哦，这很有趣；删除是O(n)。但我仍然不确定； For Each item In myList: Next item 和 For i = myList.Count - 1 To 0 Step -1: Set item = myList(i): next i 都是 O(n) 还是我不知道的枚举中存在一些隐藏的复杂性？根据您的来源，索引肯定是O(1)，所以后一种方法肯定是O(n)

以上是关于过滤列表的算法的主要内容，如果未能解决你的问题，请参考以下文章