从长（且合理）稀疏向量中选择随机元素的最有效方法是啥？

Posted 2023-02-21

技术标签:

【中文标题】从长（且合理）稀疏向量中选择随机元素的最有效方法是啥？【英文标题】：What is the most efficient way of selecting a random element from a long (and reasonably) sparse vector?从长（且合理）稀疏向量中选择随机元素的最有效方法是什么？ 【发布时间】：2017-08-21 06:04:59 【问题描述】：

我有一个长的、相当稀疏的布尔向量，我想从中迭代地选择随机元素，我想知道这样做最有效的方法是什么。

该向量最多可以包含大约 100,000 个元素，并且每 20 个元素中大约有 1 个元素在任何时候都是“真”的。

选择这些元素之一，偶尔会导致其他元素可供选择；所以我不能只对布尔向量进行一次初始传递来获取所有可用元素的索引，然后打乱该向量和弹出元素，因为可用元素的列表会发生变化。

我已经提出了一些想法，但无法确定哪个是最好的。因此，任何见解都将不胜感激。

方法一：

given input boolean vector A
create boolean vector B    // to store previously selected elements
create int vector C        // to store currently available element indices 
while stopping condition not met:
    for each element a in A:
        if a is "true":
            append index of a to C
    generate random integer i between 0 and length of A
    set i-th element of C in A to "false"
    set i-th element of C in B to "true"
    compute any new "true" values of A

方法二：

given input boolean vector A
create boolean vector B    // to store previously selected elements
create int vector C        // to store currently available element indices 
for each element a in A:
    if a is "true":
        append index of a to C
shuffle C
while stopping condition not met:
    pop element from back of C
    set i-th element of C in A to "false"
    set i-th element of C in B to "true"
    compute any new "true" values of A
    if new values in A computed:
        append index of new available element to C 
        shuffle C

因为不是 A 中的每个选择都会导致可用元素集发生变化，所以我认为方法 2 可能会比方法 1 更好，除了我不确定改组长向量会导致多少努力。

方法3：

given input boolean vector A
create boolean vector B    // to store previously selected elements
while stopping condition not met:
    generate random integer i between 0 and length of A
    If i is "true" in A:
        set i in A to "false"
        set i in B to "true"
        compute any new "true" values of A

这最后一种方式似乎有点幼稚和简单，但我认为如果每 20 个元素中大约有 1 个为真（除了最后一组元素，当不能为选中的元素添加更多元素时），然后平均只需要大约 20 次尝试就可以找到一个可选择的元素，这实际上可能比输入向量的完整传递或对可用索引的向量进行洗牌（特别是如果所讨论的向量是相当长）。找到最后几个将非常困难，但我可以跟踪已选择的数量，一旦剩余数量低于某个水平，我可以更改最终批次的选择方式。

有没有人知道哪个可能更有效？如果有任何区别，实现将使用 C++。

感谢您的帮助

【问题讨论】：

如果你的向量是稀疏的，你为什么不考虑改变表示？也许附加一个“真实”索引向量并使用所有操作更新它。这将使您随机选择几乎 O(1)，同时保持其他操作的成本。 【参考方案1】：

您可以将稀疏向量的表示更改为以下 -

主向量（您现在拥有的向量）真向量（所有“真”索引的列表）

您的操作现在变成 -

Insert:   
    check if i in Primary Vector
    if false, set to true and add to True Vector

Delete:
    check if i in Primary Vector
    if true, set to false and remove from True Vector by swapping
    with last element and reducing size

（为此，您需要从 Primary Vector 指向 True Vector）。

Random:
    Generate random index j from size of (True Vector)
    return True Vector[j]

您的所有操作都可以通过O(1) 复杂性完成。

【讨论】：

啊，太棒了，我没想到将要删除的元素与最后一个交换并减小大小；我担心删除真实向量的随机元素会花费很多精力，但这比每次都洗牌更有意义。非常感谢！ @guskenny83 是的，但请记住，您还必须在交换时将指针从主向量更新为真向量。不过仍然是 O(1)。【参考方案2】：

这听起来像是 Van Emde Boas tree 的案例

Space   O(M)
Search  O(log log M)
Insert  O(log log M)
Delete  O(log log M)

使用成员数对 aux 数组进行注释，以便更轻松地找到随机元素。

【讨论】：

以上是关于从长（且合理）稀疏向量中选择随机元素的最有效方法是啥？的主要内容，如果未能解决你的问题，请参考以下文章