排序向量查找的更快版本 (MATLAB)

Posted 2023-03-05

技术标签:

【中文标题】排序向量查找的更快版本 (MATLAB)【英文标题】：Faster version of find for sorted vectors (MATLAB) 【发布时间】：2013-12-08 15:08:48 【问题描述】：

我在 MATLAB 中有以下类型的代码：

indices = find([1 2 2 3 3 3 4 5 6 7 7] == 3)

这将返回 4,5,6 - 数组中元素的索引等于 3。现在。我的代码用很长的向量来做这种事情。向量总是排序的。

因此，我想要一个函数，它用 O(log n) 替换 find 的 O(n) 复杂度，但代价是必须对数组进行排序。

我知道 ismember，但据我所知，它不会返回所有项目的索引，只返回最后一个（我需要所有项目）。

出于可移植性的原因，我需要仅 MATLAB 的解决方案（没有编译的 mex 文件等）

【问题讨论】：

不确定在 MATLAB 中是否可行，但如果您实现自己的二进制搜索，应该可以。 【参考方案1】：

Daniel 的方法很聪明，他的 myFind2 函数肯定很快，但在边界条件附近或上下限产生的范围超出传入集合的情况下会出现错误/错误。

此外，正如他在对他的回答的评论中指出的那样，他的实施存在一些可以改进的低效率。我实现了他的代码的改进版本，它运行得更快，同时还能正确处理边界条件。此外，此代码包含更多 cmets 来解释正在发生的事情。我希望这可以帮助某人，就像丹尼尔的代码在这里帮助我一样！

function [lower_index,upper_index] = myFindDrGar(x,LowerBound,UpperBound)
% fast O(log2(N)) computation of the range of indices of x that satify the
% upper and lower bound values using the fact that the x vector is sorted
% from low to high values. Computation is done via a binary search.
%
% Input:
%
% x-            A vector of sorted values from low to high.       
%
% LowerBound-   Lower boundary on the values of x in the search
%
% UpperBound-   Upper boundary on the values of x in the search
%
% Output:
%
% lower_index-  The smallest index such that
%               LowerBound<=x(index)<=UpperBound
%
% upper_index-  The largest index such that
%               LowerBound<=x(index)<=UpperBound

if LowerBound>x(end) || UpperBound<x(1) || UpperBound<LowerBound
    % no indices satify bounding conditions
    lower_index = [];
    upper_index = [];
    return;
end

lower_index_a=1;
lower_index_b=length(x); % x(lower_index_b) will always satisfy lowerbound
upper_index_a=1;         % x(upper_index_a) will always satisfy upperbound
upper_index_b=length(x);

%
% The following loop increases _a and decreases _b until they differ 
% by at most 1. Because one of these index variables always satisfies the 
% appropriate bound, this means the loop will terminate with either 
% lower_index_a or lower_index_b having the minimum possible index that 
% satifies the lower bound, and either upper_index_a or upper_index_b 
% having the largest possible index that satisfies the upper bound. 
%
while (lower_index_a+1<lower_index_b) || (upper_index_a+1<upper_index_b)

    lw=floor((lower_index_a+lower_index_b)/2); % split the upper index

    if x(lw) >= LowerBound
        lower_index_b=lw; % decrease lower_index_b (whose x value remains \geq to lower bound)   
    else
        lower_index_a=lw; % increase lower_index_a (whose x value remains less than lower bound)
        if (lw>upper_index_a) && (lw<upper_index_b)
            upper_index_a=lw;% increase upper_index_a (whose x value remains less than lower bound and thus upper bound)
        end
    end

    up=ceil((upper_index_a+upper_index_b)/2);% split the lower index
    if x(up) <= UpperBound
        upper_index_a=up; % increase upper_index_a (whose x value remains \leq to upper bound) 
    else
        upper_index_b=up; % decrease upper_index_b
        if (up<lower_index_b) && (up>lower_index_a)
            lower_index_b=up;%decrease lower_index_b (whose x value remains greater than upper bound and thus lower bound)
        end
    end
end

if x(lower_index_a)>=LowerBound
    lower_index = lower_index_a;
else
    lower_index = lower_index_b;
end
if x(upper_index_b)<=UpperBound
    upper_index = upper_index_b;
else
    upper_index = upper_index_a;
end

请注意，Daniels searchFor 函数的改进版本现在很简单：

function [lower_index,upper_index] = mySearchForDrGar(x,value)

[lower_index,upper_index] = myFindDrGar(x,value,value);

多年后编辑：最后两个 if/else 块中出现错误，已修复。

【讨论】：

请注意，它不会返回包含索引（即 LowerBound[li, ui] = mySearchForDrGar(1:10, 5, 6) 验证我知道这是旧的，但对于其他读者，这个函数在我的测试中给出 [lower_index-1, upper_index+1]。 @JeanMercat 最后两个 if/else 块有一个小错误，我刚刚更正了。【参考方案2】：

我需要这样的功能。感谢@Daniel 的帖子！

我用它做了一些工作，因为我需要在同一个数组中找到多个索引。我想避免arrayfun（或类似）的开销或多次调用该函数。因此，您可以在range 中传递一堆值，您将获得数组中的索引。

function idx = findInSorted(x,range)
% Author Dídac Rodríguez Arbonès (May 2018)
% Based on Daniel Roeske's solution:
%   Daniel Roeske <danielroeske.de>
%   https://github.com/danielroeske/danielsmatlabtools/blob/master/matlab/data/findinsorted.m

    range = sort(range);
    idx = nan(size(range));
    for i=1:numel(range)
        idx(i) = aux(x, range(i));
    end
end

function b = aux(x, lim)
    a=1;
    b=numel(x);
    if lim<=x(1)
       b=a;
    end
    if lim>=x(end)
       a=b;
    end

    while (a+1<b)
        lw=(floor((a+b)/2));
        if (x(lw)<lim)
            a=lw;
        else
            b=lw;
        end
    end
end

我猜你可以改用parfor 或arrayfun。不过，我还没有测试过自己在多大的range 上会得到回报。

另一个可能的改进是使用先前找到的索引（如果range 已排序）来减少搜索空间。由于 O(log n) 运行时，我对它节省 CPU 的潜力持怀疑态度。

最终函数最终运行得稍微快了一点。我为此使用了@randomatlabuser 的框架：

N = 5e6;    % length of vector
p = 0.99;    % probability
KK = 100;    % number of instances
rntm1 = zeros(KK, 1);    % runtime with ismember
rntm2 = zeros(KK, 1);    % runtime with ismembc
rntm3 = zeros(KK, 1);    % runtime with Daniel's function
for kk = 1:KK
    x = cumsum(rand(N, 1) > p);
    searchfor = x(ceil(4*N/5));

    tic
    range = sort(searchfor);
    idx = nan(size(range));
    for i=1:numel(range)
        idx(i) = aux(x, range(i));
    end

    rntm1(kk) = toc;

    tic
    a=1;
    b=numel(x);
    c=1;
    d=numel(x);
    while (a+1<b||c+1<d)
        lw=(floor((a+b)/2));
        if (x(lw)<searchfor)
            a=lw;
        else
            b=lw;
        end
        lw=(floor((c+d)/2));
        if (x(lw)<=searchfor)
            c=lw;
        else
            d=lw;
        end
    end
    inds3 = (b:c)';
    rntm2(kk) = toc;

end

%%

function b = aux(x, lim)

a=1;
b=numel(x);
if lim<=x(1)
   b=a;
end
if lim>=x(end)
   a=b;
end

while (a+1<b)
    lw=(floor((a+b)/2));
    if (x(lw)<lim)
        a=lw;
    else
        b=lw;
    end
end

end

这不是一个很大的改进，但它很有帮助，因为我需要运行数千次搜索。

% Mean of running time
mean([rntm1 rntm2])
% 9.9624e-05  5.6303e-05

% Percentiles of running time
prctile([rntm1 rntm2], [0 25 50 75 100])
% 3.0435e-05  1.0524e-05
% 3.4133e-05  1.2231e-05
% 3.7262e-05  1.3369e-05
% 3.9111e-05  1.4507e-05
%  0.0027426   0.0020301

我希望这可以帮助某人。

编辑

如果完全匹配的可能性很大，在调用函数之前使用非常快速的内置ismember 是值得的：

[found, idx] = ismember(range, x);
idx(~found) = arrayfun(@(r) aux(x, r), range(~found));

【讨论】：

【参考方案3】：

这是一个使用二分搜索的快速实现。这个文件也可以在github上找到

function [b,c]=findInSorted(x,range)
%findInSorted fast binary search replacement for ismember(A,B) for the
%special case where the first input argument is sorted.
%   
%   [a,b] = findInSorted(x,s) returns the range which is equal to s. 
%   r=a:b and r=find(x == s) produce the same result   
%  
%   [a,b] = findInSorted(x,[from,to]) returns the range which is between from and to
%   r=a:b and r=find(x >= from & x <= to) return the same result
%
%   For any sorted list x you can replace
%   [lia] = ismember(x,from:to)
%   with
%   [a,b] = findInSorted(x,[from,to])
%   lia=a:b
%
%   Examples:
%
%       x  = 1:99
%       s  = 42
%       r1 = find(x == s)
%       [a,b] = myFind(x,s)
%       r2 = a:b
%       %r1 and r2 are equal
%
%   See also FIND, ISMEMBER.
%
% Author Daniel Roeske <danielroeske.de>

A=range(1);
B=range(end);
a=1;
b=numel(x);
c=1;
d=numel(x);
if A<=x(1)
   b=a;
end
if B>=x(end)
    c=d;
end
while (a+1<b)
    lw=(floor((a+b)/2));
    if (x(lw)<A)
        a=lw;
    else
        b=lw;
    end
end
while (c+1<d)
    lw=(floor((c+d)/2));
    if (x(lw)<=B)
        c=lw;
    else
        d=lw;
    end
end
end

【讨论】：

也许这段代码可以进一步改进。对于大多数迭代，a 等于 c，b 等于 d。不要以为它会变得更快！这是您的标准二进制搜索算法。最坏情况的复杂度是O(log n)，这是你能得到的最快的！顺便说一句，干得好:) @rayryeng：你说得对，O(log n) 是下界。我认为可以将速度提高约 30%，但我错了。我试过了，我取得的最好成绩是同样的速度。这个答案有一些错误 - 请参阅 DrGar 的答案，它修复了这些问题。 @carl：你能给我举一个结果不正确的例子吗？我发现的唯一问题是，如果请求的范围不包含在列表中，则下限和上限将反转。例如，myFind2([1:3 7:10], 4, 6) 将生成 4 和 3 而不是 3 和 4。【参考方案4】：

如果您查看第一个输出，ismember 将为您提供所有索引：

>> x = [1 2 2 3 3 3 4 5 6 7 7];
>> [tf,loc]=ismember(x,3);
>> inds = find(tf)

inds =

 4     5     6

您只需要使用正确的输入顺序。

请注意，ismember 使用的辅助函数可以直接调用：

% ISMEMBC  - S must be sorted - Returns logical vector indicating which 
% elements of A occur in S

tf = ismembc(x,3);
inds = find(tf);

使用ismembc 将节省计算时间，因为ismember 首先调用issorted，但这将省略检查。

请注意，较新版本的 matlab 有一个由 builtin('_ismemberoneoutput',a,b) 调用的内置函数，具有相同的功能。

由于ismember 等的上述应用有些倒退（在第二个参数中搜索x 的每个元素而不是相反），因此代码比必要的要慢得多。正如 OP 指出的那样，不幸的是，[~,loc]=ismember(3,x) 仅提供了 3 在x 中第一次出现的位置，而不是全部。但是，如果您有最新版本的 MATLAB（我认为是 R2012b+），您可以使用更多未记录的内置函数来获取第一个和最后一个索引！这些是ismembc2 和builtin('_ismemberfirst',searchfor,x)：

firstInd = builtin('_ismemberfirst',searchfor,x);  % find first occurrence
lastInd = ismembc2(searchfor,x);                   % find last occurrence
% lastInd = ismembc2(searchfor,x(firstInd:end))+firstInd-1; % slower
inds = firstInd:lastInd;

仍然比 Daniel R. 出色的 MATLAB 代码慢，但它（rntmX 添加到 randomatlabuser 的基准测试中）只是为了好玩：

mean([rntm1 rntm2 rntm3 rntmX])    
ans =
   0.559204323050486   0.263756852283128   0.000017989974213   0.000153682125682

以下是ismember.m 中这些函数的一些文档：

% ISMEMBC2 - S must be sorted - Returns a vector of the locations of
% the elements of A occurring in S.  If multiple instances occur,
% the last occurrence is returned

% ISMEMBERFIRST(A,B) - B must be sorted - Returns a vector of the
% locations of the elements of A occurring in B.  If multiple
% instances occur, the first occurence is returned.

实际上引用了一个 ISMEMBERLAST 内置函数，但它似乎不存在（还没有？）。

【讨论】：

@chappjc: rntmX是_ismemberfirst的执行时间？还不错 - 可能是内部检查减慢了速度？ @randomatlabuser 是的，rntmX 是执行了包括_ismemberfirst 和ismembc2 在内的三行代码。【参考方案5】：

这不是答案 - 我只是比较 chappjc 和 Daniel R 建议的三种解决方案的运行时间。

N = 5e7;    % length of vector
p = 0.99;    % probability
KK = 100;    % number of instances
rntm1 = zeros(KK, 1);    % runtime with ismember
rntm2 = zeros(KK, 1);    % runtime with ismembc
rntm3 = zeros(KK, 1);    % runtime with Daniel's function
for kk = 1:KK
    x = cumsum(rand(N, 1) > p);
    searchfor = x(ceil(4*N/5));

    tic
    [tf,loc]=ismember(x, searchfor);
    inds1 = find(tf);
    rntm1(kk) = toc;

    tic
    tf = ismembc(x, searchfor);
    inds2 = find(tf);
    rntm2(kk) = toc;

    tic
    a=1;
    b=numel(x);
    c=1;
    d=numel(x);
    while (a+1<b||c+1<d)
        lw=(floor((a+b)/2));
        if (x(lw)<searchfor)
            a=lw;
        else
            b=lw;
        end
        lw=(floor((c+d)/2));
        if (x(lw)<=searchfor)
            c=lw;
        else
            d=lw;
        end
    end
    inds3 = (b:c)';
    rntm3(kk) = toc;

end

Daniel 的二分查找非常快。

% Mean of running time
mean([rntm1 rntm2 rntm3])
% 0.631132275892504   0.295233981447746   0.000400786666188

% Percentiles of running time
prctile([rntm1 rntm2 rntm3], [0 25 50 75 100])
% 0.410663611685559   0.175298784336465   0.000012828868032
% 0.429120717937665   0.185935198821797   0.000014539383770
% 0.582281366154709   0.268931132925888   0.000019243302048
% 0.775917520641649   0.385297304740352   0.000026940622867
% 1.063753914942895   0.592429428396956   0.037773746662356

【讨论】：

非常有趣！使用 MATLAB 解决方案的 JIT 编译器和 C 函数的外部函数调用开销，我们得到了令人惊讶的结果。给丹尼尔的道具。与其他builtin 相比如何？不幸的是我没有内置的。 Daniel 的解决方案最多需要 ceil(log2(N)) 循环：当 N = 5e7 时，循环最多只有 26 个。我猜chappjc's answer 的意思是rntm2 中，find(tf) 调用几乎占用了所有时间，而ismembc(..) 调用本身非常快。刚刚意识到我的函数是如此之快，以至于您注意到函数调用和内联我的代码之间的区别。

以上是关于排序向量查找的更快版本 (MATLAB)的主要内容，如果未能解决你的问题，请参考以下文章