查找列表中至少相隔 x 的最小 n 值

Posted 2023-03-11

技术标签:

【中文标题】查找列表中至少相隔 x 的最小 n 值【英文标题】：Finding the smallest n values that are at least x apart in a list 【发布时间】：2021-06-24 07:18:03 【问题描述】：

我试图在一个列表中找到最小的 n 个值，它们的位置至少相隔 x，考虑到重复项。例如彼此相距至少 2 的最小 5 个值。

简单的例子：

values = [-9995, -82, -659, -1006, -2009, 2062, -10107, -12, 13]
result: [-10107, -9995, -2009, -659, 13]

更复杂的例子：

values = [-9995, -83, -82, -82, -1006, -2009, 18, 2062, -659 ,-9995, -9995]

例如在上面的列表中：

-9995 是最小值。 -9995 再次出现，并且与第一个相距至少 2。剩余的 -9995 将被忽略，因为它与前一个仅相差 1。 -2009 是第三个最小值 -1006 不被考虑，因为它与之前的值仅相差 1。所以我们取下一个最小值 -659，因为它与之前的值至少相差 2（假设我们取第一个和最后一个 -9995 并忽略倒数第二个） -83 不被考虑，因为它与 -9995 仅相差一个。所以我们取 -82。我们已经达到了 5 个数字，所以我们停止了

result: [-9995, -9995, -2009, -659, -82]

我正在使用的列表有大约 1,000,000 个元素长，我有大约 1000 个列表。我从 pandas DataFrame 生成了这些列表（通过迭代 groupby），所以如果有一个 numpy/pandas 方法来优化这个计算，那将会很有帮助。

假设没有重复，到目前为止的尝试能够生成结果：


def smallest_values(list_of_numbers: list, n_many: int, x_apart: int):
    
    sorted_values = sorted(values)
    small_val, small_val_loc = [], []

    for val in sorted_values:
        if len(small_val) <= n_many:
            ind = list_of_numbers.index(val)
            within_x = [i for i in range(ind-(x_apart-1), ind+x_apart)]
            if not any(i in small_val_loc for i in within_x):
                small_val_loc.append(ind)     
                small_val.append(val)

    return small_val

values_simple = [-9995, -82, -659, -1006, -2009, 2062, -10107, -12, 13]
values_complex = [-9995, -83, -82, -82, -1006, -2009, 18, 2062, -659 ,-9995, -9995]
d = 2
n = 5
smallest_values(values_simple, n, d) # [-10107, -9995, -2009, -659, 13] CORRECT
smallest_values(values_complex, n, d) # [-9995, -2009, -659, -82] INCORRECT

【问题讨论】：

我不确定您要优化的内容是否明确。考虑[1, 0, 1, 300, 500, 400] 和n=3 列表。如果你从零开始，你会得到[0, 300, 400]，但如果你从一开始，你会得到[1, 1, 400]。第一个具有最小的数字，但第二个具有最小的总数。哪个是正确答案？ [0, 300, 400] 将是我的问题的正确结果。我不是在寻找最小的总和，我想要至少相隔 x 的最小 n 个数字。谢谢。该算法将从最小的数字开始，并迭代地添加下一个最小的数字，但要遵守它在列表中与之前添加的数字至少相距 x 的约束所以你是说你总是取下一个最小的数字，即使那个选择迫使你以后取更大的数字？所以把-1加到上面，使n=4——[-1, 100, 1, 0, 1, 300, 500, 400]，正确答案是[-1, 0, 300, 400]不是[-1, 1, 1, 400]？是的，完全正确在更复杂的示例中取最后一个 -9995 值的逻辑是什么，而不是倒数第二个（这是相同的值，但在列表中较早）？ 【参考方案1】：

这是一项复杂的工作，我们希望构建一个索引列表，其中第一个条目是 list_of_numbers 中最小值的索引，并且该 index_list 中的每个下一个条目都指向 list_of_numbers 中的下一个最大值来操作这种类型的list 会更容易和有效。我们可以这样做：

index_map=dict()
for i in range(len(list_of_numbers)):
    value=list_of_numbers[i]
    if value in index_map:
        index_map[value]+=[i]
    else:
        index_map[value]=[i]
sorted_values = sorted(index_map)

现在我们有一个包含 list_of_numbers 中每个唯一值的字典，它映射到指向它的所有索引。我们还有一个从最小的唯一值到最大的列表。我们现在可以构建我们的 index_list：

index_list=[]
for value in sorted_values:
    index_list+=index_map[value]

del index_map, sorted_values

剩下要做的就是在我们的 index_list 中从左到右迭代并找到具有适当间隙的索引的第一个组合。这在算法中计算起来更容易、更快。

不幸的是，时间复杂度不可能小于 O(n)，因为您需要检查 list_of_numbers 中的每个条目以找到最小的条目。

我使用递归函数做到了这一点，但您绝对可以优化它并使算法更智能：

def gap_selecter(numlist, n_many, gap):

if numlist==None:          # Fast exit if recursion fails
    return None

x=numlist[0]
speudolist=numlist[1:]
                    
if n_many==1:              # base case
    return [x] 
                            

else:
    for i in range(len(speudolist)):
        
        if abs(x-speudolist[i])>=gap:   #recursive step occurs here
            
            recursion_list = gap_selecter(speudolist[i:], n_many-1, gap)   
            
            if recursion_list !=None:
                return [x]+recursion_list

return None                # if we find no possible list we return None

这里是所有的东西。

def smallest_values(list_of_numbers: list, n_many: int, x_apart: int):

index_map=dict()
for i in range(len(list_of_numbers)):
    value=list_of_numbers[i]
    if value in index_map:
        index_map[value]+=[i]
    else:
        index_map[value]=[i]
sorted_values = sorted(index_map)

index_list=[]
for value in sorted_values:
    index_list+=index_map[value]

del index_map, sorted_values

final_indices=gap_selecter(index_list, n_many, x_apart)
if final_indices==None:
    return None

final_numbers=[]
for i in final_indices:
    final_numbers+=[list_of_numbers[i]]

return final_numbers

values_simple = [-9995, -82, -659, -1006, -2009, 2062, -10107, -12, 13]
values_complex = [-9995, -83, -82, -82, -1006, -2009, 18, 2062, -659 ,-9995, -9995]
d = 2
n = 5

test_simple = smallest_values(values_simple, n, d)       # [-10107, -9995, -2009, -659, -12]
test_complex = smallest_values(values_complex, n, d)     # [-9995, -9995, -2009, -659, -83]

【讨论】：

test_complex 失败了，不是吗？它不应该有-83（如果我正确理解了这个问题）是的，很抱歉，在这个简单的例子中应该是[-10107, -9995, -2009, -659, 13] -12 在 -10107 旁边，正如 perl 提到的那样，不应该选择 -83【参考方案2】：

//编辑：啊，现在我明白了。关键词是(assuming we take the first and last -9995 and ignore the second to last)

最大的问题是您不能选择任何重复的值（在您的示例中，列表中的 倒数第二个 -9995）。相反，您希望选择重复值，以使最后一个元素（或结果列表的总和？）最小，对吗？

对我来说，这听起来像是一个受限的优化问题。我什至不确定它是否具有相同的结果，具体取决于您定义为“最佳”的内容（总和或最后一个元素或其他...）

【讨论】：

【参考方案3】：

这里的关键问题是打破重复值的关系，例如示例中的 -9995。我们基本上需要尝试以不同的顺序选择它们，并检查哪一个产生具有下一个较低值的序列（或者如果下一个值相同，那么接下来的那个，等等）。

一种方法是递归搜索：

from collections import defaultdict

# find the next smallest and return all locations of that number
# that can be used (i.e. not within d from the previously used values)
def get_next(vs, vd, d, skip):
    for v in vs:
        os = []
        for l in vd[v]:
            if not any([l>x-d and l<x+d for x in skip]):
                os.append((l, v))
        if len(os) > 0:
            return os
    return None

# recursive search
def r(vs, vd, n, d, skip=[], out=[]):
    if len(out) >= n:
        return out
    
    os = []
    for (l, v) in get_next(vs, vd, d, skip):
        o = r(vs, vd, n, d, skip+[l], out+[v])
        os.append(o)
    mo = min(os)
    return mo

# main func
def smallest_values(values, n, d):
    vd = defaultdict(list)
    for l, v in enumerate(values):
        vd[v].append(l)
    vs = sorted(vd.keys())
    return r(vs, vd, n, d, [], [])

对提供的示例进行测试：

values_simple = [-9995, -82, -659, -1006, -2009, 2062, -10107, -12, 13]
values_complex = [-9995, -83, -82, -82, -1006, -2009, 18, 2062, -659 ,-9995, -9995]

print('simple:  ', smallest_values(values_simple, 5, 2))
print('complex: ', smallest_values(values_complex, 5, 2))

输出：

simple:   [-10107, -9995, -2009, -659, 13]
complex:  [-9995, -9995, -2009, -659, -82]

对 1,000,000 个值列表的计时测试（800 毫秒，因此单线程 1,000 个列表大约需要 15 分钟）：

%%time
vs = np.random.randint(0, 1000000, 1000000)
smallest_values(vs, 5, 2)

输出：

CPU times: user 780 ms, sys: 20.8 ms, total: 800 ms
Wall time: 800 ms
[3, 5, 6, 7, 8]

附：这会找到序列中较早的具有最低值的序列。例如，它会更喜欢[1, 2, 100] 而不是[1, 3, 4]（两者都在位置 1 有 1，但第一个序列在位置 2 有 2 I guess the logic would be worded as, is there a selection of the previous chosen values that allows the next smallest value to be chosen.

【讨论】：

你的方法很聪明，非常感谢。你对问题陈述的理解是正确的，[1,2,100]应该是首选酷，很高兴它有帮助！原来是一个非常有趣的问题:)

以上是关于查找列表中至少相隔 x 的最小 n 值的主要内容，如果未能解决你的问题，请参考以下文章