查找一行中是不是有 n 个小于某个数字的数据点

Posted 2023-02-25

技术标签:

【中文标题】查找一行中是不是有 n 个小于某个数字的数据点【英文标题】：Finding if there are n data points in a row that are less than a certain number查找一行中是否有 n 个小于某个数字的数据点 【发布时间】：2021-07-15 14:46:30 【问题描述】：

我正在使用 Python 中的一个频谱，并且我已经为该频谱拟合了一条线。我想要一个代码，它可以检测到一行中光谱上是否有10个小于拟合线的数据点。有谁知道如何简单快捷地做到这一点？

我目前有这样的事情：

count = 0
for i in range(lowerbound, upperbound):
    if spectrum[i] < fittedline[i]
        count += 1
    if count > 15:
        *do whatever*

如果我将第一个 if 语句行更改为：

if spectrum[i] < fittedline[i] & spectrum[i+1] < fittedline[i+1] & so on

我确信该算法会起作用，但是如果我希望用户输入一个数字来表示一行中的数据点数必须小于拟合线，那么我是否有更聪明的方法可以自动完成?

【问题讨论】：

我相信很多人都知道一种快速简单的方法来做到这一点。但是，SO 不是免费的在线编码服务，“为我实现此功能”与本网站无关。请拨打tour，阅读what's on-topic here、How to Ask和question checklist，并提供minimal reproducible example。你必须诚实地尝试，然后就你的算法或技术提出一个具体问题。嘿 Pranav，我并没有要求任何人专门为我编写此功能。我做了一个诚实的尝试，但我正在努力找出“连续”功能，我在这里询问是否有人知道这样做的聪明方法？分享您正在苦苦挣扎的代码。询问与该代码相关的特定问题。人们将使用他们可以从您的代码中获得的内容来编写对您有意义的答案。如果您的代码完全没用，人们会告诉您如何继续。在问题中包含 yoru 代码可以让人们看到您正在使用哪些变量、您的数据是什么样的，并为人们提供一个起点来写下他们的答案。 【参考方案1】：

您的尝试非常接近工作！对于连续点，如果一个点不满足您的条件，您需要做的就是重置计数。

num_points = int(input("How many points must be less than the fitted line? "))

count = 0
for i in range(lowerbound, upperbound):
    if spectrum[i] < fittedline[i]:
        count += 1
    else: # If the current point is NOT below the threshold, reset the count
        count = 0

    if count >= num_points:
        print(f"count consecutive points found at location i-count+1-i!")

让我们测试一下：

lowerbound = 0
upperbound = 10

num_points = 5

spectrum = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
fittedline = [1, 2, 10, 10, 10, 10, 10, 8, 9, 10]

使用这些值运行代码给出：

5 consecutive points found at location 2-6!

【讨论】：

我不认为这种使用循环和 Ifs 的方式是最佳实践 @gilgorio 请详细说明。你会怎么做？我建立在OP的代码上。 IMO 切片和压缩spectrum 和fittedline 涉及两个占用新内存的切片操作，所以我认为这是执行 OP 想要的一种可接受的方式。【参考方案2】：

我的建议是在开发临时功能之前研究和使用现有库

在这种情况下，一些超级聪明的人开发了数值 python 库numpy。这个库在科学项目中广泛使用，具有大量有用的功能实现，这些功能实现了测试和优化

您的需求可以通过以下行来满足：

number_of_points = (np.array(spectrum) < np.array(fittedline)).sum()

但是让我们一步一步来：

spectrum = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
fittedline = [1, 2, 10, 10, 10, 10, 10, 8, 9, 10]

# Import numerical python module
import numpy as np

# Convert your lists to numpy arrays
spectrum_array = np.array(spectrum)
gittedline_array = np.array(fittedline)

# Substract fitted line to spectrum
difference = spectrum_array - gittedline_array
#>>> array([ 0,  0, -7, -6, -5, -4, -3,  0,  0,  0])

# Identify points where condition is met
condition_check_array = difference < 0.0
# >>> array([False, False,  True,  True,  True,  True,  True, False, False, False])

# Get the number of points where condition is met
number_of_points = condition_check_array.sum()
# >>> 5

# Get index of points where condition is met
index_of_points = np.where(difference < 0)
# >>> (array([2, 3, 4, 5, 6], dtype=int64),)

print(f"number_of_points points found at location index_of_points[0][0]-index_of_points[0][-1]!")

# Now same functionality in a simple function
def get_point_count(spectrum, fittedline):  
    return (np.array(spectrum) < np.array(fittedline)).sum()

get_point_count(spectrum, fittedline)

现在让我们考虑，您的频谱中不是 10 个点，而是 10M。代码效率是一个需要考虑的关键问题，numpy 可以在那里节省帮助：

number_of_samples = 1000000
spectrum = [1] * number_of_samples
# >>> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
fittedline = [0] * number_of_samples
fittedline[2:7] =[2] * 5
# >>> [0, 0, 2, 2, 2, 2, 2, 0, 0, 0, ...]

# With numpy
start_time = time.time()
number_of_points = (np.array(spectrum) < np.array(fittedline)).sum()
numpy_time = time.time() - start_time
print("--- %s seconds ---" % (numpy_time))


# With ad hoc loop and ifs
start_time = time.time()
count=0
for i in range(0, len(spectrum)):
    if spectrum[i] < fittedline[i]:
        count += 1
    else: # If the current point is NOT below the threshold, reset the count
        count = 0
adhoc_time = time.time() - start_time
print("--- %s seconds ---" % (adhoc_time))

print("Ad hoc is :3.1f% slower".format(100 * (adhoc_time / numpy_time - 1)))

number_of_samples = 1000000
spectrum = [1] * number_of_samples
# >>> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
fittedline = [0] * number_of_samples
fittedline[2:7] =[2] * 5
# >>> [0, 0, 2, 2, 2, 2, 2, 0, 0, 0, ...]

# With numpy
start_time = time.time()
number_of_points = (np.array(spectrum) < np.array(fittedline)).sum()
numpy_time = time.time() - start_time
print("--- %s seconds ---" % (numpy_time))


# With ad hoc loop and ifs
start_time = time.time()
count=0
for i in range(0, len(spectrum)):
    if spectrum[i] < fittedline[i]:
        count += 1
    else: # If the current point is NOT below the threshold, reset the count
        count = 0
adhoc_time = time.time() - start_time
print("--- %s seconds ---" % (adhoc_time))

print("Ad hoc is :3.1f% slower".format(100 * (adhoc_time / numpy_time - 1)))

>>>--- 0.20999646186828613 seconds ---
>>>--- 0.28800177574157715 seconds ---
>>>Ad hoc is 37.1% slower

【讨论】：

1. OP 没有表明他们使用 numpy。如果他们还没有使用 numpy 并且他们的分数很少，那么使用 numpy 就太过分了。 2. OP 正在寻找连续点。你的算法不这样做。考虑输入spectrum = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; fittedline = [1, 2, 10, 10, 0, 0, 10, 8, 9, 10]。当只有两个连续点满足条件时，您的代码会输出3。我在 qwertie 的问题中没有看到“连续”的要求。关于 numpy，在我看来，学习如何使用被广泛支持的最先进的库并不是矫枉过正，而是作为开发人员进行改进的方式他们甚至在他们的问题中将这部分加粗：...在一行中小于...当然，学习如何使用它是完全有效的。是否使用库的决定不仅仅取决于“我想学习它”。您在这两点上都是对的，我将“行”一词理解为具有不同“行”的多维频谱。让我收回我的反对票。我仍然认为 numpy 是这种操作的方式

以上是关于查找一行中是不是有 n 个小于某个数字的数据点的主要内容，如果未能解决你的问题，请参考以下文章

查找一组 n 个连续数字是不是在 SQL 中重复

二分查找边界问题