查找并计算列表中某个范围内所有出现的数字和位置

Posted 2023-03-12

技术标签:

【中文标题】查找并计算列表中某个范围内所有出现的数字和位置【英文标题】：Find and count all occurrences and position of numbers in a range in a list 【发布时间】：2020-11-19 16:17:24 【问题描述】：

当我不知道数字是什么时，我想在 6 个数字集的列表中找到每个数字出现在每个索引位置的次数，但它们的范围仅为 0-99。

示例列表：

data = [['22', '45', '6', '72', '1', '65'], ['2', '65', '67', '23', '98', '1'], ['13', '45', '98', '4', '12', '65']]

最终我会将结果计数放入 pandas DataFrame 中，如下所示：

num numofoccurances position numoftimesinposition
01         02            04            01
01         02            05            01
02         01            00            01
04         02            03            01
06         01            02            01
12         01            04            01
13         01            00            01
and so on...

由于每次出现在不同的索引位置时都会重复 num，因此生成的数据会略有不同，但希望这有助于您了解我在寻找什么。

到目前为止，这是我开始的：

data = json.load(f)
numbers = []
contains = []

'''
This section is simply taking the data from the json file and putting it all into a list of lists containing the 6 elements I need in each list
'''
for i in data['data']:
    item = [i[9], i[10]]
#   print(item)
    item = [words for segments in item for words in segments.split()]
    numbers.append(item)

'''
This is my attempt to count to number of occurrences for each number in the range then add it to a list.
'''
x = range(1,99)
for i in numbers:
    if x in i and not contains:
        contains.append(x)

【问题讨论】：

您能描述一下这 3 列中的每一列是什么吗？ numofoccurrances 中的第一个值 22 .. 这只是数据中存在的直接数字吗？什么是 num.. 是 0-99 吗？还是它的 6 个长度列表的索引？ @AkshaySehgal num 是列表中出现的数字。 numofoccurrences 是该数字总共出现的次数。 position 是该数字出现的索引位置。numoftimesinpostion 是该数字在该特定索引位置出现的次数。您能否编辑表格以使其与提供的数据相匹配？ @Onyambu DataFrame 示例是任意的。这仅是为了了解我在获得数据后将如何处理这些数据。我可以创建 DataFrame 我真的只需要一种方法来获取 DataFrame 的数据。 numoccurance 和 numtimesinposition 有什么区别 【参考方案1】：

import pandas as pd
num_pos = [(num,pos) for i in data for pos,num in enumerate(i)]
df = pd.DataFrame(num_pos,columns = ['number','position']).assign(numoftimesinposition = 1)
df = df.astype(int).groupby(['number','position']).count().reset_index()

df1 = df.groupby('number').numoftimesinposition.sum().reset_index().\
    rename(columns = 'numoftimesinposition':'numofoccurences').\
    merge(df, on='number')

print(df1)
    number  numofoccurences  position  numoftimesinposition
0        1                2         4                     1
1        1                2         5                     1
4        2                1         0                     1
7        4                1         3                     1
9        6                1         2                     1
2       12                1         4                     1
3       13                1         0                     1
5       22                1         0                     1
6       23                1         3                     1
8       45                2         1                     2
10      65                3         1                     1
11      65                3         5                     2
12      67                1         2                     1
13      72                1         3                     1
14      98                2         2                     1
15      98                2         4                     1

如果上面的代码感觉很慢，那么使用collections中的Counter：

import pandas as pd
from collections import Counter

num_pos = [(int(num),pos) for i in data for pos,num in enumerate(i)]

count_data = [(num,pos,occurence) for (num,pos), occurence in Counter(num_pos).items()]

df = pd.DataFrame(count_data, columns = ['num','pos','occurence']).sort_values(by='num')

df['total_occurence'] = [Counter(df.num).get(num) for num in df.num]
print(df)

【讨论】：

所以，当我运行这个时，我在 DataFrame 中只得到 3 列。数字、位置和 numoftimesinposition。是否需要采取其他步骤？我知道了，但我也喜欢编辑。在之前的 DataFrame 中，没有使用附加列进行更新。我刚刚将 df= 添加到 df.groupby 部分，它可以工作。我也会试试 Counter 看看效果如何。【参考方案2】：

这应该可以解决您的查询（应该比极慢的 groupby（您需要 2 个）和其他处理更大数据的 pandas 操作更快）-

#get the list of lists into a 2d numpy array
dd = np.array(data).astype(int)

#get vocab of all unique numbers
vocab = np.unique(dd.flatten())

#loop thru vocab and get sum of occurances in each index position
df = pd.DataFrame([[i]+list(np.sum((dd==i).astype(int), axis=0)) for i in vocab])

#rename cols
df.columns = ['num', 0, 1, 2, 3, 4, 5] 

#create total occurances of the item
df['numoccurances'] = df.iloc[:,1:].sum(axis=1)  
 
#Stack the position counts and rename cols
stats = pd.DataFrame(df.set_index(['num','numoccurances']).\
                     stack()).reset_index().\
                     set_axis(['num', 'numoccurances', 'position', 'numtimesinposition'], axis=1)

#get only rows with occurances
stats = stats[stats['numtimesinposition']>0].reset_index(drop=True) 
stats

    num  numoccurances  position  numtimesinposition
0     1              2         4                   1
1     1              2         5                   1
2     2              1         0                   1
3     4              1         3                   1
4     6              1         2                   1
5    12              1         4                   1
6    13              1         0                   1
7    22              1         0                   1
8    23              1         3                   1
9    45              2         1                   2
10   65              3         1                   1
11   65              3         5                   2
12   67              1         2                   1
13   72              1         3                   1
14   98              2         2                   1
15   98              2         4                   1

如结果所示 -

1 在您共享的示例数据中总共出现 2 次，并且在第 5 位和第 6 位中各出现 1 次。同样 2 共出现 1 次，也是第 1 位。

【讨论】：

这太完美了！谢谢！

以上是关于查找并计算列表中某个范围内所有出现的数字和位置的主要内容，如果未能解决你的问题，请参考以下文章