基于其他列长度的列中值的索引列表

Posted

技术标签:

【中文标题】基于其他列长度的列中值的索引列表【英文标题】:Index list of values in a column based on length of other column 【发布时间】:2020-08-30 02:59:37 【问题描述】:

我有一个如下的DataFrame:

len  scores
5      [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324, 0.33509446692807404, 0.01202741856859997, 0.01202741856859997, 0.031149023579740857, 0.031149023579740857, 0.9382029832667171]
4      [0.1289882974831455, 0.17069367229950574, 0.03518847270370917, 0.3283517918439753, 0.41119171582425107, 0.5057528742869354]

3      [0.22345885572316307, 0.1366147609256035, 0.09309687010700848]
2      [0.4049920770888036]

我想根据 len 列的值索引 score 列并获取多行

len    scores
5       [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324]
5       [0.33509446692807404, 0.01202741856859997, 0.01202741856859997]
5       [0.031149023579740857, 0.031149023579740857]
5       [0.9382029832667171]
5       
4       [0.1289882974831455, 0.17069367229950574, 0.03518847270370917]
4       [0.3283517918439753, 0.41119171582425107]
4       [0.9382029832667171]
4
3       [0.22345885572316307, 0.1366147609256035]
3       [0.09309687010700848]
3
2       [0.4049920770888036]
2

我试过了

d = []
for x in df['len']:
    col = df['scores'][:(x-1)]
    d.append(col)

但这只会给我第一行索引行

len  scores
5      [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324]
4      [0.1289882974831455, 0.17069367229950574, 0.03518847270370917]
3      [0.22345885572316307, 0.1366147609256035]
2      [0.4049920770888036]

如何根据我的要求将其余行编入索引?

【问题讨论】:

【参考方案1】:

假设列 len 与列分数中的列表长度相关,如您的示例所示,您可以使用 apply 将列表重塑为长度递减的嵌套列表,然后 explode喜欢:

#define function to create nested list
def create_nested_list (x):
    l_idx = [0]+np.cumsum(np.arange(x['len'])[::-1]).tolist()
    return [x['scores'][i:j] for i, j in zip(l_idx[:-1], l_idx[1:])]

#apply row-wise
s = df.apply(create_nested_list, axis=1)
#change index to keep the value in len
s.index=df['len']
#explode and reset_index
df_f = s.explode().reset_index(name='scores')

print (df_f)
    len                                             scores
0     5  [0.45814112124905954, 0.34974337172257086, 0.0...
1     5  [0.33509446692807404, 0.01202741856859997, 0.0...
2     5       [0.031149023579740857, 0.031149023579740857]
3     5                               [0.9382029832667171]
4     5                                                 []
5     4  [0.1289882974831455, 0.17069367229950574, 0.03...
6     4          [0.3283517918439753, 0.41119171582425107]
7     4                               [0.5057528742869354]
8     4                                                 []
9     3          [0.22345885572316307, 0.1366147609256035]
10    3                              [0.09309687010700848]
11    3                                                 []
12    2                               [0.4049920770888036]
13    2                                                 []

编辑:如果你不能使用爆炸,试试这样:

#define function to create a series from nested lists
def create_nested_list_s (x):
    l_idx = [0]+np.cumsum(np.arange(x['len'])[::-1]).tolist()
    return pd.Series([x['scores'][i:j] for i, j in zip(l_idx[:-1], l_idx[1:])])

df_f = (df.apply(create_nested_list_s, axis=1)
          .set_index(df['len'])
          .stack()
          .reset_index(name='scores')
          .drop('level_1', axis=1))
print(df_f)

【讨论】:

对我来说,当我尝试像你提到的那样爆炸和重置索引时,它给了我“AttributeError: 'Series' object has no attribute 'explode'”错误 @gamyanaidu explode 是 pandas 0.25 之后的新版本,你能升级你的版本吗? @gamyanaidu 看到我的编辑,它应该适用于早期版本的熊猫【参考方案2】:

df.explode() 完全符合您的要求。

例子:

import pandas as pd

df = pd.DataFrame('A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1)
df.explode('A')
#Output
#      A  B
# 0    1  1
# 0    2  1
# 0    3  1
# 1  foo  1
# 2  NaN  1
# 3    3  1
# 3    4  1

【讨论】:

以上是关于基于其他列长度的列中值的索引列表的主要内容,如果未能解决你的问题,请参考以下文章

索引 创建原则

Mysql设计索引的原则

查找作为列表存在的列元素的数据框索引的最快方法

有没有办法自动生成需要索引的列列表?

将提取的列附加到没有索引的列表中:Pandas

Postgres:获取对应于组中其他列的最大值的列的值