基于其他列长度的列中值的索引列表
Posted
技术标签:
【中文标题】基于其他列长度的列中值的索引列表【英文标题】:Index list of values in a column based on length of other column 【发布时间】:2020-08-30 02:59:37 【问题描述】:我有一个如下的DataFrame:
len scores
5 [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324, 0.33509446692807404, 0.01202741856859997, 0.01202741856859997, 0.031149023579740857, 0.031149023579740857, 0.9382029832667171]
4 [0.1289882974831455, 0.17069367229950574, 0.03518847270370917, 0.3283517918439753, 0.41119171582425107, 0.5057528742869354]
3 [0.22345885572316307, 0.1366147609256035, 0.09309687010700848]
2 [0.4049920770888036]
我想根据 len 列的值索引 score 列并获取多行
len scores
5 [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324]
5 [0.33509446692807404, 0.01202741856859997, 0.01202741856859997]
5 [0.031149023579740857, 0.031149023579740857]
5 [0.9382029832667171]
5
4 [0.1289882974831455, 0.17069367229950574, 0.03518847270370917]
4 [0.3283517918439753, 0.41119171582425107]
4 [0.9382029832667171]
4
3 [0.22345885572316307, 0.1366147609256035]
3 [0.09309687010700848]
3
2 [0.4049920770888036]
2
我试过了
d = []
for x in df['len']:
col = df['scores'][:(x-1)]
d.append(col)
但这只会给我第一行索引行
len scores
5 [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324]
4 [0.1289882974831455, 0.17069367229950574, 0.03518847270370917]
3 [0.22345885572316307, 0.1366147609256035]
2 [0.4049920770888036]
如何根据我的要求将其余行编入索引?
【问题讨论】:
【参考方案1】:假设列 len 与列分数中的列表长度相关,如您的示例所示,您可以使用 apply
将列表重塑为长度递减的嵌套列表,然后 explode
喜欢:
#define function to create nested list
def create_nested_list (x):
l_idx = [0]+np.cumsum(np.arange(x['len'])[::-1]).tolist()
return [x['scores'][i:j] for i, j in zip(l_idx[:-1], l_idx[1:])]
#apply row-wise
s = df.apply(create_nested_list, axis=1)
#change index to keep the value in len
s.index=df['len']
#explode and reset_index
df_f = s.explode().reset_index(name='scores')
print (df_f)
len scores
0 5 [0.45814112124905954, 0.34974337172257086, 0.0...
1 5 [0.33509446692807404, 0.01202741856859997, 0.0...
2 5 [0.031149023579740857, 0.031149023579740857]
3 5 [0.9382029832667171]
4 5 []
5 4 [0.1289882974831455, 0.17069367229950574, 0.03...
6 4 [0.3283517918439753, 0.41119171582425107]
7 4 [0.5057528742869354]
8 4 []
9 3 [0.22345885572316307, 0.1366147609256035]
10 3 [0.09309687010700848]
11 3 []
12 2 [0.4049920770888036]
13 2 []
编辑:如果你不能使用爆炸,试试这样:
#define function to create a series from nested lists
def create_nested_list_s (x):
l_idx = [0]+np.cumsum(np.arange(x['len'])[::-1]).tolist()
return pd.Series([x['scores'][i:j] for i, j in zip(l_idx[:-1], l_idx[1:])])
df_f = (df.apply(create_nested_list_s, axis=1)
.set_index(df['len'])
.stack()
.reset_index(name='scores')
.drop('level_1', axis=1))
print(df_f)
【讨论】:
对我来说,当我尝试像你提到的那样爆炸和重置索引时,它给了我“AttributeError: 'Series' object has no attribute 'explode'”错误 @gamyanaidu explode 是 pandas 0.25 之后的新版本,你能升级你的版本吗? @gamyanaidu 看到我的编辑,它应该适用于早期版本的熊猫【参考方案2】:df.explode()
完全符合您的要求。
例子:
import pandas as pd
df = pd.DataFrame('A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1)
df.explode('A')
#Output
# A B
# 0 1 1
# 0 2 1
# 0 3 1
# 1 foo 1
# 2 NaN 1
# 3 3 1
# 3 4 1
【讨论】:
以上是关于基于其他列长度的列中值的索引列表的主要内容,如果未能解决你的问题,请参考以下文章