如何从数据框熊猫中制作列表列表?
Posted
技术标签:
【中文标题】如何从数据框熊猫中制作列表列表?【英文标题】:How to make list of list from dataframe pandas? 【发布时间】:2018-03-11 23:51:16 【问题描述】:我有一个带有单词和标签的 Pandas 数据框
words tags
0 I WW
1 am XX
2 newbie YY
3 . ZZ
4 You WW
5 are XX
6 cool YY
7 . ZZ
有什么方法可以从数据框中创建列表列表,如下所示:
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.','ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.','ZZ')]]
它是元组列表的列表。对于列表中的每个列表,由('.','ZZ')
分隔。表示它是一个句子。
如果条件为真,我可以迭代数据帧的每一行并创建列表并附加它,但是有什么“熊猫”方法来解决它吗?
【问题讨论】:
【参考方案1】:这是一种方法
In [5149]: dft = df.apply(tuple, 1)
In [5150]: parts = (dft == ('.', 'ZZ')).shift().cumsum().bfill()
# parts = (dft.shift() == ('.', 'ZZ')).cumsum() from Alexander's
In [5151]: [x.values.tolist() for _, x in dft.groupby(parts)]
Out[5151]:
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
或者,
In [5152]: dft.groupby(parts).apply(list).tolist()
Out[5152]:
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
或者,
In [5165]: list(dft.groupby(parts).apply(list))
Out[5165]:
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
详情
In [5153]: parts
Out[5153]:
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 1.0
6 1.0
7 1.0
dtype: float64
【讨论】:
【参考方案2】:第一部分 (df.groupby((df.shift().values == ['.', 'ZZ']).all(axis=1).cumsum())
) 将根据数据帧的“单词”列中的连续值对数据帧进行分组,直到并包括第二列也等于 Z
的时段。这是shift-cumsum
模式的一个变体(在 SO 上搜索 pandas shift cumsum,你应该会发现很多变体)。
第二部分 (.apply(lambda group: zip(group['words'], group['tags']))
) 为每一行创建元组对,例如
0 [(I, WW), (am, XX), (newbie, YY), (., ZZ)]
1 [(You, WW), (are, XX), (cool, YY), (., ZZ)]
dtype: object
最后一部分 (.values.tolist()
) 将数据框转换为您想要的格式作为列表列表。
>>> df.groupby((df.shift().values == ['.', 'ZZ']).all(axis=1).cumsum()).apply(
lambda group: zip(group['words'], group['tags'])).values.tolist()
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
【讨论】:
我认为,Alex 假设由于这些是 NLP 标签,'.'
将始终被标记为 'ZZ'
。
但我修改了它以适应特定要求。【参考方案3】:
你也可以做 np.array_split 即
li = list(filter(None,[i.apply(tuple,1).values.tolist() \
for i in np.array_split(df,df[(df['words'] == '.') & (df['tags'] == 'ZZ')].index+1)]))
或
x = df.apply(tuple,1)
li = [ i.tolist() for i in np.array_split(x,x[x==('.','ZZ')].index+1) if len(i.tolist())>1]
输出:
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
【讨论】:
【参考方案4】:如果性能很重要,您可以先从所有值创建元组,然后将它们分成子列表:
from itertools import groupby
L = list(zip(df['words'], df['tags']))
print (L)
[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'),
('.', 'ZZ'), ('You', 'WW'), ('are', 'XX'),
('cool', 'YY'), ('.', 'ZZ')]
sep = ('.','ZZ')
new_L = [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k]
print (new_L)
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
时间安排:
df = pd.concat([df]*1000).reset_index(drop=True)
def zero(df):
dft = df.apply(tuple, 1)
return ([x.values.tolist() for _, x in dft.groupby((dft == ('.', 'ZZ')).shift().cumsum().bfill())])
In [55]: %timeit ([list(g) + [('.','ZZ')] for k, g in groupby(list(zip(df['words'], df['tags'])), lambda x: x==('.','ZZ')) if not k] )
100 loops, best of 3: 4.14 ms per loop
def pir(df):
v = df.values
return ([list(map(tuple, x)) for x in np.split(v, np.where((v == ['.', 'ZZ']).all(1)[:-1])[0] + 1)])
In [68]: %timeit (pir(df))
10 loops, best of 3: 21.9 ms per loop
In [56]: %timeit (zero(df))
1 loop, best of 3: 328 ms per loop
In [57]: %timeit (df.groupby((df.shift().values == ['.', 'ZZ']).all(axis=1).cumsum()).apply(lambda group: list(zip(group['words'], group['tags']))).values.tolist())
1 loop, best of 3: 286 ms per loop
In [58]: %timeit (list(filter(None,[i.apply(tuple,1).values.tolist() for i in np.array_split(df,df[(df['words'] == '.') & (df['tags'] == 'ZZ')].index+1)])))
1 loop, best of 3: 1.31 s per loop
对于我创建问题的子列表,您可以查看解决方案here:
def jez_coldspeed(df):
L = list(zip(df['words'], df['tags']))
L2 = []
for i in L[::-1]:
if i == ('.','ZZ'):
L2.append([])
L2[-1].append(i)
return [x[::-1] for x in L2[::-1]]
def jez_coldspeed1(df):
L = list(zip(df['words'], df['tags']))
L2 = []
sep = ('.','ZZ')
for i in reversed(L):
if i == sep:
L2.append([])
L2[-1].append(i)
return [x[::-1] for x in reversed(L2)]
In [74]: %timeit (jez_coldspeed(df))
100 loops, best of 3: 2.96 ms per loop
In [75]: %timeit (jez_coldspeed1(df))
100 loops, best of 3: 2.95 ms per loop
def jez_theBuzzyCoder(df):
L = list(zip(df['words'], df['tags']))
a = list()
start = 0
sep = ('.', 'ZZ')
while start < len(L) and (L.index(sep, start) != -1):
end = L.index(sep, start) + 1
a.append(L[start:end])
start = end
return a
print (jez_theBuzzyCoder(df))
In [81]: %timeit (jez_theBuzzyCoder(df))
100 loops, best of 3: 3.16 ms per loop
【讨论】:
这个方法绝对是最快的。 确实很快。 啊哈!确实很快! 哇!脑洞大开。感谢大家! (特别是对你 jezrael xD)【参考方案5】:v = df.values
[
list(map(tuple, x))
for x in np.split(v, np.where((v == ['.', 'ZZ']).all(1)[:-1])[0] + 1)
]
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
【讨论】:
以上是关于如何从数据框熊猫中制作列表列表?的主要内容,如果未能解决你的问题,请参考以下文章