如何使用来自另一个数据帧的索引创建考拉数据帧?
Posted
技术标签:
【中文标题】如何使用来自另一个数据帧的索引创建考拉数据帧?【英文标题】:How to create a koalas dataframe with index from another dataframe? 【发布时间】:2022-01-21 09:48:18 【问题描述】:我可以在熊猫中做到这一点,但我正在努力在考拉中实现同样的目标。以下是我迄今为止的尝试:
from databricks import koalas as pd
import pandas
熊猫(作品):
dft = pandas.DataFrame('a':[1,2,3],'b':[0,1,0],index=[11,12,13])
dft1 = pandas.DataFrame('a':[2,21,31],'c':[3,4,5], index=dft.index)
考拉(因错误而失败):
dft = pd.DataFrame('a':[1,2,3],'b':[0,1,0],index=[11,12,13])
dft1 = pd.DataFrame('a':[2,21,31],'c':[3,4,5], index=dft.index)
output:
ValueError Traceback (most recent call last)
/tmp/ipykernel_2826623/2112004205.py in <module>
1 dft = pd.DataFrame('a':[1,2,3],'b':[0,1,0],index=[11,12,13])
----> 2 dft1 = pd.DataFrame('a':[2,21,31],'c':[3,4,5], index=dft.index)
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/pandas/frame.py in __init__(self, data, index, columns, dtype, copy)
517 pdf = data
518 else:
--> 519 pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, copy=copy)
520 internal = InternalFrame.from_pandas(pdf)
521
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
612 elif isinstance(data, dict):
613 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 614 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
615 elif isinstance(data, ma.MaskedArray):
616 import numpy.ma.mrecords as mrecords
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pandas/core/internals/construction.py in dict_to_mgr(data, index, columns, dtype, typ, copy)
462 # TODO: can we get rid of the dt64tz special case above?
463
--> 464 return arrays_to_mgr(
465 arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
466 )
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pandas/core/internals/construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity, typ, consolidate)
119 index = _extract_index(arrays)
120 else:
--> 121 index = ensure_index(index)
122
123 # don't force copy because getting jammed in an ndarray anyway
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pandas/core/indexes/base.py in ensure_index(index_like, copy)
6334 else:
6335
-> 6336 return Index(index_like, copy=copy)
6337
6338
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
482 data = list(data)
483
--> 484 if data and all(isinstance(e, tuple) for e in data):
485 # we must be all tuples, otherwise don't construct
486 # 10697
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/pandas/indexes/base.py in __bool__(self)
2605
2606 def __bool__(self) -> bool:
-> 2607 raise ValueError(
2608 "The truth value of a 0 is ambiguous. "
2609 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(self.__class__.__name__)
ValueError: The truth value of a Int64Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
熊猫(作品):
dft = pandas.DataFrame('a':[1,2,3],'b':[0,1,0],index=[11,12,13])
dft1 = pandas.DataFrame('a':[2,21,31],'c':[3,4,5])
dft1.index=dft.index
print(dft1)
output:
a c
11 2 3
12 21 4
13 31 5
考拉(失败无错误):
dft = pd.DataFrame('a':[1,2,3],'b':[0,1,0],index=[11,12,13])
dft1 = pd.DataFrame('a':[2,21,31],'c':[3,4,5])
dft1.index=dft.index
print(dft1)
output:
a c
0 2 3
1 21 4
2 31 5
print(dft1.index)
output: Int64Index([0, 1, 2], dtype='int64')
【问题讨论】:
【参考方案1】:我现在已经制定了一个 hacky 解决方案。如果有人有更好的解决方案,请告诉我:
dft = dft = pd.DataFrame('a':[1,2,3],'b':[0,1,0],index=[11,12,13])
dft1 = dft1 = pd.DataFrame('a':[2,21,31],'c':[3,4,5])
index = dft.index
index = index.to_series()
index = index.reset_index(drop=True)
pd.set_option('compute.ops_on_diff_frames',True)
dft1['r'] = index
dft1 = dft1.set_index('r',drop=True)
dft1.index.name = dft.index.name
pd.reset_option('compute.ops_on_diff_frames')
dft1
output:
a c
11 2 3
12 21 4
13 31 5
【讨论】:
以上是关于如何使用来自另一个数据帧的索引创建考拉数据帧?的主要内容,如果未能解决你的问题,请参考以下文章
熊猫,我怎样才能避免使用 iterrow (如何根据来自另一个数据帧的值将值分配给数据帧中的新列)
Pyspark - 如何将多个数据帧的列连接成一个数据帧的列