连接两个熊猫数据框并重新排序列

Posted 2023-02-24

技术标签:

【中文标题】连接两个熊猫数据框并重新排序列【英文标题】：Concat two pandas dataframes and reorder columns 【发布时间】：2019-12-28 17:18:39 【问题描述】：

我有两个数据框（df1 和 df2，如下所示），它们的列在顺序和计数上都不同。我需要将这两个数据框附加到一个 Excel 文件中，其中列顺序必须按照下面Col_list 中的指定。

df1 是：

 durable_medical_equipment    pcp  specialist  diagnostic  imaging  generic  formulary_brand  non_preferred_generic  emergency_room  inpatient_facility  medical_deductible_single  medical_deductible_family  maximum_out_of_pocket_limit_single  maximum_out_of_pocket_limit_family plan_name      pdf_name
0                      False  False       False       False    False    False            False                  False           False               False                      False                      False                               False                               False   ABCBCBC  adjnajdn.pdf

... df2 是：

   pcp  specialist  generic  formulary_brand  emergency_room  urgent_care  inpatient_facility  durable_medical_equipment  medical_deductible_single  medical_deductible_family  maximum_out_of_pocket_limit_single  maximum_out_of_pocket_limit_family plan_name      pdf_name
0  True        True    False            False            True         True                True                       True                       True                       True                                True                                True   ABCBCBC  adjnajdn.pdf

我正在创建一个与 excel 中的列顺序相同的列列表。

Col_list = ['durable_medical_equipment', 'pcp', 'specialist', 'diagnostic',
            'imaging', 'generic', 'formulary_brand', 'non_preferred_generic',
            'emergency_room', 'inpatient_facility', 'medical_deductible_single',
            'medical_deductible_family', 'maximum_out_of_pocket_limit_single', 'maximum_out_of_pocket_limit_family',
            'urgent_care', 'plan_name', 'pdf_name']

我正在尝试使用 concat() 根据 Col_list 重新排序我的数据框。对于数据框中不存在的列值，该值可以是 NaN。

result = pd.concat([df, pd.DataFrame(columns=list(Col_list))])

这不能正常工作。我怎样才能实现这种重新排序？

我尝试了以下方法：

 result = pd.concat([df_repo, pd.DataFrame(columns=list(Col_list))], sort=False, ignore_index=True)
        print(result.to_string())

我得到的输出是：

 durable_medical_equipment    pcp specialist diagnostic imaging generic formulary_brand non_preferred_generic emergency_room inpatient_facility medical_deductible_single medical_deductible_family maximum_out_of_pocket_limit_single maximum_out_of_pocket_limit_family plan_name      pdf_name urgent_care
0                     False  False      False      False   False   False           False                 False          False              False                     False                     False                              False                              False   ABCBCBC  adjnajdn.pdf         NaN
    pcp specialist generic formulary_brand emergency_room urgent_care inpatient_facility durable_medical_equipment medical_deductible_single medical_deductible_family maximum_out_of_pocket_limit_single maximum_out_of_pocket_limit_family plan_name      pdf_name diagnostic imaging non_preferred_generic
0  True       True   False           False           True        True               True                      True                      True                      True                               True                               True   ABCBCBC  adjnajdn.pdf        NaN     NaN                   NaN

【问题讨论】：

使用 concat 而不是 merge 似乎是一个错误，因为您的数据框共享许多公共列 (pcp, specialist, generic)。您真的希望这些列在输出中显示两次吗？使用 concat，它不会给我重复 当您想组合 2+ 个具有共享列的数据框时，请使用 merge 而不是 concat：Difference(s) between merge() and concat() in pandas 【参考方案1】：

如果需要按列表中的值更改顺序添加DataFrame.reindex 并传递给concat，请使用：

df = pd.concat([df1.reindex(Col_list, axis=1), 
                df2.reindex(Col_list, axis=1)], sort=False, ignore_index=True)
print (df)
   durable_medical_equipment    pcp  specialist  diagnostic  imaging  generic  \
0                      False  False       False         0.0      0.0    False   
1                       True   True        True         NaN      NaN    False   

   formulary_brand  non_preferred_generic  emergency_room  inpatient_facility  \
0            False                    0.0           False               False   
1            False                    NaN            True                True   

   medical_deductible_single  medical_deductible_family  \
0                      False                      False   
1                       True                       True   

   maximum_out_of_pocket_limit_single  maximum_out_of_pocket_limit_family  \
0                               False                               False   
1                                True                                True   

   urgent_care plan_name      pdf_name  
0          NaN   ABCBCBC  adjnajdn.pdf  
1          1.0   ABCBCBC  adjnajdn.pdf

【讨论】：

我试过result = pd.concat([df_repo, pd.DataFrame(columns=list(Col_list))], sort=False, ignore_index=True)。它没有给我正确的输出。我已经用我得到的输出更新了我的问题，但顺序仍然不一样。实际上我需要根据我定义的列表更改数据帧的顺序，因为在那之后我将在循环中将该 df 附加到我的 excel 中 @user1896796 - 这是我的第二个解决方案，现在第一个被删除了。是的，我可以使用 reindex 来做到这一点。我做了类似下面的事情-

result = pd.concat([df_repo, pd.DataFrame(columns=list(Col_list))], sort=False, ignore_index=True)         result = result.reindex(Col_list, axis=1)

以上是关于连接两个熊猫数据框并重新排序列的主要内容，如果未能解决你的问题，请参考以下文章

使熊猫具有多索引列的多个数据框并完全连接

比较两个（py）spark sql数据框并在保持连接列的同时有条件地选择列数据

如何比较两个熊猫数据框并返回将它们相互映射的索引？

熊猫追加和连接重新排序数据框？

使用特定列连接两个熊猫数据框

熊猫：连接数据框时如何聚合两个列表列