如何组合两个数据框并自然地对混合字母数字类型的列进行排序？

Posted 2023-03-12

技术标签:

【中文标题】如何组合两个数据框并自然地对混合字母数字类型的列进行排序？【英文标题】：How to combine two dataframes and naturally sort on a column of mixed alphanumeric types? 【发布时间】：2021-01-06 21:19:06 【问题描述】：

我有a dataframe df1:

   QID Questions    B Answer1 Answer2 Answer3  F  G  H  I  J
0    3         a  4.0       a       a       a  a  e  g  i  l
1    4         b  5.0       b       b       b  a  r  h  m  p
2    5         d  5.0     NaN       e       d  b  u  e  i  z
3    6         e  5.0       d       h       r  b  c  z  i  3

我想在df_1 的行之间添加another one, new_dataframe。

   QID Questions    B Answer1 Answer2 Answer3  F  G  H  I  J
2  4_1         z  5.0       b       k       b  a  r  h  m  p
3  4_2         w  4.0       b       k       b  c  r  h  m  p

确实，我想得到：

   QID Questions    B Answer1 Answer2 Answer3  F  G  H  I  J
0    3         a  4.0       a       a       a  a  e  g  i  l
1    4         b  5.0       b       b       b  a  r  h  m  p
2  4_1         z  5.0       b       k       b  a  r  h  m  p
3  4_2         w  4.0       b       k       b  c  r  h  m  p
4    5         d  5.0     NaN       e       d  b  u  e  i  z
5    6         e  5.0       d       h       r  b  c  z  i  3

所以我想整合第二个数据帧new_dataframe 的行，其QID 由一个数字和一个子数字组成，跟随df1 的行。比如new_dataframe中QID为4_1、4_2...的行应该在4之后合并。

到目前为止，我尝试了以下方法：

# Now I would like to join this new dataframe with dataframe_1, respecting the index, sort it and so on.
for i, row in df1.iterrows():
    qid = row['QID']
    # test if there is such a QID in new_dataframe
    repeated_question = new_dataframe[new_dataframe['QID'].split("_")[0] == qid]
    # insert them

更新

我用实际数据尝试了 Trenton McKinney 的更新答案并得到了一个例外：

from natsort import index_natsorted as ins
import numpy as np
import pandas as pd

# read the files
df1 = pd.read_csv("/content/drive/My Drive/Auspex/QuestionBank_06082020_QGrid_and_CINT.csv", dtype='QID': str, low_memory=False)
df1.drop(columns=['Unnamed: 0'], inplace=True)

df2 = pd.read_csv('/content/drive/My Drive/Auspex/new_df.csv', dtype='QID': str)
df2.drop(columns=['Unnamed: 0'], inplace=True)

# concat them
df = pd.concat([df1, df2])

# sort the values using the key parameter in sort_values
df.sort_values(by='QID', key=lambda col: np.argsort(ins(col))).reset_index(drop=True)

我明白了：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-d9f848d1f6bf> in <module>()
     14 
     15 # sort the values using the key parameter in sort_values
---> 16 df.sort_values(by='QID', key=lambda col: np.argsort(ins(col))).reset_index(drop=True)

TypeError: sort_values() got an unexpected keyword argument 'key'

【问题讨论】：

【参考方案1】：

更新：使用数据文件和`natsort`

对于包含字符串和数字的混合数据类型，值不会按数字顺序排序，但相关值会放在一起。例如，100 将与 100_A 一起使用。要解决此问题，请使用natsort，可用于pip 或conda 安装。这将使用.sort_values 的key 参数

from natsort import index_natsorted as ins
import numpy as np
import pandas as pd

# read the files
df1 = pd.read_csv('data/QuestionBank_06082020_QGrid_and_CINT - QuestionBank_06082020_QGrid_and_CINT.csv', dtype='QID': str, low_memory=False)
df1.drop(columns=['Unnamed: 0'], inplace=True)

df2 = pd.read_csv('data/new_df.csv', dtype='QID': str)
df2.drop(columns=['Unnamed: 0'], inplace=True)

# concat them
df = pd.concat([df1, df2])

# sort the values using the key parameter in sort_values
df.sort_values(by='QID', key=lambda col: np.argsort(ins(col))).reset_index(drop=True)

样本结果

# display(df.iloc[:60, :1])

     QID
0    NaN
1    NaN
2    NaN
3    NaN
4      0
5     0R
6      1
7      2
8      4
9      5
10    5R
11     6
12     7
13     8
14     9
15    10
16    11
17    12
18    13
19    14
20    15
21    16
22    17
23    18
24    19
25    20
26  20_1
27  20_2
28  20_3
29  20_4
30  20_5
31  20_6
32  20_7
33    21
34    22
35    23
36    24
37   24R
38    25
39    26
40    27
41    28
42    29
43    30
44    31
45    32
46    33
47    34
48    35
49    36
50  36_1
51  36_2
52  36_3
53  36_4
54  36_5
55  36_6
56  36_7
57  36_8
58  36_9
59    37

原答案：使用测试数据

pandas.concat 数据框 df1.QID 可能是 int 基于示例的列，df2.QID 将是 str 类型。 DataFrames合并时，'QID'列需要转换成str类型和.astype，因为创建时会是混合类型，.sort_values不起作用。如果'QID' 列具有混合类型，则将产生TypeError: '<' not supported between instances of 'str' and 'int'。

# concat the dataframes
df = pd.concat([df1, df2])

# convert the column to strings
df.QID = df.QID.astype(str)

# sort and reset
df = df.sort_values('QID').reset_index(drop=True)

# display(df)
   QID Questions    B Answer1 Answer2 Answer3  F  G  H  I  J
0    3         a  4.0       a       a       a  a  e  g  i  l
1    4         b  5.0       b       b       b  a  r  h  m  p
2  4_1         z  5.0       b       k       b  a  r  h  m  p
3  4_2         w  4.0       b       k       b  c  r  h  m  p
4    5         d  5.0     NaN       e       d  b  u  e  i  z
5    6         e  5.0       d       h       r  b  c  z  i  3

【讨论】：

非常感谢！我试过了，但得到了一个 TypeError 异常sort_values() got an unexpected keyword argument 'key'我认为只有当它是 pandas Series 时才会触发这个异常？我已经用我的尝试更新了我的问题 @RevolucionforMonica 你需要更新熊猫。密钥仅在 1.1.0 版本后可用

以上是关于如何组合两个数据框并自然地对混合字母数字类型的列进行排序？的主要内容，如果未能解决你的问题，请参考以下文章

求密码是由6-18位字母和数字的混合组成的正则表达式的js代码，谢谢各位大神！！

JAVA数字和字母混合排序

如何生成包含连续数字和字母的混合列表？

如何在 SQL 语句中引用组合框并在 VBA 中运行该语句

Microsoft SQL 2005 中的自然（人类字母数字）排序

数据框中所有可能的列组合 -pandas/python

如何组合两个数据框并自然地对混合字母数字类型的列进行排序？

更新

更新：使用数据文件和natsort

样本结果

原答案：使用测试数据

更新：使用数据文件和`natsort`