Python：比较两个excel表之间的列字符串并找到匹配的列并写入另一个数据框

Posted 2023-03-12

技术标签:

【中文标题】Python：比较两个excel表之间的列字符串并找到匹配的列并写入另一个数据框【英文标题】：Python: compare strings of a column between two excel sheets and find the matching columns and write to another data frame 【发布时间】：2021-09-08 19:09:43 【问题描述】：

我需要阅读两个电子表格（比如说 SS1 和 SS2）。然后必须阅读 SS1 中的描述列并在 SS2 中搜索最近日期的类似描述。必须将输出写入另一个电子表格，其中包含唯一匹配描述及其最新日期和价格。

以下输入和输出示例供参考： SS1：

S.No	Product	Product_Description
1	Shirt	Monte Carlo Men Shirt
2	Shirt	Belmonte Shirt Cotton
3	Shirt	US Polo tshirt men
4	Shirt	Monte Carlo tshirt
5	Shirt	Monte Carlo Men Shirt
6	Suit	Louis Philippe wrinkle free
7	Suit	Park Avenue
8	Suit	Van Heusen
9	Watches	Titan Men Wrist Type
10	Watches	Casio
11	Watches	Titan Women Wrist Type
12	Watches	Rolex
13	Watches	Casio

SS2：

S.No	Product	Product_Description	Purchase_Date	Quantity	Price	Net Value
1	Watches	Casio	Jan-19	10	5000	50000
2	Watches	Rolex	May-20	2	500000	1000000
3	Shirt	Monte Carlo tshirt	Feb-20	20	2000	40000
4	Suit	Raymond	Jan-20	50	10000	500000
5	Watches	Lois Moinet	May-21	3	60000	180000
6	Shirt	Peter England	Apr-21	40	1800	72000
7	Watches	Casio	Mar-19	30	5500	165000
8	Shirt	Monte Carlo Men Shirt	Jun-19	10	3000	30000
9	Shirt	Monte Carlo Men Shirt	Apr-20	12	3100	37200
10	Watches	Rolex	Dec-20	4	505000	2020000
11	Suit	Louis Philippe wrinkle free suit	Jun-21	9	20000	180000
12	Suit	Allen Solly	Jan-21	12	4000	48000
13	Shirt	Monte Carlo tshirt	Apr-21	15	2500	37500

输出：

S.No	Product	Product_Description	Purchase_Date	Price
1	Shirt	Monte Carlo Men Shirt	Apr-20	3100
2	Shirt	Monte Carlo tshirt	Apr-21	2500
3	Suit	Louis Philippe wrinkle free suit	Jun-12	20000
4	Watches	Casio	Mar-19	5500
5	Watches	Rolex	Dec-20	505000

【问题讨论】：

到目前为止你尝试了什么？ 【参考方案1】：

以下应该有效：

temp = SS1.merge(SS2, on=['Product', 'Product_Description'])[['S.No_x', 'Product', 'Product_Description', 'Purchase_Date', 'Price']]
    
res = temp.sort_values(['Product_Description','Purchase_Date']).drop_duplicates('Product_Description', keep='last')    
res=res.rename(columns='S.No_x':'S.No')
res=res.sort_values('S.No')
res.reset_index(drop=True, inplace=True)
        
print(res)

输出：

   S.No  Product    Product_Description Purchase_Date   Price
0     4    Shirt     Monte Carlo tshirt        Feb-20    2000
1     5    Shirt  Monte Carlo Men Shirt        Jun-19    3000
2    12  Watches                  Rolex        May-20  500000
3    13  Watches                  Casio        Mar-19    5500

如果您想要 ACTUAL LATEST DATE（不是数据集中的最新日期），请在上述代码的第二行插入以下行：

temp['Purchase_Date'] = pd.to_datetime(temp['Purchase_Date'], format='%b-%d')

【讨论】：

谢谢，它运作良好，但并不完全。我得到了 /** raise KeyError(f"not_found not in index") **/ 因为我使用了来自 ss2 的更多列。似乎它要求我使用 loc，iloc。我说的对吗？ @JoaTzimas，在同一个合并函数上，如何使用基于位置的参数来避免“raise KeyError(f"not_found not in index")”？

以上是关于Python：比较两个excel表之间的列字符串并找到匹配的列并写入另一个数据框的主要内容，如果未能解决你的问题，请参考以下文章