如何从python中的pandas数据框中的列中提取关键字（字符串）

Posted 2023-03-11

技术标签:

【中文标题】如何从python中的pandas数据框中的列中提取关键字（字符串）【英文标题】：How to extract a keyword(string) from a column in pandas dataframe in python 【发布时间】：2019-06-23 17:33:53 【问题描述】：

我有一个数据框 df，它看起来像这样：

         id                        Type                        agent_id  created_at
0       44525   Stunning 6 bedroom villa in New Delhi               184  2018-03-09
1       44859   Villa for sale in Amritsar                          182  2017-02-19
2       45465   House in Faridabad                                  154  2017-04-17
3       50685   5 Hectre land near New Delhi                        113  2017-09-01
4      130728   Duplex in Mumbai                                    157  2017-02-07
5      130856   Large plot with fantastic views in Mumbai           137  2018-01-16
6      130857   Modern Design Penthouse in Bangalore                199  2017-03-24

我有这个表格数据，我正在尝试通过从列中提取关键字来清理这些数据，从而创建一个包含新列的新数据框。

Apartment  = ['apartment', 'penthouse', 'duplex']
House      = ['house', 'villa', 'country estate']
Plot       = ['plot', 'land']
Location   = ['New Delhi','Mumbai','Bangalore','Amritsar']

因此所需的数据框应如下所示：

         id      Type        Location    agent_id  created_at
0       44525   House       New Delhi         184  2018-03-09
1       44859   House        Amritsar         182  2017-02-19
2       45465   House       Faridabad         154  2017-04-17
3       50685   Plot        New Delhi         113  2017-09-01
4      130728   Apartment      Mumbai         157  2017-02-07
5      130856   Plot           Mumbai         137  2018-01-16
6      130857   Apartment   Bangalore         199  2017-03-24

所以到目前为止我已经尝试过：

import pandas as pd
df = pd.read_csv('test_data.csv')

#i can extract these keywords one by one by using for loops but how
#can i do this work in pandas with minimum possible line of code.

for index, values in df.type.iteritems():
  for i in Apartment:
     if i in values:
         print(i)

df_new = pd. Dataframe(df['id'])

谁能告诉我如何解决这个问题？

【问题讨论】：

【参考方案1】：

首先通过str.extract 使用| 为正则表达式OR 创建Location 列：

pat = '|'.join(r"\b\b".format(x) for x in Location)
df['Location'] = df['Type'].str.extract('('+ pat + ')', expand=False)

然后从另一个lists 创建字典，将键与值交换，并在循环中通过掩码使用str.contains 和参数case=False 设置值：

d = 'Apartment' : Apartment,
     'House' : House,
     'Plot' : Plot

d1 = k: oldk for oldk, oldv in d.items() for k in oldv

for k, v in d1.items():
    df.loc[df['Type'].str.contains(k, case=False), 'Type'] = v

print (df)
       id       Type  agent_id  created_at   Location
0   44525      House       184  2018-03-09  New Delhi
1   44859      House       182  2017-02-19   Amritsar
2   45465      House       154  2017-04-17        NaN
3   50685       Plot       113  2017-09-01  New Delhi
4  130728  Apartment       157  2017-02-07     Mumbai
5  130856       Plot       137  2018-01-16     Mumbai
6  130857  Apartment       199  2017-03-24  Bangalore

【讨论】：

感谢您的帮助。如果列表中没有“位置”的关键字怎么办，那会发生什么？它会把'NAN'放在那里？？ @jezrael @astroluv - 是的，确切地说，如果值不存在，则创建缺失值。如有必要，最后一步应为 df['Location'] = df['Location'].fillna('not exist location') 以将 NaN 替换为字符串。【参考方案2】：

106 如果 isna(key).any(): --> 107 raise ValueError('cannot index with vector contains ' 108 'NA / NaN 值') 109 返回错误

ValueError: 无法使用包含 NA / NaN 值的向量进行索引

上面的错误

【讨论】：

嗨阿瓦尼！如果您对接受的答案有疑问，您可以在该答案的评论部分询问更多信息，或者您甚至可以直接在 Stack Overflow 上提问

以上是关于如何从python中的pandas数据框中的列中提取关键字（字符串）的主要内容，如果未能解决你的问题，请参考以下文章

像 Qlik 一样计算 pandas 数据框中的列中的唯一值？

如果所有行的列中只有一个值，则折叠 Pandas 数据框中的行

当我将 pandas 数据框保存为 csv 文件时，从 18 位长的列中截断 3 位

从数据框中删除不包括一组列的列中的nan行。

是否有python代码可以从数据框中的列中转移和总计/计数数据

如何将 numpy 数组存储在 Pandas 数据框的列中？