根据另一列中的值创建新的指标列

Posted 2023-02-23

技术标签:

【中文标题】根据另一列中的值创建新的指标列【英文标题】：Create new indicator columns based on values in another column 【发布时间】：2022-01-20 13:02:23 【问题描述】：

我有一些看起来像这样的数据：

import pandas as pd

fruits = ['apple', 'pear', 'peach']

df = pd.DataFrame('col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'])

print(df.head())

                              col1
0                  i want an apple
1                     i hate pears
2  please buy a peach and an apple
3                    I want squash

我需要一个解决方案，它为fruits 中的每个项目创建一个列，并给出一个 1 或 0 值来指示 col 是否包含该值。理想情况下，输出将如下所示：

goal_df = pd.DataFrame('col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'],
                        'apple': [1, 0, 1, 0],
                        'pear': [0, 1, 0, 0],
                        'peach': [0, 0, 1, 0])

print(goal_df.head())


                              col1  apple  pear  peach
0                  i want an apple      1     0      0
1                     i hate pears      0     1      0
2  please buy a peach and an apple      1     0      1
3                    I want squash      0     0      0

我试过了，但没用：

for i in fruits:
    if df['col1'].str.contains(i):
        df[i] = 1
    else:
        df[i] = 0

【问题讨论】：

【参考方案1】：

您可以在下面的苹果专栏中使用，并为其他人做同样的事情

def has_apple(st):
    if "apple" in st.lower():
        return 1
    return 0
df['apple'] = df['col1'].apply(has_apple)

【讨论】：

【参考方案2】：

items = ['apple', 'pear', 'peach']
for it in items:
    df[it] = df['col1'].str.contains(it, case=False).astype(int)

输出：

>>> df
                              col1  apple  pear  peach
0                  i want an apple      1     0      0
1                     i hate pears      0     1      0
2  please buy a peach and an apple      1     0      1
3                    I want squash      0     0      0

【讨论】：

这很棒。我尝试了类似的方法，但没有使用 case=False，也没有使用 .astype(int)。【参考方案3】：

使用str.extractall提取单词，然后使用pd.crosstab：

pattern = f"('|'.join(fruits))"
s = df['col1'].str.extractall(pattern)
df[fruits] = (pd.crosstab(s.index.get_level_values(0), s[0].values)
                .re_index(index=df.index, columns=fruits, fill_value=0)
             )

输出：

                              col1  apple  pear  peach
0                  i want an apple      1     0      0
1                     i hate pears      0     1      0
2  please buy a peach and an apple      1     0      1
3                    I want squash      0     0      0

【讨论】：

【参考方案4】：

尝试使用来自numpy 库的np.where：

fruit = ['apple', 'pear', 'peach']
    for i in fruit:
        df[i] = np.where(df.col1.str.contains(i), 1, 0)

【讨论】：

【参考方案5】：

试试：

str.extractall

pd.get_dummies

join

matches = pd.get_dummies(df["col1"].str.extractall(f"('|'.join(fruits))")[0].droplevel(1, 0))
output = df.join(matches.groupby(level=0).sum()).fillna(0)

>>> output
                              col1  apple  peach  pear
0                  i want an apple    1.0    0.0   0.0
1                     i hate pears    0.0    0.0   1.0
2  please buy a peach and an apple    1.0    1.0   0.0
3                    I want squash    0.0    0.0   0.0

【讨论】：

我正要发布一个类似的解决方案！ :) 哈哈，伟大的思想;）请注意，您应该能够将.str.extractall(f"('|'.join(fruits))") 简化为.str.findall('|'.join(items)) 返回每行的匹配值列表，不能直接传递给get_dummies【参考方案6】：

我想到了另一种完全不同的单线：

df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|')

输出：

>>> df
                              col1  apple  pear  peach
0                  i want an apple      1     0      0
1                     i hate pears      0     0      1
2  please buy a peach and an apple      1     1      0
3                    I want squash      0     0      0

【讨论】：

以上是关于根据另一列中的值创建新的指标列的主要内容，如果未能解决你的问题，请参考以下文章