pandas - 如果列标题是另一列的子字符串，则创建真/假列

Posted 2023-03-29

技术标签:

【中文标题】pandas - 如果列标题是另一列的子字符串，则创建真/假列【英文标题】：pandas - create true/false column if column header is substring of another column 【发布时间】：2020-07-13 15:18:23 【问题描述】：

我正在关注this post，根据另一列中是否存在子字符串来创建多个真/假的列。

在使用上述帖子中的代码之前，我查看了一个名为 LANGUAGES 的字段，其中包含 "ENG, SPA, CZE" 或 "ENG, SPA" 等值。不幸的是，数据是逗号分隔的字符串而不是列表，但没问题，在一行中，我可以获得 25 个唯一值的列表。

获得唯一值列表后，我想为每个值创建一个新列，例如 df[ENG]、df[SPA] 等列。我希望根据标题是否是原始 LANGUAGES 列的子字符串，这些列是真/假。

在帖子之后，我使用df.apply(lambda x: language in df.LANGUAGES, axis = 1)。但是，当我检查列的值（最后一个 for 循环中的值计数）时，所有值都为 false。

如何根据列的标题是另一列的子字符串来创建真/假列？

我的代码：

import json
import pandas as pd
import requests

url  = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="

response = requests.get(url)

if response.status_code == 200:
    res = response.json()
    df = pd.DataFrame(res)
    df.columns = [str(h).upper() for h in list(df)]
    #
    # the below line is confusing but it creates a sorted list of all unique languages
    #
    languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
    for language in languages:
        print(language)
        df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)
    for language in languages:
        print(df[language].value_counts())
        print("\n")
else:
    print("\nConnection was unsuccesful: 0".format(response.status_code))

编辑：请求提供原始数据输入和预期输出。下面是列的样子：

+-------+-----------------+
| Index |    LANGUAGES    |
+-------+-----------------+
|     0 | 'ENG, OTH, RUS' |
|     1 | 'ENG'           |
|     2 | 'ENG, CZE, SPA' |
+-------+-----------------+

这是预期的输出：

+-------+-----------------+------+-------+-------+-------+-------+
| Index |    LANGUAGES    | ENG  |  CZE  |  OTH  |  RUS  |  SPA  |
+-------+-----------------+------+-------+-------+-------+-------+
|     0 | 'ENG, OTH, RUS' | TRUE | FALSE | TRUE  | TRUE  | FALSE |
|     1 | 'ENG'           | TRUE | FALSE | FALSE | FALSE | FALSE |
|     2 | 'ENG, CZE, SPA' | TRUE | TRUE  | FALSE | TRUE  | FALSE |
+-------+-----------------+------+-------+-------+-------+-------+

【问题讨论】：

您能否将您的原始数据输入连同您的预期输出一起发布到代码块中？ @Datanovice 完成 【参考方案1】：

两步，

首先，我们分解您的列表并创建一个数据透视表，以根据索引重新连接到您的原始 df。

s  = df['LANGUAGES'].str.replace("'",'').str.split(',').explode().to_frame()

cols = s['LANGUAGES'].drop_duplicates(keep='first').tolist()

df2 = pd.concat([df, pd.crosstab(s.index, s["LANGUAGES"])[cols]], axis=1).replace(
    1: True, 0: False
)
print(df2)
         LANGUAGES   ENG    OTH    RUS    CZE    SPA
0  'ENG, OTH, RUS'  True   True   True  False  False
1            'ENG'  True  False  False  False  False
2  'ENG, CZE, SPA'  True  False  False   True   True

【讨论】：

【参考方案2】：

在this post找到，我换掉了下面这行代码：

df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)

对于以下两行：

    criteria = lambda row : language in row["LANGUAGES"]
    df[language] = df.apply(criteria, axis =1)

而且它有效。

import json
import pandas as pd
import requests

url  = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="

response = requests.get(url)

if response.status_code == 200:
    res = response.json()
    df = pd.DataFrame(res)
    df.columns = [str(h).upper() for h in list(df)]
    #
    # the below line is confusing but it creates a sorted list of all unique languages
    #
    languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
    for language in languages:
        criteria = lambda row : language in row["LANGUAGES"]
        df[language] = df.apply(criteria, axis =1)
    for language in languages:
        print(df[language].value_counts())
        print("\n")
else:
    print("\nConnection was unsuccesful: 0".format(response.status_code))

这种换行也可以工作：

for language in languages:
    df[language] = df.LANGUAGES.apply(lambda x: 'True' if language in x else 'False')
    print(":".format(language, df[df[language] == 'True'].shape[0]))

【讨论】：

最好不要在 pandas 中使用 for 循环，因为它违反了 api - 使用利用底层 C /Cython 代码的矢量化方法。

以上是关于pandas - 如果列标题是另一列的子字符串，则创建真/假列的主要内容，如果未能解决你的问题，请参考以下文章