如何计算包含一组列中的值和 Pandas 数据框中另一列中的另一个值的行数？

Posted 2023-04-18

技术标签:

【中文标题】如何计算包含一组列中的值和 Pandas 数据框中另一列中的另一个值的行数？【英文标题】：How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe? 【发布时间】：2020-11-24 13:54:16 【问题描述】：

# import packages, set nan
import pandas as pd
import numpy as np
nan = np.nan

问题

我有一个数据框，有一定数量的 observations 作为列，measurements 作为行。观察的结果是A, B, C, D ...。它还有一个 category 列，表示 measurement 的 category。分类：a, b, c, d ...。如果一列连续包含nan，则表示尚未进行该测量期间的观察（因此nan 不是observation，它缺少它）。一个MRE：

data = 'observation0': ['A','A','A','A','B'],'observation1': ['B','B','B','C',nan], 'category': ['a', 'b', 'c','a','b']
df = pd.DataFrame.from_dict(data)

df 看起来像这样：

我想计算使用每种测量类别（即a, b, c, d ...）观察每个观察结果（即A, B, C, D...）的次数。

我想得到：

obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0

观察A 与index 0 和3 成行出现（见上面的df），而测量category 是a，所以obs_A_in_cat_a 是2。观测值A 在category 的测量中只出现一次（行index 1）：b，所以obs_A_in_cat_b 是1，依此类推。

我的解决方案

首先我收集观察结果，taking care not to include nans：

observations = pd.unique(pd.concat([df[col] for col in df.columns if 'observation' in col]).dropna())

它们所属的不同类别：

categories = pd.unique(df['category'])

然后，遍历观察。如果是依赖this，

for observation in observations:
    for category in categories:
        df['obs_'+observation+'_in_cat_'+category]=\
        df.apply(lambda row: int(observation in [row[col]
                                                 for col in df.columns
                                                 if 'observation' in col]
                                 and row['category'] == category),axis=1)

lambda 函数检查observation 是否出现在每个row 中，以及测量是否属于当前在迭代中考虑的类别。创建新列，标题为 obs_OBSERVATION_in_cat_CATEGORY，其中 OBSERVATION 是 A, B, C, D ...，CATEGORY 是 a, b, c, d ... 如果在测量期间创建了 categoryY 中的 observationX，则 obs_OBSERVATIONX_in_cat_CATEGORYY 中的 @98765436对应于该度量的行，否则为0。

生成的df（部分）如下所示：

使用sum()ming 完成新创建列的值，选择带有conditional list comprehension 的列：

df[[col for col in df.columns if '_in_cat_' in col]].sum()

这给了我想要得到的输出，如上所示。 Whole notebook here.

问题

这种方法似乎有效，但速度太慢，无法在现实生活中轻松应用。我怎样才能让它更快？我正在寻找类似的东西：

how_many_times_each_observation_was_made_using_each_category_of_measurement(
df,
list_of_observation_columns,
category_column)

【问题讨论】：

【参考方案1】：

使用MultiIndex 和DataFrame.melt、GroupBy.size 的解决方案是计数值，添加0 用于Series.reindex 的缺失组合：

s = df.melt('category').groupby(['value','category']).size()
s = s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=0)
print (s)
value  category
A      a           2
       b           1
       c           1
B      a           1
       b           2
       c           1
C      a           1
       b           0
       c           0
dtype: int64

最后可以通过f-strings 将其展平：

s.index = s.index.map(lambda x: f'obs_x[0]_in_cat_x[1]')   
print (s)
obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0
dtype: int64

【讨论】：

【参考方案2】：

您可以将melt 与crosstab 组合起来以获得您的输出：

s = df.melt("category")
s = pd.crosstab(s.value, s.category).stack()
s.index = [f"obs_first_in_cat_last" for first, last in s.index]

s

obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0
dtype: int64

【讨论】：

【参考方案3】：

你可以通过以下方式做到这一点：

dfT = []
for colName in ['observation0','observation1']:
    df1 = df.groupby([colName,'category'])['category'].count().to_frame()
    df1.columns = ['count']
    df1 = df1.reset_index()
    df1['label'] = 'obs_'+df1[colName]+'_cat_'+df1['category']
    df1 = df1.loc[:,['label','count']]
    dfT.append(df1)

dfT = pd.concat(dfT,axis=0).reset_index(drop=True)

【讨论】：

以上是关于如何计算包含一组列中的值和 Pandas 数据框中另一列中的另一个值的行数？的主要内容，如果未能解决你的问题，请参考以下文章