如何在一个热编码数据帧中找到唯一组合?

Posted

技术标签:

【中文标题】如何在一个热编码数据帧中找到唯一组合?【英文标题】:How to find unique combinations in one hot encoded dataframe? 【发布时间】:2019-02-19 20:01:31 【问题描述】:

我有一个名为 test 的数据框,看起来像这样

+-------+---------+---------+---------+------------+
|       | Term 1  | Term 2  | Term 3  | Final Exam |
+-------+---------+---------+---------+------------+
| 1288  |      0  |      0  |      1  |          1 |
| 1290  |      1  |      1  |      1  |          1 |
| 1294  |      0  |      0  |      1  |          1 |
| 1296  |      1  |      1  |      1  |          1 |
| 1297  |      1  |      1  |      1  |          1 |
| 1304  |      0  |      1  |      1  |          1 |
| 1308  |      0  |      0  |      1  |          1 |
| 1324  |      1  |      1  |      1  |          1 |
| 1325  |      1  |      1  |      1  |          1 |
| 1332  |      1  |      1  |      1  |          1 |
+-------+---------+---------+---------+------------+

我想要一个包含所有唯一组合的汇总表,其中 column = 1 以及它出现的次数:

+-----------------------------------+-----------+
|            Combination            | Frequency |
+-----------------------------------+-----------+
| Term 3, Final Exam                |         3 |
| Term 2, Term 3, Final Exam        |         1 |
| Term 1, Term2, Term 3, Final Exam |         6 |
+-----------------------------------+-----------+

我尝试过使用 mlxtend.apriori,但这让我将所有出现的列放在一起:

from mlxtend.frequent_patterns import apriori
results = apriori(test,min_support=0.00001,use_colnames=True)
results['length'] = results['itemsets'].apply(lambda x:len(x))
numberofcases = test.shape[0]
results['Frequency'] = results['support'] * numberofcases
results['Terms'] = results['itemsets'].astype(str).str.replace('frozenset\(','').str.replace('\)','').str.replace('\'','').str.replace('\"','')
results[results['length'] > 1][['Terms','Frequency']]

结果集:

+-----+-------------------------------------+-----------+
|     |               Terms                 | Frequency |
+-----+-------------------------------------+-----------+
|  4  | Term 2, Term 1                      |       6.0 |
|  5  | Term 3, Term 1                      |       6.0 |
|  6  | Final Exam, Term 1                  |       6.0 |
|  7  | Term 2, Term 3                      |       7.0 |
|  8  | Term 2, Final Exam                  |       7.0 |
|  9  | Term 3, Final Exam                  |      10.0 |
| 10  | Term 2, Term 3, Term 1              |       6.0 |
| 11  | Term 2, Final Exam, Term 1          |       6.0 |
| 12  | Term 3, Final Exam, Term 1          |       6.0 |
| 13  | Term 2, Term 3, Final Exam          |       7.0 |
| 14  | Term 2, Term 3, Final Exam, Term 1  |       6.0 |
+-----+-------------------------------------+-----------+

是否有一些先验参数可以产生所需的结果或其他方式来做到这一点?

【问题讨论】:

【参考方案1】:

使用dotvalue_counts

df.dot(df.columns+',').str[:-1].value_counts()
Out[419]: 
Term1,Term2,Term3,FinalExam    6
Term3,FinalExam                3
Term2,Term3,FinalExam          1
dtype: int64

【讨论】:

以上是关于如何在一个热编码数据帧中找到唯一组合?的主要内容,如果未能解决你的问题,请参考以下文章

如何从带有火花的数据框中找到最大长度的唯一行?

Julia DataFrames - 如何进行一次热编码?

如何在 pandas 数据帧中有效地使用 one-hot 编码规范化列?

如何在两个 Pandas 数据帧中找到元素调和平均值

如果我的测试数据在列中缺少值,我该如何解决一个热编码?

特征组合--组合独热矢量