如何在一个热编码数据帧中找到唯一组合?
Posted
技术标签:
【中文标题】如何在一个热编码数据帧中找到唯一组合?【英文标题】:How to find unique combinations in one hot encoded dataframe? 【发布时间】:2019-02-19 20:01:31 【问题描述】:我有一个名为 test 的数据框,看起来像这样
+-------+---------+---------+---------+------------+
| | Term 1 | Term 2 | Term 3 | Final Exam |
+-------+---------+---------+---------+------------+
| 1288 | 0 | 0 | 1 | 1 |
| 1290 | 1 | 1 | 1 | 1 |
| 1294 | 0 | 0 | 1 | 1 |
| 1296 | 1 | 1 | 1 | 1 |
| 1297 | 1 | 1 | 1 | 1 |
| 1304 | 0 | 1 | 1 | 1 |
| 1308 | 0 | 0 | 1 | 1 |
| 1324 | 1 | 1 | 1 | 1 |
| 1325 | 1 | 1 | 1 | 1 |
| 1332 | 1 | 1 | 1 | 1 |
+-------+---------+---------+---------+------------+
我想要一个包含所有唯一组合的汇总表,其中 column = 1 以及它出现的次数:
+-----------------------------------+-----------+
| Combination | Frequency |
+-----------------------------------+-----------+
| Term 3, Final Exam | 3 |
| Term 2, Term 3, Final Exam | 1 |
| Term 1, Term2, Term 3, Final Exam | 6 |
+-----------------------------------+-----------+
我尝试过使用 mlxtend.apriori,但这让我将所有出现的列放在一起:
from mlxtend.frequent_patterns import apriori
results = apriori(test,min_support=0.00001,use_colnames=True)
results['length'] = results['itemsets'].apply(lambda x:len(x))
numberofcases = test.shape[0]
results['Frequency'] = results['support'] * numberofcases
results['Terms'] = results['itemsets'].astype(str).str.replace('frozenset\(','').str.replace('\)','').str.replace('\'','').str.replace('\"','')
results[results['length'] > 1][['Terms','Frequency']]
结果集:
+-----+-------------------------------------+-----------+
| | Terms | Frequency |
+-----+-------------------------------------+-----------+
| 4 | Term 2, Term 1 | 6.0 |
| 5 | Term 3, Term 1 | 6.0 |
| 6 | Final Exam, Term 1 | 6.0 |
| 7 | Term 2, Term 3 | 7.0 |
| 8 | Term 2, Final Exam | 7.0 |
| 9 | Term 3, Final Exam | 10.0 |
| 10 | Term 2, Term 3, Term 1 | 6.0 |
| 11 | Term 2, Final Exam, Term 1 | 6.0 |
| 12 | Term 3, Final Exam, Term 1 | 6.0 |
| 13 | Term 2, Term 3, Final Exam | 7.0 |
| 14 | Term 2, Term 3, Final Exam, Term 1 | 6.0 |
+-----+-------------------------------------+-----------+
是否有一些先验参数可以产生所需的结果或其他方式来做到这一点?
【问题讨论】:
【参考方案1】:使用dot
和value_counts
df.dot(df.columns+',').str[:-1].value_counts()
Out[419]:
Term1,Term2,Term3,FinalExam 6
Term3,FinalExam 3
Term2,Term3,FinalExam 1
dtype: int64
【讨论】:
以上是关于如何在一个热编码数据帧中找到唯一组合?的主要内容,如果未能解决你的问题,请参考以下文章