Apriori算法解释
Posted
技术标签:
【中文标题】Apriori算法解释【英文标题】:Apriori algorithm explanation 【发布时间】:2011-05-12 15:37:42 【问题描述】:我在 Internet 上找到了 Apriori 算法的实现,但其中有些东西我无法理解。希望有人能帮帮我。
# region----- Apriori-gen
//Generates Candidate Itemsets
static ArrayList AprioriGen (ArrayList L)
ArrayList Lk = new ArrayList (); //List to store generated Candidate Itemsets
Regex r = new Regex (",");
for (int i = 0 ; i <L.Count ; i++)
string [] subL1 = r.Split (L [i]. ToString ());
for (int j = i+1 ; j <L.Count ; j++)
string [] subL2 = r.Split (L [j]. ToString ());
// Compare two items in L, and set them in temp
string temp = L [j]. ToString (); //store two key sets
for (int m = 0; m <subL1.Length; m++)
bool subL1mInsubL2 = false;
for (int n = 0; n <subL2.Length; n++)
if (subL1 [m] == subL2 [n]) subL1mInsubL2 = true;
if (subL1mInsubL2 == false) temp = temp + "," + subL1 [m];
// If temp contains the entry for L in the (itemset size +1)
//and the focus is not with the candidates seeking the same items set temp
string [] subTemp = r.Split (temp);
if (subTemp.Length == subL1.Length + 1)
bool isExists = false;
for (int m = 0; m <Lk.Count; m++)
bool isContained = true;
for (int n = 0; n <subTemp.Length; n++)
if (!Lk[m].ToString().Contains(subTemp [n]) ) isContained = false;
if (isContained == true) isExists = true;
if (isExists == false) Lk.Add(temp);
return Lk;
# endregion----- Apriori-gen
现在我知道了 Apriori Gen 过程,我们通过将项集连接在一起来将它们变成更大的项集。但我看不出这是如何在前面的代码中实现的。为什么我们使用 temp? isExists 和 isContained 如何帮助我们?这两部分代码到底发生了什么?
【问题讨论】:
【参考方案1】:首先,有两个循环:
for (int i = 0 ; i
这些循环用于比较给定大小的每对项集。关于这个 Apriori 实现,我注意到的第一件事是它效率不高,因为如果项集是按词法排序的,那么您就不需要相互比较每个项集。你可以提前停下来。但是这段代码没有这个优化。
我在这段代码中看到的第二个大问题是候选对象存储为字符串。将其存储为整数数组会更有效。将项目集存储为包含“,”的字符串并将它们拆分为单独的数字是一个非常糟糕的设计决策,这将浪费内存和执行时间。对于数据挖掘算法,实现应该尽可能高效。在我看来,这意味着您正在查看的代码是由新手编写的。
关于您的问题,变量“temp”用于存储新候选人。提醒一下,一个候选项是两个项目集的串联。要组合两个项目集,您需要检查它们是否共享除一个之外的所有项目。例如,如果您有两个项目集 ABC 和 ABD,这两个项目集将生成一个新的候选,即 ABCD。但是如果两个项集有多个不同的项,则不应将它们合并。这就是您向我展示的代码正在尝试做的事情。
如果你想看一些高效的 Apriori 实现,可以查看我的website (http://www.philippe-fournier-viger.com/spmf/),我提供了一些高效的 Java 实现。如果您想要一些高效的 c++ 实现,请查看:http://fimi.ua.ac.be/src/。
【讨论】:
【参考方案2】:描述:Apriori 算法的简单 Python 实现
用法:
$python apriori.py -f DATASET.csv -s minSupport -c minConfidence $python apriori.py -f DATASET.csv -s 0.15 -c 0.6
import sys
from itertools import chain, combinations
from collections import defaultdict
from optparse import OptionParser
def subsets(arr):
""" Returns non empty subsets of arr"""
return chain(*[combinations(arr, i + 1) for i, a in enumerate(arr)])
def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet):
"""calculates the support for items in the itemSet and returns a subset
of the itemSet each of whose elements satisfies the minimum support"""
_itemSet = set()
localSet = defaultdict(int)
for item in itemSet:
for transaction in transactionList:
if item.issubset(transaction):
freqSet[item] += 1
localSet[item] += 1
for item, count in localSet.items():
support = float(count)/len(transactionList)
if support >= minSupport:
_itemSet.add(item)
return _itemSet
def joinSet(itemSet, length):
"""Join a set with itself and returns the n-element itemsets"""
return set([i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length])
def getItemSetTransactionList(data_iterator):
transactionList = list()
itemSet = set()
for record in data_iterator:
transaction = frozenset(record)
transactionList.append(transaction)
for item in transaction:
itemSet.add(frozenset([item])) # Generate 1-itemSets
return itemSet, transactionList
def runApriori(data_iter, minSupport, minConfidence):
"""
run the apriori algorithm. data_iter is a record iterator
Return both:
- items (tuple, support)
- rules ((pretuple, posttuple), confidence)
"""
itemSet, transactionList = getItemSetTransactionList(data_iter)
freqSet = defaultdict(int)
largeSet = dict()
# Global dictionary which stores (key=n-itemSets,value=support)
# which satisfy minSupport
assocRules = dict()
# Dictionary which stores Association Rules
oneCSet = returnItemsWithMinSupport(itemSet,
transactionList,
minSupport,
freqSet)
currentLSet = oneCSet
k = 2
while(currentLSet != set([])):
largeSet[k-1] = currentLSet
currentLSet = joinSet(currentLSet, k)
currentCSet = returnItemsWithMinSupport(currentLSet,
transactionList,
minSupport,
freqSet)
currentLSet = currentCSet
k = k + 1
def getSupport(item):
"""local function which Returns the support of an item"""
return float(freqSet[item])/len(transactionList)
toRetItems = []
for key, value in largeSet.items():
toRetItems.extend([(tuple(item), getSupport(item))
for item in value])
toRetRules = []
for key, value in largeSet.items()[1:]:
for item in value:
_subsets = map(frozenset, [x for x in subsets(item)])
for element in _subsets:
remain = item.difference(element)
if len(remain) > 0:
confidence = getSupport(item)/getSupport(element)
if confidence >= minConfidence:
toRetRules.append(((tuple(element), tuple(remain)),
confidence))
return toRetItems, toRetRules
def printResults(items, rules):
"""prints the generated itemsets sorted by support and the confidence rules sorted by confidence"""
for item, support in sorted(items, key=lambda (item, support): support):
print "item: %s , %.3f" % (str(item), support)
print "\n------------------------ RULES:"
for rule, confidence in sorted(rules, key=lambda (rule, confidence): confidence):
pre, post = rule
print "Rule: %s ==> %s , %.3f" % (str(pre), str(post), confidence)
def dataFromFile(fname):
"""Function which reads from the file and yields a generator"""
file_iter = open(fname, 'rU')
for line in file_iter:
line = line.strip().rstrip(',') # Remove trailing comma
record = frozenset(line.split(','))
yield record
if __name__ == "__main__":
optparser = OptionParser()
optparser.add_option('-f', '--inputFile',
dest='input',
help='filename containing csv',
default=None)
optparser.add_option('-s', '--minSupport',
dest='minS',
help='minimum support value',
default=0.15,
type='float')
optparser.add_option('-c', '--minConfidence',
dest='minC',
help='minimum confidence value',
default=0.6,
type='float')
(options, args) = optparser.parse_args()
inFile = None
if options.input is None:
inFile = sys.stdin
elif options.input is not None:
inFile = dataFromFile(options.input)
else:
print 'No dataset filename specified, system with exit\n'
sys.exit('System will exit')
minSupport = options.minS
minConfidence = options.minC
items, rules = runApriori(inFile, minSupport, minConfidence)
printResults(items, rules)
【讨论】:
以上是关于Apriori算法解释的主要内容,如果未能解决你的问题,请参考以下文章