大量记录之间的高效关联计算
Posted
技术标签:
【中文标题】大量记录之间的高效关联计算【英文标题】:Efficient correlation calculation between large number of records 【发布时间】:2013-12-27 19:00:49 【问题描述】:我正在阅读一本书 (A Programmer's Guide to Data Mining),其中附有以下数据 BX-Dump,有 10 万用户的评分,每个用户都有一些书评分。我想将整个数据集移动到 pandas 数据帧中,它的加载速度比作者的实现快 10 倍:
%time r.loadBookDB('/Users/danialt/Downloads/BX-Dump/')
1700018
CPU times: user 16.1 s, sys: 373 ms, total: 16.5 s
Wall time: 16.5 s
和我的:
# Mine
%time ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\")
%time books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\")
%time users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\")
#Output[5]
CPU times: user 484 ms, sys: 73.3 ms, total: 557 ms
Wall time: 567 ms
CPU times: user 1.28 s, sys: 138 ms, total: 1.41 s
Wall time: 1.45 s
CPU times: user 148 ms, sys: 25.7 ms, total: 173 ms
Wall time: 178 ms
#/Output
现在,要计算相关性ratings.corr()
,我需要将用户放在索引上,将书籍放在列上,并将评分作为值:
ratings_piv = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')
但这会失败,因为会形成一个 100k x 400k 的矩阵!
有没有更好/优雅的方法来计算这个非常稀疏的矩阵的相关性,而不需要遍历每一行?
示例代码:(不要运行最后一行,它会杀死你的 RAM)
import numpy as np
import pandas as pd
import codecs
from math import sqrt
users = "Angelica": "Blues Traveler": 3.5, "Broken Bells": 2.0,
"Norah Jones": 4.5, "Phoenix": 5.0,
"Slightly Stoopid": 1.5,
"The Strokes": 2.5, "Vampire Weekend": 2.0,
"Bill":"Blues Traveler": 2.0, "Broken Bells": 3.5,
"Deadmau5": 4.0, "Phoenix": 2.0,
"Slightly Stoopid": 3.5, "Vampire Weekend": 3.0,
"Chan": "Blues Traveler": 5.0, "Broken Bells": 1.0,
"Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
"Slightly Stoopid": 1.0,
"Dan": "Blues Traveler": 3.0, "Broken Bells": 4.0,
"Deadmau5": 4.5, "Phoenix": 3.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 2.0,
"Hailey": "Broken Bells": 4.0, "Deadmau5": 1.0,
"Norah Jones": 4.0, "The Strokes": 4.0,
"Vampire Weekend": 1.0,
"Jordyn": "Broken Bells": 4.5, "Deadmau5": 4.0,
"Norah Jones": 5.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 4.0,
"Sam": "Blues Traveler": 5.0, "Broken Bells": 2.0,
"Norah Jones": 3.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.0, "The Strokes": 5.0,
"Veronica": "Blues Traveler": 3.0, "Norah Jones": 5.0,
"Phoenix": 4.0, "Slightly Stoopid": 2.5,
"The Strokes": 3.0
class recommender:
def __init__(self, data, k=1, metric='pearson', n=5):
""" initialize recommender
currently, if data is dictionary the recommender is initialized
to it.
For all other data types of data, no initialization occurs
k is the k value for k nearest neighbor
metric is which distance formula to use
n is the maximum number of recommendations to make"""
self.k = k
self.n = n
self.username2id =
self.userid2name =
self.productid2name =
# for some reason I want to save the name of the metric
self.metric = metric
if self.metric == 'pearson':
self.fn = self.pearson
#
# if data is dictionary set recommender data to it
#
if type(data).__name__ == 'dict':
self.data = data
def convertProductID2name(self, id):
"""Given product id number return product name"""
if id in self.productid2name:
return self.productid2name[id]
else:
return id
def userRatings(self, id, n):
"""Return n top ratings for user with id"""
print ("Ratings for " + self.userid2name[id])
ratings = self.data[id]
print(len(ratings))
ratings = list(ratings.items())
ratings = [(self.convertProductID2name(k), v)
for (k, v) in ratings]
# finally sort and return
ratings.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
ratings = ratings[:n]
for rating in ratings:
print("%s\t%i" % (rating[0], rating[1]))
def loadBookDB(self, path=''):
"""loads the BX book dataset. Path is where the BX files are
located"""
self.data =
i = 0
#
# First load book ratings into self.data
#
f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
user = fields[0].strip('"')
book = fields[1].strip('"')
rating = int(fields[2].strip().strip('"'))
if user in self.data:
currentRatings = self.data[user]
else:
currentRatings =
currentRatings[book] = rating
self.data[user] = currentRatings
f.close()
#
# Now load books into self.productid2name
# Books contains isbn, title, and author among other fields
#
f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
isbn = fields[0].strip('"')
title = fields[1].strip('"')
author = fields[2].strip().strip('"')
title = title + ' by ' + author
self.productid2name[isbn] = title
f.close()
#
# Now load user info into both self.userid2name and
# self.username2id
#
f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
for line in f:
i += 1
#print(line)
#separate line into fields
fields = line.split(';')
userid = fields[0].strip('"')
location = fields[1].strip('"')
if len(fields) > 3:
age = fields[2].strip().strip('"')
else:
age = 'NULL'
if age != 'NULL':
value = location + ' (age: ' + age + ')'
else:
value = location
self.userid2name[userid] = value
self.username2id[location] = userid
f.close()
print(i)
def pearson(self, rating1, rating2):
sum_xy = 0
sum_x = 0
sum_y = 0
sum_x2 = 0
sum_y2 = 0
n = 0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x, 2)
sum_y2 += pow(y, 2)
if n == 0:
return 0
# now compute denominator
denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
* sqrt(sum_y2 - pow(sum_y, 2) / n))
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator
def computeNearestNeighbor(self, username):
"""creates a sorted list of users based on their distance to
username"""
distances = []
for instance in self.data:
if instance != username:
distance = self.fn(self.data[username],
self.data[instance])
distances.append((instance, distance))
# sort based on distance -- closest first
distances.sort(key=lambda artistTuple: artistTuple[1],
reverse=True)
return distances
def recommend(self, user):
"""Give list of recommendations"""
recommendations =
# first get list of users ordered by nearness
nearest = self.computeNearestNeighbor(user)
#
# now get the ratings for the user
#
userRatings = self.data[user]
#
# determine the total distance
totalDistance = 0.0
for i in range(self.k):
totalDistance += nearest[i][1]
# now iterate through the k nearest neighbors
# accumulating their ratings
for i in range(self.k):
# compute slice of pie
weight = nearest[i][1] / totalDistance
# get the name of the person
name = nearest[i][0]
# get the ratings for this person
neighborRatings = self.data[name]
# get the name of the person
# now find bands neighbor rated that user didn't
for artist in neighborRatings:
if not artist in userRatings:
if artist not in recommendations:
recommendations[artist] = (neighborRatings[artist]
* weight)
else:
recommendations[artist] = (recommendations[artist]
+ neighborRatings[artist]
* weight)
# now make list from dictionary
recommendations = list(recommendations.items())
recommendations = [(self.convertProductID2name(k), v)
for (k, v) in recommendations]
# finally sort and return
recommendations.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
# Return the first n items
return recommendations[:self.n]
r = recommender(users)
# The author implementation
r.loadBookDB('/Users/danialt/Downloads/BX-Dump/')
# The alternative loading
ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\")
books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\")
users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\")
pivot_rating = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')
【问题讨论】:
为什么两次投反对票但没有建设性的 cmets?对我来说似乎很清楚,+1 @DanAllan 我也想知道。 【参考方案1】:小心这样的基准。 Pandas 可能正在使用延迟加载,即它可能会返回但实际上尚未读取数据。在这种情况下,测量的挂墙时间将毫无价值。尝试对所有数据执行一些简单的操作,以确保它确实已被读取。
至于相关性:您的输入矩阵可能是稀疏的,但相关性矩阵可能不会那么稀疏;因为通常任何事物之间都有一些最小的相关性。
请注意,相关矩阵将是方形的,即如果您有100k
用户,则用户-用户相关矩阵将为100k x 100k
(由于对称性,您可以节省一半的内存,但这无济于事很多)。
如果您想加快计算速度,请考虑您是否需要所有数据、是否需要全精度数据,以及是否可以利用内存中的数据布局来加快计算速度。 例如,协方差(在相关性中使用)可以通过利用稀疏性轻松加速,并且仅在列中处理非零值而不是在行中处理。
但是,要真正快速地变得真正,你将不得不放弃对矩阵的思考。相反,请考虑避免将所有内容与其他所有内容进行比较的散列和索引结构(成本自然是二次方)。在考虑矩阵时,请始终考虑在内存或磁盘中并不真正存在的稀疏矩阵。
【讨论】:
你能详细说明你的最后一段吗?你的意思是像这样散列:google.com/trends/correlate/nnsearch.pdf?假设我有 30 万用户和 3 万件商品。我应该如何进行? 我没有读过那篇文章。我指的是“草图和散列”和“位置敏感散列”中的散列。以上是关于大量记录之间的高效关联计算的主要内容,如果未能解决你的问题,请参考以下文章
MapReduce模型中数据关联使用or语句导致计算效率低下