基于关联计算表中行之间相似度的最佳方法是啥?
Posted
技术标签:
【中文标题】基于关联计算表中行之间相似度的最佳方法是啥?【英文标题】:What's the best way to calculate similarity between rows in a table based on association?基于关联计算表中行之间相似度的最佳方法是什么? 【发布时间】:2010-06-13 01:15:22 【问题描述】:假设每个人都有一组最喜欢的书籍。
所以我有一张桌子:
人 书籍 Person 和 Book 之间的关联(MxN 的联合表)我想根据最喜欢的书籍重叠来获取与 Person1 相似的人。那就是:它们的共同点越多,它们就越相似。
我不必只使用 SQL 来解决这个问题。我也可以使用编程。我正在使用 SQL Server 2008 和 C#。
您的专家会使用什么解决方案?
【问题讨论】:
【参考方案1】:这可能不是最有效的,但相对简单:
WITH SimlarBookPrefs(person_id, similar_person_id, booksInCommon) AS
(
Select p1.person_id, p2.person_id AS simlar_person_id,
/* Find the number of books p1 and p2 have in common */
(SELECT COUNT(*) FROM PersonBook pb1, PersonBook pb2
JOIN pb1=book_id=pb2.book_id
WHERE pb1.person_id=p1.person_id AND pb2.person_id=p2.person_id) As BooksInCommon
FROM Person p1 CROSS JOIN Person p2
)
这将为您提供每个人、其他人的列表和共同的号码簿。
要获得最相似的人,请添加(在同一查询中)
SELECT TOP 1 similar_person_id FROM SimilarBookPrefs
WHERE person_id = <person_to_match>
ORDER By booksInCommon DESC;
第一部分不必是 CTE(即 WITH ...),它可以是视图甚至是派生表。为简洁起见,这里是 CTE。
【讨论】:
【参考方案2】:如果我在 C# 中这样做,我可能会这样处理它
var query = from personBook in personBooks
where personBook.PersonId != basePersonId // ID of person to match
join bookbase in personBooks
on personBook.BookId equals bookbase.BookId
where bookbase.PersonId == basePersonId // ID of person to match
join person in persons
on personBook.PersonId equals person.Id
group person by person into bookgroup
select new
Person = bookgroup.Key,
BooksInCommon = bookgroup.Count()
;
这可能通过实体框架或 Linq to SQL 完成,或者直接翻译成 SQL。
完整示例代码
class CommonBooks
static void Main()
List<Person> persons = new List<Person>()
new Person(1, "Jane"), new Person(2, "Joan"), new Person(3, "Jim"), new Person(4, "John"), new Person(5, "Jill")
;
List<Book> books = new List<Book>()
new Book(1), new Book(2), new Book(3), new Book(4), new Book(5)
;
List<PersonBook> personBooks = new List<PersonBook>()
new PersonBook(1,1), new PersonBook(1,2), new PersonBook(1,3), new PersonBook(1,4), new PersonBook(1,5),
new PersonBook(2,2), new PersonBook(2,3), new PersonBook(2,5),
new PersonBook(3,2), new PersonBook(3,4), new PersonBook(3,5),
new PersonBook(4,1), new PersonBook(4,4),
new PersonBook(5,1), new PersonBook(5,3), new PersonBook(5,5)
;
int basePersonId = 4; // person to match likeness
var query = from personBook in personBooks
where personBook.PersonId != basePersonId
join bookbase in personBooks
on personBook.BookId equals bookbase.BookId
where bookbase.PersonId == basePersonId
join person in persons
on personBook.PersonId equals person.Id
group person by person into bookgroup
select new
Person = bookgroup.Key,
BooksInCommon = bookgroup.Count()
;
foreach (var item in query)
Console.WriteLine("0\t1", item.Person.Name, item.BooksInCommon);
Console.Read();
class Person
public int Id get; set;
public string Name get; set;
public Person(int id, string name) Id = id; Name = name;
class Book
public int Id get; set;
public Book(int id) Id = id;
class PersonBook
public int PersonId get; set;
public int BookId get; set;
public PersonBook(int personId, int bookId) PersonId = personId; BookId = bookId;
【讨论】:
【参考方案3】:您所描述的问题通常被称为“协同过滤”并使用“推荐系统”来解决。搜索这些术语中的任何一个都应该会为您带来大量有用的信息。
【讨论】:
以上是关于基于关联计算表中行之间相似度的最佳方法是啥?的主要内容,如果未能解决你的问题,请参考以下文章
将来自不同表(实体)的 2 个特定托管对象与 Core Data 保持关联的最佳方法是啥?