如何从电影数据集中提取用户评分

Posted 2023-03-12

技术标签:

【中文标题】如何从电影数据集中提取用户评分【英文标题】：How to extract user ratings from a movie dataset 【发布时间】：2020-10-20 02:47:53 【问题描述】：

这个截图是合并后的movielens数据集的样本，我有两个问题：

仅列出

任何指南都将受到高度赞赏。

【问题讨论】：

仅供参考：彻底回答问题非常耗时。如果您的问题已解决，请通过接受最适合您的需求的解决方案表示感谢。 ✔ 位于答案左上角的 ▲/▼ 箭头下方。如果出现更好的解决方案，则可以接受新的解决方案。如果您的声望超过 15，您还可以使用 ▲/▼ 箭头对答案的有用性进行投票。 如果解决方案无法回答问题，请发表评论。 What should I do when someone answers my question?。谢谢。 【参考方案1】：

为每个唯一用户创建一个数据框字典

使用正则表达式'\((\d+)\)'从括号之间提取数字\d并将值分配给movies['Year'] 不是为不同用户重复过滤数据帧，而是将每个用户作为键添加到字典中，值将是为该用户过滤的数据帧。有 162541 个唯一的 userId 值，因此不要使用 df.userId.unique()，而是使用您感兴趣的特定 userId 值的列表。另请参阅与此 MovieLens 数据集相关的 answer 至 How to rotate seaborn barplot x-axis tick labels。

# question 1: create a column for the year extracted from the title
# extracts the digits between parenthesis
# does not change the title column
df['Year'] = df.title.str.extract('\((\d+)\)')

# create dict of dataframes for each user
userid_movies = dict()
for user in [10, 15, 191]:  # df.userId.unique() = 162541 unique users
    data = df[df.userId == user]
    userid_movies[user] = data

# get data for user 191; assumes ids are int. if not, use '191'
userid_movies[191]  # if you're using jupyter, don't use print

示例

给定movie.csv & rating.csv 来自Kaggle: MovieLens 20M Dataset

import pandas as pd

# load movies
movies = pd.read_csv('data/ml-25m/movies.csv')

# extract year
movies['Year'] = movies.title.str.extract('\((\d+)\)')

# display head
   movieId                               title                                       genres  Year
0        1                    Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995
1        2                      Jumanji (1995)                   Adventure|Children|Fantasy  1995
2        3             Grumpier Old Men (1995)                               Comedy|Romance  1995
3        4            Waiting to Exhale (1995)                         Comedy|Drama|Romance  1995
4        5  Father of the Bride Part II (1995)                                       Comedy  1995

# load ratings
ratings = pd.read_csv('data/ml-25m/ratings.csv')

# merge on movieId
df = pd.merge(movies, ratings, on='movieId').reset_index(drop=True)

# display df
   movieId             title                                       genres  Year  userId  rating   timestamp
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       2     3.5  1141415820
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       3     4.0  1439472215
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       4     3.0  1573944252
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       5     4.0   858625949
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       8     4.0   890492517

# dict of dataframes
# there are 162541 unique userId values, so instead of using df.userId.unique()
# use a list of the specific Id values you're interested in
userid_movies = dict()
for user in [10, 15, 191]:
    data = df[df.userId == user].reset_index(drop=True)
    userid_movies[user] = data

# display(userid_movies[191].head())
   movieId                                  title                                              genres  Year  userId  rating   timestamp
0    68135                        17 Again (2009)                                        Comedy|Drama  2009     191     3.0  1473704208
1    68791            Terminator Salvation (2009)                    Action|Adventure|Sci-Fi|Thriller  2009     191     5.0  1473704167
2    68954                              Up (2009)                  Adventure|Animation|Children|Drama  2009     191     4.0  1473703994
3    69406                   Proposal, The (2009)                                      Comedy|Romance  2009     191     4.0  1473704198
4    69644  Ice Age: Dawn of the Dinosaurs (2009)  Action|Adventure|Animation|Children|Comedy|Romance  2009     191     1.5  1473704242

【讨论】：

时间戳的真正含义是什么？ @KaruneshPalekar 就是这样，它是timestamp。但是，自从我发布了这个答案后，Kaggle 文件现在有一个常规的日期时间戳。鉴于此答案中的示例，df.timestamp = pd.to_datetime(df.timestamp, unit='s') 会将列转换为datetime dtype。我的意思是喜欢它所描绘的。据我了解，它描述了自用户对特定电影评分以来的过去时间。我说的对吗？ @KaruneshPalekar 是的，这是我的理解感谢您的帮助【参考方案2】：

对于问题的第一部分，您可以过滤数据框。

user191 = df.loc[df['userId']==191]

对于您问题的第二部分，年份似乎总是出现在末尾，因此您可以取出字符串的最后一部分并删除括号。

df['Year'] = df['title'].str[-5:].str.replace(')','')

【讨论】：

【参考方案3】：

第一个问题；使用布尔选择

df[df['userid']=='191']

第二题#使用正则表达式提取括号之间的短语

df['Year']=df.title.str.extract('\((.*?)\)')

【讨论】：

这是否会同时返回movieid 32 中出现的短语和年份？什么意思？短语将保留在列标题中，我的解决方案会创建一个名为 Year 的新列，其中输入数字年份。如果提取年份，就不能再和词组出现在同一列，可以吗？哦，所以如果括号仅包含数字，您的正则表达式只会返回一个值？我对正则表达式不太熟悉，但是这个解决方案很好！正则表达式案例完美运行，但其他解决方案没有产生预期的输出。非常感谢您的快速回复。我猜他们是 int，而不是 str。

以上是关于如何从电影数据集中提取用户评分的主要内容，如果未能解决你的问题，请参考以下文章