电影数据可视化项目--数据清理
Posted 天天学点数据分析
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了电影数据可视化项目--数据清理相关的知识,希望对你有一定的参考价值。
电影数据可视化项目--数据清理
提出问题:
1:电影类型是如何随着时间的推移发生变化的?
2.Universal Pictures 和 ParamountPictures 之间的对比情况如何?
3.改编电影和原创电影的对比情况如何?(通过keywords变量中的based on novel字段来判断)
导入需要的库
importpandas as pd
import numpy as np
import re
将csv数据加载到pandas数据框
movies =pd.read_csv("movies.csv")
movies.head()
id |
imdb_id |
popularity |
budget |
revenue |
original_title |
cast |
homepage |
director |
tagline |
... |
overview |
runtime |
genres |
production_companies |
release_date |
vote_count |
vote_average |
release_year |
budget_adj |
revenue_adj |
|
0 |
135397 |
tt0369610 |
32.985763 |
150000000 |
1513528810 |
Jurassic World |
Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... |
http://www.jurassicworld.com/ |
Colin Trevorrow |
The park is open. |
... |
Twenty-two years after the events of Jurassic ... |
124 |
Action|Adventure|Science Fiction|Thriller |
Universal Studios|Amblin Entertainment|Legenda... |
2015-06-09 |
5562 |
6.5 |
2015 |
1.379999e+08 |
1.392446e+09 |
1 |
76341 |
tt1392190 |
28.419936 |
150000000 |
378436354 |
Mad Max: Fury Road |
Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... |
http://www.madmaxmovie.com/ |
George Miller |
What a Lovely Day. |
... |
An apocalyptic story set in the furthest reach... |
120 |
Action|Adventure|Science Fiction|Thriller |
Village Roadshow Pictures|Kennedy Miller Produ... |
2015-05-13 |
6185 |
7.1 |
2015 |
1.379999e+08 |
3.481613e+08 |
2 |
262500 |
tt2908446 |
13.112507 |
110000000 |
295238201 |
Insurgent |
Shailene Woodley|Theo James|Kate Winslet|Ansel... |
http://www.thedivergentseries.movie/#insurgent |
Robert Schwentke |
One Choice Can Destroy You |
... |
Beatrice Prior must confront her inner demons ... |
119 |
Adventure|Science Fiction|Thriller |
Summit Entertainment|Mandeville Films|Red Wago... |
2015-03-18 |
2480 |
6.3 |
2015 |
1.012000e+08 |
2.716190e+08 |
3 |
140607 |
tt2488496 |
11.173104 |
200000000 |
2068178225 |
Star Wars: The Force Awakens |
Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... |
http://www.starwars.com/films/star-wars-episod... |
J.J. Abrams |
Every generation has a story. |
... |
Thirty years after defeating the Galactic Empi... |
136 |
Action|Adventure|Science Fiction|Fantasy |
Lucasfilm|Truenorth Productions|Bad Robot |
2015-12-15 |
5292 |
7.5 |
2015 |
1.839999e+08 |
1.902723e+09 |
4 |
168259 |
tt2820852 |
9.335014 |
190000000 |
1506249360 |
Furious 7 |
Vin Diesel|Paul Walker|Jason Statham|Michelle ... |
http://www.furious7.com/ |
James Wan |
Vengeance Hits Home |
... |
Deckard Shaw seeks revenge against Dominic Tor... |
137 |
Action|Crime|Thriller |
Universal Pictures|Original Film|Media Rights ... |
2015-04-01 |
2947 |
7.3 |
2015 |
1.747999e+08 |
1.385749e+09 |
5 rows × 21 columns
数据清理
问题 1:电影类型是如何随着时间的推移发生变化的?
Question 1: How havemovie genres changed over time?
构建一个数据框子集movies_genres
movies_genres= movies[['id','original_title','genres']].reset_index(drop =True)
movies_genres.head()
id |
original_title |
genres |
|
0 |
135397 |
Jurassic World |
Action|Adventure|Science Fiction|Thriller |
1 |
76341 |
Mad Max: Fury Road |
Action|Adventure|Science Fiction|Thriller |
2 |
262500 |
Insurgent |
Adventure|Science Fiction|Thriller |
3 |
140607 |
Star Wars: The Force Awakens |
Action|Adventure|Science Fiction|Fantasy |
4 |
168259 |
Furious 7 |
Action|Crime|Thriller |
使用split函数对genres分列处理
movies_genres[['genres1','genres2','genres3','genres4','genres5']]= movies_genres['genres'].str.split('|',expand = True)
del movies_genres['genres']
movies_genres.head()
id |
original_title |
genres1 |
genres2 |
genres3 |
genres4 |
genres5 |
|
0 |
135397 |
Jurassic World |
Action |
Adventure |
Science Fiction |
Thriller |
None |
1 |
76341 |
Mad Max: Fury Road |
Action |
Adventure |
Science Fiction |
Thriller |
None |
2 |
262500 |
Insurgent |
Adventure |
Science Fiction |
Thriller |
None |
None |
3 |
140607 |
Star Wars: The Force Awakens |
Action |
Adventure |
Science Fiction |
Fantasy |
None |
4 |
168259 |
Furious 7 |
Action |
Crime |
Thriller |
None |
None |
对新分出来的5列genres列进行逆透视处理
movies_genres= pd.melt(movies_genres,id_vars = ['id','original_title'],value_name ='genres',var_name = 'genre_n')
movies_genres.dropna(axis = 0,subset = ['genres'],inplace =True)
movies_genres.head()
id |
original_title |
genre_n |
genres |
|
0 |
135397 |
Jurassic World |
genres1 |
Action |
1 |
76341 |
Mad Max: Fury Road |
genres1 |
Action |
2 |
262500 |
Insurgent |
genres1 |
Adventure |
3 |
140607 |
Star Wars: The Force Awakens |
genres1 |
Action |
4 |
168259 |
Furious 7 |
genres1 |
Action |
删除movies中的原始genres列,然后合并数据框
delmovies['genres']
movies_cleaned_genres = pd.merge(movies,movies_genres,how = 'right',left_on =['id','original_title'],right_on =['id','original_title'])
movies_cleaned_genres.head()
id |
imdb_id |
popularity |
budget |
revenue |
original_title |
cast |
homepage |
director |
tagline |
... |
runtime |
production_companies |
release_date |
vote_count |
vote_average |
release_year |
budget_adj |
revenue_adj |
genre_n |
genres |
|
0 |
135397 |
tt0369610 |
32.985763 |
150000000 |
1513528810 |
Jurassic World |
Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... |
http://www.jurassicworld.com/ |
Colin Trevorrow |
The park is open. |
... |
124 |
Universal Studios|Amblin Entertainment|Legenda... |
2015-06-09 |
5562 |
6.5 |
2015 |
1.379999e+08 |
1.392446e+09 |
genres1 |
Action |
1 |
135397 |
tt0369610 |
32.985763 |
150000000 |
1513528810 |
Jurassic World |
Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... |
http://www.jurassicworld.com/ |
Colin Trevorrow |
The park is open. |
... |
124 |
Universal Studios|Amblin Entertainment|Legenda... |
2015-06-09 |
5562 |
6.5 |
2015 |
1.379999e+08 |
1.392446e+09 |
genres2 |
Adventure |
2 |
135397 |
tt0369610 |
32.985763 |
150000000 |
1513528810 |
Jurassic World |
Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... |
http://www.jurassicworld.com/ |
Colin Trevorrow |
The park is open. |
... |
124 |
Universal Studios|Amblin Entertainment|Legenda... |
2015-06-09 |
5562 |
6.5 |
2015 |
1.379999e+08 |
1.392446e+09 |
genres3 |
Science Fiction |
3 |
135397 |
tt0369610 |
32.985763 |
150000000 |
1513528810 |
Jurassic World |
Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... |
http://www.jurassicworld.com/ |
Colin Trevorrow |
The park is open. |
... |
124 |
Universal Studios|Amblin Entertainment|Legenda... |
2015-06-09 |
5562 |
6.5 |
2015 |
1.379999e+08 |
1.392446e+09 |
genres4 |
Thriller |
4 |
76341 |
tt1392190 |
28.419936 |
150000000 |
378436354 |
Mad Max: Fury Road |
Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... |
http://www.madmaxmovie.com/ |
George Miller |
What a Lovely Day. |
... |
120 |
Village Roadshow Pictures|Kennedy Miller Produ... |
2015-05-13 |
6185 |
7.1 |
2015 |
1.379999e+08 |
3.481613e+08 |
genres1 |
Action |
5 rows × 22 columns
保存对genres处理后的文件
movies_cleaned_genres.to_csv('movies_cleaned_genres.csv',index= False)
问题 2: Universal Pictures 和 Paramount Pictures 之间的对比情况如何?
Question 2: How dothe attributes differ between Universal Pictures and Paramount Pictures?
构建数据框子集movies_production_companies
movies_production_companies= movies[['id','original_title','production_companies']].reset_index(drop =True)
movies_production_companies.head()
id |
original_title |
production_companies |
|
0 |
135397 |
Jurassic World |
Universal Studios|Amblin Entertainment|Legenda... |
1 |
76341 |
Mad Max: Fury Road |
Village Roadshow Pictures|Kennedy Miller Produ... |
2 |
262500 |
Insurgent |
Summit Entertainment|Mandeville Films|Red Wago... |
3 |
140607 |
Star Wars: The Force Awakens |
Lucasfilm|Truenorth Productions|Bad Robot |
4 |
168259 |
Furious 7 |
Universal Pictures|Original Film|Media Rights ... |
采用正则表达式匹配Universal和Paramount
deffind_Universal(production_company):
try:
match =re.search("(\|{0,1}[\w\s]*Universal[\w\s]*\|{0,1})",production_company)
if match:
returnmatch.group(0)
else:
return None
except TypeError:
return None
filtering_Universal = set()
for text in movies_production_companies['production_companies'].tolist():
results =find_Universal(text)
filtering_Universal.add(results)
print(filtering_Universal)
{'UniversalPictures International ', '|Universal Music', '|Universal Studios HomeEntertainment|', '|Universal|', 'Universal Studios', 'Universal Studios HomeEntertainment', 'Universal Cartoon Studios|', 'Universal Productions France S','|Universal Home Video', '|Universal Family and Home Entertainment','|Universal City Studios|', None, 'Universal Pictures', '|Universal PicturesInternational ', '|Universal City Studios', 'Universal', 'NBC UniversalTelevision', '|Universal CGI|', 'Universal Studios Home Entertainment FamilyProductions|', 'Universal Home Entertainment', 'Universal PicturesCorporation', 'Universal Pictures Germany GmbH', '|Universal Television', '|NBCUniversal Global Networks|', 'Universal TV|', '|Universal InternationalPictures ', 'Universal Studios|', '|Universal Pictures', 'Universal TV','Universal Cable Productions|', '|Universal Network Television|', '|UniversalPictures|', '|Universal Studios Home Entertainment', 'Universal Pictures|','|Universal Studios Sound Facilities', '|Universal Home Entertainment','Universal Cartoon Studios', '|Universal 1440 Entertainment|', 'NBC UniversalTelevision|', '|Universal 1440 Entertainment', 'Universal 1440 Entertainment','|Universal Cartoon Studios|'}
deffind_Paramount(production_company):
try:
match =re.search("(\|{0,1}[\w\s]*Paramount[\w\s]*\|{0,1})",production_company)
if match:
returnmatch.group(0)
else:
return None
except TypeError:
return None
filtering_Paramount = set()
for text in movies_production_companies['production_companies'].tolist():
results =find_Paramount(text)
filtering_Paramount.add(results)
print(filtering_Paramount)
{'|ParamountClassics|', '|Paramount Pictures', 'Paramount Pictures|', None, 'ParamountPictures', '|Paramount Pictures Digital Entertainment|', '|ParamountTelevision|', '|Paramount Classics', 'Paramount|', 'Paramount FamousProductions', '|Paramount Home Entertainment', 'Paramount Home Entertainment','Paramount Vantage', '|Paramount Vantage|', 'Paramount Vantage|', 'ParamountPictures Digital Entertainment', '|Paramount Vantage', '|Paramount Animation','Paramount Classics'}
转化production_company列的内容
def modified_as_Universal_or_Paramount(production_company):
try:
Universal =re.search("(\|{0,1}[\w\s]*Universal[\w\s]*\|{0,1})",production_company)
Paramount =re.search("(\|{0,1}[\w\s]*Paramount[\w\s]*\|{0,1})",production_company)
if Universal:
return 'Universal'
elif Paramount:
return 'Paramount'
else:
return None
except TypeError:
return None
#直接通过数据框调用函数
movies['production_companies'] =movies['production_companies'].apply(modified_as_Universal_or_Paramount)
movies.head()
id |
imdb_id |
popularity |
budget |
revenue |
original_title |
cast |
homepage |
director |
tagline |
keywords |
overview |
runtime |
production_companies |
release_date |
vote_count |
vote_average |
release_year |
budget_adj |
revenue_adj |
|
0 |
135397 |
tt0369610 |
32.985763 |
150000000 |
1513528810 |
Jurassic World |
Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... |
http://www.jurassicworld.com/ |
Colin Trevorrow |
The park is open. |
monster|dna|tyrannosaurus rex|velociraptor|island |
Twenty-two years after the events of Jurassic ... |
124 |
Universal |
2015-06-09 |
5562 |
6.5 |
2015 |
1.379999e+08 |
1.392446e+09 |
1 |
76341 |
tt1392190 |
28.419936 |
150000000 |
378436354 |
Mad Max: Fury Road |
Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... |
http://www.madmaxmovie.com/ |
George Miller |
What a Lovely Day. |
future|chase|post-apocalyptic|dystopia|australia |
An apocalyptic story set in the furthest reach... |
120 |
None |
2015-05-13 |
6185 |
7.1 |
2015 |
1.379999e+08 |
3.481613e+08 |
2 |
262500 |
tt2908446 |
13.112507 |
110000000 |
295238201 |
Insurgent |
Shailene Woodley|Theo James|Kate Winslet|Ansel... |
http://www.thedivergentseries.movie/#insurgent |
Robert Schwentke |
One Choice Can Destroy You |
based on novel|revolution|dystopia|sequel|dyst... |
Beatrice Prior must confront her inner demons ... |
119 |
None |
2015-03-18 |
2480 |
6.3 |
2015 |
1.012000e+08 |
2.716190e+08 |
3 |
140607 |
tt2488496 |
11.173104 |
200000000 |
2068178225 |
Star Wars: The Force Awakens |
Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... |
http://www.starwars.com/films/star-wars-episod... |
J.J. Abrams |
Every generation has a story. |
android|spaceship|jedi|space opera|3d |
Thirty years after defeating the Galactic Empi... |
136 |
None |
2015-12-15 |
5292 |
7.5 |
2015 |
1.839999e+08 |
1.902723e+09 |
4 |
168259 |
tt2820852 |
9.335014 |
190000000 |
1506249360 |
Furious 7 |
Vin Diesel|Paul Walker|Jason Statham|Michelle ... |
http://www.furious7.com/ |
James Wan |
Vengeance Hits Home |
car race|speed|revenge|suspense|car |
Deckard Shaw seeks revenge against Dominic Tor... |
137 |
Universal |
2015-04-01 |
2947 |
7.3 |
2015 |
1.747999e+08 |
1.385749e+09 |
添加计算列profit,profit_rate,profit_adj
movies['profit']= movies['revenue'] - movies['budget']
movies['profit_rate'] =(movies['revenue']-movies['budget'])*100/movies['budget']
movies['profit_adj'] = movies['revenue_adj'] -movies['budget_adj']
movies.head()
id |
imdb_id |
popularity |
budget |
revenue |
original_title |
cast |
homepage |
director |
tagline |
... |
production_companies |
release_date |
vote_count |
vote_average |
release_year |
budget_adj |
revenue_adj |
profit |
profit_rate |
profit_adj |
|
0 |
135397 |
tt0369610 |
32.985763 |
150000000 |
1513528810 |
Jurassic World |
Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... |
http://www.jurassicworld.com/ |
Colin Trevorrow |
The park is open. |
... |
Universal |
2015-06-09 |
5562 |
6.5 |
2015 |
1.379999e+08 |
1.392446e+09 |
1363528810 |
909.019207 |
1.254446e+09 |
1 |
76341 |
tt1392190 |
28.419936 |
150000000 |
378436354 |
Mad Max: Fury Road |
Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... |
http://www.madmaxmovie.com/ |
George Miller |
What a Lovely Day. |
... |
None |
2015-05-13 |
6185 |
7.1 |
2015 |
1.379999e+08 |
3.481613e+08 |
228436354 |
152.290903 |
2.101614e+08 |
2 |
262500 |
tt2908446 |
13.112507 |
110000000 |
295238201 |
Insurgent |
Shailene Woodley|Theo James|Kate Winslet|Ansel... |
http://www.thedivergentseries.movie/#insurgent |
Robert Schwentke |
One Choice Can Destroy You |
... |
None |
2015-03-18 |
2480 |
6.3 |
2015 |
1.012000e+08 |
2.716190e+08 |
185238201 |
168.398365 |
1.704191e+08 |
3 |
140607 |
tt2488496 |
11.173104 |
200000000 |
2068178225 |
Star Wars: The Force Awakens |
Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... |
http://www.starwars.com/films/star-wars-episod... |
J.J. Abrams |
Every generation has a story. |
... |
None |
2015-12-15 |
5292 |
7.5 |
2015 |
1.839999e+08 |
1.902723e+09 |
1868178225 |
934.089113 |
1.718723e+09 |
4 |
168259 |
tt2820852 |
9.335014 |
190000000 |
1506249360 |
Furious 7 |
Vin Diesel|Paul Walker|Jason Statham|Michelle ... |
http://www.furious7.com/ |
James Wan |
Vengeance Hits Home |
... |
Universal |
2015-04-01 |
2947 |
7.3 |
2015 |
1.747999e+08 |
1.385749e+09 |
1316249360 |
692.762821 |
1.210949e+09 |
5 rows × 23 columns
#movies.to_csv('movies_production_company.csv')
问题 3:改编电影和原创电影的对比情况如何?(通过keywords变量中的based on novel字段来判断)
Question 3: How havemovies based on novels performed relative to movies not based on novels?
构建一个子数据框movies_novels
movies_novels= movies[['id','original_title','keywords','tagline']].reset_index(drop =True)
movies_novels.head()
id |
original_title |
keywords |
tagline |
|
0 |
135397 |
Jurassic World |
monster|dna|tyrannosaurus rex|velociraptor|island |
The park is open. |
1 |
76341 |
Mad Max: Fury Road |
future|chase|post-apocalyptic|dystopia|australia |
What a Lovely Day. |
2 |
262500 |
Insurgent |
based on novel|revolution|dystopia|sequel|dyst... |
One Choice Can Destroy You |
3 |
140607 |
Star Wars: The Force Awakens |
android|spaceship|jedi|space opera|3d |
Every generation has a story. |
4 |
168259 |
Furious 7 |
car race|speed|revenge|suspense|car |
Vengeance Hits Home |
通过正则表达式分别在keywords和tagline两列中匹配含有novel字符的记录
deffind_novel_from_keywords(keywords):
try:
match =re.search('(\|{0,1}[\w\s]*novel[\w\s]*\|{0,1})',keywords)
if match:
returnmatch.group(0)
else:
return None
except TypeError:
return None
words_with_novel = set()
for text in movies_novels['keywords'].tolist():
results =find_novel_from_keywords(text)
words_with_novel.add(results)
print(words_with_novel)
#用来处理错误
{'|stolennovel|', 'based on novel|', '|tell all novel|', '|based on graphic novel','|novelist', None, '|novelist|', '|based on graphic novel|', 'based on novel','based on graphic novel|', '|inspired by novel', '|based on novel|', '|based onnovel'}
deffind_novel_based_movie(keywords):
try:
match1 =re.search('(\|{0,1}based[\w\s]*novel\|{0,1})',keywords)
match2 =re.search('(\|{0,1}inspired[\w\s]*novel\|{0,1})',keywords)
if match1 or match2:
return True
else:
return False
except TypeError:
returnNone
movies_novels['based_on_novel_0'] =movies_novels['keywords'].apply(find_novel_based_movie)
movies_novels.head()
id |
original_title |
keywords |
tagline |
based_on_novel_0 |
|
0 |
135397 |
Jurassic World |
monster|dna|tyrannosaurus rex|velociraptor|island |
The park is open. |
False |
1 |
76341 |
Mad Max: Fury Road |
future|chase|post-apocalyptic|dystopia|australia |
What a Lovely Day. |
False |
2 |
262500 |
Insurgent |
based on novel|revolution|dystopia|sequel|dyst... |
One Choice Can Destroy You |
True |
3 |
140607 |
Star Wars: The Force Awakens |
android|spaceship|jedi|space opera|3d |
Every generation has a story. |
False |
4 |
168259 |
Furious 7 |
car race|speed|revenge|suspense|car |
Vengeance Hits Home |
False |
deffind_novel_from_tagline(tagline):
try:
match =re.search('(.*novel.*)',tagline)
if match:
returnmatch.group(0)
else:
return None
except TypeError:
return None
words_with_novel_tagline = set()
for text in movies_novels['tagline'].tolist():
results =find_novel_from_tagline(text)
words_with_novel_tagline.add(results)
print(words_with_novel_tagline)
{'Basedon the novel by Henry James', 'Based on the novel of Chico Xavier', None, 'The#1 novel of the year - now a motion picture!', 'Based on the best-sellingnovel'}
deffind_novel_based_movie_1(tagline):
try:
match =re.search('(.*novel.*)',tagline)
if match:
return True
else:
return False
except TypeError:
returnNone
movies_novels['based_on_novel_1'] =movies_novels['tagline'].apply(find_novel_based_movie_1)
movies_novels.head()
id |
original_title |
keywords |
tagline |
based_on_novel_0 |
based_on_novel_1 |
|
0 |
135397 |
Jurassic World |
monster|dna|tyrannosaurus rex|velociraptor|island |
The park is open. |
False |
False |
1 |
76341 |
Mad Max: Fury Road |
future|chase|post-apocalyptic|dystopia|australia |
What a Lovely Day. |
False |
False |
2 |
262500 |
Insurgent |
based on novel|revolution|dystopia|sequel|dyst... |
One Choice Can Destroy You |
True |
False |
3 |
140607 |
Star Wars: The Force Awakens |
android|spaceship|jedi|space opera|3d |
Every generation has a story. |
False |
False |
4 |
168259 |
Furious 7 |
car race|speed|revenge|suspense|car |
Vengeance Hits Home |
False |
False |
通过逻辑值的运算,合并tagline和keywords两列包含novel的结果
movies_novels['based_on_novel']= movies_novels['based_on_novel_0'] +movies_novels['based_on_novel_1']
movies_novels.drop(['keywords','tagline','based_on_novel_0','based_on_novel_1'],axis= 1, inplace = True)
movies_novels.head()
id |
original_title |
based_on_novel |
|
0 |
135397 |
Jurassic World |
0 |
1 |
76341 |
Mad Max: Fury Road |
0 |
2 |
262500 |
Insurgent |
1 |
3 |
140607 |
Star Wars: The Force Awakens |
0 |
4 |
168259 |
Furious 7 |
0 |
print(movies_novels['based_on_novel'].unique())
[0 1 nan2]
movies_novels[movies_novels['based_on_novel']==2]
id |
original_title |
based_on_novel |
|
10660 |
10671 |
Airport |
2 |
movies.loc[movies["original_title"]=="Airport",["original_title","keywords","tagline"]]
original_title |
keywords |
tagline |
|
10660 |
Airport |
bomb|based on novel|airport|desperation|snow s... |
The #1 novel of the year - now a motion picture! |
转换成逻辑值
defconvert_to_bool(value):
if value==1 or value==2:
return True
else:
returnFalse
movies_novels['based_on_novel'] =movies_novels['based_on_novel'].apply(convert_to_bool)
movies_novels.head()
id |
original_title |
based_on_novel |
|
0 |
135397 |
Jurassic World |
False |
1 |
76341 |
Mad Max: Fury Road |
False |
2 |
262500 |
Insurgent |
True |
3 |
140607 |
Star Wars: The Force Awakens |
False |
4 |
168259 |
Furious 7 |
False |
合并movies和movies_novels,合并之前删除keywords和tagline列
movies.drop(['keywords','tagline'],axis= 1, inplace = True)
movies = pd.merge(movies,movies_novels,how = 'left',left_on =['id','original_title'],right_on = ['id','original_title'])
movies.head()
id |
imdb_id |
popularity |
budget |
revenue |
original_title |
cast |
homepage |
director |
overview |
... |
release_date |
vote_count |
vote_average |
release_year |
budget_adj |
revenue_adj |
profit |
profit_rate |
profit_adj |
based_on_novel |
|
0 |
135397 |
tt0369610 |
32.985763 |
150000000 |
1513528810 |
Jurassic World |
Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... |
http://www.jurassicworld.com/ |
Colin Trevorrow |
Twenty-two years after the events of Jurassic ... |
... |
2015-06-09 |
5562 |
6.5 |
2015 |
1.379999e+08 |
1.392446e+09 |
1363528810 |
909.019207 |
1.254446e+09 |
False |
1 |
76341 |
tt1392190 |
28.419936 |
150000000 |
378436354 |
Mad Max: Fury Road |
Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... |
http://www.madmaxmovie.com/ |
George Miller |
An apocalyptic story set in the furthest reach... |
... |
2015-05-13 |
6185 |
7.1 |
2015 |
1.379999e+08 |
3.481613e+08 |
228436354 |
152.290903 |
2.101614e+08 |
False |
2 |
262500 |
tt2908446 |
13.112507 |
110000000 |
295238201 |
Insurgent |
Shailene Woodley|Theo James|Kate Winslet|Ansel... |
http://www.thedivergentseries.movie/#insurgent |
Robert Schwentke |
Beatrice Prior must confront her inner demons ... |
... |
2015-03-18 |
2480 |
6.3 |
2015 |
1.012000e+08 |
2.716190e+08 |
185238201 |
168.398365 |
1.704191e+08 |
True |
3 |
140607 |
tt2488496 |
11.173104 |
200000000 |
2068178225 |
Star Wars: The Force Awakens |
Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... |
http://www.starwars.com/films/star-wars-episod... |
J.J. Abrams |
Thirty years after defeating the Galactic Empi... |
... |
2015-12-15 |
5292 |
7.5 |
2015 |
1.839999e+08 |
1.902723e+09 |
1868178225 |
934.089113 |
1.718723e+09 |
False |
4 |
168259 |
tt2820852 |
9.335014 |
190000000 |
1506249360 |
Furious 7 |
Vin Diesel|Paul Walker|Jason Statham|Michelle ... |
http://www.furious7.com/ |
James Wan |
Deckard Shaw seeks revenge against Dominic Tor... |
... |
2015-04-01 |
2947 |
7.3 |
2015 |
1.747999e+08 |
1.385749e+09 |
1316249360 |
692.762821 |
1.210949e+09 |
False |
5 rows × 22 columns
movies.to_csv('movies_cleaned.csv',index= False)
参考文献
1.Tidydata in Python http://www.jeannicholashould.com/tidy-data-in-python.html
2.Udacity
3.https://www.youtube.com/watch?v=2CwzOjYbi-w&list=PLXbU-2B80FvCKj0aqdpudCqpif2vNuING&index=1
4.原始数据链接 https://d17h27t6h515a5.cloudfront.net/topher/2017/January/587e7057_movies/movies.csv
以上是关于电影数据可视化项目--数据清理的主要内容,如果未能解决你的问题,请参考以下文章
毕业设计-基于大数据的电影爬取与可视化分析系统-python