抓取 twitter 数据中的各种特征
Posted
技术标签:
【中文标题】抓取 twitter 数据中的各种特征【英文标题】:Scraping various features in twitter data 【发布时间】:2021-12-27 08:59:43 【问题描述】:我正在尝试提取 Twitter 数据但遇到错误。我正在使用 tweepy 提取以下特征
'retweeted_status','hashtags', 'text', 'urls', 'user_mentions', 'screen_name', 'id', 'created_at', 'country' , 'state', 'place', 'hashtag_count', 'url_count', 'mention_count','possibly_sensitive','favorite_count', 'favorited', 'retweet_count', 'retweeted', user.statuses_count, user.favourites_count, user.followers_count, user_description', user_'location', user_'time_zone'
如果我能帮助调试以下错误或python中的替代方法以提取上述特征,将会很有帮助
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import csv
from collections import Counter
import ast
import tweepy
import json
from tweepy import OAuthHandler
consumer_key = 'xxxxxxxxx'
consumer_secret = 'xxxxxxxxx'
access_key= 'xxxxxxxxx'
access_secret = 'xxxxxxxxx'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
from tweepy import Stream
#from tweepy.streaming import StreamListener
# get retweet status
def try_retweet(status, attribute):
try:
if getattr(status, attribute):
return True
except AttributeError:
return None
# get country status
def try_country(status, attribute):
if getattr(status, attribute) != None:
place = getattr(status, attribute)
return place.country
return None
# get city status
def try_city(status, attribute):
if getattr(status, attribute) != None:
place = getattr(status, attribute)
return place.full_name
return None
# function that tries to get attribute from object
def try_get(status, attribute):
try:
return getattr(status, attribute).encode('utf-8')
except AttributeError:
return None
# open csv file
csvFile = open('originalsample.csv', 'a')
# create csv writer
csvWriter = csv.writer(csvFile)
class MyListener(Stream):
def on_status(self, status):
try:
# if this represents a retweet
if try_retweet(status,'retweeted_status'):
status = status.retweeted_status
# get and sanitize hashtags
hashtags = status.entities['hashtags']
hashtag_list = []
for el in hashtags:
hashtag_list.append(el['text'])
hashtag_count = len(hashtag_list)
# get and sanitize urls
urls = status.entities['urls']
url_list = []
for el in urls:
url_list.append(el['url'])
url_count = len(url_list)
# get and sanitize user_mentions
user_mentions = status.entities['user_mentions']
mention_list = []
for el in user_mentions:
mention_list.append(el['screen_name'])
mention_count = len(mention_list)
# save it all as a tweet
tweet = [status.id, status.created_at, try_country(status, 'place'), try_city(status, 'place'), status.text.encode('utf-8'), status.lang,
hashtag_list, url_list, mention_list,
hashtag_count, url_count, mention_count,
try_get(status, 'possibly_sensitive'),
status.favorite_count, status.favorited, status.retweet_count, status.retweeted,
status.user.statuses_count,
status.user.favourites_count,
status.user.followers_count,
try_get(status.user, 'description'),
try_get(status.user, 'location'),
try_get(status.user, 'time_zone')]
# write to csv
csvWriter.writerow(tweet)
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
# tell us if there's an error
def on_request_error(self, status):
print(status)
return True
twitter_stream = Stream(auth, MyListener())
twitter_stream.sample()
假设输出格式如下:
id created_at country city text lang hashtags urls user_mentions hashtag_count url_count mention_count possibly_sensitive favorite_count favorited retweet_count retweeted user_statuses_count user_favorites_count user_follower_count user_description user_location user_timezone
0 669227044996124673 2015-11-24 18:52:15 NaN NaN Yo ???????????????????? ' ' und [] [] [] 0 0 0 NaN 270 False 288 False 10726 18927 24429 NaN Yucatán, México Mexico City
它显示以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-c016fb9faa9c> in <module>
92 return True
93
---> 94 twitter_stream = Stream(auth, MyListener())
95 twitter_stream.sample()
TypeError: __init__() missing 4 required positional arguments: 'consumer_key', 'consumer_secret', 'access_token', and 'access_token_secret'
【问题讨论】:
请不要将您的问题更改为一个全新的错误。而是创建一个新问题或在问题中添加一个新部分以表明您的调试已进行。 【参考方案1】:StreamListener
在 Tweepy v4.0.0 中被合并到 Stream
(请参阅"Where did StreamListener
go? 的文档)。
您现在需要将Stream
和on_error
changed 子类化为on_request_error
。
【讨论】:
@D Malan 我是新手,请您帮我修改需要更改的部分。谢谢 将StreamListener
替换为Stream
,将on_error
替换为on_request_error
。
@D Malan 我已经做到了,并收到以下错误:---------------------------- ---------------------------------------------- TypeError Traceback(大多数最近通话最后)consumer_key
等传递给Stream
而不是OAuthHandler
。您可以阅读流式传输文档 here 和 here。以上是关于抓取 twitter 数据中的各种特征的主要内容,如果未能解决你的问题,请参考以下文章