抓取 twitter 数据中的各种特征

Posted

技术标签:

【中文标题】抓取 twitter 数据中的各种特征【英文标题】:Scraping various features in twitter data 【发布时间】:2021-12-27 08:59:43 【问题描述】:

我正在尝试提取 Twitter 数据但遇到错误。我正在使用 tweepy 提取以下特征

'retweeted_status','hashtags', 'text', 'urls', 'user_mentions', 'screen_name', 'id', 'created_at', 'country' , 'state', 'place', 'hashtag_count', 'url_count', 'mention_count','possibly_sensitive','favorite_count', 'favorited', 'retweet_count', 'retweeted', user.statuses_count, user.favourites_count, user.followers_count, user_description',  user_'location', user_'time_zone'

如果我能帮助调试以下错误或python中的替代方法以提取上述特征,将会很有帮助

%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import csv


from collections import Counter
import ast

import tweepy
import json
from tweepy import OAuthHandler

consumer_key =    'xxxxxxxxx'       
consumer_secret = 'xxxxxxxxx'       
access_key=       'xxxxxxxxx'         
access_secret =   'xxxxxxxxx' 

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)

api = tweepy.API(auth)

from tweepy import Stream
#from tweepy.streaming import StreamListener



# get retweet status
   def try_retweet(status, attribute):
    try:
        if getattr(status, attribute):
            return True
    except AttributeError:
        return None

   # get country status
   def try_country(status, attribute):
    if getattr(status, attribute) != None:
        place = getattr(status, attribute)
        return place.country
    return None

# get city status
def try_city(status, attribute):
    if getattr(status, attribute) != None:
        place = getattr(status, attribute)
        return place.full_name
    return None

# function that tries to get attribute from object
def try_get(status, attribute):
    try:
        return getattr(status, attribute).encode('utf-8')
    except AttributeError:
        return None

# open csv file
csvFile = open('originalsample.csv', 'a')

# create csv writer
csvWriter = csv.writer(csvFile)

class MyListener(Stream):

    def on_status(self, status):
        try:
            # if this represents a retweet
            if try_retweet(status,'retweeted_status'):
                status = status.retweeted_status

                # get and sanitize hashtags 
                hashtags = status.entities['hashtags']
                hashtag_list = []
                for el in hashtags:
                    hashtag_list.append(el['text'])
                hashtag_count = len(hashtag_list)

                # get and sanitize urls
                urls = status.entities['urls']
                url_list = []
                for el in urls:
                    url_list.append(el['url'])
                url_count = len(url_list)

                # get and sanitize user_mentions
                user_mentions = status.entities['user_mentions']
                mention_list = []
                for el in user_mentions:
                    mention_list.append(el['screen_name'])
                mention_count = len(mention_list)

                # save it all as a tweet
                tweet = [status.id, status.created_at, try_country(status, 'place'), try_city(status, 'place'), status.text.encode('utf-8'), status.lang,
                  hashtag_list, url_list, mention_list, 
                  hashtag_count, url_count, mention_count, 
                  try_get(status, 'possibly_sensitive'),
                  status.favorite_count, status.favorited, status.retweet_count, status.retweeted, 
                  status.user.statuses_count, 
                  status.user.favourites_count, 
                  status.user.followers_count,
                  try_get(status.user, 'description'),
                  try_get(status.user, 'location'),
                  try_get(status.user, 'time_zone')]

                # write to csv
                csvWriter.writerow(tweet)
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True

    # tell us if there's an error
    def on_request_error(self, status):
        print(status)
        return True

twitter_stream = Stream(auth, MyListener())
twitter_stream.sample()

假设输出格式如下:

                      id    created_at       country    city    text                                   lang   hashtags  urls user_mentions  hashtag_count   url_count   mention_count   possibly_sensitive  favorite_count  favorited   retweet_count   retweeted   user_statuses_count user_favorites_count    user_follower_count user_description    user_location   user_timezone   
     0  669227044996124673  2015-11-24 18:52:15  NaN     NaN     Yo ???????????????????? '               '    und []  []  []                0          0             0                NaN                      270      False               288   False                   10726                 18927          24429                      NaN                 Yucatán, México Mexico City

它显示以下错误:

    ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-c016fb9faa9c> in <module>
     92         return True
     93 
---> 94 twitter_stream = Stream(auth, MyListener())
     95 twitter_stream.sample()

TypeError: __init__() missing 4 required positional arguments: 'consumer_key', 'consumer_secret', 'access_token', and 'access_token_secret'

【问题讨论】:

请不要将您的问题更改为一个全新的错误。而是创建一个新问题或在问题中添加一个新部分以表明您的调试已进行。 【参考方案1】:

StreamListener 在 Tweepy v4.0.0 中被合并到 Stream(请参阅"Where did StreamListener go? 的文档)。

您现在需要将Streamon_error changed 子类化为on_request_error

【讨论】:

@D Malan 我是新手,请您帮我修改需要更改的部分。谢谢 StreamListener 替换为Stream,将on_error 替换为on_request_error @D Malan 我已经做到了,并收到以下错误:---------------------------- ---------------------------------------------- TypeError Traceback(大多数最近通话最后) in 92 return True 93 ---> 94 twitter_stream = Stream(auth, MyListener()) 95 twitter_stream.sample() TypeError: __init__() missing 4 required位置参数:“consumer_key”、“consumer_secret”、“access_token”和“access_token_secret” 您现在需要将consumer_key 等传递给Stream 而不是OAuthHandler。您可以阅读流式传输文档 here 和 here。

以上是关于抓取 twitter 数据中的各种特征的主要内容,如果未能解决你的问题,请参考以下文章

Tweepy1——抓取Twitter数据

API爬虫--Twitter实战

Twitter数据非API采集方法

抓取 Twitter 用户和关注者数据

使用 rvest 在 Twitter 中抓取用户视频

使用 Selenium 从 Twitter 抓取关注者