使用 Python 解析这个自定义 Twitter 捕获数据并创建统计信息

Posted

技术标签:

【中文标题】使用 Python 解析这个自定义 Twitter 捕获数据并创建统计信息【英文标题】:Parse This Custom Twitter Capture Data With Python and Create Statistics 【发布时间】:2011-07-21 00:49:51 【问题描述】:

我正在尝试从提供给我的特定数据集中收集 Twitter 统计信息。在将数据提供给我之前,我无法控制数据的格式,所以我陷入了这种混乱之中。

我想要一些关于如何构建 python 程序来解析这种输入并输出更多内容的建议,这些内容与 CSV 文件的行类似,其中字段标题作为标题和下面的值。

我想使用 python,因为最终我想使用一些我已经放在一起的统计工具。

另外,首选 CSV 类型格式的输出,因为我可能会将其输入到 SPSS 之类的东西中进行统计验证。

以下是数据集中单个帖子的示例:

"text":"A gente todos os dias arruma os cabelos: por que não o coração?","contributors":null,"geo":null,"retweeted":false,"in_reply_to_screen_name":null,"truncated":false,"entities":"urls":[],"hashtags":[],"user_mentions":[],"in_reply_to_status_id_str":null,"id":50270714498002945,"source":"web","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"created_at":"Tue Mar 22 19:00:46 +0000 2011","in_reply_to_user_id":null,"retweet_count":0,"id_str":"50270714498002945","place":null,"user":"location":"Brasil, Recife-PE","statuses_count":16,"profile_background_tile":true,"lang":"en","profile_link_color":"867c5f","id":59154474,"following":null,"favourites_count":0,"protected":false,"profile_text_color":"91957f","verified":false,"contributors_enabled":false,"description":"","profile_sidebar_border_color":"eae2bc","name":"Natalia Aráujo","profile_background_color":"eae2bc","created_at":"Wed Jul 22 15:27:15 +0000 2009","followers_count":10,"geo_enabled":false,"profile_background_image_url":"http://a3.twimg.com/profile_background_images/220796682/music-2.png","follow_request_sent":null,"url":null,"utc_offset":-10800,"time_zone":"Brasilia","notifications":null,"profile_use_background_image":true,"friends_count":18,"profile_sidebar_fill_color":"eae2bc","screen_name":"nat_araujo","id_str":"59154474","show_all_inline_media":false,"profile_image_url":"http://a0.twimg.com/profile_images/1247378890/154254_normal.JPG","listed_count":1,"is_translator":false,"coordinates":null

数据集是一条连续的线,帖子之间没有回线。实际帖子之间的唯一分隔符是:

所有帖子都以

开头
"text": 

并以

结尾
null

任何建议都将不胜感激,我当然很乐意与大家分享我的结果。


编辑

根据每个人的说法,我从以下内容开始:

导入系统 导入json 从 pprint 导入 pprint 如果 len(sys.argv) != 2: 打印'使用:twitterjson2cvs.py(路径/文件名)' sys.exit() 输入文件 = 打开(sys.argv[1]) jsondatain = json.load(输入文件) pprint(jsondatain) 输入文件.close()

它以以下形式输出一些更干净的东西:

u'贡献者':无, u'坐标':无, u'created_at': u'Tue Mar 22 19:00:46 +0000 2011', u'entities': u'hashtags': [], u'urls': [], u'user_mentions': [], u'favorite':错误, u'geo':没有, u'id': 50270714498002945L, u'id_str': u'50270714498002945', u'in_reply_to_screen_name':无, u'in_reply_to_status_id':无, u'in_reply_to_status_id_str':无, u'in_reply_to_user_id':无, u'in_reply_to_user_id_str':无, 你'地方':无, u'retweet_count': 0, 你'转推':错误, 你'来源':你'网络', u'text': u'A gente todos os dias arruma os cabelos: por que n\xe3o o cora\xe7\xe3o?', 你'截断':错误, u'user':u'contributors_enabled':假, u'created_at': u'Wed Jul 22 15:27:15 +0000 2009', u'描述': u'', u'favourites_count': 0, u'follow_request_sent':无, u'followers_count':10, 你'以下':无, u'friends_count': 18, 你'geo_enabled':假, 用户名:59154474, u'id_str': u'59154474', u'is_translator':错误, u'lang': u'en', 你'listed_count':1, u'location': u'Brasil, Recife-PE', u'name': u'Natalia Ar\xe1ujo', u'通知':无, u'profile_background_color': u'eae2bc', u'profile_background_image_url': u'http://a3.twimg.com/profile_background_images/220796682/music-2.png', u'profile_background_tile':是的, u'profile_image_url': u'http://a0.twimg.com/profile_images/1247378890/154254_normal.JPG', u'profile_link_color': u'867c5f', u'profile_sidebar_border_color': u'eae2bc', u'profile_sidebar_fill_color': u'eae2bc', u'profile_text_color': u'91957f', u'profile_use_background_image':是的, 你'受保护':错误, u'screen_name': u'nat_araujo', u'show_all_inline_media':错误, u'statuses_count':16, u'time_zone': u'巴西利亚', u'url':无, u'utc_offset':-10800, 你'验证':假

编辑

我已经添加到前面的代码中以尝试输出到 csv 文件:

导入系统 导入json #从 pprint 导入 pprint 导入 csv 如果 len(sys.argv) != 2: 打印'使用:twitterjson2cvs.py(路径/文件名)' sys.exit() 输入文件 = 打开(sys.argv[1]) jsondatain = json.load(输入文件) f=csv.writer(open("test.csv","wb+")) f.writerow(["contributors","coordinates","created_at","entities","hashtags","urls","user_mentions","favorited","geo","id","id_str"," in_reply_to_screen_name","in_reply_to_status_id","in_reply_to_status_id_str","in_reply_to_user_id","in_reply_to_user_id_str","place","re​​tweet_count","re​​tweeted","source","text","truncated","user","contributors_enabled" ,"created_at","description","favourites_count","follow_request_sent","followers_count","following","friends_count","geo_enabled","id","id_str","is_translator","lang","列出的计数","位置","名称","通知","profile_background_color","profile_background_image_url","profile_background_tile","profile_image_url","profile_link_color","profile_sidebar_border_color","profile_sidebar_fill_color","profile_text_color","profile_use_background_image" ,"protected","screen_name","show_all_inline_media","statuses_count","time_zone","url","utc_offset","verified"]) 对于 jsondatain 中的 x: f.writerow([x["contributors"],x["fields"]["coordinates"],x["fields"]["created_at"],x["fields"]["entities"],x[ "fields"]["hashtags"],x["fields"]["urls"],x["fields"]["user_mentions"],x["fields"]["favorited"],x["fields "]["geo"],x["fields"]["id"],x["fields"]["id_str"],x["fields"]["in_reply_to_screen_name"],x["fields"] ["in_reply_to_status_id"],x["fields"]["in_reply_to_status_id_str"],x["fields"]["in_reply_to_user_id"],x["fields"]["in_reply_to_user_id_str"],x["fields"][" place"],x["fields"]["retweet_count"],x["fields"]["retweeted"],x["fields"]["source"],x["fields"]["text" ],x["fields"]["截断"],x["fields"]["user"],x["fields"]["contributors_enabled"],x["fields"]["created_at"], x["fields"]["description"],x["fields"]["favourites_count"],x["fields"]["follow_request_sent"],x["fields"]["followers_count"],x[ "fields"]["following"],x["fields"]["friends_count"],x["fields"]["geo_enabled"],x["fields"]["id"],x["fields "]["id_str"],x["fields"]["is_translator"],x["fields"]["lang"],x["fields"]["listed_count" ],x["fields"]["location"],x["fields"]["name"],x["fields"]["notifications"],x["fields"]["profile_background_color"], x["fields"]["profile_background_image_url"],x["fields"]["profile_background_tile"],x["fields"]["profile_image_url"],x["fields"]["profile_link_color"],x[ "fields"]["profile_sidebar_border_color"],x["fields"]["profile_sidebar_fill_color"],x["fields"]["profile_text_color"],x["fields"]["profile_use_background_image"],x["fields "]["protected"],x["fields"]["screen_name"],x["fields"]["show_all_inline_media"],x["fields"]["statuses_count"],x["fields"] ["time_zone"],x["fields"]["url"],x["fields"]["utc_offset"],x["fields"]["verified"]]) #pprint(jsondatain) 输入文件.close()

但是当我运行它时,我得到:

文件“twitterjson2cvs.py”,第 28 行,在 f.writerow([x["contributors"],x["fields"]["coordinates"],x["fields"]["created_at"],x["fields"]["entities"],x[ "fields"]["hashtags"],x["fields"]["urls"],x["fields"]["user_mentions"],x["fields"]["favorited"],x["fields "]["geo"],x["fields"]["id"],x["fields"]["id_str"],x["fields"]["in_reply_to_screen_name"],x["fields"] ["in_reply_to_status_id"],x["fields"]["in_reply_to_status_id_str"],x["fields"]["in_reply_to_user_id"],x["fields"]["in_reply_to_user_id_str"],x["fields"][" place"],x["fields"]["retweet_count"],x["fields"]["retweeted"],x["fields"]["source"],x["fields"]["text" ],x["fields"]["截断"],x["fields"]["user"],x["fields"]["contributors_enabled"],x["fields"]["created_at"], x["fields"]["description"],x["fields"]["favourites_count"],x["fields"]["follow_request_sent"],x["fields"]["followers_count"],x[ "fields"]["following"],x["fields"]["friends_count"],x["fields"]["geo_enabled"],x["fields"]["id"],x["fields "]["id_str"],x["fields"]["is_translator"],x["fields"]["lang"],x["fields"]["listed_count" ],x["fields"]["location"],x["fields"]["name"],x["fields"]["notifications"],x["fields"]["profile_background_color"], x["fields"]["profile_background_image_url"],x["fields"]["profile_background_tile"],x["fields"]["profile_image_url"],x["fields"]["profile_link_color"],x[ "fields"]["profile_sidebar_border_color"],x["fields"]["profile_sidebar_fill_color"],x["fields"]["profile_text_color"],x["fields"]["profile_use_background_image"],x["fields "]["protected"],x["fields"]["screen_name"],x["fields"]["show_all_inline_media"],x["fields"]["statuses_count"],x["fields"] ["time_zone"],x["fields"]["url"],x["fields"]["utc_offset"],x["fields"]["verified"]]) TypeError:字符串索引必须是整数

错误与字段的格式有关,但我没有看到。


编辑

我更新了代码以反映您的格式建议如下:

导入系统 导入json 导入 csv 如果 len(sys.argv) != 2: 打印'使用:twitterjson2cvs.py(路径/文件名)' sys.exit() 输入文件 = 打开(sys.argv[1]) jsondatain = json.load(输入文件) f=csv.writer(open("test.csv","wb+")) f.writerow(["contributors","coordinates","created_at","entities","hashtags","urls","user_mentions","favorited","geo","id","id_str"," in_reply_to_screen_name","in_reply_to_status_id","in_reply_to_status_id_str","in_reply_to_user_id","in_reply_to_user_id_str","place","re​​tweet_count","re​​tweeted","source","text","truncated","user","contributors_enabled" ,"created_at","description","favourites_count","follow_request_sent","followers_count","following","friends_count","geo_enabled","id","id_str","is_translator","lang","列出的计数","位置","名称","通知","profile_background_color","profile_background_image_url","profile_background_tile","profile_image_url","profile_link_color","profile_sidebar_border_color","profile_sidebar_fill_color","profile_text_color","profile_use_background_image" ,"protected","screen_name","show_all_inline_media","statuses_count","time_zone","url","utc_offset","verified"]) 对于 jsondatain 中的 x: f.writerow( ( x['贡献者'], x['坐标'], x['created_at'], x['entities']['hashtags'], x['entities']['urls'], x['entities']['user_mentions'], x['收藏'], x['地理'], x['id'], x['id_str'], x['in_reply_to_screen_name'], x['in_reply_to_status_id'], x['in_reply_to_status_id_str'], x['in_reply_to_user_id'], x['in_reply_to_user_id_str'], x['地点'], x['retweet_count'], x['转推'], x['来源'], x['text'].encode('utf8'), x['截断'], x['user']['contributors_enabled'], x['user']['created_at'], x['用户']['描述'], x['user']['favourites_count'], x['user']['follow_request_sent'], x['user']['followers_count'], x['用户']['关注'], x['user']['friends_count'], x['user']['geo_enabled'], x['user']['id'], x['user']['id_str'], x['user']['is_translator'], x['user']['lang'], x['user']['listed_count'], x['用户']['位置'], x['user']['name'].encode('utf8'), x['用户']['通知'], x['user']['profile_background_color'], x['user']['profile_background_image_url'], x['user']['profile_background_tile'], x['user']['profile_image_url'], x['user']['profile_link_color'], x['user']['profile_sidebar_border_color'], x['user']['profile_sidebar_fill_color'], x['user']['profile_text_color'], x['user']['profile_use_background_image'], x['用户']['受保护'], x['user']['screen_name'], x['user']['show_all_inline_media'], x['user']['statuses_count'], x['user']['time_zone'], x['user']['url'], x['user']['utc_offset'], x['用户']['已验证'] ) ) 输入文件.close()

我仍然收到以下错误:

twitterjson2cvs.py TweetFile1300820340639.tcm.online 回溯(最近一次通话最后): 文件“workspace/coalmine-datafilter/src/twitterjson2csv.py”,第 30 行,在 x['贡献者'], TypeError:字符串索引必须是整数

编辑

到目前为止,对于单个 json 格式的输入文件,一切都很好。上例json字符串输入到这个程序中:

导入系统 导入json 导入 csv 如果 len(sys.argv) != 2: 打印'使用:twitterjson2cvs.py(路径/文件名)' sys.exit() 输入文件 = 打开(sys.argv[1]) jsonindata = json.load(输入文件) f=csv.writer(open("test.csv","wb+")) f.writerow(["contributors","coordinates","created_at","entities","hashtags","urls","user_mentions","favorited","geo","id","id_str"," in_reply_to_screen_name","in_reply_to_status_id","in_reply_to_status_id_str","in_reply_to_user_id","in_reply_to_user_id_str","place","re​​tweet_count","re​​tweeted","source","text","truncated","user","contributors_enabled" ,"created_at","description","favourites_count","follow_request_sent","followers_count","following","friends_count","geo_enabled","id","id_str","is_translator","lang","列出的计数","位置","名称","通知","profile_background_color","profile_background_image_url","profile_background_tile","profile_image_url","profile_link_color","profile_sidebar_border_color","profile_sidebar_fill_color","profile_text_color","profile_use_background_image" ,"protected","screen_name","show_all_inline_media","statuses_count","time_zone","url","utc_offset","verified"]) f.writerow( ( jsonindata['贡献者'], jsonindata['坐标'], jsonindata['created_at'], jsonindata['entities']['hashtags'], jsonindata['entities']['urls'], jsonindata['entities']['user_mentions'], jsonindata['收藏'], jsonindata['geo'], jsonindata['id'], jsonindata['id_str'], jsonindata['in_reply_to_screen_name'], jsonindata['in_reply_to_status_id'], jsonindata['in_reply_to_status_id_str'], jsonindata['in_reply_to_user_id'], jsonindata['in_reply_to_user_id_str'], jsonindata['地点'], jsonindata['retweet_count'], jsonindata['转推'], jsonindata['源'], jsonindata['text'].encode('utf8'), jsonindata['截断'], jsonindata['user']['contributors_enabled'], jsonindata['user']['created_at'], jsonindata['user']['description'], jsonindata['user']['favourites_count'], jsonindata['user']['follow_request_sent'], jsonindata['user']['followers_count'], jsonindata['用户']['关注'], jsonindata['user']['friends_count'], jsonindata['user']['geo_enabled'], jsonindata['user']['id'], jsonindata['user']['id_str'], jsonindata['user']['is_translator'], jsonindata['user']['lang'], jsonindata['user']['listed_count'], jsonindata['user']['location'], jsonindata['user']['name'].encode('utf8'), jsonindata['user']['notifications'], jsonindata['user']['profile_background_color'], jsonindata['user']['profile_background_image_url'], jsonindata['user']['profile_background_tile'], jsonindata['user']['profile_image_url'], jsonindata['user']['profile_link_color'], jsonindata['user']['profile_sidebar_border_color'], jsonindata['user']['profile_sidebar_fill_color'], jsonindata['user']['profile_text_color'], jsonindata['user']['profile_use_background_image'], jsonindata['user']['protected'], jsonindata['user']['screen_name'], jsonindata['user']['show_all_inline_media'], jsonindata['user']['statuses_count'], jsonindata['user']['time_zone'], jsonindata['user']['url'], jsonindata['user']['utc_offset'], jsonindata['用户']['已验证'] ) ) 输入文件.close()

生成格式良好的输出,可供 SPSS 等工具使用,如下所示:

贡献者,坐标,created_at,实体,标签,url,user_mentions,收藏夹,地理位置,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,place,retweet_count,用户,转发,来源,文本,截断描述,favourites_count,follow_request_sent,followers_count,following,friends_count,geo_enabled,id,id_str,is_translator,lang,listed_count,location,name,notifications,profile_background_color,profile_background_image_url,profile_background_tile,profile_image_url,profile_link_color,profile_sidebar_border_color,profile_sidebar_fill_color,profile_text_color,profile screen_name,show_all_inline_media,statuses_count,time_zone,url,utc_offset,已验证 ,,Tue Mar 22 19:00:46 +0000 2011,[],[],[],False,,50270714498002945,50270714498002945,,,,,,,0,False,web,A gente todos os dias arruma os cabelos : por que não o coração?,False,False,Wed Jul 22 15:27:15 +0000 2009,,0,,10,,18,False,59154474,59154474,False,en,1,"Brasil, Recife- PE",Natalia Aráujo,,eae2bc,http://a3.twimg.com/profile_background_images/220796682/music-2.png,True,http://a0.twimg.com/profile_images/1247378890/154254_normal.JPG,867c5f ,eae2bc,eae2bc,91957f,True,False,nat_araujo,False,16,巴西利亚,,-10800,False

现在唯一的问题是我的输入文件有多个相互内联的 json 字符串,它们都在一条连续的线上。当我尝试在这些文件上运行相同的程序时,我收到以下错误:

回溯(最近一次通话最后): 文件“workspace/coalmine-datafilter/src/twitterjson2cvs.py”,第 22 行,在 jsonindata = json.load(输入文件) 加载中的文件“/usr/lib/python2.6/json/__init__.py”,第 267 行 parse_constant=parse_constant, **kw) 加载中的文件“/usr/lib/python2.6/json/__init__.py”,第 307 行 返回 _default_decoder.decode(s) 解码中的文件“/usr/lib/python2.6/json/decoder.py”,第 322 行 raise ValueError(errmsg("额外数据", s, end, len(s))) ValueError:额外数据:第 1 行第 1514 列 - 第 2 行第 1 列(字符 1514 - 2427042)

输入文件很大,(即:多万条推特帖子),不知道是帖子数量的问题还是文件有多个"...." "...." 都在同一行。有任何想法吗?我是否可能需要在每次提要后以某种方式添加换行符?

【问题讨论】:

@user672387:更新后,您的预期输出是什么? @Johnsyweb:预期的输出应该是基于('Location')与('time_zone')相关的位置统计数据,只是为了衡量。然后是消息('text')和消息熵。然后使用熵、时间、位置等进行聚类分析。 @securemindorg:这些信息将来自您说您已经编写的工具。我不明白你现在的问题是什么。 毫无疑问,我只是在说明这一点。我现在正在研究一个解决方案,现在获取输入 json 文件并将其输出为 csv,同时进行数据分析。完成后我会分享。 没问题?请再看看FAQ。 【参考方案1】:

这里的输入是 JSON。 Python 有一个JSON module。令人高兴的是,它也有一个CSV module。所以这就是你的输入和输出!

更新

你快到了!

您对writerow() 的调用需要看起来更像(而不是在for 循环中):

f.writerow( 
        (
            jsonindata['contributors'],
            jsonindata['coordinates'],
            jsonindata['created_at'],
            jsonindata['entities']['hashtags'],
            jsonindata['entities']['urls'],
            jsonindata['entities']['user_mentions'],
            jsonindata['favorited'],
            jsonindata['geo'],
            jsonindata['id'],
            jsonindata['id_str'],
            jsonindata['in_reply_to_screen_name'],
            jsonindata['in_reply_to_status_id'],
            jsonindata['in_reply_to_status_id_str'],
            jsonindata['in_reply_to_user_id'],
            jsonindata['in_reply_to_user_id_str'],
            jsonindata['place'],
            jsonindata['retweet_count'],
            jsonindata['retweeted'],
            jsonindata['source'],
            jsonindata['text'].encode('utf8'),
            jsonindata['truncated'],
            jsonindata['user']['contributors_enabled'],
            jsonindata['user']['created_at'],
            jsonindata['user']['description'],
            jsonindata['user']['favourites_count'],
            jsonindata['user']['follow_request_sent'],
            jsonindata['user']['followers_count'],
            jsonindata['user']['following'],
            jsonindata['user']['friends_count'],
            jsonindata['user']['geo_enabled'],
            jsonindata['user']['id'],
            jsonindata['user']['id_str'],
            jsonindata['user']['is_translator'],
            jsonindata['user']['lang'],
            jsonindata['user']['listed_count'],
            jsonindata['user']['location'],
            jsonindata['user']['name'].encode('utf8'),
            jsonindata['user']['notifications'],
            jsonindata['user']['profile_background_color'],
            jsonindata['user']['profile_background_image_url'],
            jsonindata['user']['profile_background_tile'],
            jsonindata['user']['profile_image_url'],
            jsonindata['user']['profile_link_color'],
            jsonindata['user']['profile_sidebar_border_color'],
            jsonindata['user']['profile_sidebar_fill_color'],
            jsonindata['user']['profile_text_color'],
            jsonindata['user']['profile_use_background_image'],
            jsonindata['user']['protected'],
            jsonindata['user']['screen_name'],
            jsonindata['user']['show_all_inline_media'],
            jsonindata['user']['statuses_count'],
            jsonindata['user']['time_zone'],
            jsonindata['user']['url'],
            jsonindata['user']['utc_offset'],
            jsonindata['user']['verified']
        )
    )

还可以考虑使用DictWriter,但请记住Python's CSV module deals badly with Unicode,因此在元组的几个元素上使用.encode('utf8')

【讨论】:

哇!感谢您及时的回复。我无论如何都不精通编程,统计数据更适合我。但是,是的,我看你是对的,它是 JSON,而且我确实看到有一个 CSV 模块。我会试一试,然后告诉你进展如何。 我已经编辑了我以前的帖子,我仍然收到“TypeError:字符串索引必须是整数”错误。 @securemindorg:啊,是的……您正在尝试对字符串进行索引。我已经更新了我的答案以使其更清楚。不需要for-loop。 感谢您的澄清,现在更有意义了。如上述编辑中所述,我只有最后一个问题。当我尝试在一个包含多个 json 格式 Twitter 帖子的大文件上运行它时遇到问题。 @Securemindorg:现在您正尝试在一次调用中读取多个 JSON 对象。我怀疑这不是您的最后一个错误,因为您没有错误处理。自从您问“我如何读写 csv?”以来,您的问题范围发生了变化,这有点不公平。请尝试自己解决这些问题,如果无法克服这些问题,请在 *** 上提出新问题【参考方案2】:

这应该让你开始......你需要照顾嵌套对象

import json
import csv
f = file('test.json', 'r')
data = json.load(f)
#result = []
for k,v in data.iteritems():
    print k,v
    #result.append(v)
f = file('output.csv', 'w')
writer = csv.writer(f)
writer.writerows(result)

【讨论】:

以上是关于使用 Python 解析这个自定义 Twitter 捕获数据并创建统计信息的主要内容,如果未能解决你的问题,请参考以下文章

我可以在 iOS 中自定义 Twitter 工具包的登录按钮吗?

python 制作自定义包并安装到系统目录

Python使用Mistune对markdown自定义规则解析

Python使用Mistune对markdown自定义规则解析

PHP 使用PHP自定义Twitter徽章

使用 JSON 序列化从 Twitter 填充自定义 TableViewCell