Twint 模块并从推文创建数据框
Posted
技术标签:
【中文标题】Twint 模块并从推文创建数据框【英文标题】:Twint Module and creating a dataframe from tweets 【发布时间】:2019-12-18 02:12:32 【问题描述】:我在将 twint 结果转换为数据框时遇到问题。我无法获取推文结果并将其存储到数据框中。每次我设置 c.Pandas=True 时都会出错。任何想法如何解决这个问题。
我知道我总是可以将它存储到 json/csv 中,然后将其重新导入,但我想避免这样做。
我正在使用的代码:
import twint
from datetime import datetime, timedelta
import nest_asyncio
import pandas as pd
nest_asyncio.apply()
c = twint.Config()
c.Limit=10
c.Username='ProtonMail'
c.Store_object=True
c.Pandas=True
twint.run.Search(c)
错误日志如下:
Traceback (most recent call last):
File "<ipython-input-39-e0414b83fe16>", line 17, in <module>
twint.run.Search(c)
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\twint\run.py", line 292, in Search
run(config, callback)
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\twint\run.py", line 213, in run
get_event_loop().run_until_complete(Twint(config).main(callback))
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\nest_asyncio.py", line 61, in run_until_complete
return f.result()
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\asyncio\futures.py", line 178, in result
raise self._exception
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\asyncio\tasks.py", line 251, in __step
result = coro.throw(exc)
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\twint\run.py", line 154, in main
await task
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\asyncio\futures.py", line 260, in __await__
yield self # This tells Task to wait for completion.
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\asyncio\tasks.py", line 318, in __wakeup
future.result()
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\asyncio\futures.py", line 178, in result
raise self._exception
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\asyncio\tasks.py", line 249, in __step
result = coro.send(None)
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\twint\run.py", line 198, in run
await self.tweets()
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\twint\run.py", line 145, in tweets
await output.Tweets(tweet, self.config, self.conn)
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\twint\output.py", line 142, in Tweets
await checkData(tweets, config, conn)
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\twint\output.py", line 116, in checkData
panda.update(tweet, config)
File "c:\users\xx\appdata\local\programs\python\python37-32\lib\site-packages\twint\storage\panda.py", line 67, in update
day = weekdays[strftime("%A", localtime(Tweet.datetime))]
OSError: [Errno 22] Invalid argument`enter code here`
【问题讨论】:
【参考方案1】:我遇到了同样的问题,删除“store object”和“pandas = true”并用下面的代码替换它们(c.Store_csv,c.Custom_csv)对我有用。您还应该为输出编写整个路径。
import twint
import nest_asyncio
nest_asyncio.apply()
# Configure
c = twint.Config()
c.Search = "data science"
c.Store_csv = True
c.Custom_csv = ["id", "user_id", "username", "tweet"]
c.Output = ("C:\Users\name\Downloads\tweet`enter code here`.csv")
【讨论】:
【参考方案2】:要使用twint.run.Search(c)
,首先您需要使用您要搜索的文本定义c.Search= ""
。但是,如果您有兴趣从ProtonMail
的个人资料中抓取推文,则应改为运行twint.run.Profile(c)
。根据您需要的数据类型,可以运行不同的选项(在此reference on github 阅读更多信息。)。
【讨论】:
【参考方案3】:你在正确的轨道上。您只需从 twint.storage.panda.Tweets_df
检索保存的搜索并将其存储在变量中。
import twint
import pandas
c = twint.Config()
c.Pandas = True
c.Lang = 'en'
c.Username='ProtonMail'
c.Limit=10
twint.run.Search(c)
test_df = twint.storage.panda.Tweets_df
查看https://github.com/twintproject/twint/issues/173了解更多信息
如果有帮助,我在 python 3.7 上使用 twint 版本 2.1.21,使用命令下载
pip install git+https://github.com/twintproject/twint.git@origin/master#egg=twint
在 anaconda 提示符下。
【讨论】:
以上是关于Twint 模块并从推文创建数据框的主要内容,如果未能解决你的问题,请参考以下文章
是否可以过滤推文,因为它们正在被twint或getoldtweets3中的转发或喜欢的数量刮取?