如果beautifulsoup中没有数据,如何让f.write()放入NA?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如果beautifulsoup中没有数据,如何让f.write()放入NA?相关的知识,希望对你有一定的参考价值。

我的目标是在汗学院的多个个人资料页面上搜集一些特定数据。并将数据放在csv文件中。

以下是抓取一个特定配置文件页面并将其放在csv上的代码:

from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
r.html.render(sleep=5)

soup=BeautifulSoup(r.html.html,'html.parser')

user_info_table=soup.find('table', class_='user-statistics-table')

dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]

user_socio_table=soup.find_all('div', class_='discussion-stat')

data = {}
for gettext in user_socio_table:
   category = gettext.find('span')
   category_text = category.text.strip()
   number = category.previousSibling.strip()
   data[category_text] = number

filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx
"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "
")
f.close()

此代码可以正常使用此特定链接('https://www.khanacademy.org/profile/DFletcher1990/')。

现在,当我将我的链接更改为汗学院的其他个人资料时,例如:'https://www.khanacademy.org/profile/Kkasparas/'

我收到此错误:

KeyError: 'project help requests'

这是正常的,因为在这个配置文件"https://www.khanacademy.org/profile/Kkasparas/"没有project help requests值(也没有project help replies)。

因此data['project help requests']data['project help replies']不存在,因此不能写在csv文件上。

我的目标是使用许多配置文件页面运行此脚本。所以我想知道如何在每种情况下放置一个NA我不会得到每个变量的数据。然后将te NA打印到csv文件。

换句话说:我想让我的脚本适用于任何类型的用户个人资料页面。

非常感谢您的贡献:)

答案

在将其写入文件之前,您可以定义包含所有可能标头的新列表,并将不存在的键的值设置为“NA”。

full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
    if header_value not in data.keys():
        data[header_value]='NA'

同样温和提醒您在问题中提供完整的代码。 user_socio_table没有在问题中定义。我不得不查看你之前提出的问题。

完整的代码将是

from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/Kkasparas/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
data = {}
user_socio_table=soup.find_all('div', class_='discussion-stat')
for gettext in user_socio_table:
   category = gettext.find('span')
   category_text = category.text.strip()
   number = category.previousSibling.strip()
   data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
    if header_value not in data.keys():
        data[header_value]='NA'
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx
"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "
")
f.close()

输出 - khanscraptry1.csv

date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0

如果user_info_table不存在,请更改为以下行

if user_info_table is not None:
    dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
    dates=points=videos='NA'

以上是关于如果beautifulsoup中没有数据,如何让f.write()放入NA?的主要内容,如果未能解决你的问题,请参考以下文章

如何让 Beautifulsoup 不添加 <html> 或 <?xml ?>

如何使用python和beautifulsoup4循环和抓取多个页面的数据

如何让 BeautifulSoup 获得以下 div 类的价值

如何让 beautifulsoup 对脚本标签的内容进行编码和解码

在 python 上使用 selenium 或 beautifulsoup 从带有链接的页面中抓取数据,没有类,没有 id

如何从 BeautifulSoup4 中的 html 标签中找到特定的数据属性?