仅针对过去一年的网页抓取评论
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了仅针对过去一年的网页抓取评论相关的知识,希望对你有一定的参考价值。
我正在尝试从tripadvisor为特定航空公司Spicejet提取年仅一年的评论。链接:https://www.tripadvisor.com/Airline_Review-d8728949-Reviews-or60-SpiceJet#REVIEWS
但是存储评论日期存在不一致,因为有些是跨度类值:<span class="ratingDate">
Reviewed October 22, 2018
</span>
有些人在标题中:
<span class="ratingDate relativeDate" title="October 23, 2018">
Reviewed 5 weeks ago
</span>
我想提取日期并设置条件,提取仅一年的评论。我在处理两种日期格式时遇到了困难,所以我应该如何比较它。
代码:
date = items.find(class_="ratingDate").get("title")
date = dt.strptime(date, "%B %d, %Y")
if (date > dt.strptime(('November 26 2017'),"%B %d %Y")):
date = items.find('span', class_='ratingDate')['title']
输出:
“它易于管理”
(''2018年10月23日',)
<ipython-input-72-3d5de04a2794> in get_info()
6 for items in soup.find_all(class_="innerBubble"):
7 date = items.find(class_="ratingDate").get("title")
----> 8 date = dt.strptime(date, "%B %d, %Y")
9 if (date > dt.strptime(('November 26 2017'),"%B %d %Y")):
10 print("===========================================")
TypeError: strptime() argument 1 must be str, not None
你可以做很多工作,或者你可以追踪数据的来源,并稍微模糊一下源,直到它吐出更可爱的东西。这里看起来像是从以下位置加载数据:
https://www.tripadvisor.com/AirlineTips
正如你所指出的那样,它是丑陋的。
它给我的确切电话是:
https://www.tripadvisor.com/AirlineTips?d=8728949&inline=true
吐出来的:
<div class="page page1">
<div class="tip">
<div class="memberOverlayLink" id="UID_-SRC_635739734" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorWidth="30">
<div class="circularAvWrap smallCircularAvWrap profile_UID_-SRC_635739734">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/85/avatar006.jpg" class="avatar" width="28" height="28"/>
</div>
</div> <div class="tipText">
<blockquote>“Value for Money”</blockquote>
<span class="ui_bubble_rating bubble_4" alt="4.0 of 5 bubbles"></span>
Santhoshpp, 2 days ago
<span class="pipe">|</span> <a href="/ShowUserReviews-g1-d8728949-r635739734-SpiceJet-World.html" onclick="ta.trackEventOnPage('Tab Content', 'read_review', 'Read Review');">Read review</a> </div> </div>
<div class="tip">
<div class="memberOverlayLink" id="UID_-SRC_635711432" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorWidth="30">
<div class="circularAvWrap smallCircularAvWrap profile_UID_-SRC_635711432">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/99/avatar025.jpg" class="avatar" width="28" height="28"/>
</div>
</div> <div class="tipText">
据我所知,您不必比较两个日期值,因为它们都意味着相同的日期。因此,对于每次审核,请检查span类日期或标题日期是否存在。如果两者都存在,只需检查一个。检查可以使用strptime完成。
对于标题日期,您将需要timedelta。
span_date = None
title_date = None
one_year_ago_date = datetime.now().replace(year=dt.year-3)
# ADD CODE HERE to get date strings for span_date and title_date
# Assume span_date = "October 22, 2018"
review_date = None
if span_date is not None:
review_date = datetime.datetime.strptime(span_date, "%B %d, %Y").date()
# Assume title_date = "5 weeks ago"
elif title_date is not None:
title_date = [title_date .split()[:2]]
time_dict = dict((fmt, float(amount)) for amount,fmt in title_date)
dt = datetime.timedelta(**time_dict)
review_date = datetime.datetime.today() - dt
# Check if review_date is earlier than one year ago
if review_date.date() < one_year_ago_date:
print("Save this review")
您可以利用CSS在课堂上的匹配方式,使用类选择器.ratingDate
来恢复所有审核日期。它将匹配.ratingDate
和.ratingDate.relativeDate
。你会发现匹配元素类的len将是2,因为日期在元素的title属性中,即。类ratingDate relativeDate
的元素。
<span class="ratingDate relativeDate" title="October 26, 2018">Reviewed 4 weeks ago
</span>
您也可以按类选择器获取评论文本。拉链并转到列表。
下面是没有日期过滤的大纲。过滤日期早于此(但随后您将需要一个索引来链接列表以匹配日期和评论文本)或从此处。日期都是一致的格式。
import requests
from bs4 import BeautifulSoup
url = 'https://www.tripadvisor.com/Airline_Review-d8728949-Reviews-or60-SpiceJet#REVIEWS'
data = requests.get(url).content
soup = BeautifulSoup(data,'lxml')
dateStrings = soup.select('.ratingDate')
reviewStrings = soup.select('.partial_entry')
reviewDates = [date['title'].strip() if len(date['class']) == 2 else date.text.strip().replace('Reviewed ','') for date in dateStrings]
reviews = [review.text.strip() for review in reviewStrings]
allInfo = list(zip(updatedDates,reviews))
以上是关于仅针对过去一年的网页抓取评论的主要内容,如果未能解决你的问题,请参考以下文章
将列名显示为从 current_date 开始的过去一年的月份