即使未达到评论数量,如何告诉 Python 继续提取?
Posted
技术标签:
【中文标题】即使未达到评论数量,如何告诉 Python 继续提取?【英文标题】:How to tell Python to continue extracting even if the number of reviews isn't reached? 【发布时间】:2022-01-23 01:08:45 【问题描述】:对于一个项目,我正在使用 python 提取酒店评论。我有 100 多家酒店的列表,我从每家酒店中提取 1.500 条评论。问题是有些酒店没有那么多评论。发生的情况是,当未达到 1.500 时,循环停止并显示错误。
这是我的代码:
# The number of reviews to obtain per hotel
reviewsToGet = 1500
# Loop for all hotels
for index, row in hotelsToScrap.iterrows():
# Present feedback on which hotel is being processed
print("Processing hotel", index)
# Reset counter per hotel
reviewsExtracted = 0
# Loop until it extracts the pre-defined number of reviews
while reviewsExtracted<reviewsToGet:
# Define URL to use based on the number of reviews extracted so far
urlToUse = row['URL']
if reviewsExtracted>0:
repText = "-Reviews-or"+str(reviewsExtracted)+"-"
urlToUse = urlToUse.replace("-Reviews-",repText)
# Open and read the web page content
soup = openPageReadhtml(urlToUse)
# Process web page
hotelReviews = processPage(soup, index, hotelReviews)
# Update counter
reviewsExtracted = reviewsExtracted + 5
# Present feedback on the number of extracted reviews
print("Extracted ",reviewsExtracted,"/",reviewsToGet)
# Save the extracted reviews data frame to an Excel file
hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")
即使没有达到 1.500,我该怎么做才能进行提取?
【问题讨论】:
出现错误时使用try
和except
处理
也许只使用 for 循环和 try/except 语句会更容易。让尝试成为您已经在做的事情,除了保存评论。如果已达到您想要的评论数量,不要 for get 打破 for 循环。
这是什么网站,因为您可以使用 position()
【参考方案1】:
在不知道确切错误的情况下很难判断,但我想当 while 循环条件为真时,即当 reviewsExtracted
小于 reviewsToGet
并且实际上没有更多评论时,你的一些代码是仍在尝试提取评论,但实际上没有更多评论!这里:
repText = "-Reviews-or"+str(reviewsExtracted)+"-"
urlToUse = urlToUse.replace("-Reviews-",repText)
您只需使用可能不在网站上的值 reviewsExtracted
设置新 URL urlToUse
。
所以我会再添加一个条件,在提取之前先检查是否有要提取的评论,或者使用try
和except
来捕捉确切的错误,我认为它发生在这一行:
soup = openPageReadHTML(urlToUse)
【讨论】:
【参考方案2】:根据您的帖子,我认为 1500 是一个 上限,例如。 “这是我希望为每家酒店获得的最大条评论。如果一家酒店的评论少于 1500 条,请获取所有评论,然后继续。”
我建议将您的方法从依赖最多 1500 条评论作为 条件 改为使用它作为指示何时停止的信号,然后继续下一个酒店。
但是,如果您只想让当前的实现运行,请按照其他几位评论者所说的那样做:实现异常处理,以便您的流程在遇到错误时可以继续。
例如,使用您的代码:
reviewsToGet = 1500
# Loop for all hotels
for index, row in hotelsToScrap.iterrows():
print("Processing hotel", index)
reviewsExtracted = 0
try:
# Loop until it extracts the pre-defined number of reviews
while reviewsExtracted<reviewsToGet:
# Define URL to use based on the number of reviews extracted so far
urlToUse = row['URL']
if reviewsExtracted>0:
repText = "-Reviews-or"+str(reviewsExtracted)+"-"
urlToUse = urlToUse.replace("-Reviews-",repText)
# Open and read the web page content
soup = openPageReadHTML(urlToUse)
# Process web page
hotelReviews = processPage(soup, index, hotelReviews)
# Update counter
reviewsExtracted = reviewsExtracted + 5
# Present feedback on the number of extracted reviews
print("Extracted ",reviewsExtracted,"/",reviewsToGet)
except Exception as err:
print("[+] Exception encountered! Most likely because hotel has too few reviews, but check stack trace. Continuing anyways.")
pass
# Save the extracted reviews data frame to an Excel file
hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")
【讨论】:
【参考方案3】:最简单的方法是在try-except
块内循环while
,如下所示:
# The number of reviews to obtain per hotel
reviewsToGet = 1500
# Loop for all hotels
for index, row in hotelsToScrap.iterrows():
# Present feedback on which hotel is being processed
print("Processing hotel", index)
# Reset counter per hotel
reviewsExtracted = 0
# Loop until it extracts the pre-defined number of reviews
while reviewsExtracted<reviewsToGet:
try:
# Define URL to use based on the number of reviews extracted so far
urlToUse = row['URL']
if reviewsExtracted>0:
repText = "-Reviews-or"+str(reviewsExtracted)+"-"
urlToUse = urlToUse.replace("-Reviews-",repText)
# Open and read the web page content
soup = openPageReadHTML(urlToUse)
# Process web page
hotelReviews = processPage(soup, index, hotelReviews)
# Update counter
reviewsExtracted = reviewsExtracted + 5
# Present feedback on the number of extracted reviews
print("Extracted ",reviewsExtracted,"/",reviewsToGet)
except:
continue
# Save the extracted reviews data frame to an Excel file
hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")
PS:您必须只捕获您想要避免的必需异常,而不是捕获原始异常。
【讨论】:
以上是关于即使未达到评论数量,如何告诉 Python 继续提取?的主要内容,如果未能解决你的问题,请参考以下文章