即使未达到评论数量,如何告诉 Python 继续提取?

Posted

技术标签:

【中文标题】即使未达到评论数量,如何告诉 Python 继续提取?【英文标题】:How to tell Python to continue extracting even if the number of reviews isn't reached? 【发布时间】:2022-01-23 01:08:45 【问题描述】:

对于一个项目,我正在使用 python 提取酒店评论。我有 100 多家酒店的列表,我从每家酒店中提取 1.500 条评论。问题是有些酒店没有那么多评论。发生的情况是,当未达到 1.500 时,循环停止并显示错误。

这是我的代码:

    # The number of reviews to obtain per hotel
    reviewsToGet = 1500

    # Loop for all hotels
    for index, row in hotelsToScrap.iterrows():

        # Present feedback on which hotel is being processed
        print("Processing hotel", index)

        # Reset counter per hotel
        reviewsExtracted = 0    

        # Loop until it extracts the pre-defined number of reviews
        while reviewsExtracted<reviewsToGet:

            # Define URL to use based on the number of reviews extracted so far
            urlToUse = row['URL']
            if reviewsExtracted>0:
                repText = "-Reviews-or"+str(reviewsExtracted)+"-"
                urlToUse = urlToUse.replace("-Reviews-",repText)

            # Open and read the web page content
            soup = openPageReadhtml(urlToUse)

            # Process web page
            hotelReviews = processPage(soup, index, hotelReviews)

            # Update counter
            reviewsExtracted = reviewsExtracted + 5

            # Present feedback on the number of extracted reviews
            print("Extracted ",reviewsExtracted,"/",reviewsToGet)
         

    # Save the extracted reviews data frame to an Excel file
    hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")

即使没有达到 1.500,我该怎么做才能进行提取?

【问题讨论】:

出现错误时使用tryexcept处理 也许只使用 for 循环和 try/except 语句会更容易。让尝试成为您已经在做的事情,除了保存评论。如果已达到您想要的评论数量,不要 for get 打破 for 循环。 这是什么网站,因为您可以使用 position() 【参考方案1】:

在不知道确切错误的情况下很难判断,但我想当 while 循环条件为真时,即当 reviewsExtracted 小于 reviewsToGet 并且实际上没有更多评论时,你的一些代码是仍在尝试提取评论,但实际上没有更多评论!这里:

repText = "-Reviews-or"+str(reviewsExtracted)+"-"
urlToUse = urlToUse.replace("-Reviews-",repText)

您只需使用可能不在网站上的值 reviewsExtracted 设置新 URL urlToUse

所以我会再添加一个条件,在提取之前先检查是否有要提取的评论,或者使用tryexcept 来捕捉确切的错误,我认为它发生在这一行:

soup = openPageReadHTML(urlToUse)

【讨论】:

【参考方案2】:

根据您的帖子,我认为 1500 是一个 上限,例如。 “这是我希望为每家酒店获得的最大条评论。如果一家酒店的评论少于 1500 条,请获取所有评论,然后继续。”

我建议将您的方法从依赖最多 1500 条评论作为 条件 改为使用它作为指示何时停止的信号,然后继续下一个酒店。

但是,如果您只想让当前的实现运行,请按照其他几位评论者所说的那样做:实现异常处理,以便您的流程在遇到错误时可以继续。

例如,使用您的代码:

reviewsToGet = 1500

# Loop for all hotels
for index, row in hotelsToScrap.iterrows():

    print("Processing hotel", index)
    reviewsExtracted = 0    
    
    try:
        # Loop until it extracts the pre-defined number of reviews
        while reviewsExtracted<reviewsToGet:

            # Define URL to use based on the number of reviews extracted so far
            urlToUse = row['URL']
            if reviewsExtracted>0:
                repText = "-Reviews-or"+str(reviewsExtracted)+"-"
                urlToUse = urlToUse.replace("-Reviews-",repText)

            # Open and read the web page content
            soup = openPageReadHTML(urlToUse)

            # Process web page
            hotelReviews = processPage(soup, index, hotelReviews)

            # Update counter
            reviewsExtracted = reviewsExtracted + 5

            # Present feedback on the number of extracted reviews
            print("Extracted ",reviewsExtracted,"/",reviewsToGet)
    except Exception as err:
        print("[+] Exception encountered! Most likely because hotel has too few reviews, but check stack trace. Continuing anyways.")
        pass

    # Save the extracted reviews data frame to an Excel file
    hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")

【讨论】:

【参考方案3】:

最简单的方法是在try-except 块内循环while,如下所示:

# The number of reviews to obtain per hotel
reviewsToGet = 1500

# Loop for all hotels
for index, row in hotelsToScrap.iterrows():

    # Present feedback on which hotel is being processed
    print("Processing hotel", index)

    # Reset counter per hotel
    reviewsExtracted = 0    

    # Loop until it extracts the pre-defined number of reviews
    while reviewsExtracted<reviewsToGet:
        try:
                # Define URL to use based on the number of reviews extracted so far
                urlToUse = row['URL']
                if reviewsExtracted>0:
                    repText = "-Reviews-or"+str(reviewsExtracted)+"-"
                    urlToUse = urlToUse.replace("-Reviews-",repText)

                # Open and read the web page content
                soup = openPageReadHTML(urlToUse)

                # Process web page
                hotelReviews = processPage(soup, index, hotelReviews)

                # Update counter
                reviewsExtracted = reviewsExtracted + 5

                # Present feedback on the number of extracted reviews
                print("Extracted ",reviewsExtracted,"/",reviewsToGet)
        except:
            continue

# Save the extracted reviews data frame to an Excel file
hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")

PS:您必须只捕获您想要避免的必需异常,而不是捕获原始异常。

【讨论】:

以上是关于即使未达到评论数量,如何告诉 Python 继续提取?的主要内容,如果未能解决你的问题,请参考以下文章

Python-即使出现一些错误,如何继续我的测试脚本

错误告诉我运行类中的方法未定义,即使它是

即使未达到测试覆盖率阈值,Karma-coverage 退出代码也始终为 0

初创公司应该如何提效

初创公司应该如何提效

消除相同的字符串