抓取亚马逊客户评论

Posted

技术标签:

【中文标题】抓取亚马逊客户评论【英文标题】:Scraping Amazon Customer Reviews 【发布时间】:2017-07-28 01:41:11 【问题描述】:

我正在使用 R 抓取 Amazon 客户评论,并遇到了一个错误,我希望有人能对此有所了解。

我注意到 R 未能从所有评论中抓取指定的节点(通过使用 SelectorGadget 找到)。每次我运行脚本时,我都会检索到不同的数量,但不是全部。这非常令人沮丧,因为目标是抓取评论并将它们编译成 csv 文件,以后可以使用 R 进行操作。本质上,如果一个产品有 200 条评论,当我运行脚本时,有时我会得到 150 条评论,有时是 75 条评论等-但不是整个 200 条。这个问题似乎是在我反复抓取之后发生的。

我还遇到了一些超时错误,特别是“open.connection(x, "rb") 中的错误:已达到超时”。

如何解决这个问题以继续抓取?我是初学者,但非常感谢任何帮助或见解!

 url <- "https://www.amazon.com/Match-Mens-Wild-Cargo-Pants/product-reviews/B009HLOZ9U/ref=cm_cr_arp_d_show_all?ie=UTF8&reviewerType=all_reviews&pageNumber="

N_pages <- 204
A <- NULL
for (j in 1: N_pages)
   pant <- read_html(paste0(url, j)) 
   B <- cbind(pant %>% html_nodes(".review-text") %>%     html_text()     )
   A <- rbind(A,B)
 
tail(A)


print(j) 

【问题讨论】:

【参考方案1】:

这不适合你吗?

将网址设置为“https://www.amazon.com/Match-Mens-Wild-Cargo-Pants/product-reviews/B009HLOZ9U/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber=”

N_pages <- 204
A <- NULL
for (j in 1: N_pages)
  pant <- read_html(paste0(url, j)) 
  B <- cbind(pant %>% html_nodes(".review-text") %>%     html_text()     )
  A <- rbind(A,B)

tail(A)
        [,1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[1938,] "This is really a good item to get. Trendy, probably you can choose a different color, it fits good but I wouldn't say perfect."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
[1939,] "I don't write reviews for most products, but I felt the need to do so for these pants for a couple reasons.  First, they are great pants!  Solid material, well-made, and they fit great.  Second, I want to echo those who say you need to go up in size when you order.  I wear anywhere from 32-34, depending on the brand.  I ordered these in a 36 and they fit like a 33 or 34.  I really love the look and feel of these, and will be ordering more!"                                                                                                                                                            
[1940,] "I bought the green one before, it is good quality and looks nice, than I purchased the similar one, but the  khaki color, but received absolutely different product, different material. really disappointed."                                                                                                                                                                                                                                                                                                                                                                                                          
[1941,] "These pants are great!  I have been looking to update my wardrobe with a more edgy style; these cargo pants deliver on that.  Paired with some casual sneakers or a decent nubuck leather boot completes the look from the waist down.  The lazy-casual look is great when traveling, as are the many pockets.  I wore these pants on a recent day trip to NYC and traveled comfortably with essential items contained in the 8 pockets.  I placed a second order shortly after my first pair arrived because I like them so much.  Shipping and delivery is also fairly fast, considering these pants ship from China!"
[1942,] "Pants are awesome, just like the picture. The size runs small, so if you order them I would order them bigger than normal. I usually wear a 34inch waist because i dont like my pants snug, these pants fit more like a 32 inch waist.Other than that i love them!"                                                                                                                                                                                                                                                                                                                                                     
[1943,] "the good:Pants are made from the durable cotton that has a nice feel; have a lot of useful features and roomy well placed pockets; durable stitching.the bad:Pants will shrink and drier/hot water is not recommended. Would have been better if the cotton was pretreated to prevent shrinking. I would gladly gave up the belt if I wouldn't have to wary about how to wash these pants.the ugly:faux pocket with a zipper. useless feature. on my pair came with a bright gold zipper, unlike a silver in a picture." 

【讨论】:

非常感谢您的意见,非常感谢!但是,这个产品似乎有 2038 条评论,而您的代码产生了 1943 条评论(即使只是从第二页开始,似乎也有大约 100 条评论的不足)? 啊,我只做了经过验证的评论!如果您希望它是所有评论,则需要更改 URL 中的类型。例如“Type=all_reviews”而不是“Type=avp_only_reviews”。 哦,好的!令人惊讶的是,错误就这么简单!有时我只是从第二页开始抓取,但仍然出现这个不完整的抓取错误,因为我确实记得学习尝试使用第二页(但从不知道为什么)。另外,我尝试使用刮掉的页面数量,似乎如果我将该数量增加到超过包含评论的页面数量,它似乎有效?再次,非常感谢您花时间帮助我!!我一直在为此苦苦挣扎! 如果无法重现问题,就很难评估问题。对我来说,任意增加 N_pages 将是一件有用的事情是没有意义的。我最好的想法是刮掉 N_pages 的值,而不是每次都重置它。这样你就可以肯定你的过程每次都是一样的 您认为亚马逊可能会积极尝试阻止抓取吗?

以上是关于抓取亚马逊客户评论的主要内容,如果未能解决你的问题,请参考以下文章

亚马逊 API - ***客户评论者 [重复]

亚马逊畅销书的NLP分析——推荐系统评论分类和主题建模

亚马逊产品广告 API - 获得评论排名

使用 URL PHP 抓取亚马逊产品数据 [重复]

Scrapy - 无法在亚马逊上抓取评级数据

根据位置抓取亚马逊价格时,cookie 随机无法保留位置信息?