不能写出三个链接的所有结果,而只能写最后一个链接的结果

Posted

技术标签:

【中文标题】不能写出三个链接的所有结果,而只能写最后一个链接的结果【英文标题】:Can't write all the results from three links instead of the results from the last link only 【发布时间】:2022-01-15 18:06:45 【问题描述】:

我编写了一个宏来从website 的三页中抓取一些字段。我使用Array()来存储和写入结果,以使执行更快一点。

只要关注单个页面的内容,脚本就可以正常运行。但是,当我使用列表中的三个链接时出现问题。 To be specific, the script overwrites previous results。例如,我应该在执行后得到 150 个结果。相反,我从最后一个链接得到 50 个结果。

到目前为止我已经写了:

Public Sub FetchData()
    Dim Xhr As Object, html As HTMLDocument, Ws As Worksheet
    Dim Link As Variant, Links As Variant, LeadInfo() As String
    Dim I&, HtmlDoc As HTMLDocument, Listings As Object, Headers()
    Dim URLS(), N As Variant
    
    Links = Array( _
        "https://***.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50", _
        "https://***.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50", _
        "https://***.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50" _
    )
    
    Set Ws = ThisWorkbook.Worksheets("Sheet1")
    Set Xhr = CreateObject("MSXML2.XMLHTTP")
    Set Html = New HTMLDocument
    Set HtmlDoc = New HTMLDocument

    Headers = Array("Title", "URL", "User", "Asked")

    For Each Link In Links
        With Xhr
             .Open "GET", Link, False
             .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
             .send
             Html.body.innerHTML = .responseText
         End With

         Set Listings = Html.querySelectorAll(".summary")
         ReDim LeadInfo(1 To Listings.Length, 1 To 4)

         On Error Resume Next
         For I = 0 To Listings.Length - 1
             HtmlDoc.body.innerHTML = Listings.item(I).innerHTML
             LeadInfo(I + 1, 1) = HtmlDoc.querySelector(".question-hyperlink").innerText
             LeadInfo(I + 1, 2) = HtmlDoc.querySelector(".question-hyperlink").getAttribute("href")
             LeadInfo(I + 1, 3) = HtmlDoc.querySelector(".user-details > a").innerText
             LeadInfo(I + 1, 4) = HtmlDoc.querySelector(".user-action-time > span.relativetime").innerText
         Next I
         On Error GoTo 0

         If IsEmpty(Ws.Cells(1, 1).Value) Then Ws.Cells(1, 1).Resize(1, UBound(Headers) + 1) = Headers
         Ws.Cells(2, 1).Resize(UBound(LeadInfo, 1), UBound(LeadInfo, 2)) = LeadInfo
     Next Link
End Sub

我怎样才能写出三个链接的所有结果,而不是只写最后一个链接的结果?

【问题讨论】:

使用Ws.Cells(2, 1) 中的变量,即Ws.Cells(k+2, 1)。在 UBound(LeadInfo) 之后增加 k 请查看编辑@CDP1802。 我的评论看不到任何变化 你说的很对。你的建议确实解决了问题。感谢一万亿。 【参考方案1】:

您已经有了页面大小(即每页的最大结果)、页数和标题大小。只需对一个数组进行尺寸标注以存储结果并将其写出一次。比重复 ReDim 更有效,ReDim 复制一个数组,然后写出会产生 I/O。

使用变量来跟踪要填充到数组中的行。

将你的写作移到循环之外。

声明Listings As MSHTML.IHTMLDOMChildrenCollection 以便提供更新的 Excel 版本(至少向后兼容 2010)。

Option Explicit

Public Sub FetchData()
    Dim Xhr As Object, Html As MSHTML.HTMLDocument, Ws As Worksheet
    Dim Link As Variant, Links() As Variant, LeadInfo() As String
    Dim I As Long, HtmlDoc As MSHTML.HTMLDocument, Listings As MSHTML.IHTMLDOMChildrenCollection
    Dim Headers() As Variant
    
    Links = Array( _
            "https://***.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50", _
            "https://***.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50", _
            "https://***.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50" _
            )
    
    Set Ws = ThisWorkbook.Worksheets("Sheet1")
    Set Xhr = CreateObject("MSXML2.XMLHTTP")
    Set Html = New HTMLDocument
    Set HtmlDoc = New HTMLDocument

    Headers = Array("Title", "URL", "User", "Asked")

    ReDim LeadInfo(1 To (UBound(Links) + 1) * 50, 1 To UBound(Headers) + 1) 'size according to headers and page size
    
    Dim rowNumber As Long
    
    For Each Link In Links
        With Xhr
            .Open "GET", Link, False
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
            .send
            Html.body.innerHTML = .responseText
        End With

        Set Listings = Html.querySelectorAll(".summary")

        On Error Resume Next
        For I = 0 To Listings.Length - 1
            rowNumber = rowNumber + 1
            HtmlDoc.body.innerHTML = Listings.Item(I).innerHTML
            LeadInfo(rowNumber, 1) = HtmlDoc.querySelector(".question-hyperlink").innerText
            LeadInfo(rowNumber, 2) = Replace$(HtmlDoc.querySelector(".question-hyperlink").href, "about:", "https://***.com")
            LeadInfo(rowNumber, 3) = HtmlDoc.querySelector(".user-details > a").innerText
            LeadInfo(rowNumber, 4) = HtmlDoc.querySelector(".user-action-time > span.relativetime").innerText
        Next I
        On Error GoTo 0

    Next Link
     
    If IsEmpty(Ws.Cells(1, 1).Value) Then Ws.Cells(1, 1).Resize(1, UBound(Headers) + 1) = Headers
    Ws.Cells(2, 1).Resize(UBound(LeadInfo, 1), UBound(LeadInfo, 2)) = LeadInfo
    
End Sub

【讨论】:

以上是关于不能写出三个链接的所有结果,而只能写最后一个链接的结果的主要内容,如果未能解决你的问题,请参考以下文章

noip提高组2012 借教室(luogu 1083)

请教大虾JSP链接数据库的代码怎样写,最好可以把代码写出来给我看看,我驱动已经连好了

不能只删除链接列表中的最后一个元素! (所有其他具有特定值的元素都会被删除)

如何用JS点击超链接弹出对话框

Centos7.3 进入救援模式,解决虚拟机开机引导只能看到一个横杠

网页上的复制链接按钮不管用,只能用右键复制,怎么回事?