简单的 HTML DOM 解析器 - 抓取没有 id 或类的 html 内容

Posted 2023-03-05

技术标签:

【中文标题】简单的 HTML DOM 解析器 - 抓取没有 id 或类的 html 内容【英文标题】：Simple HTML DOM Parser- Scraping html content that has no id or class 【发布时间】：2015-12-04 05:58:30 【问题描述】：

我正在抓取网页中的值并将它们存储在一个数组中，目前我可以提取所有 td.Place 值，因为它有一个类。

注意：我使用的是简单的 html DOM Parser

我当前有效的代码：

<?php 

include('simple_html_dom.php');
$html = file_get_html('http://www...');

// initialize empty array to store the data array from each row
$theData3 = array();

// initialize array to store the cell data from each row
$rowData3 = arra

foreach($row->find('td.Place') as $cell) 


// push the cell's text to the array
$rowData3[] = $cell->innertext;


// push the row's data array to the 'big' array
$theData3[] = $rowData3;



print_r($theData3);
 ?>

有什么问题？

我想在 class="Grad 中提取值 100 和 - 3。** 1234562 内的前两个 td= “研究生*。因为这两个 TD 值没有 id 或 class 我觉得很难。

这是我目前正在抓取的 html

<tr class="PersonrRow odd">
        <td></td>
        <td class="place">T9</td>
        <td>
        <span class="rank"></span>16</td>
        <td class="Grad">-7
        </td>
        <td>
        100
        </td>
        <td>
        -3
        </td>
        <td>
        712
        </td>
        <td>
        682
        </td>
        <td>
        702
        </td>
        <td>
        68
        </td>
        <td class="person large"></td>
        <td style="">
        277
        </td>
    </tr>

【问题讨论】：

我很困惑你是使用 php 来废弃这个还是 jquery，因为我在这个问题中看不到 jquery 的知识道歉马克我正在使用 php，我将从我的问题的标签中删除 jquery。谢谢这里的问题看不懂，说清楚点你也可以把你的完整代码放在这里。我之前已经构建了一个爬虫，如果我能看到你的完整代码，我将能够快速帮助你。感谢 Mark 我添加了更多代码。 【参考方案1】：

好的，所以在进行了一些研究并挖掘了我的旧文件之后，这就是我为你想出的。你不需要任何花哨的插件或任何东西，只需要 php DOMDocument：

php

<?php
    $thedata3 = array();
    $rowdata3 = array();
    $DOM = new DOMDocument();
    $DOM->loadHTMLFile("file path or url");

    // get the actual table itself
    $xpath = new DOMXPath($DOM);
    $table = $xpath->query('//table[@id="tableID"]')->item(0);


    $rows = $table->getElementsByTagName("tr");

    for ($i = 0; $i < $rows->length; $i++) 
        $cols = $rows->item($i)->getElementsbyTagName("td");
        for ($j = 0; $j < $cols->length; $j++) 

          //change $cols->item($j) $cols->item('insert column number here')
          // that will give you the proper column you're after
           array_push($rowdata3, $cols->item($j)->nodeValue);
        
        array_push($thedata3, $rowdata3);
        $rowdata3 = array(); //empty the $rowdata3 array for fresh results
    
?>

这是我能用你提供的东西做的最好的事情，但我希望它在某种程度上有所帮助，如果你需要更多帮助，请告诉我。

为了便于访问和阅读。我建议按照您的计划将所有内容放入关联数组中，然后在您抓取所有数据之后。操作数组数据并从中提取您想要的内容。那应该更容易。

参考文献

PHP.net DOMDocument http://php.net/manual/en/class.domdocument.php

PHP.net DOMXPath http://php.net/manual/en/class.domxpath.php

此处的此链接包含对 DOMDocument 和 DOMXPath 类的所有引用。这将包含您入门所需的一切！

【讨论】：

谢谢马克，只是一个简单的问题，我正在解析的 html 页面很大，你不必在 getElementsByTagName 中指向 class="Grad" 所以我确保我得到正确的值从那个地区？ Helena，我已经添加了代码来解决这个问题。如果您需要其他任何东西，请告诉我。如果它是您正在寻找的解决方案，请接受此答案并投票。很好的答案，非常有用，我在运行脚本时似乎遇到了错误。 PHP 致命错误：无法使用 [] 读取第 22 行的 /testen/tester.php 您在 /var/mail/root 中有新邮件你使用的是什么版本的 PHP？ PHP 版本 => 5.5.12-2ubuntu4.6

以上是关于简单的 HTML DOM 解析器 - 抓取没有 id 或类的 html 内容的主要内容，如果未能解决你的问题，请参考以下文章