简单的 html dom 抓取大的 html 文件

Posted 2023-03-05

技术标签:

【中文标题】简单的 html dom 抓取大的 html 文件【英文标题】：simple html dom scraping large html file 【发布时间】：2013-07-30 03:10:23 【问题描述】：

我需要使用简单的 html dom 抓取一个大的 html 文件（例如：http://www.indianrail.gov.in/mail_express_trn_list.html）。我从一个简单的脚本开始：

<?php
require "simple_html_dom.php";
echo file_get_html('http://www.indianrail.gov.in/mail_express_trn_list.html')->plaintext;
?>

什么都不显示，只是一个空白页面，Apache error.log 文件中有错误消息

 PHP Notice:  Trying to get property of non-object in /var/www/index.php on line 3
 PHP Notice:  Trying to get property of non-object in /var/www/index.php on line 3

同时所有其他页面（例如：http://www.indianrail.gov.in/special_trn_list.html）都可以使用相同的脚本正常工作。

【问题讨论】：

你试过用file_get_contents代替file_get_html吗？ php.net/manual/en/function.file-get-contents.php 我可以复制这个问题，我会深入挖掘并告诉你 @Fred 我试过了，但同样的错误.. @DevZer0 等待回复.. 非常感谢 :) @krizna 这些关于 SO 的答案可能会有所帮助 ***.com/a/6006379/1415724 和 ***.com/a/6519443/1415724 【参考方案1】：

问题似乎是在simple_html_dom 中定义的MAX_FILE_SIZE。

您可以通过编辑 simple_html_dom.php 文件中的define('MAX_FILE_SIZE', 600000); 行来调整它。

【讨论】：

我试过define('MAX_FILE_SIZE', 6000000000000000000); ..但没有运气..仍然是同样的错误..谢谢定义一个真实的数字，我设置为12600000 它接缝工作，但我现在得到不同的错误..退出信号分段错误（11）

以上是关于简单的 html dom 抓取大的 html 文件的主要内容，如果未能解决你的问题，请参考以下文章

使用 php simple html dom 抓取时需要帮助修复 html [重复]

使用 Simple HTML Dom Parser 使用特定关键字抓取 <script> 标记

爬虫技术——抓取滴滴打车优惠券

Java:简单的解析XML文件之使用DOM解析

Dom操作和javascript中的抓取

使用 php 抓取数据的 json 服务