在解析页面内容时删除 DocDocument 警告
Posted
技术标签:
【中文标题】在解析页面内容时删除 DocDocument 警告【英文标题】:Removing DocDocument warning while parsing page content 【发布时间】:2013-09-27 10:42:53 【问题描述】:我正在尝试解析任何 url 的内容。不应该满足任何 html 代码。 这工作正常,但在阅读给出的 url 上的内容时会出现一堆错误。如何删除此警告?
<?php
$url= 'http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script)
$script->parentNode->removeChild($script);
$textContent = $doc->textContent; //inherited from DOMNode
echo $textContent;
?>
警告:
content-from-a-web-page, line: 255 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 255 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 273 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 273 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 412 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 412 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 551 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 551 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
Warning: DOMDocument::loadHTMLFile(): ID display-name already defined in http://***.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 731 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
【问题讨论】:
DOMDocument::loadHTML error 的可能重复项 【参考方案1】:您可以使用libxml_use_internal_errors()
并执行以下操作:
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
libxml_clear_errors();
正如 Peehaa 在下面的 cmets 中指出的,重置错误状态是个好主意。你可以这样做:
$errors = libxml_use_internal_errors(true); //store
$doc->loadHTMLFile($url);
libxml_clear_errors();
libxml_use_internal_errors($errors); //reset back to previous state
它是这样工作的:
libxml_use_internal_errors()
告诉 libxml 在内部处理错误和警告,并且不应将其输出到浏览器。还将错误的当前状态存储在变量中
然后你用loadHTML()
方法加载HTML文件
用libxml_clear_errors
清除错误缓冲区
恢复错误值的旧状态
Demo!
【讨论】:
请注意,存储libxml_use_internal_errors
的当前状态并在之后重置它被认为是一种很好的做法。
@PeeHaa:好主意。我已将其添加到答案中:)
@AmalMurali:非常感谢。你能解释一下代码的区别吗?
@Karimkhan:为我的回答添加了解释。
如果你再调用libxml_use_internal_errors
,你就不需要调用libxml_clear_errors
了。以上是关于在解析页面内容时删除 DocDocument 警告的主要内容,如果未能解决你的问题,请参考以下文章