如何使用 PHP Simple HTML DOM Parser 提取标题和元描述？

Posted 2023-03-05

技术标签:

【中文标题】如何使用 PHP Simple HTML DOM Parser 提取标题和元描述？【英文标题】：How to extract title and meta description using PHP Simple HTML DOM Parser? 【发布时间】：2012-07-08 07:17:35 【问题描述】：

如何使用php Simple html DOM Parser 提取页面的title 和元description？

我只需要页面的标题和纯文本的关键字。

【问题讨论】：

PHP 库 simplehtmldom.sourceforge.net 我猜？ 【参考方案1】：

我刚刚看了一下 HTML DOM Parser，试试：

$html = new simple_html_dom();
$html->load_file('xxx'); //put url or filename in place of xxx
$title = $html->find('title');
echo $title->plaintext;

$descr = $html->find('meta[description]');
echo $descr->plaintext;

【讨论】：

这段代码对我不起作用（不再？---答案比库的最新版本旧），因为 find 可能返回多个元素。为了让它工作，我需要添加一个值为 0 的第二个参数来查找： $html->find('title', 0)->plaintext; 答案不正确 - 请参阅下面我测试过的答案【参考方案2】：

$html = new simple_html_dom();
$html->load_file('xxx'); 
//put url or filename in place of xxx
$title = array_shift($html->find('title'))->innertext;
echo $title;
$descr = array_shift($html->find("meta[name='description']"))->content;
echo $descr;

【讨论】：

是的！测试了这段代码。 look myblog 我收到警告：严格标准：只有变量应该通过引用传递【参考方案3】：

$html = new simple_html_dom();
$html->load_file('http://www.google.com'); 
$title = $html->find('title',0)->innertext;

$html->find('title') 将返回一个数组

所以你应该使用$html->find('title',0)，meta[description]也是如此

【讨论】：

【参考方案4】：

$html = new simple_html_dom();
$html->load_file('some_url'); 

//To get Meta Title
$meta_title = $html->find("meta[name='title']", 0)->content;

//To get Meta Description
$meta_description = $html->find("meta[name='description']", 0)->content;

//To get Meta Keywords
$meta_keywords = $html->find("meta[name='keywords']", 0)->content;

注意：元标记的名称区分大小写！

【讨论】：

正确。 +1 按标签和属性抓取。【参考方案5】：

你可以使用 php 代码，很容易知道。喜欢这里

$result = 'site.com'; $tags = get_meta_tags("html/".$result);

【讨论】：

此功能在某些情况下会严重失败【参考方案6】：

正确答案是：

$html = str_get_html($html);
$descr = $html->find("meta[name=description]", 0);
$description = $descr->content;

上面的代码把html变成object格式，然后find方法寻找一个带有name描述的meta标签，最后需要返回meta标签内容的值，而不是别人说的innertext或者plaintext .

这已经过测试并在实时代码中使用。最好的

【讨论】：

哪个 var 应该保存 Web URL？【参考方案7】：

取自上面LeiXC的解决方案，需要使用简单的html dom类：

$dom = new simple_html_dom();
$dom->load_file( 'websiteurl.com' );// put your own url in here for testing
$html = str_get_html($dom);
$descr = $html->find("meta[name=description]", 0);
$description = $descr->content;
echo $description;

我已经测试过这段代码，是的，它区分大小写（一些元标记使用大写的 D 进行描述）

这里有一些拼写错误检查：

if( is_object( $html->find("meta[name=description]", 0)) )
    echo $html->find("meta[name=description]", 0)->content;
 elseif( is_object( $html->find("meta[name=Description]", 0)) )
    echo $html->find("meta[name=Description]", 0)->content;

【讨论】：

【参考方案8】：

$html->find('meta[name=keywords]',0)->attr['content'];
$html->find('meta[name=description]',0)->attr['content'];

【讨论】：

【参考方案9】：

我找到了简单的描述方法

$html = new simple_html_dom(); 
$html->load_file('your_url');
$title = $html->load('title')->simpletext; //<title>**Text from here**</title>
$description = $html->load("meta[name='description']", 0)->simpletext; //<meta name="description" content="**Text from here**">

如果你的行包含多余的空格，那么试试这个

$title = trim($title);
$description = trim($description);

【讨论】：

以上是关于如何使用 PHP Simple HTML DOM Parser 提取标题和元描述？的主要内容，如果未能解决你的问题，请参考以下文章

如何使用 PHP Simple HTML DOM Parser 在 HTML 文件中找到最后一个 <div class>？

从 Simple Html Dom 中排除不需要的 html - PHP

php解析html类库simple_html_dom

使用 PHP Simple HTML DOM Parser 查找和删除 html 标签

使用 PHP 和 Simple HTML DOM 解析 HTML 时遇到问题

使用php simple html dom parser解析html标签