按顺序提取文本标签 - 如何做到这一点?
Posted
技术标签:
【中文标题】按顺序提取文本标签 - 如何做到这一点?【英文标题】:Extracting text tags in order - How can this be done? 【发布时间】:2020-11-25 22:07:15 【问题描述】:我正在尝试在 html 中查找所有文本以及父标记。在下面的示例中,名为 html
的变量具有示例 HTML,我尝试在其中提取标签和文本。
这工作正常,并按预期给出了tags
和text
这里我使用了cheerio
来遍历DOM。 cheerio
与 jquery
工作方式完全相同。
const cheerio = require("cheerio");
const html = `
<html>
<head></head>
<body>
<p>
Regular bail is the legal procedure through which a court can direct
release of persons in custody under suspicion of having committed an offence,
usually on some conditions which are designed to ensure
that the person does not flee or otherwise obstruct the course of justice.
These conditions may require executing a “personal bond”, whereby a person
pledges a certain amount of money or property which may be forfeited if
there is a breach of the bail conditions. Or, a court may require
executing a bond “with sureties”, where a person is not seen as
reliable enough and may have to present
<em>other persons</em> to vouch for her,
and the sureties must execute bonds pledging money / property which
may be forfeited if the accused person breaches a bail condition.
</p>
</body>
</html>
`;
const getNodeType = function (renderedHTML, el, nodeType)
const $ = cheerio.load(renderedHTML)
return $(el).find(":not(iframe)").addBack().contents().filter(function ()
return this.nodeType == nodeType;
);
let allTextPairs = [];
const $ = cheerio.load(html);
getNodeType(html, $("html"), 3).map((i, node) =>
const parent = node.parentNode.tagName;
const nodeValue = node.nodeValue.trim();
allTextPairs.push([parent, nodeValue])
);
console.log(allTextPairs);
如下图
但问题是提取的文本标签是乱序的。如果你看到上面的截图,other persons
已经被报告了,虽然它应该发生在to vouch for her ...
之前。为什么会这样?我怎样才能防止这种情况发生?
【问题讨论】:
【参考方案1】:您可能只想按深度顺序遍历树。步行功能由this gist提供。
function walk(el, fn, parents = [])
fn(el, parents);
(el.children || []).forEach((child) => walk(child, fn, parents.concat(el)));
walk(cheerio.load(html).root()[0], (node, parents) =>
if (node.type === "text" && node.data.trim())
console.log(parents[parents.length - 1].name, node.data);
);
这会打印出这些东西,但你也可以把它放在你的那个数组中。
【讨论】:
嗯,我不明白。这与getNodeType
有何不同?您能在这种情况下向我解释一下吗?
递归遍历函数确保它按文档顺序遍历 DOM 树。我认为.addBack()
等的组合会破坏 Cheerio 遍历。
关于改进相同功能以保持顺序的任何建议?
老实说,没有。 :)
为什么node.type
是string
? (即“文本”)通常不是数字吗?参考developer.mozilla.org/en-US/docs/Web/API/Node/nodeType以上是关于按顺序提取文本标签 - 如何做到这一点?的主要内容,如果未能解决你的问题,请参考以下文章