Jsoup在极少数情况下无法解析元素
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Jsoup在极少数情况下无法解析元素相关的知识,希望对你有一定的参考价值。
我最近将我的应用程序中的RSS解析从rome迁移到jsoup,当尝试解析源文件时,Jsoup将无法正确解析<
和>
,导致<
和>
在检索到的Document
中,进一步导致问题试图使用Document::select
。
MCVE
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import java.io.IOException;
import java.util.Collection;
public class MCVE {
public static void main(final String[] args) throws IOException {
Jsoup.connect("https://rss.packetstormsecurity.com/files/page18")
.parser(Parser.xmlParser())
.get()
.select("item")
.stream()
.map(e -> e.select("pubDate"))
.flatMap(Collection::stream)
.map(Element::text)
.forEach(System.out::println);
}
}
上面的代码目前(RSS源不断更新,本地文件不会出现问题)打印如下:
Wed, 22 Nov 2017 15:29:54 GMT
Wed, 22 Nov 2017 15:29:43 GMT
Wed, 22 Nov 2017 15:29:36 GMT
Wed, 22 Nov 2017 15:29:28 GMT
Wed, 22 Nov 2017 15:29:22 GMT
Wed, 22 Nov 2017 15:27:23 GMT
Tue, 21 Nov 2017 23:23:23 GMT
Tue, 21 Nov 2017 19:21:38 GMT
Tue, 21 Nov 2017 19:20:12 GMT
Tue, 21 Nov 2017 19:18:15 GMT
Tue, 21 Nov 2017 19:16:17 GMT
Tue, 21 Nov 2017 19:14:37 GMT
Tue, 21 Nov 2017 19:13:34 GMT
Tue, 21 Nov 2017 19:11:33 GMT
Tue, 21 Nov 2017 19:07:49 GMT
Tue, 21 Nov 2017 19:06:56 GMT
Tue, 21 Nov 2017 19:04:19 GMT
Tue, 21 Nov 2017 19:03:57 GMT
Tue, 21 Nov 2017 10:11:11 GMT
Tue, 21 Nov 2017 04:54:00 GMT
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
Mon, 20 Nov 2017 22:22:00 GMT
Mon, 20 Nov 2017 16:16:00 GMT
Mon, 20 Nov 2017 16:15:00 GMT
Mon, 20 Nov 2017 16:14:00 GMT
这是由Jsoup送回给我的Document
的片段。
<item>
<title>Ubuntu Security Notice USN-3483-2</title>
<link>
https://packetstormsecurity.com/files/145055/USN-3483-2.txt
</link>
<guid isPermaLink="true">
https://packetstormsecurity.com/files/145055/USN-3483-2.txt
</guid>
<comments>
https://packetstormsecurity.com/files/145055/Ubuntu-Security-Notice-USN-3483-2.html
</comments>
<pubDate>
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> <!-- the affected line -->
<description>
Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
</description>
<category></category>
</pubDate>
</item>
在这里,一些字符被错误地解析,而网站上的xml格式正确。
当使用带有斜杠(https://rss.packetstormsecurity.com/files/page18/
)的相同URL时,问题不会发生在同一页面上,但它会在不同的页面上发生。
由于Feed的活动性质,发生问题的Feed页面也会发生变化。如果第18页上的问题未能解决,我将使用新页面进行更新。如果文件单独下载然后用Jsoup::parse
解析,也不会发生。
Jsoup版本是1.11.2。
额外的MCVE
此MCVE显示只有在使用Jsoup解析响应时才会出现问题,实际下载的XML很好:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.IOException;
public class MCVE {
public static void main(final String[] args) throws IOException {
final Connection.Response response = Jsoup.connect("https://rss.packetstormsecurity.com/files/page18").execute();
// Well formed XML
System.out.println(response.body());
// Malformed XML
System.out.println(response.parse());
}
}
这似乎是org.jsoup.helper.HttpConnection::get
和org.jsoup.helper.HttpConnection.Response::parse
,here's my corresponding github issue和here's a repo复制这个bug的错误。
This will be fixed in Jsoup
1.11.3
以上是关于Jsoup在极少数情况下无法解析元素的主要内容,如果未能解决你的问题,请参考以下文章