Jsoup在极少数情况下无法解析元素

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Jsoup在极少数情况下无法解析元素相关的知识,希望对你有一定的参考价值。

我最近将我的应用程序中的RSS解析从迁移到,当尝试解析源文件时,Jsoup将无法正确解析<>,导致&lt;&gt;在检索到的Document中,进一步导致问题试图使用Document::select

MCVE

import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;

import java.io.IOException;
import java.util.Collection;

public class MCVE {
    public static void main(final String[] args) throws IOException {
        Jsoup.connect("https://rss.packetstormsecurity.com/files/page18")
             .parser(Parser.xmlParser())
             .get()
             .select("item")
             .stream()
             .map(e -> e.select("pubDate"))
             .flatMap(Collection::stream)
             .map(Element::text)
             .forEach(System.out::println);
    }
}

上面的代码目前(RSS源不断更新,本地文件不会出现问题)打印如下:

Wed, 22 Nov 2017 15:29:54 GMT
Wed, 22 Nov 2017 15:29:43 GMT
Wed, 22 Nov 2017 15:29:36 GMT
Wed, 22 Nov 2017 15:29:28 GMT
Wed, 22 Nov 2017 15:29:22 GMT
Wed, 22 Nov 2017 15:27:23 GMT
Tue, 21 Nov 2017 23:23:23 GMT
Tue, 21 Nov 2017 19:21:38 GMT
Tue, 21 Nov 2017 19:20:12 GMT
Tue, 21 Nov 2017 19:18:15 GMT
Tue, 21 Nov 2017 19:16:17 GMT
Tue, 21 Nov 2017 19:14:37 GMT
Tue, 21 Nov 2017 19:13:34 GMT
Tue, 21 Nov 2017 19:11:33 GMT
Tue, 21 Nov 2017 19:07:49 GMT
Tue, 21 Nov 2017 19:06:56 GMT
Tue, 21 Nov 2017 19:04:19 GMT
Tue, 21 Nov 2017 19:03:57 GMT
Tue, 21 Nov 2017 10:11:11 GMT
Tue, 21 Nov 2017 04:54:00 GMT
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
Mon, 20 Nov 2017 22:22:00 GMT
Mon, 20 Nov 2017 16:16:00 GMT
Mon, 20 Nov 2017 16:15:00 GMT
Mon, 20 Nov 2017 16:14:00 GMT

这是由Jsoup送回给我的Document的片段。

<item> 
 <title>Ubuntu Security Notice USN-3483-2</title> 
 <link>
  https://packetstormsecurity.com/files/145055/USN-3483-2.txt
 </link> 
 <guid isPermaLink="true">
  https://packetstormsecurity.com/files/145055/USN-3483-2.txt
 </guid> 
 <comments>
  https://packetstormsecurity.com/files/145055/Ubuntu-Security-Notice-USN-3483-2.html
 </comments> 
 <pubDate>
  Tue, 21 Nov 2017 04:04:00 GMT&lt;/pubDate&gt; <!-- the affected line -->
  <description>
   Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
  </description> 
  <category></category> 
 </pubDate>
</item>

在这里,一些字符被错误地解析,而网站上的xml格式正确。


当使用带有斜杠(https://rss.packetstormsecurity.com/files/page18/)的相同URL时,问题不会发生在同一页面上,但它会在不同的页面上发生。

由于Feed的活动性质,发生问题的Feed页面也会发生变化。如果第18页上的问题未能解决,我将使用新页面进行更新。如果文件单独下载然后用Jsoup::parse解析,也不会发生。

Jsoup版本是1.11.2。

额外的MCVE

此MCVE显示只有在使用Jsoup解析响应时才会出现问题,实际下载的XML很好:

import org.jsoup.Connection;
import org.jsoup.Jsoup;

import java.io.IOException;

public class MCVE {
    public static void main(final String[] args) throws IOException {
        final Connection.Response response = Jsoup.connect("https://rss.packetstormsecurity.com/files/page18").execute();

        // Well formed XML
        System.out.println(response.body());

        // Malformed XML
        System.out.println(response.parse());
    }
}
答案

这似乎是org.jsoup.helper.HttpConnection::getorg.jsoup.helper.HttpConnection.Response::parsehere's my corresponding github issuehere's a repo复制这个bug的错误。

This will be fixed in Jsoup 1.11.3

以上是关于Jsoup在极少数情况下无法解析元素的主要内容,如果未能解决你的问题,请参考以下文章

jsoup解析xml某片段的问题

使用Jsoup解析html网页

JSoup 解析器没有拾取表格元素

Blog 使用Jsoup解析出html中的img元素

Android - 使用 JSOUP 解析 JS 生成的 url

请问用Jsoup如何解析一个已知name的元素的value值?