Jsoup 没有完全获取原始 html 代码

Posted 2023-03-05

技术标签:

【中文标题】Jsoup 没有完全获取原始 html 代码【英文标题】：Jsoup doesn't fully fetch the raw html code 【发布时间】：2021-04-28 14:18:28 【问题描述】：

我正在尝试从genius.com 获取一些歌词（我知道他们有一个 api。我正在手动进行。）但我似乎每次都没有得到相同的 html 字符串。事实上我把下面的代码放在一个 for 循环，它似乎只在 %50 的时间内工作。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class Fetch_lyrics 
    public static void testing() 
        try 

            String urll = "https://genius.com/In-mourning-debris-lyrics";;
            Document doc = Jsoup.connect(urll).maxBodySize(0).get();
            String text = doc.select("p").first().toString();
            System.out.println(text);

         catch (IOException e) 
            e.printStackTrace();

我通过 doc 变量打印了原始 html，似乎大约 50% 的时间原始 html 字符串没有包含歌词的 <p> 类（idk，如果它被称为类或其他东西）。谢谢提前。

【问题讨论】：

可能相关：Page content is loaded with javascript and Jsoup doesn't see it 【参考方案1】：

看起来天才网站为新用户返回了不同的内容。我第一次来的时候得到了两个不同的内容，当我在浏览器 (Chrome) 中清除 cookie 并再次访问时。

我建议你添加两个选择器来获取你需要的信息。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.IOException;

class Outer 

    public static void main(String[] args) 
        try 
            String urll = "https://genius.com/In-mourning-debris-lyrics";
            Document doc = Jsoup.connect(urll).maxBodySize(0).get();
            Element first = doc.selectFirst("p");
            if (first == null) 
                first = doc.selectFirst("div[class^=Lyrics__Container]");
            
            if (first != null) 
                System.out.println(first.text());
            
         catch (IOException e) 
            e.printStackTrace();

【讨论】：

以上是关于Jsoup 没有完全获取原始 html 代码的主要内容，如果未能解决你的问题，请参考以下文章