使用 JSoup 帮助抓取 HTML

Posted 2023-03-12

技术标签:

【中文标题】使用 JSoup 帮助抓取 HTML【英文标题】：Help scraping HTML with JSoup 【发布时间】：2011-10-21 22:46:18 【问题描述】：

这里有一点点初学者，正在从事一个个人项目，将我的学校课程设置为易于阅读的表格格式，但在从网站上抓取数据的初始步骤中遇到了麻烦。

我刚刚在 Eclipse 中将 JSoup 库添加到我的项目中，现在在使用 Jsoup 的文档时无法初始化连接。

最后，我的目标是获取每个类名/时间/描述，但现在我只想获取名称。源网站的 html 如下所示：

<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')

我的第一个猜测是getElementsByTag(td)，然后在这些元素中查询onclick=的参数或者'class'参数的值，通过去掉开头的“I”和后缀“SW”来清理” 留下名称“CS3330”。

现在进入实际实现：

Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");

在这一点上，我已经遇到了问题（即使我与文档中提供的示例相差不远）并且希望获得一些关于让我的代码正常运行的指导！

编辑：知道了！谢谢大家！

【问题讨论】：

什么问题？如果有错误，请在上面的原始帖子中发布实际的错误消息。刚刚下载了 JSoup 库并在您的网站上试用。它就像一个魅力！太酷了！ 1+ 【参考方案1】：

根据documentation，您应该这样做：

Document doc = Jsoup.connect(url).get();

parse() 方法用于文件。

【讨论】：

类型不匹配：无法从 org.jsoup.nodes.Document 转换为 javax.swing.text.Document @asolanki：这是您的代码中的一个错误：您正在尝试使用 javax.swing.Document 而不是 org.jsoup.nodes.Document。换句话说，不要使用 Swing Document 而是使用 JSoup 附带的 Document 类。再次，弗拉德是正确的，我建议你也给他投票。修复了这个问题——似乎我在导入的文档包中出错了。该死的 IDE 让事情看起来更简单，同时混淆了我的理解！没错。真的就是这么简单，然后只需使用 firebug 来查看要读取或操作的 id 和类。【参考方案2】：

我刚刚下载了 JSoup 并在你们学校的网站上试用了一下，得到了这样的输出：

Unit: Computer Science
   CS 1010: Introduction to Information Technology
   CS 1110: Introduction to Programming
   CS 1111: Introduction to Programming
   CS 1112: Introduction to Programming
   CS 1120: From Ada and Euclid to Quantum Computing and the World Wide Web
   CS 2102: Discrete Mathematics I
   CS 2110: Software Development Methods
   CS 2150: Program and Data Representation
   CS 2220: Engineering Software
   CS 2330: Digital Logic Design
   CS 2501: Special Topics in Computer Science
   CS 3102: Theory of Computation
   CS 3330: Computer Architecture
   CS 4102: Algorithms
   CS 4240: Principles of Software Design
   CS 4414: Operating Systems
   CS 4444: Introduction to Parallel Computing
   CS 4457: Computer Networks
   CS 4501: Special Topics in Computer Science
   CS 4753: Electronic Commerce Technologies
   CS 4810: Introduction to Computer Graphics
   CS 4993: Independent Study
   CS 4998: Distinguished BA Majors Research
   CS 6161: Design and Analysis of Algorithms
   CS 6190: Computer Science Perspectives
   CS 6354: Computer Architecture
   CS 6444: Introduction to Parallel Computing
   CS 6501: Special Topics in Computer Science
   CS 6610: Programming Languages
   CS 7457: Computer Networks
   CS 7993: Independent Study
   CS 7995: Supervised Project Research
   CS 8501: Special Topics in Computer Science
   CS 8524: Topics in Software Engineering
   CS 8897: Graduate Teaching Instruction
   CS 8999: Thesis
   CS 9999: Dissertation

太酷了！弗拉德是对的。使用 connect(...) 方法。 1+ 到弗拉德

其他建议和提示：这些是我在我的小程序中使用的常量：

   private static final String URL = "http://rabi.phys.virginia.edu/mySIS/CS2/" +
        "page.php?Semester=1118&Type=Group&Group=CompSci";
   private static final String TD_TAG = "td";
   private static final String CLASS_ATTRIB = "class";
   private static final String CLASS_ATTRIB_UNIT_NAME = "UnitName";
   private static final String CLASS_ATTRIB_COURSE_NUM = "CourseNum";
   private static final String CLASS_ATTRIB_COURSE_NAME = "CourseName";

这些是我在抓取方法中使用的变量：

     String unitName = "";
     List<String> courseNumbNameList = new ArrayList<String>();
     String courseNumbName = "";

编辑 1 根据您最近的cmets，我认为您有点过度思考。对我有用的是这个简单的算法：

创建上面列出的 3 个变量按照 Vlad 的建议获取我的文档。创建一个 td Elements 变量并将所有具有 td 标签的元素分配给它。使用 for 循环，int i 从 0 到 td.get(i); 获取每个元素，元素在循环内检查元素的类属性。如果属性字符串等于 CLASS_ATTRIB_UNIT_NAME 字符串（见上文），则获取元素的文本并使用它来设置 unitName 变量。如果属性 String 等于 CLASS_ATTRIB_COURSE_NUM，则将 courseNumbName 设置为元素的文本。如果属性 String 等于 CLASS_ATTRIB_COURSE_NAME 将元素的文本附加到 courseNumbName 字符串，将字符串添加到数组列表中，并将 courseNumbName = 设置为“”。

【讨论】：

这很像我想要的！这是我现在的位置： Document doc = Jsoup.connect(URL).get();元素表 = doc.getElementsByTag(TD_TAG);现在说有问题的 HTML 是： Introduction to Information Technology 任何关于 A) 以该表格元素为目标和 B) 清理 innerHTML 以使其成为可读的 CSXXXX 的指针？ @asolanki：将此代码发布为对您原始问题的编辑，而不是在评论中，因为它在 cmets 中格式不正确。 @asolanki：请参阅上面我的回答中的 Edit 1。非常感谢您的耐心等待，看来我手头的任务已经差不多完成了。我已经重新编辑了我希望成为我最后一个问题的 OP。 @asloanki：恭喜你继续前进。同样，不要忘记为 Vlad 的贡献投票。

以上是关于使用 JSoup 帮助抓取 HTML的主要内容，如果未能解决你的问题，请参考以下文章