可做爬虫的jsoup常用方法,附异步请求实现

Posted 2020-09-05

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了可做爬虫的jsoup常用方法,附异步请求实现相关的知识，希望对你有一定的参考价值。

jsoup是一款java html 解析器,可以解析url地址,html文本内容,可以通过dom,css以及类似javascript和jquery的操作方法来取出和操作数据

jsoup主要功能:

  1.从url,文件或者字符串中解析html

2.使用dom或css或JavaScript或类似jquery的选择器超照取出数据

3.可操作html元素,属性,文本

jsoup解析:

    jsoup提供一系列的静态解析方法生成document对象

static Document parse(File in, String charsetName)

static Document parse(File in, String charsetName, String baseUri)

static Document parse(InputStream in, String charsetName, String baseUri)

static Document parse(String html)

static Document parse(String html, String baseUri)

static Document parse(URL url, int timeoutMillis)

static Document parseBodyFragment(String bodyHtml)

static Document parseBodyFragment(String bodyHtml, String baseUri)

注: 1. baseUri表示检索到相对url是相对于baseUriURL的

2. charsetName表示字符集

***********************************************************************************************************************

Connection 提供一些方法抓取网页的内容,一般我用来去网页上爬数据

注:Connection connect(String url) 根据给定的url(必须是http或https)来创建连接

Connection cookie(String name, String value) 发送请求时放置cookie
Connection data(Map<String,String> data) 传递请求参数
Connection data(String... keyvals) 传递请求参数
Document get() 以get方式发送请求并对返回结果进行解析
Document post()以post方式发送请求并对返回结果进行解析
Connection userAgent(String userAgent)
Connection header(String name, String value) 添加请求头
Connection referrer(String referrer) 设置请求来源

***********************************************************************************************************************

jsoup提供类似JS获取html元素

getElementById(String id) 用id获得元素
getElementsByTag(String tag) 用标签获得元素
getElementsByClass(String className) 用class获得元素
getElementsByAttribute(String key) 用属性获得元素

同时还提供下面的方法提供获取兄弟节点：

siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()

***********************************************************************************************************************

jsoup 类似css选择器操作

获得与设置元素的数据
attr(String key) 获得元素的数据 attr(String key, String value) 设置元素数据
attributes() 获得所以属性
id(), className() classNames() 获得id class得值
text()获得文本值
text(String value) 设置文本值
html() 获取html
html(String value)设置html
outerHtml() 获得内部html
data()获得数据内容
tag() 获得tag 和 tagName() 获得tagname

***********************************************************************************************************************

操作html元素：

append(String html), prepend(String html)
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)
html(String value)

***********************************************************************************************************************

jsoup还提供了类似于JQuery方式的选择器

采用选择器来检索数据

tagname 使用标签名来定位，例如 a
ns|tag    使用命名空间的标签定位，例如 fb:name 来查找 <fb:name> 元素
#id    使用元素 id 定位，例如 #logo
.class    使用元素的 class 属性定位，例如 .head

*    定位所有元素

[attribute] 使用元素的属性进行定位，例如 [href] 表示检索具有 href 属性的所有元素
[^attr] 使用元素的属性名前缀进行定位，例如 [^data-] 用来查找 HTML5 的 dataset 属性
[attr=value]使用属性值进行定位，例如 [width=500] 定位所有 width 属性值为 500 的元素
[attr^=value],[attr$=value],[attr*=value] 这三个语法分别代表，属性以 value 开头、结尾以及包含
[attr~=regex]使用正则表达式进行属性值的过滤，例如 img[src~=(?i)\.(png|jpe?g)]
以上是最基本的选择器语法，这些语法也可以组合起来使用

组合用法
el#id    定位id值某个元素，例如 a#logo -> <a id=logo href= … >
el.class 定位 class 为指定值的元素，例如 div.head -> <div class="head">xxxx</div>
el[attr] 定位所有定义了某属性的元素，例如 a[href]
以上三个任意组合    例如 a[href]#logo 、a[name].outerlink

***********************************************************************************************************************
除了一些基本的语法以及这些语法进行组合外，jsoup 还支持使用表达式进行元素过滤选择

:lt(n)    例如 td:lt(3) 表示小于三列
:gt(n)    div p:gt(2) 表示 div 中包含 2 个以上的 p
:eq(n)    form input:eq(1) 表示只包含一个 input 的表单
:has(seletor)    div:has(p) 表示包含了 p 元素的 div
:not(selector)    div:not(.logo) 表示不包含 class="logo" 元素的所有 div 列表
:contains(text)    包含某文本的元素，不区分大小写，例如 p:contains(oschina)
:containsOwn(text)    文本信息完全等于指定条件的过滤
:matches(regex)    使用正则表达式进行文本过滤：div:matches((?i)login)

:matchesOwn(regex)    使用正则表达式找到自身的文本

***********************************************************************************************************************

jsoup使用

//url网址作为输入源

Document doc = Jsoup.connect("http://www.example.com").timeout(60000).get();

//File文件作为输入源

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://www.example.com/");

//String作为输入源

Document doc = Jsoup.parse(htmlStr);

和java script类似，Jsoup提供了下列的函数

getElementById(String id) 通过id获得元素
getElementsByTag(String tag) 通过标签获得元素
getElementsByClass(String className) 通过class获得元素
getElementsByAttribute(String key) 通过属性获得元素

同时还提供下面的方法提供获取兄弟节点：
siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()

用下面方法获得元素的数据：
attr(String key) 获得元素的数据
attr(String key, String value) 设置元素数据
attributes() 获得所有属性
id(), className() classNames() 得到id class的值
text()得到文本值
text(String value) 设置文本值
html() 获取html
html(String value)设置html
outerHtml() 获得内部html
data()获得数据内容
tag() 得到tag 和 tagName() 得到tagname

操作html提供了下面方法：
append(String html), prepend(String html)
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)
html(String value)

列如:

Document doc = Jsoup.connect("http://example.com")

.data("key1", "value1")//异步发送的多个数据

.data("key2", "value2")

.userAgent("Mozilla")

.cookie("cookie1", "cookieValue1")//可以发多个cookie

.cookie("cookie2", "cookieValue2")

.timeout(3000)//最大延时

.post()/.get()//请求方式

注意,一部请求时一定要弄清楚需要哪些cookie,可以使用chrome的f12 Applicaton 查看cookie

以上是关于可做爬虫的jsoup常用方法,附异步请求实现的主要内容，如果未能解决你的问题，请参考以下文章

爬虫实践－基于Jsoup爬取Facebook群组成员信息

Jsoup-简单爬取知乎推荐页面（附：get_agent()）

Java使用Java实现爬虫

Java使用Java实现爬虫

Java使用Java实现爬虫

Python啥爬虫库好用？