爬虫学习之-xpath

Posted 2021-01-07 php-linux

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫学习之-xpath相关的知识，希望对你有一定的参考价值。

1、XPATH使用方法
使用XPATH有如下几种方法定位元素（相比CSS选择器，方法稍微多一点）：
a、通过绝对路径定位元素（不推荐！）
WebElement ele = driver.findElement(By.xpath("html/body/div/form/input"));
b、通过相对路径定位元素
WebElement ele = driver.findElement(By.xpath("//input"));
c、使用索引定位元素
WebElement ele = driver.findElement(By.xpath("//input[4]"));
d、使用XPATH及属性值定位元素
WebElement ele = driver.findElement(By.xpath("//input[@id=‘fuck‘]"));
//其他方法(看字面意思应该能理解吧)
WebElement ele = driver.findElement(By.xpath("//input[@type=‘submit‘][@name=‘fuck‘]"));
WebElement ele = driver.findElement(By.xpath("//input[@type=‘submit‘ and @name=‘fuck‘]"));
WebElement ele = driver.findElement(By.xpath("//input[@type=‘submit‘ or @name=‘fuck‘]"));
e、使用XPATH及属性名称定位元素
   元素属性类型：@id 、@name、@type、@class、@tittle
//查找所有input标签中含有type属性的元素
WebElement ele = driver.findElement(By.xpath("//input[@type]"));
f、部分属性值匹配
WebElement ele = driver.findElement(By.xpath("//input[start-with(@id,‘fuck‘)]"));//匹配id以fuck开头的元素，id=‘fuckyou‘
WebElement ele = driver.findElement(By.xpath("//input[ends-with(@id,‘fuck‘)]"));//匹配id以fuck结尾的元素，id=‘youfuck‘
WebElement ele = driver.findElement(By.xpath("//input[contains(@id,‘fuck‘)]"));//匹配id中含有fuck的元素，id=‘youfuckyou‘
g、使用任意值来匹配属性及元素
WebElement ele = driver.findElement(By.xpath("//input[@*=‘fuck‘]"));//匹配所有input元素中含有属性的值为fuck的元素
元素定位总结

//注：本专题只介绍java版
//By id
WebElement ele = driver.findElement(By.id());
//By Name
WebElement ele = driver.findElement(By.id());
//By className
WebElement ele = driver.findElement(By.className());
//By tabName
WebElement ele = driver.findElement(By.tagName());
//By linkText
WebElement ele = driver.findElement(By.linkText());
//By partialLinkText
WebElement ele = driver.findElement(By.partialLinkText());//通过部分文本定位连接
//By cssSelector
WebElement ele = driver.findElement(By.cssSelector());
//By XPATH
WebElement ele = driver.findElement(By.xpath());

=================================栗子=====================================

1、id 获取id 的属性值

2、starts-with 顾名思义，匹配一个属性开始位置的关键字 -- 模糊定位

3、contains 匹配一个属性值中包含的字符串 -- 模糊定位

4、text() 函数文本定位

5、last() 函数位置定位

<input id="su" class="bg s_btn btnhover" value="百度一下" type="submit"/>
//*[@id=‘su‘]      获取id 的属性为‘su‘ 的值
或
//input[contains(@class,‘bg s_btn‘)]

<a class="lb" href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" name="tj_login" onclick="return false;">登录</a>
//a[starts-with(@name,‘tj_lo‘)]     属性模糊定位
//a[contains(@name,‘tj_lo‘)]     属性模糊定位

<a href="http://www.baidu.com">百度搜索</a>
//a[text()=‘百度搜索‘] 
或
//a[contains(text(),"搜索")]    --文本模糊定位

<a id="setf" href="//www.baidu.com/cache/sethelp/help.html" onmousedown="return ns_c({‘fm‘:‘behs‘,‘tab‘:‘favorites‘,‘pos‘:0})" target="_blank">把百度设为主页</a>

//a[text()=‘把百度设为主页‘]

/A/B/C[last()]   表示A元素→B元素→C元素的最后一个子元素，得到id值为e2的E元素

以上是关于爬虫学习之-xpath的主要内容，如果未能解决你的问题，请参考以下文章

Java爬爬学习之WebMagic

爬虫学习之webmagic源码剖析

爬虫概念与编程学习之如何爬取视频网站页面（用HttpClient）

爬虫概念与编程学习之如何爬取网页源代码

小白学 Python 爬虫（19）：Xpath 基操

python学习之爬虫一