BeautifulSoup,一碗美丽的汤,一个隐藏的大坑

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了BeautifulSoup,一碗美丽的汤,一个隐藏的大坑相关的知识,希望对你有一定的参考价值。

python 网络爬虫常用的4大解析库助手:re正则、etree xpath、scrapy xpath、BeautifulSoup。(因为etree xpath和scrapy xpath用法上有较大的不同,故没有归为一类),本文来介绍BeautifulSoup一个少为人知的坑,见示例: 例1(它是长得不一样, 柬文勿怪): content = """ <html> <body> <div class="td-post-content td-pb-padding-side"> <p> <img class="alignnone size-full wp-image-122426" data-recalc-dims="1" height="352" src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching.jpg?resize=630%2C352&amp;ssl=1" width="630"/> </p> <p> <img class="alignnone size-full wp-image-122427" data-recalc-dims="1" height="473" src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&amp;ssl=1" width="630"/> </p> <p> ????????????????? ????????????????????????????? ????????????????????????????????????????????????????????????????? ????????????????????????? ?????????????????????????? </p> <p> <img class="alignnone size-full wp-image-122427" data-recalc-dims="1" height="473" src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&amp;ssl=1" width="630"/> </p> <p> <img class="alignnone size-full wp-image-122428" data-recalc-dims="1" height="473" src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching2.jpg?resize=630%2C473&amp;ssl=1" width="630"/> <br/> <em> <br/> ?????? </em> ??????????????????????? ???????????? ?????????????????? ????????????????????????????????????? </p> </div> </body> </html> """ soup = BeautifulSoup(content) img_lst = [] inner_src_list = soup.find_all(‘img‘, src=True) for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&amp;ssl") print(url) print(soup.prettify()) # content = soup.prettify() # src的打印结果一样 img_tags = soup.find_all(‘img‘) for img in img_tags: print(img[‘src‘]) 控制台打印输出如下: ![](http://i2.51cto.com/images/blog/201810/19/f709eed65fc5ebf49e98cc7cb67e6b91.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) ![](http://i2.51cto.com/images/blog/201810/19/3bda9857b63335670b3dcac69903aa74.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) ![](http://i2.51cto.com/images/blog/201810/19/9e41161d11fb22a9f01ec2868e870ead.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) 怎么会这样:文本中的‘amp;’字符怎么消失了? 解释如下:BeautifulSoup在提取src时内部会自动把符号‘&amp;’转义成‘&‘,【网页解析有时不一定要眼前的直觉】【不仅bs如此, etree xpath和scrapy xpath也是一样】 例2: 文本同上 soup = BeautifulSoup(content) img_lst = [] inner_src_list = soup.find_all(‘img‘, src=True) # 注意比较 for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&amp;ssl") print(url) inner_src_list = soup.find_all(‘img‘, attr={‘src‘:True}) # 注意比较 for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&amp;ssl") print(url) 这里不作打印了,直接说明现象,第一个print正常打印,第二个print输出为空,为什么? 解释如下: 第一个find_all,把src=True视为存在src属性的img标签,第二个find_all,把attr={‘src‘, True}视为存在src且属性值为True的img标签,所以结果可想而知! 上述如有不正之处,欢迎指出,谢谢!

以上是关于BeautifulSoup,一碗美丽的汤,一个隐藏的大坑的主要内容,如果未能解决你的问题,请参考以下文章

美丽的汤 html csv

美丽的汤,使用“findAll()”时完全匹配

python 美丽的汤粪便表:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

python 美丽的汤粪便表:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

美丽的汤:“ResultSet”对象没有“find_all”属性?

美丽的汤在源文件中找到标记的位置?