如何仅抓取多个div中的文本内容[重复]

Posted 2023-02-23

技术标签:

【中文标题】如何仅抓取多个div中的文本内容[重复]【英文标题】：How to scrape only textual content inside multple div [duplicate] 【发布时间】：2016-01-02 06:14:58 【问题描述】：

我只需要在 URL 处的 h3 中的参考下抓取文本内容，我正在尝试使用此代码，但我无法以 html 页面中显示的相同顺序获取文本。

    i=43
    while tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]//p['+str(i)+']/a/text()')!=[] :
        reference=tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]//p['+str(i)+']/text()')
        link_ref=tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]//p['+str(i)+']//a/text()')
        testo_reference=testo_reference + link_ref[0]+reference
        i= i+1

我想返回一个数组，其中包含引用下的每一行，不带 html 标记，但只包含文本内容。

【问题讨论】：

使用BeautifulSoup 【参考方案1】：

正如 cmets 中所建议的，BeautifulSoup 让事情变得异常简单：

In [2]: from bs4 import BeautifulSoup

In [3]: import urllib2

In [4]: url = "http://www.dlib.org/dlib/november14/***/11***.html"

In [5]: soup = BeautifulSoup(urllib2.urlopen(url))

In [6]: for h3 in soup.find_all("h3"):
   ...:     print(h3.text)
   ...:     
D-Lib Magazine
The Social, Political and Legal Aspects of Text and Data Mining (TDM)
Abstract
1. Introduction
2. Copyright, database right, licences and TDM
3. Recent changes to UK law
4. What can politicians and policy makers do? 
5. Publishers are not embracing opportunities of TDM
6. How can publishers help TDM researchers?
7. Awareness among academics and a technological gap 
8. Conclusion
Notes
References
About the Authors

【讨论】：

以上是关于如何仅抓取多个div中的文本内容[重复]的主要内容，如果未能解决你的问题，请参考以下文章